X APIAN
Code an in-app search engine with Xapian
David Bolton shows how to add a search engine to your Python applications using the Xapian open source code library.
OUR EXPERT
David Bolton is very careless and is always losing things. No one needs a programmable search engine more than him to help him find his stuff and get it back.
In the
Getting Started With Xapian
boxout
(see page 96),
we indexed one of the CSV files
I
and that generated files in the db folder. We’ll come back to that shortly but let’s explain some of the Xapian concepts. A database holds the details of things you want to search. It’s not a relational database but a specially formatted disk file. To add a document to a database is called indexing, and that’s what we did in the
Getting Started
box.
First some Xapian concepts
When you search the database, it returns a list of documents. Document here just means an item returned from a search. Documents are not fixed in size; it might be arbitrary text, a web page or a small paragraph – it’s just whatever you added.
Internally, this is stored as a blob plus a numeric id that identifies it in a database. The numeric id can be any 32-bit integer. You can also use a non-numeric term as an id. You’ll see this in code where a prefix of ‘Q’ is used. For instance, in the delete1.py example, the following code loops through all identifiers, builds an identifying term for each, and then deletes matching documents: for identifier in identifiers: idterm = u’Q’ + identifier db.delete_document(idterm)
Document data
The data in the document is called document data. There’s no schema to it but Xapian compresses it if it can. Documents can be up to 100MB in size.
Finally, each document has terms. You can use these when searching a database to return documents that match the term(s). Terms are generated as you index (add) a document to a database. A term is often generated for each word in a piece of text, usually by applying some form of normalisation (such as changing all the characters to be lower case). There are many useful strategies for producing terms.