11 MIN READ TIME

X APIAN

Code an in-app search engine with Xapian

David Bolton shows how to add a search engine to your Python applications using the Xapian open source code library.

OUR EXPERT

David Bolton is very careless and is always losing things. No one needs a programmable search engine more than him to help him find his stuff and get it back.

In the Getting Started With Xapian boxout (see page 96), we indexed one of the CSV files I and that generated files in the db folder. We’ll come back to that shortly but let’s explain some of the Xapian concepts. A database holds the details of things you want to search. It’s not a relational database but a specially formatted disk file. To add a document to a database is called indexing, and that’s what we did in the Getting Started box.

First some Xapian concepts

When you search the database, it returns a list of documents. Document here just means an item returned from a search. Documents are not fixed in size; it might be arbitrary text, a web page or a small paragraph – it’s just whatever you added.

Internally, this is stored as a blob plus a numeric id that identifies it in a database. The numeric id can be any 32-bit integer. You can also use a non-numeric term as an id. You’ll see this in code where a prefix of ‘Q’ is used. For instance, in the delete1.py example, the following code loops through all identifiers, builds an identifying term for each, and then deletes matching documents: for identifier in identifiers: idterm = u’Q’ + identifier db.delete_document(idterm)

Document data

The data in the document is called document data. There’s no schema to it but Xapian compresses it if it can. Documents can be up to 100MB in size.

Finally, each document has terms. You can use these when searching a database to return documents that match the term(s). Terms are generated as you index (add) a document to a database. A term is often generated for each word in a piece of text, usually by applying some form of normalisation (such as changing all the characters to be lower case). There are many useful strategies for producing terms.

Unlock this article and much more with
You can enjoy:
Enjoy this edition in full
Instant access to 600+ titles
Thousands of back issues
No contract or commitment
Try for 99c
SUBSCRIBE NOW
30 day trial, then just $9.99 / month. Cancel anytime. New subscribers only.


Learn more
Pocketmags Plus
Pocketmags Plus

This article is from...


View Issues
Linux Format
February 2025
VIEW IN STORE

Other Articles in this Issue


LINUX FORMAT
LINUX FORMAT
The #1 open source mag Future Publishing Limited,
WELCOME
MEET THE TEAM
This issue we’re looking at more open AI and machine-learning concepts, so what’s the most interesting thing you’ve found yourself using AI for?
Inside AI
Of all the questions surrounding AI, we do
REGULARS AT A GLANCE
Is Valve starting an OS war with Microsoft?
Valve branding hints at SteamOS expansion, while Microsoft’s TPM requirements frustrate users. Could this spark a shift towards Linux?
ChromeOS upgrades
Rumours of Google’s plans to merge ChromeOS into Android mean Apple’s tablet dominance could be challenged. But what about existing Chromebooks?
Intel announces Arc B series
Battlemage is accompanied by open source drivers.
MAKE A DATE!
Italo Vignoli is one of the founders
CHANGES IN STORE?
Dave Stokes is a technology evangelist at
Linus hates CPU feature levels
Linux creator slams “completely broken” x86_64 feature levels.
Paid software on Flathub?
Gnome seeks a program manager to set up payments.
Mozilla pixel brand refresh
Non-profit hopes to “reclaim the internet” with a new logo.
Distro watch
What’s behind the free software sofa?
MEDIA UPDATE
Mark Filion is a senior marketing manager
ELECTRIC THEMES
Jon Masters is a kernel hacker who’s
Kernel Watch
Jon Masters summarises the latest happenings in the Linux kernel, so that you don’t have to.
Answers
Got a burning question about open source or the kernel? Whatever your level, email it to answers@linuxformat.com
Mailserver
WRITE TO US Do you have a burning
YOUR DIGITAL ISSUE ACCESS
Linux Format print subscribers can now access digital back issues two ways! * Who’s a lucky bunch of readers?!
HotPicks
Sweet Home 3D Duplicati VeraCrypt ImageMagick WhatSie Topgrade-rs Ark KTuberling PortProton MuseScore LosslessCut
FIREWALL YOUR HOME!
CREDIT: Magictorch LXF325 will be on sale Tuesday
REVIEWS
AMD Ryzen 7 9800X3D
AMD forges the unquestioned gaming champ, Paul Alcorn finds.
GhostBSD 24.10.1
Nate Drake determines whether this is the Casper of the BSD world or whether your system will be haunted by bugs.
Tucana Linux 3.1
Nate Drake is left scratching his head over this latest distribution for building your own OS (almost) from scratch.
Br OS 24.10
Nate Drake delves into this colourful Latin American distro, which couples KDE Plasma 6 with its very own dock.
BackBox 9
After two years, Nate Drake finally gets to review this Ubuntu-based pentesting distro. Can it contend with greats like Kali?
ROUNDUP
Roundup
ChatGPT Claude HuggingChat Microsoft Copilot Google Gemini
The verdict
AI chatbots
GET INSIDE OPEN AI!
GET INSIDE OPEN AI!
Long-suspected chatbot Jonni Bidwell invokes all his neural networks to harness the power of open source AI.
Pi USER
Multiple Pi 5 releases in time for 2025
It seems there are Pis flying at us faster than we can dodge them all!
Emteria OS
Les Pounder has installed Android on countless Raspberry Pis (and even an O2 Joggler) but Emteria has to be the easiest way ever.
Pi Pico 2 W
Feeling rather full after Christmas, it seems Les Pounder can still manage another wafer-thin Pi.
Raspberry Pi 500
The keyboard is the computer again! Les Pounder takes the updated homage to retro home computers for a spin.
Capture images and video with Picamzero
CAMERA
IN DEPTH
BUILD A STEAM DECK
John Knight tries one of several SteamOS imitations and is overwhelmed by the future possibilities for Linux PC gaming.
TUTORIALS
Hide vital information
Shashank Sharma is a man of many hidden talents. Too bad he’s forgotten the passphrase to unlock them all.
Take back control of your desktop email
Nick Peers takes a fresh look at the venerable email app Thunderbird as it celebrates its twentieth birthday in style, by sending it an ecard.
Boost and bolster your Firefox browsing
Firefox is a powerful web browser, but it lags behind on some of the latest features. Michael Reed explores how to add them.
Edit OpenStreetMap like a pro contributor!
JOSM
The rise and fall of pen plotters
RETRO DEVICES
Enhance your keyboard with full RGB lighting!
OPENRGB
ADMINISTERIA
Top New Year’s sysadmin resolutions
Rather than try any silly new diet fads, this year Stuart Burns has decided to learn some new classic terminal tricks.
INTEL’S LAKES, COVES & SKIES
Strapping on his hiking boots, Jarred Walton walks us through Intel’s new outdoor-inspired architectures.
CODING ACADEMY
CODING ACADEMY Space Invaders
Nate Drake invites you to stave off the alien menace by coding your very own Space Invaders clone in classic BBC Basic.
Chat
X
Pocketmags Support