AU
  
You are currently viewing the Australia version of the site.
Would you like to switch to your local site?
15 MIN READ TIME

Dump your paper docs with perfect OCR

Nick Peers reveals how to extract editable text from images and printed materials with the help of optical character recognition software.

OUR EXPERT

Nick Peers loves his family history, and loves it even more now he can convert family biographies into editable text files without manually transcribing them first.

OUR EXPERT

Nick Peers loves his family history, and loves it even more now he can convert family biographies into editable text files without manually transcribing them first.

Anyone who’s been faced with the arduous task of transcribing text from either printed material or a digital scan of printed text will no doubt have heard of optical character recognition (OCR) technology. OCR is one of the earliest examples of machine learning, whereby a computer model is trained to recognise shapes on a digital image and translate those shapes into text characters. Once the shape of each letter is identified and translated into editable text, words followed by sentences, paragraphs and entire tracts of text can be extracted from the digital scan.

OCR has roots going back to the 1980s, and while commercial engines perform increasingly miraculous conversions – not just on typed text, but also handwriting – open source engines continue to develop alongside them. Linux is blessed with several OCR engines, all with roots in commercial products, but now open sourced and completely free to use. The best known of these – which we’ll focus on in this tutorial – is Tesseract (https://github.com/tesseractocr/tesseract), a command-line OCR engine that can be used on its own or paired with a number of graphical front-ends to perform OCR across a variety of usage scenarios, from extracting editable text directly from scanned documents to converting everything from PDFs and image files to screen grabs and imagebased subtitle tracks in media files, too.

Before going further, check the box opposite for a quick look at Tesseract and two of its main open-source rivals – note, you can install all three at once and try different ones to see which produces the best results.

Choose which part of the screen to capture and NormCap OCRs its contents and places them on the clipboard.

Marks, set, scan!

The obvious place to start is by installing the underlying Tesseract OCR engine. It exists in various forms – including standalone AppImage and Snap – but you can also install it via its own repository to ensure it’s the latest version (5, as opposed to 4 in most universal repositories):

If your scanned image isn’t perfect, use ScanTailor Advanced to make it clearer for Tesseract to read and translate.

$ sudo add-apt-repository ppa:alex-p/tesseract-ocr5 $ sudo apt update $ sudo apt install tesseract-ocr

This installs both the Tesseract engine, plus trained data enabling it to recognise English text. You can add more languages using sudo apt install tesseract-ocrlan , substituting lan with the relevant country code, such as fra, spa or deu.

With the command-line engine installed, you can now perform OCR from the command line on a wide range of files – prior to 2016, Tesseract only worked with TIFF files, but now it can handle most popular formats, including PNG, GIF and JPEG. Basic command-line usage is as follows: $ tesseract image.jpg textfile

Full documentation can be found under Command Line Usage at the program’s online documentation (https://tesseract-ocr.github.io/tessdoc/), but instead of persevering with the terminal, let’s examine a selection of GUI tools that provide a user-friendly frontend to Tesseract while also expanding its capabilities.

Unlock this article and much more with
You can enjoy:
Enjoy this edition in full
Instant access to 600+ titles
Thousands of back issues
No contract or commitment
Try for $1.48
SUBSCRIBE NOW
30 day trial, then just $14.99 / month. Cancel anytime. New subscribers only.


Learn more
Pocketmags Plus
Pocketmags Plus

This article is from...


View Issues
Linux Format
January 2024
VIEW IN STORE

Other Articles in this Issue


WELCOME
MEET THE TEAM
We’re trying to entice people to learn to
Fresh start
As we dive into 2024 and a whole
REGULARS AT A GLANCE
Newsdesk
THIS ISSUE: Valve console Steams ahead Foundations laid for HPSF Canonical cloud creation
High Performance Software Foundation announced
The HPSF is to inspire HPC innovation and “make life easier for high performance software developers”.
Canonical launches MicroCloud
The Ubuntu publisher releases new cloud software.
WHAT’S IN A NAME?
Italo Vignoli is one of the founders of
CLOSING THE DOOR
Joe Brockmeier is head of community, Percona. Another
Blender 4.0 released
Latest stable version has overhauled UI and improved support for OneAPI.
Foundation gets €1 million from Germany
Gnome recognised as public interest infrastructure.
Itanium heading for scrapheap
The end of an era: Will the Linux kernel drop support for ia64 architecture?
Distro watch
What’s behind the free software sofa?
DRIVER’S TEST
Faith Ekstrand is an engineering fellow at Collabora.
GETTING TOGETHER
Jon Masters is a kernel hacker who’s been
Kernel Watch
Jon Masters keeps up with all the latest happenings in the Linux kernel, so you don’t have to.
ONGOING DE VELOPMENT
There continue to be significant developments on the
Answers
Got a burning question about open source or the kernel? Whatever your level, email it to answers@linuxformat.com
Mailserver
WRITE TO US Do you have a burning
HotPicks
THE BEST NEW OPEN SOURCE SOFT WARE ON THE PL ANET
REVIEWS
Intel Core i9 14900K
Fast, powerful and a bit boring is not how to describe Jacob Ridley…
OSGeoLive 16.0
Uncharted no more – Nate Drake maps out the array of geospatial tools on offer in the specialist distro OSGeoLive.
GhostBSD 23.10.1
Roll over Slimer. Nate Drake explores the latest GhostBSD and finds that, like Casper, it’s extremely friendly and easy to manage.
Kubuntu 23.10
Nate Drake explores the lavish new Plasma desktop in the latest Kubuntu. Is this the greatest KDE-based distro to date?
Counter-Strike 2
It’s CS:GO Jim, but not as we know it, says Rich Stanton, as he takes the long-standing esport stalwart’s successor for a spin.
ROUNDUP
Live distributions
Michael Reed checks out five distros that you could carry around on your keychain to give you Linux goodness at a moment’s notice.
LEARN LINUX!
Finally had enough of your Windows PC? Nick Peers reveals how to fully road-test – then switch to – a user-friendly Linux distro.
Pi USER
Arm buys a slice of Pi
The deal effectively cements the Raspberry Pi to the Arm ISA for an undisclosed sum.
Pi OS Bookworm
Les Pounder has a Raspberry Pi 5 and isn’t afraid to use it, now that he has the latest Raspberry Pi OS.
Creality K1 Max
A big fan of fancy, Denise Bertacchi has taken to the K1’s bigger brother.
Take your Pi 5 to the next level!
Les Pounder has got his hands on a Raspberry Pi 5 and wants to show you how to get the best from it.
IN DEPTH
Tall Tails
Nate Drake explores how to configure the latest version of Tails for maximum security and convenience.
TUTORIALS
Safeguard your secrets
Shashank Sharma gets a taste for 007’s life by reading fiction and occasionally sending encrypted messages to his unsuspecting brother.
WordPress security, events and users
Michael Reed concludes his overview of setting up and maintaining a WordPress site with some finishing touches and a look at plugins.
Relive your BBC Basic days!
Turn your home system into a BBC Micro by following David Bolton’s advice to download BBC Basic and write programs just like it’s 1982.
Make your own pointand-click adventure
Nate Drake invites you to relive the glory days of Scumm-style point-and-click adventure games by creating your very own.
ADMINISTERIA
Massively improve your SSH login security
Passwords are so last century proclaims Stuart Burns! Using public keys and a couple of tweaks makes for a far more secure login.
Learning the way of the Docker
Docker is designed for an easy sysadmin life. Here are some top tips for using it.
Oxylabs
Mayank Sharma finds this proxy service to be a breath of fresh air.
Private Internet Access
A high-value VPN with a pile of interesting features that’s keeping the nefarious Mike Williams out of trouble.
No Ethernet, I expect you to die!
More stubborn than an anachronistic spy stereotype and just as reliable, Darien GrahamSmith uncrosses the network twisted pair.
CODING ACADEMY
Write a Linux shell from scratch
Never one to shy away from the difficult, Ferenc Deak takes us by the hand and helps us code a shell – from scratch.
Build a smart-home data application
Matt Holder investigates how to take data from an API and display it in a GUI for fun and possibly profit!
ADVERTISEMENT
GO COMPARE
BACK ISSUES MISSED ONE?
www.magazinesdirect.com
IT’S EASY TO SUBSCRIBE!
www.magazinesdirect.com/linux-format
Tom's Hardware
magazines direct
magazinesdirect.com
code club
www.codeclub.org.uk
LINUX FORMAT
The #1 open source mag
EFF
EFF.ORG
centre point
centrepoint.org.uk/place
SUBSCRIBE
www.magazinesdirect.com/LIN/XE92
SUBSCRIBE
MAGAZINESDIRECT.COM/WINTER23
Chat
X
Pocketmags Support