This new release of Tesseract is fantastic news for anyone involved in scanning or taking photos of printed documents. The software is a cornerstone of optical character recognition (OCR) on Linux, which means that you can obtain a fully editable text from a scanned image or photograph. We have a long track record of using OCR in Linux. Early Tesseract versions yielded mediocre results and fell short compared to Abbyy Finereader, until Tesseract 4 was released at the end of 2019. Two years on and the new Tesseract 5.0 improves both the recognition quality and speed. The latest version delivers faster performance thanks to “fast floats”. This means that the program uses floats instead of doubles for its LSTM model training and text recognition.
LSTM stands for Long Short-Term Memory networks, a neural engine that helps Tesseract to learn new characters. You’ll probably encounter it when trying to digitise hand-written documents or older papers featuring typed text. An hour spent helping Tesseract learn a new font will pay off with plenty of saved time later on. Tesseract 5.0 is advertised as being much faster in training and OCR performance while using less system memory. Although we didn’t stress-test Tesseract’s memory usage, we certainly noticed that the recognition phase now takes substantially less time that it used to. Don’t hesitate updating to the latest code!