MIT ports Tesseract OCR to JavaScript

Port from developers at MIT supports dozens of languages and makes it easier and cheaper to build image-processing applications

With their JavaScript port of the Tesseract optical character recognition engine, developers at MIT are looking to provide convenience and lower costs in building image-processing applications.

Tesseract.js, released this month, supports more than 60 languages, automatic text orientation, and script detection. Running in either a browser or a server via Node.js, it features a simple interface for reading paragraph, word, and character bounding boxes.

"We've seen people use it to build Web applications for scanning receipts, for motivational poster applications, and in general it's useful for anything where user-supplied pictures with text on them need to be recognized or edited," said co-developers Kevin Kwok and Guillermo Webster, students at MIT.

The developers believed there were practical reasons people might want JavaScript-based OCR. "The first reason is convenience -- the C++ version of Tesseract can be tricky to install, and nearly impossible for people with rare setups or limited privileges," the developers said. The advantage of a pure JavaScript library is it can run on pretty much any system with a JavaScript interpreter.

"The second reason is that for some applications, it's just too expensive or painful to set up a server to offload image processing onto," the said. "Tesseract.JS lets you offload the computationally expensive task of text recognition to the client, allowing your service to scale to arbitrarily many users without having to figure out how to set up -- and to pay for -- compute clusters doing OCR."

Tesseract.js is built on top of the Tesseract engine. Using the Emscripten compiler, developers cross-compiled the Tesseract library to create tesseract.js-core and added  a system to automatically download and persist language files. Computation is done a separate thread to boost application performance.

"We tried to make the actual API layer that developers interact with as smooth and painless as possible," the students said. "After a developer includes the script in their project, they only have to write the line: Tesseract.recognize(myImage).then(function (result) { console.log(result) })." No boilerplate code is required for initialization, and there is no need for manual management of pointers.

The developers, though, say some users have been disappointed with Tesseract.js after a few test runs, in part because of its being geared toward use with documents and not photographs. "One of these reasons is that Tesseract was designed first and foremost for scanning documents -- it really shines when it's given high contrast, high resolution paper documents. But with photographs, it tends to get confused." For now, it is recommended developers pre-process the images they feed into Tesseract.js to improve the contrast, scale up the resolution, and remove background noise. But the developers are looking into providing these functions as part of Tesseract.js itself, as well as adding support for more file formats.

Copyright © 2016 IDG Communications, Inc.