Getting Started with Optical Character Recognition (OCR) using TesseractOCR and Java

Getting Started with Optical Character Recognition (OCR) using TesseractOCR and Java

Optical Character Recognition (OCR) is a powerful technology that allows machines to recognize and convert text from images or scanned documents into editable and searchable data. In this article, we will explore the basics of OCR and learn how to use the powerful open-source OCR engine, TesseractOCR, with Java.

Overview of Tesseract OCR

TesseractOCR is a widely used open-source OCR engine that is highly accurate and versatile. It supports over 100 languages, making it compatible with various platforms and suitable for many applications. Let's dive into how you can seamlessly integrate Tesseract into your Java applications.

Installing Tesseract

First, you'll need to install the Tesseract engine and the trained data on your system. You can find installation instructions tailored to your operating system on the official Tesseract documentation website. (tesseract-ocr.github.io/tessdoc/Installatio..).

To confirm Tesseract has been installed, run this in your terminal:

tesseract -v

Running Tesseract in the terminal

Now that you have Tesseract installed, you may want to use it directly from the command line for quick text extraction. Let's explore how to do this.

Navigate to your image directory

Assuming you have an image you'd like to extract text from, if you don't you can download image.png from the code repository, and navigate to the folder where your image is located.

cd /path/to/your/image/folder

Replace /path/to/your/image/folder with the actual path to your image folder.

Run Tesseract

To extract text from your image, use the following command in your terminal:

tesseract image.png -

The extracted text is printed directly to the terminal like this:

By following these steps, you can quickly use Tesseract OCR from the terminal to extract text from images, which can be particularly useful for small-scale text extraction tasks.

Now that we've covered running Tesseract in the terminal, let's proceed to set up Tesseract in Java to leverage its OCR capabilities within your Java applications.

Using Tesseract in Java applications

To use Tesseract in our project we need to add tess4j Maven dependency to our project. Tess4j is a Java JNA wrapper for Tesseract OCR API that provides support for multiple image formats, like JPEG, GIF and PNG. It also supports PDF document format.

First, add the tess4j dependency to your pom.xml file:

<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.5.1</version>
</dependency>

Now that your environment is set up, you can start using Tesseract in your Java application. Below is a basic example of how to perform OCR with TesseractOCR:

import net.sourceforge.tess4j.*;
import java.io.File;

public class TesseractTest {

    public static void main(String[] args) throws Exception {

        // Load the image as a file from its path
        File image = new File("src/main/resources/image.png");

        //import Tesseract from the tess4j library
        ITesseract tesseract = new Tesseract();

        // Set the path to the trained dataset
        tesseract.setDatapath("src/main/resources/tessdata/");

        try {
            String result = tesseract.doOCR(image);
            System.out.println(result);
        } catch (TesseractException e) {
            System.err.println("Error during OCR: " + e.getMessage());
        }

    }
}

Here, we’ve imported Tesseract and set the value of the datapath to the directory location that contains osd.traineddata and eng.traineddata files.

Now, we can run the code and get the text printed like this:

By default, the OCR engine uses English when processing the images. However, we can declare the language by adding the following after importing Tesseract:

// The language is set to Turkish for the image being used
tesseract.setLanguage("tur");

Now, when we run the code, the text is extracted better:

Conclusion

In this article, we’ve explored the Tesseract OCR engine and examined the Tesseract command-line tool to process the images. Then, we’ve explored tess4j, a Java wrapper to integrate Tesseract in Java applications.

The code implementations are available on GitHub. If you have any questions or improvements, please let me know in the comments below.