OCR support for additional languages

The Fixup 'Create invisible text via OCR' by default supports English and German. This article describes how to add support for additional languages.

The "Create invisible text via OCR" Fixup internally uses the Tesseract engine. Language files (trainings) for various languages can be obtained from a github repository that is maintained by the Tesseract community:

https://github.com/tesseract-ocr/tessdata_fast

Making language trainings available to pdfToolbox Desktop

In order to use these files in pdfToolbox Desktop, they have to be put into the folder "OCR" in the preferences folder. The easiest way to get to this folder is to open the Switchboard.

Desktop

Go to Text, OCR

Then click on the options item at the bottom and select: Manage language trained data. That will open the folder or - if no language trainings were used beforehand - create it.

You may now put language trainings into this folder.

Desktop

Referencing language trainings in a Fixup

After a language training is installed, it can be used in a Fixup.

Desktop

Performance and quality are better if only those languages are specified that are absolutely needed.

If more than one language training is used the most accurate one should be placed at the top of the list, since it will be used with priority by the engine.

Using language trainings in pdfToolbox Server/CLI or SDK

When you export a Profile using the "Create invisible text via OCR" Fixup, the language trainings will not be exported. Instead they will have to be installed in the instance of pdfToolbox Server/CLI or pdfToolbox SDK that you are using. All these applications have a subfolder named "etc" in their program folders.

Desktop

There you will find the folder "OCRTool" and there "tessdata". Any language trainings that you want to use in pdfToolbox Server/CLI or pdfToolbox SDK have to be put into this folder.

Desktop

Note that English and German are installed beforehand. They will be used even if no language setting is specified in the Profile. However, if you process more German text than English you should still specify the language to improve results.