OCR support for additional languages
The Fixup 'Create invisible text via OCR' by default supports English and German. This article describes how to add support for additional languages.
The "Create invisible text via OCR" Fixup internally uses the Tesseract engine. Language files (trainings) for various languages can be obtained from a github repository that is maintained by the Tesseract community:
In order to use these files in pdfToolbox Desktop, they have to be put into the folder "OCR" in the preferences folder. The easiest way to get to this folder is to open the Switchboard.
Go to Text, OCR
Then click on the options item at the bottom and select: Manage language trained data. That will open the folder or - if no language trainings were used beforehand - create it.
You may now put language trainings into this folder.
After a language training is installed, it can be used in a Fixup.
Performance and quality are better if only those languages are specified that are absolutely needed.
If more than one language training is used the most accurate one should be placed at the top of the list, since it will be used with priority by the engine.
When you export a Profile using the "Create invisible text via OCR" Fixup, the language trainings will not be exported. Instead they will have to be installed in the instance of pdfToolbox Server/CLI or pdfToolbox SDK that you are using. All these applications have a subfolder named "etc" in their program folders.
There you will find the folder "OCRTool" and there "tessdata". Any language trainings that you want to use in pdfToolbox Server/CLI or pdfToolbox SDK have to be put into this folder.
Note that English and German are installed beforehand. They will be used even if no language setting is specified in the Profile. However, if you process more German text than English you should still specify the language to improve results.