Create invisible text via OCR
A customer scans a document from paper format and now the document is available in a "primitive" electronic form (as an image, with e.g. TIFF format). However, the text is not searchable.
pdfToolbox 12, using Tesseract technology (open source), allows to create an output PDF with searchable text from an input document which can be an image or a PDF. The visual representation of the input document is preserved where the produced PDF has an overlay containing the searchable text without a visual representation.
- Image Resolution:
- Apply to: Apply the Fixup to, for example, all pages OR
- Only to text that cannot be mapped to Unicode (as in the screenshot above)
- Language: Results are much better if the language is specified, however, if the language field is empty, all the installed languages (how to install languages) are used. This input field has to be used with 3 character ISO language codes.
Double check OCR results
You can see results of the OCR in "Test" mode.
As stated in the Fixup's comment, the Fixup supports English (eng) and German (deu) text by default. You can install further languages as explained here.
Action: OCR
The same can be achieved using a Switchboard Action. Simply look for the Action 'OCR':