Partial OCR (filtering page content)

You can understand OCR as adding semantics to characters and words which are otherwise only shapes (glyphs). Whether or not a character has such semantics is often described as Unicode representations: if there is such semantics that means that there is a known Unicode code point for the respective character (glyph).

In most PDFs at least most characters have Unicode representations, unless they are created by a scanner from paper. However, some PDF creators are not as thorough as they should be and skip certain characters. In such cases, a full OCR would add additional Unicode codes to characters that already have one. This is not a big problem for text search or copying text out of the PDF, however, if you create a full text extract from the PDF you will end up with double paragraphs with the same content. It would be better to OCR only those text portions that do not already have Unicode. The Process Plan introduced in this article does exactly that.

Step 1: OCR text that does not have Unicode

The "Create invisible text via OCR" Fixup has an "Apply to" filter. If used only those page objects are rendered into the intermediate image that is the basis for the OCR that are found by the filter.

Since we want to OCR all text that does not already have Unicode we are using this filter.

Desktop

And the "No Unicode representation" filter uses

Desktop

This adds additional, invisible "characters" to all text that has no Unicode representation and these characters have Unicode.

Step 2: Outline none Unicode text

When you want to prepare for text extract (and that is the main purpose of this approach as explained above) you should in addition remove all text without unicode, otherwise you will see some "invalid unicode" indicators in your text extract:

Documents

Thanks to the OCR that we have done there in addition is the unicode text, but better to get rid of this as well.

Therefore the second step converts the glyphs of the none Unicode text into outlines. Then it will not be text for the extraction engine.

0 Comments

Send Your Comment

E-Mail me when someone replies to this comment