Step by step - Learn how to use callas products

Process Plan: Determine text in custom area

This Process Plan can be used to derive text in a specified area. There are two variations of this Process Plan:

1. Determine text in area: Can be used for PDF files containing "regular" text

(earliest version with full support is pdfToolbox 14)

2. Determine text in area with OCR: Can be used for PDF files that do not contain "regular" text objects (e.g. scanned page)

Determine text in area with OCR.kfpx

(earliest version with full support is pdfToolbox 15)

Testfile_extract_text_OCR.pdf

Both Process Plans are essentially the same. "Determine text in area with OCR" has three additional steps at the beginning:

Remove existing OCR text
Convert page into an image
Create a new OCR text

These steps are only necessary to create a proper OCR text that allows good text extraction. After the engine has analyzed the text in the defined area, the original file will be picked up.

The next steps are the same for both variants:

This Check uses the property "Text on page" with a RegEx to determine non white space text in a specified area on the page. The regular expression matches any string that contains at least one non-whitespace character.
If text is found in the specified area, it is returned as a string in the JavaScrip object. If no text is found, the string "no text found" is returned.
For demonstration purposes: Places a green rectangle around the search area to visualize where the text was extracted (see result PDF below).
For demonstration purposes: Places the extracted text in green color on the page at the same location where it was extracted (see result PDF below).

Limitations

These Process Plans can only extract text on one page. They are not designed to extract text on multiple pages.

Step by step - Learn how to use callas products

Process Plan: Determine text in custom area

Sections

Last Updated

Documentation for all products

Recent Updates