Process Plan: Determine text in custom area

This Process Plan can be used to derive text in a specified area. There are two variations of this Process Plan:

Erliest version with full support for “Determine text in area.kfpx” is pdfToolbox 14

Erliest version with full support for “Determine text in area with OCR.kfpx” is pdfToolbox 15

1. Determine text in area: Can be used for PDF files containing "regular" text

2. Determine text in area with OCR: Can be used for PDF files that do not contain "regular" text objects (e.g. scanned page)

Both Process Plans are essentially the same. "Determine text in area with OCR" has three additional steps at the beginning:

  1. Remove existing OCR text
  2. Convert page into an image
  3. Create a new OCR text

These steps are only necessary to create a proper OCR text that allows good text extraction. After the engine has analyzed the text in the defined area, the original file will be picked up.

The next steps are the same for both variants:

  1. This Check uses the property "Text on page" with a RegEx to determine non white space text in a specified area on the page. The regular expression matches any string that contains at least one non-whitespace character.
  2. If text is found in the specified area, it is returned as a string in the JavaScrip object. If no text is found, the string "no text found" is returned.
  3. For demonstration purposes: Places a green rectangle around the search area to visualize where the text was extracted (see result PDF below).
  4. For demonstration purposes: Places the extracted text in green color on the page at the same location where it was extracted (see result PDF below).

Limitations

These Process Plans can only extract text on one page. They are not designed to extract text on multiple pages.