The Check property 'Text on page' identifies pages that contain or do not contain specified text. Please keep in mind that this property cannot be combined with any other object properties.
Text search using the Check property 'Text on page' works on the basis of the pdfToolbox Action --extracttext, which extracts text from a PDF file.
- Matching criteria: Text can be searched based on different search criterion from the drop down, some examples below:
- begins with
- contains text with Regex
- ends with
- equal to
- matches with Regex
- Text to search: Based on the 'Matching criteria', text to be searched is to be entered (Regex in the screenshot below)
- Search in custom area: Defines the position where text has to be searched. Important to note:
- If you want to search on the whole page, "Search in custom area" must be deactivated)
- Positive and negative numbers are allowed
- If a 0 (zero) is entered in "Width" or "Height", this is interpreted as "no value" and the orginal width and height of the pagebox remains
- Units: Defines the 'unit' of the custom page dimensions in
Careful consideration has to be made while opting from different operators which are used for different use cases. For example:
For the search word 'Black' on a PDF, a regex 'Ba*' and operator:
- 'Contains text with Regex': will give a hit (The word Black contains B followed by zero or more a's)
- 'Matches with Regex': will not give a hit because the regex doesn't entirely MATCH with the word 'Black'
This Check property can be used with the Action (and corresponding CLI parameter) 'Split PDF at mark'/--splitatmark