Modifying structure of bookmarks and DPart
pdfToolbox can extract existing DPart Metadata or bookmark structures in a JSON format so that you can modify them and inject them back into the PDF document. This functionality is available as part of Process Plans where you extract the structures using Quick Check and import them via the “Apply structures” action and directly on the command-line with pdfToolbox CLI using the following commands:
--modifystructures --extract --modifystructures --apply
If you modify a document, such as converting a single-page booklet to reader spreads, adding new pages, or deleting existing pages, most of the time the bookmark structure is not updated and can no longer be used. Here is an example:
- Document with single pages and existing bookmark structure
- After conversion to reader spreads, the existing bookmarks won't work anymore because the bookmark structure is not updated accordingly.
To update the bookmark structure, follow the next steps. The modified testfile with the "old" bookmark structure can be downloaded here:
Use the following command on the CLI to extract the existing bookmark structure to JSON:
pdfToolbox <sample.pdf> --modifystructures --extract=bookmarks
To modify the extracted bookmark structure, it is important to understand the different properties of the JSON objects:
"bookmarks": Array that contains a list of all bookmarks found in the PDF file.
"level": Indicates the nesting level of each bookmark as typically displayed in a PDF viewing program.
"name": Indicates the name of the bookmark.
"page": Specifies the page number that defines the destination of the bookmark.
"open": True if the bookmark is initially open. This only has an effect for a bookmark hierarchy.
The bookmark structure should now be adjusted using algorithms for the specific use case. In our example, the page numbers have to be divided by 2 to so that the bookmarks refer to the correct pages in the reader spreads document. Also, the name of the first bookmark was changed from "Introduction" to "Manual content".
DPart is page based metadata in a PDF file, stored in a hierarchy to group pages into “document parts” (which can group pages to "records"). It is organized in a tree so that it is easier for any PDF processors to identify pages or page ranges that have certain properties. pdfToolbox can display such metadata (as described in Display DPart Metadata).
When the page structure in a document is modified by any application, such as adding new pages or deleting existing pages, quite often the DPart structure is not updated.
In the example below (taken from the PDF/VT sample files at https://pdfa.org/resource/cal-poly-pdfvt-test-suite/) there is a PDF file that has 10 "document parts" / "records". Each "record" consists of 4 pages: the first two pages are a brochure and the last two pages are luggage tags.
Let us assume that we need to delete the luggage tags from each record which means that the DPart metadata will not fit anymore.
To update the bookmark structure, follow the next steps. The modified testfile with the "old" DPart metadata structure can be downloaded here:
Use the following command on the CLI to extract the existing DPart structure to JSON:
pdfToolbox <sample.pdf> --modifystructures --extract=dpart
To modify the extracted DPart structure, it is important to understand the different entries:
"dpartroot" : The root node of the hierarchy.
"dparts": Can refer to a specific range of one or more PDF page objects, identified by their start and end keys.
"dpm": The Document Part Metadata Dictionary contains the actual metadata related to the different parts of the document. The DPM refers to the DPart node in which it is defined and its pages.
"start": The start key is the number of the first PDF page to which the DPart node refers.
"end" (optional): If a DPart node refers to not just one page but to a range of pages there is an end property (in addition to the start property) indicating the last page to which the node refers.
At the end there are two more entries:
"nodenamelist" (optional): Specifies the names/meaning of the DPart node levels in the tree hierarchy. These names can provide some explanation and could be used by software to display information. In our example we have three levels: The root, the Record level and the pages.
"recordlevel" (optional): Indicates on which level of the DPart hierarchy the record boundaries are, in our example RecordLevel is 1. Since the root is zero this indicates that the first level of the DPart hierarchy refers to the records. In our example that are ranges of always four pages specified via their start and page entries.
The DPart metadata for the luggage tags has been manually removed from the records. Now the records consist only of the two brochure entries. The page numbers (
"start" keys) have also been adjusted:
Use the following command on the CLI to inject the DPart structure into the PDF:
pdfToolbox --modifystructures --apply=<JSON file> <PDF file>
Now each record has only two pages, and the DPart metadata for all luggage tags in the DPart structure is deleted: