"direct" data structures and output
Filtering for data substructures or elements inside the "direct" data structure usually requires intimate knowledge of PDF syntax. A few examples are given below:
$.direct.Root: false
$.direct.Info: true
$.direct.ID: true
$.direct.Encrypt: true
Using a filter expression like "$.direct: true
" or "$.direct.Root:true
" risks creating massive amounts of output data, especially for non-trivial PDF files. Typically, using any of these two filter expressions will lead to output files with possibly several times the size of the original PDF file, despite the fact that actual content data – such as page descriptions or image data – are not even ever included.
If you actually intend to use these two filter expressions anyway, try them first on small and simple PDFs.
The "direct" block output
The "direct" block is a more or less direct translation of PDF syntax into JSON syntax.
- For requesting the Catalog root object, "
$.direct.Root:true
" must be used - For requesting entries in the trailer dictionary, such as Info or ID, use "
$.direct.Info:true
" and "$.direct.ID:true
"
For stream dictionaries, the stream portion will be omitted.
In the Quick Check configuration, specific parts of the PDF data structure can be requested by using the respective entry names in a concatenated path expression. For example, in order to request the ExtGState dictionary for pages in a PDF, the following filter expression could be used (which only works if the Page
objects are direct children of the Kids
element):
$.direct.Root.Pages.Kids.Resources.ExtGState:: true
PDFs can include pages in very different ways – either as Kids
entries directly under the Pages
key. But like in real life, Kids
can have Kids
, and these again can also have Kids
. This makes it very unpredictable to actually locate where pages of interest can be found in the PDF data structure. Of course one could simply retrieve any data below the top most Pages
entry – but this create massive output for any not so small multi-page PDF files, and would also require undue burden on JavaScript code that would have to parse and interpret the collected data.
Future versions of pdfToolbox will offer more elegant ways to walk nested trees of arrays, but for now the current approach has to be accepted as a known limitation.
Currently there is no mechanism to retrieve data inside stream objects. Usually this is not much of a problem – Quick Check is not the right approach to, for example, retrieve raw image data. There is at least one type of data that exists in stream objects: XMP metadata. In some scenarios it might be useful to be able to retrieve raw XMP metadata in the context of using Quick Check. For now this is not supported. Depending on user demand, we may add extended capabilities in future versions of pdfToolbox. If this is of interest to you, please get in touch via our support email address, [email protected], and please make us understand why this would matter to you.
Example of complete "direct" output from a simple 1 page PDF
{
"direct": {
"Root": {
"Metadata" : {
"Type" : "Metadata",
"Length" : 51198,
"Subtype" : "XML"},
"OCProperties" : {
"D" : {
"Name" : "D",
"ON" : [
{
"Name" : "Image layer",
"Type" : "OCG",
"Intent" : [ "View", "Design"],
"Usage" : {
"CreatorInfo" : {
"Creator" : "Adobe Illustrator 22.1",
"Subtype" : "Artwork"}}},
{
"Name" : "Text layer",
"Type" : "OCG",
"Intent" : [ "View", "Design"],
"Usage" : {
"CreatorInfo" : {
"Creator" : "Adobe Illustrator 22.1",
"Subtype" : "Artwork"}}}],
"Order" : [],
"RBGroups" : []},
"OCGs" : []},
"OutputIntents" : [
{
"Info" : "U.S. Web Coated (SWOP) v2",
"DestOutputProfile" : {
"Length" : 557168,
"N" : 4
},
"OutputCondition" : "",
"OutputConditionIdentifier" : "CGATS TR 001",
"RegistryName" : "http://www.color.org",
"S" : "GTS_PDFX",
"Type" : "OutputIntent"
}
],
"Type" : "Catalog",
"Pages" : {
"Type" : "Pages",
"Count" : 1,
"Kids" : [
{
"Type" : "Page",
"BleedBox" : [ 0.000000, 0.000000, 400.000000, 300.000000],
"Contents" : {
"Length" : 931,
"Filter" : "FlateDecode"},
"CropBox" : [ 0.000000, 0.000000, 400.000000, 300.000000],
"Group" : {
"S" : "Transparency",
"CS" : "DeviceCMYK",
"I" : "false",
"K" : "false"},
"MediaBox" : [ 0.000000, 0.000000, 400.000000, 300.000000],
"Resources" : {
"ColorSpace" : {
"CS0" : [
"ICCBased",
{
"Length" : 2574,
"Filter" : "FlateDecode",
"N" : 3
}
]
},
"ExtGState" : {
"GS0" : {
"Type" : "ExtGState",
"AIS" : "false",
"BM" : "Normal",
"CA" : 1.000000,
"OP" : "false",
"OPM" : 1,
"SA" : "true",
"SMask" : "None",
"ca" : 1.000000,
"op" : "false"},
"GS1" : {
"Type" : "ExtGState",
"AIS" : "false",
"BM" : "Normal",
"CA" : 0.600006,
"OP" : "false",
"OPM" : 1,
"SA" : "true",
"SMask" : "None",
"ca" : 0.600006,
"op" : "false"}},
"Properties" : {},
"Shading" : {
"Sh0" : {
"AntiAlias" : "false",
"Coords" : [ 0.000000, 0.000000, 1.000000, 0.000000],
"Domain" : [ 0.000000, 1.000000],
"Extend" : [ "true", "true"],
"Function" : {
"Domain" : [ 0.000000, 1.000000],
"Bounds" : [],
"Encode" : [ 0.000000, 1.000000],
"FunctionType" : 3,
"Functions" : [
{
"N" : 1.651740,
"Domain" : [ 0.000000, 1.000000],
"FunctionType" : 2,
"C0" : [ 0.749020, 0.145098, 0.250980],
"C1" : [ 0.000000, 0.352941, 0.725490]}]},
"ShadingType" : 2}},
"XObject" : {
"Fm0" : {
"Length" : 9272,
"Subtype" : "Form",
"Group" : {
"S" : "Transparency",
"Type" : "Group",
"I" : "false",
"K" : "false"},
"Resources" : {
"ExtGState" : {},
"Font" : {
"TT0" : {
"Type" : "Font",
"Subtype" : "TrueType",
"BaseFont" : "LQUSKK+SourceSansPro-Bold",
"Encoding" : "WinAnsiEncoding",
"FirstChar" : 32,
"FontDescriptor" : {
"Type" : "FontDescriptor",
"Ascent" : 974,
"CapHeight" : 660,
"Descent" : -383,
"Flags" : 32,
"FontBBox" : [ -231, -383, 1223, 974],
"FontFamily" : "Source Sans Pro",
"FontFile2" : {
"Length" : 5645,
"Filter" : "FlateDecode",
"Length1" : 13038
},
"FontName" : "LQUSKK+SourceSansPro-Bold",
"FontStretch" : "Normal",
"FontWeight" : 700,
"ItalicAngle" : 0,
"StemV" : 148,
"XHeight" : 496
},
"LastChar" : 120,
"Widths" : [ 208, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 528, 528, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 556, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 573, 0, 573, 0, 341, 0, 0, 0, 0, 0, 286, 0, 0, 555, 573, 0, 0, 0, 0, 0, 0, 0, 514]}},
"ProcSet" : [
"PDF",
"Text"
]
},
"BBox" : [ 0.000000, 78.644500, 400.000000, 17.500000],
"Matrix" : [ 1.000000, 0.000000, 0.000000, 1.000000, 0.000000, 0.000000]
}
}
},
"TrimBox" : [ 0.000000, 0.000000, 400.000000, 300.000000]}]}
},
"Info": {
"CreationDate" : "D:20180329161202+02'00'",
"Creator" : "Adobe Illustrator CC 22.1 (Macintosh)",
"GTS_PDFXVersion" : "PDF/X-4",
"ModDate" : "D:20180329161202+02'00'",
"Producer" : "Adobe PDF library 15.00",
"Title" : "simple pdfToolbox 10 sample file",
"Trapped" : "False"
},
"ID": [ "97ef49f0c10247839eb4d4e368588648", "c330dc533cf143f096cd2e246bfb39dc" ]
},
"status": {
"time_needed_sec" : 0.016667,
"result" : "complete"
}
}
Hints and tricks
If a dot (.) or colon (:) occurs in a filter path identifier, then this glyph must be escaped with a preceding backslash (\), e.g..:
- $.direct.Root.Private.Test\:2\:Colon : true
- $.direct.Root.Private.Test\.2\.Points : true