PDF/A-3: The Better Container for Electronic Documents
PDF/A
Summary:
The Portable Document Format, or PDF is the most commonly used document format for business applications. PDF/A-3 permits the embedding of files of any format. This article gives an overview of the advantages of PDF/A-3 as an electronic container.
The Portable Document Format, or PDF is the most commonly used document format for business applications. Back in 1990, Adobe's co-founder Dr. John Warnock published a white paper essentially describing the need for the PDF format
What industries badly need is a universal way to communicate documents across a wide variety of machine configurations, operating systems and communication networks. These documents should be viewable on any display and should be printable on any modern printers. If this problem can be solved, then the fundamental way people work will change.
Dr. John Warnock, Co-founder Adobe
In a nutshell, the advantage of PDF documents is the multi-platform compatibility. A document such as a tax form or invoice can be send to any recipient who is able to read, complete or print it. Because of available editing restrictions (and later electronic signatures), a PDF has been handled very much like printed paper and received a similar status in business processes.
Electronic Data Interchange, better known as EDI, existed since the early 1970s to communicate data between applications
Legal restrictions and user experience require most data (for example invoices) to be human-readable. In theory, the PDF document is the perfect format to replace printed paper. It is easy to send, easy to read on all machines, can be searched and is good for archiving processes. But the machine-readable data is missing.
The software industry tried to solve this issue by recognizing content in PDF documents (very similar to OCR processes) to give documents a context (Is this document an invoice?) and to match content with expected fields (invoice number, addresses, products, ...).
In the most recent iteration of PDF/A specifications, PDF/A-3 added a significant change to all predecessors. PDF/A-3 (ISO 19005-3:2012) permits the embedding of files of any format (including XML, MS Word and proprietary binary formats). This change allows the progression from electronic paper to an electronic container that holds the human and machine-readable versions of documents.
Now, the human-readable version can be ignored by applications reading the data of the document. Applications can extract the machine-readable portion of the PDF document in order to process it. A PDF/A-3 document can contain an unlimited number of embedded documents for different processes. According to the specification, software applications can extract embedded files without explicit knowledge of the PDF document itself.
Technically, that is not an easy process. With TX Text Control X19 (29.0), we will provide PDF features to create documents with embedded files and also to extract embedded files from these electronic containers.The following code creates a PDF document with an embedded XML document:
When opened in Adobe Acrobat Reader, you can find the embedded files in the Attachments sidebar tab:
TX Text Control cannot only be used to create those documents, but to import and extract the embedded files as well. The following code shows how to extract the XML data from the PDF document:
The EmbeddedFiles property contains an array of all files embedded in the PDF document. TX Text Control can be used to cover the complete PDF document workflow from creating the document to processing incoming documents in business applications. Combined with the powerful template-based document creation engine, TX Text Control provides developers the complete solution to handle PDF documents in business processes.