| skip navigation | |||||||
|
PDF Document Management Software, Services & Support |
||||||
|
|||||||
|
|
The Fourth Paradigm of Document Capture, part 2by Duff Johnson Friday, April 8, 2011 In the previous article I offered a synopsis of the document capture paradigms up to and including electronic document capture. Here in part 2, I explain the current and explore the future paradigms of document capture. The Third Paradigm of Document Capture: PDFSecond Paradigm thinking wasn't limited to scanning paper or previously-imaged microfilm. Computer Output to Laser Disc (COLD), became popular in the late 1980s and early 1990s as a “straight to storage” solution for electronically-generated documents such as bank and insurance statements. Most COLD technologies represented the document with images (usually TIFF files) and database records. As it turns out, electronic documents are more than just images and metadata. In the 1980s, Adobe Systems was a humming factory for a vast array of technologies and concepts. Among other accomplishments, the company drove the publishing industry from a set of highly technical specialities right into the computer on your desk. Adobe got many requests from government and industry for electronic document technology, but it was a request from the IRS for a reliable, cross-platform electronic document format that caught their eye. In 1993, Adobe Systems announced PDF (Portable Document Format) along with it's soon-to-be flagship PDF creation, management and manipulation software; Adobe Acrobat. The “electronic document” was born. Why PDF?There are many reasons why the PDF format has proven so durable. I cover these reasons in detail in this article, so here, I'll keep it to the bullet points:
As it turns out, these are the attributes that allow an electronic document to sufficiently resemble paper such that people are willing to trust it as they do paper. The Fourth Paradigm of Document Capture: SemanticsOnce page and text have been captured, users can view and search for documents. What's left to capture? The answer: semantics. What are “semantics”?In the electronic document context, semantic information describes the relationships between elements of content. Those familiar with HTML will recognize the concept right away because semantics are an inextricable part of HTML. Here's an example, with the browser's interpretation in the green box to the right: A second-level heading in the current page.A paragraph of text.
<H2>A second-level heading in the current page.</H2> <P>A paragraph of text.</P> <UL><LI>The first List Item in an Unordered List (bullets instead of numbers)</LI> <LI>The second List Item in the Unordered List</LI></UL> Pretty simple, right? The tags within the brackets express logical relationships that help software interpret and display the actual content in a pleasing, easy-to-read fashion. Tags accomplish two things:
For example, an <H2> tag signifies that the text enclosed by the tag is to be understood as a 2nd level heading. This fact allows the reader and (and other consumers, such as software) to characterize the text as important... a chapter, or perhaps a section heading. Likewise, an <UL> tag, along with its subordinate <LI> tags, denotes a list of items, to be distinguished from simple paragraphs of text. Semantics, in other words, allow one to distinguish a document from a stream of words. How Semantics Work in PDF
Since 2000, PDF files may include tags, and they even look rather similar to HTML tags (see the image to the right). PDF tags perform exactly the same sort of role as HTML tags – they define the logical reading order of the content. A tagged PDF contains information that can help all manner of consumers; blind users, those copying text, search-engines and others to navigate and understand their documents. Today, the vast majority (probably over 99%) over PDF files are created without semantic information. Even if their software is capable of including semantics in their PDF files, most users don't know how to implement correct semantics when they write documents, let alone convert them into some other format. What's the new Paradigm?In 2011, we'll see the publication of ISO 19005-2, part two of PDF/A. We'll also see the approval (if not the publication) of ISO 14289-1, PDF/UA. Why should document capture people care?
Between these two standards and ISO 32000 itself, the technical underpinnings of the Fourth Paradigm of Document Capture is complete. It is now possible to capture not just images of documents, not just their text, but also the structure and organization of the content. From search-engines to screen-readers, from tables to footnotes, the humble PDF will become even easier to use and to re-use than ever before. How can we help?Appligent Document Solutions was the first company in the world to offer commercial PDF tagging services to capture document semantics in addition to document text. PDF tags work with on any sort of PDF file, including electronic-source content, scanned documents, forms and more. Contact us for more information. |
||||||