skip navigation

PDF Document Management Software, Services & Support

Server Desktop Services Support Why Us? About Us

The Latest

SecurSign 5 Now Available! Includes Signature Validation to Detect Tampering.
Lansdowne, PA (July 13, 2011)
Encrypt, digitally sign and verify digital signatures on PDF documents.

Redax 5: Advanced Redaction for PDF Documents
Tuesday, March 22, 2011
The latest Redax adds new patterns, regular expressions and more!

Redax Enterprise Server 3 Ships!
Thursday, January 6, 2011
New Redaction Engine, Powerful New Markup Options and More!

Survey: Server Based PDF Applications
Tuesday, December 7, 2010
The 2010 Survey asked about PDF server application development.

5 PDF Readers Compared
Tuesday, November 30, 2010
Expanding on our previous review, we've included Nitro's Reader and Adobe's new Reader X.

PDF Form Aids Sales Team Collaboration
Friday, November 26, 2010
Take a document, add a dash of JavaScript, a sprinkling of PDF know-how, and serve.

The Fourth Paradigm of Document Capture, part 2

TalkPDF225x100_noDJ.png

by Duff Johnson

Friday, April 8, 2011

In the previous article I offered a synopsis of the document capture paradigms up to and including electronic document capture. Here in part 2, I explain the current and explore the future paradigms of document capture.

The Third Paradigm of Document Capture: PDF

Second Paradigm thinking wasn't limited to scanning paper or previously-imaged microfilm. Computer Output to Laser Disc (COLD), became popular in the late 1980s and early 1990s as a “straight to storage” solution for electronically-generated documents such as bank and insurance statements. Most COLD technologies represented the document with images (usually TIFF files) and database records. As it turns out, electronic documents are more than just images and metadata.

In the 1980s, Adobe Systems was a humming factory for a vast array of technologies and concepts. Among other accomplishments, the company drove the publishing industry from a set of highly technical specialities right into the computer on your desk. Adobe got many requests from government and industry for electronic document technology, but it was a request from the IRS for a reliable, cross-platform electronic document format that caught their eye.

In 1993, Adobe Systems announced PDF (Portable Document Format) along with it's soon-to-be flagship PDF creation, management and manipulation software; Adobe Acrobat. The “electronic document” was born.

Why PDF?

There are many reasons why the PDF format has proven so durable. I cover these reasons in detail in this article, so here, I'll keep it to the bullet points:

  • Easy to make from any source

  • Authentically represents the original

  • Portable, and free to view and print

  • Flexible and powerful, with many features

  • Relatively secure (and secure-able)

  • Non-proprietary

As it turns out, these are the attributes that allow an electronic document to sufficiently resemble paper such that people are willing to trust it as they do paper.

The Fourth Paradigm of Document Capture: Semantics

Once page and text have been captured, users can view and search for documents. What's left to capture? The answer: semantics.

What are “semantics”?

In the electronic document context, semantic information describes the relationships between elements of content. Those familiar with HTML will recognize the concept right away because semantics are an inextricable part of HTML. Here's an example, with the browser's interpretation in the green box to the right:

A second-level heading in the current page.

A paragraph of text.

  • The first List Item in an Unordered List (bullets instead of numbers)

  • The second List Item in the Unordered List

<H2>A second-level heading in the current page.</H2>

<P>A paragraph of text.</P>

<UL><LI>The first List Item in an Unordered List (bullets instead of numbers)</LI>

<LI>The second List Item in the Unordered List</LI></UL>

Pretty simple, right? The tags within the brackets express logical relationships that help software interpret and display the actual content in a pleasing, easy-to-read fashion. Tags accomplish two things:

  1. Organization of content into the correct logical reading order, and

  2. Specify the role or function of the text in the document

For example, an <H2> tag signifies that the text enclosed by the tag is to be understood as a 2nd level heading. This fact allows the reader and (and other consumers, such as software) to characterize the text as important... a chapter, or perhaps a section heading. Likewise, an <UL> tag, along with its subordinate <LI> tags, denotes a list of items, to be distinguished from simple paragraphs of text.

Semantics, in other words, allow one to distinguish a document from a stream of words.

How Semantics Work in PDF

A tag tree from a PDFPDF was originally designed to ensure reliable on-screen and in-print appearance. Searchability, text extraction and content re-use wasn't a priority. While it was always possible to extract text from PDFs, the means of including semantic information along with text, graphics, annotations and other content in the document was only added to commercially-available software beginning in 2000.

Since 2000, PDF files may include tags, and they even look rather similar to HTML tags (see the image to the right). PDF tags perform exactly the same sort of role as HTML tags – they define the logical reading order of the content. A tagged PDF contains information that can help all manner of consumers; blind users, those copying text, search-engines and others to navigate and understand their documents.

Today, the vast majority (probably over 99%) over PDF files are created without semantic information. Even if their software is capable of including semantics in their PDF files, most users don't know how to implement correct semantics when they write documents, let alone convert them into some other format.

What's the new Paradigm?

In 2011, we'll see the publication of ISO 19005-2, part two of PDF/A. We'll also see the approval (if not the publication) of ISO 14289-1, PDF/UA. Why should document capture people care?

  • PDF/A specifies constraints and quality-control measures for PDF files to ensure they will operate reliably anytime in the future.
  • PDF/UA defines the correct usage of the features in PDF that allow for document semantics to be stored and retrieved in addition to the raw text and graphics.

Between these two standards and ISO 32000 itself, the technical underpinnings of the Fourth Paradigm of Document Capture is complete. It is now possible to capture not just images of documents, not just their text, but also the structure and organization of the content.

From search-engines to screen-readers, from tables to footnotes, the humble PDF will become even easier to use and to re-use than ever before.

How can we help?

Appligent Document Solutions was the first company in the world to offer commercial PDF tagging services to capture document semantics in addition to document text. PDF tags work with on any sort of PDF file, including electronic-source content, scanned documents, forms and more. Contact us for more information.


Server Desktop Services Support Why Us? About Us
AppendPDF
AppendPDF Pro
FDFMerge
FDFMerge Lite
pdfHarmony
Redax Enterprise Server
SecurSign
StampPDF Batch
APCrypt
APJavaScript
APSplit
APGetInfo
pdfAPilot Server 2
Redax
StampPDF plugin
StampPDF DE
AppendPDF DE
APSplit DE
PDF Forms
Designer/XFA Forms
PDF JavaScript
PDF Accessibility
Section 508
Publication Scanning
CD/DVD-ROMs
Custom Development
Software Support Policy
Technical Support
Product Documentation
FAQs
Sample Scripts
PDF Glossary
Contact Support

Talking PDF
Appligent Labs
Customers
Testimonials
Case Studies
Cost Effectiveness
Innovation
PDF Standards
Experience
Mission
History
People
Partners
Contact Us
News & Events
Site Accessibility
Site Index
 
Site Accessibility | Email the WebAdmin
Valid HTML 4.01! Section 508 Compliance logo