skip navigation

PDF Document Management Software, Services & Support

Server Desktop Services Support Why Us? About Us

The Latest

SecurSign 5 Now Available! Includes Signature Validation to Detect Tampering.
Lansdowne, PA (July 13, 2011)
Encrypt, digitally sign and verify digital signatures on PDF documents.

Redax 5: Advanced Redaction for PDF Documents
Tuesday, March 22, 2011
The latest Redax adds new patterns, regular expressions and more!

Redax Enterprise Server 3 Ships!
Thursday, January 6, 2011
New Redaction Engine, Powerful New Markup Options and More!

Survey: Server Based PDF Applications
Tuesday, December 7, 2010
The 2010 Survey asked about PDF server application development.

The Fourth Paradigm of Document Capture

TalkPDF225x100_noDJ.png

by Duff Johnson

Thursday, March 31, 2011

Few know it yet, but a new paradigm for document capture begins in 2011. This is the year in which PDF/A and PDF/UA become cornerstone International Standards for electronic documents in the decades and centuries ahead. This is the year in which document capture begins its fourth paradigm - semantics.

My old company started life as an imaging service bureau. In 1996, our very first job involved scanning several hundred pages of typed poetry and converting the results into a useful Microsoft Word file. I think we charged $400 (which didn't come close to covering the OCR software), but we learned a bit about document capture.

This two-part article argues for a definable set of paradigms in the development of document capture, and suggests that PDF/A and PDF/UA will enable a brave new world in which documents and data exchange freely, improving document longevity, search, reuse and accessibility into the future.

Where We're Coming From

Paper was (and often remains) the indispensable medium for documents. Paper is, for the most part, indisputable. It either exists, or it doesn't. It has this or that writing on it (whether currency or contracts or music). Paper is portable and convenient, or rather, seemed convenient. Dead trees have enjoyed a very long run as the medium of documents, but the utility of paper, relative to the alternatives, has been declining for decades. From micrographics to PDF/A, the set of technologies providing alternatives to paper have long established one thing: To replace paper, you better be able to act like paper. Each of the paradigms of document capture, the art, science and industry of converting source content into reliable, useful documents, has respected that fact.

The First Paradigm of Document Capture: Micrographics

Measured in paper consumption terms, some organizations generate metric tons of documents. The really big organizations generate tons of documents on a daily basis. The (seemingly) simple act of storing and retrieving these documents can pose a major organizational and financial burden. It was for this reason that the micrographics (microfilm and microfiche) industry was born in the years following World War II.

The idea was (and remains) simple: It's easier to store, view and share pictures of documents than it is to store the documents themselves. If you are willing to trust these little pictures as if they were the original document, then you can safely shred the paper and start saving money.

Some organizations still store pictures of their documents on film or fiche - it's eye-readable, even without electricity. Ever since the advent of the personal computer, however, it's become popular to capture pictures using a digital camera instead. The most common type of digital camera used for this purpose is more commonly known as a scanner.

The Second Paradigm of Document Capture: Digital Imaging and OCR

Scanning allows documents to be captured to a digital form rather than to another physical medium.

Compared with previous analogue technologies (paper and micrographics), the digital age developed very, very quickly. Almost as soon as imaging became popular for storage, retrieval and sharing of business, research, technical and other documents, developers began to produce software to analyze and convert those images into text for search or reuse purposes.

While Optical Character Recognition (OCR) was first made commercially available in the late 1970s, it took until the late 1990s for the accuracy, speed and cost begin to make sense for large-scale capture.

OCR represents a crucial step beyond a reproducible document image. When locating interesting content was purely a function of metadata and human indexing, there was simply no way to find specific content without physically reading every page - a problem that's unimaginable in the age of search engines. These days, reasonably high-quality and fast OCR is approaching commodity pricing. It's now routine for scanned documents to be OCRed without much extra thought. Google will even OCR your scanned documents for free simply because you leave them on a webserver!

In the 1980s, while electronic imaging began to get going, others were asking themselves: Why scan or image a paper document to store it? Why not just create it electronically in the first place?

It was this idea that animated Adobe's "Camelot" project, eventually resulting in PDF, the third paradigm of document capture.

-------------

Read Part 2 of "The Fourth Paradigm of Document Capture"



Server Desktop Services Support Why Us? About Us
AppendPDF
AppendPDF Pro
FDFMerge
FDFMerge Lite
pdfHarmony
Redax Enterprise Server
SecurSign
StampPDF Batch
APCrypt
APJavaScript
APSplit
APGetInfo
pdfAPilot Server 2
Redax
StampPDF plugin
StampPDF DE
AppendPDF DE
APSplit DE
PDF Forms
Designer/XFA Forms
PDF JavaScript
PDF Accessibility
Section 508
Publication Scanning
CD/DVD-ROMs
Custom Development
Software Support Policy
Technical Support
Product Documentation
FAQs
Sample Scripts
PDF Glossary
Contact Support

Talking PDF
Appligent Labs
Customers
Testimonials
Case Studies
Cost Effectiveness
Innovation
PDF Standards
Experience
Mission
History
People
Partners
Contact Us
News & Events
Site Accessibility
Site Index
 
Site Accessibility | Email the WebAdmin
Valid HTML 4.01! Section 508 Compliance logo