skip navigation

PDF Document Management Software, Services & Support

Server Desktop Services Support Why Us? About Us

The Latest

SecurSign 5 Now Available! Includes Signature Validation to Detect Tampering.
Lansdowne, PA (July 13, 2011)
Encrypt, digitally sign and verify digital signatures on PDF documents.

Redax 5: Advanced Redaction for PDF Documents
Tuesday, March 22, 2011
The latest Redax adds new patterns, regular expressions and more!

Redax Enterprise Server 3 Ships!
Thursday, January 6, 2011
New Redaction Engine, Powerful New Markup Options and More!

Survey: Server Based PDF Applications
Tuesday, December 7, 2010
The 2010 Survey asked about PDF server application development.

5 PDF Readers Compared
Tuesday, November 30, 2010
Expanding on our previous review, we've included Nitro's Reader and Adobe's new Reader X.

PDF Form Aids Sales Team Collaboration
Friday, November 26, 2010
Take a document, add a dash of JavaScript, a sprinkling of PDF know-how, and serve.

Section 508 Center for PDF now online!
Wednesday, November 17, 2010
A key resource for document authors, content managers and Section 508 coordinators concerned with PDF accessibility.

Case Study: MedWiz Technologies
Friday, November 12, 2010
It's a lot more than simply knowing JavaScript; it's the background, the analysis and the thought that counts.

Section 508 Coordinators Conference
Monday, November 1, 2010
Join us at the annual Federal agencies conference to discuss Section 508.

PDF/UA Introduced at ATIA
Thursday, October 28, 2010
Unveils ISO 14289 for accessible PDF at the AT industry's conference in Chicago

REVIEW: PDF Accessibility Checker
Tuesday, October 26, 2010
We test Switzerland's latest contribution to PDF accessibility - and we like it!

Duff Johnson named Vice-Chair of the PDF/A Competence Center
Thursday, October 14, 2010
Joins PDF/A Competence Center Board at the PDF/A Conference in Rome, Italy

Each PDF Page is a Painting

TalkPDF225x100_noDJ.png

Why PDF “reading order” is irrelevant to accessibility

by Duff Johnson

Monday, November 8, 2010

Introduction

This article attempts to explain the concept of "reading order" in PDF files. Why is this necessary?

  1. End users are often frustrated by inconsistent and often illegible results when attempting to read PDF files on mobile devices, search for PDF content online, or when using assistive technology (AT) to read.
  2. Content authors and managers tasked with ensuring accessibility or Section 508 compliance in PDF documents often focus on objects rather than tags, thus missing the mark.
  3. Software developers are (understandably) confused by “reading order” as presented in the current PDF Reference (ISO 32000), the technical description of PDF.

Many have come to use the term "reading order" as functionally synonymous with the logical order imposed by tags, but this interpretation is incorrect.

I’ve tried to make this article comprehensible and useful to all; you’ll be the judge of my success. A technical annex is included at the end for those who want to see what "reading order" really means in PDF. Feel free to check my credentials at the end of the article.

Perhaps you prefer PDF poetry?

each PDF page is a painting

individual brushstrokes

merge with others on the canvas of a page

painters consider their brushstrokes very carefully

viewers see only the final product

The PDF Paintbrush

When you create a PDF, you’re painting a picture. Your paintbrush is the is the result of a combination of the software used to create the source document and the software you’ve chosen to convert your source document into the universal electronic document format we all know as PDF.

Like the painter's brushstrokes, each character, each line and each image is fundamentally independent, but they can interact with each other to produce particular visual effects. On the PDF page, objects are connected by a coordinate system and not a lot else. There’s no logical, semantic connection between the letters comprising a word; characters simply happen at a series of locations on the rendered page.

As originally designed, PDF is fundamentally a system for painting objects onto a page, plus a whole lot of other features we aren't talking about right now! There's no innate concept of words, sentences, paragraphs, columns, headings, images, tables, lists, footnotes - any of the semantic structures that distinguish a "document" from a meaningless heap of letters, shapes and colors. PDF is fundamentally about how the document appears on the page, not how it looks when abstracted from the page.

When a PDF includes instructions to paint more than one object in the same spot (it happens all the time), the items stack on top of each other, with the last item painted appearing on the top of the stack. Unlike watercolors, each brushstroke only appears to blend with the others if one or more of them is semi-transparent.

Another example: A PDF creator may choose to paint all the Times-Roman text on the page first, then come back and paint the text that appears in other fonts. Since it’s a painting, the order doesn’t really matter anymore than it matters whether Monet painted his water lilies from left-to-right or from right-to-left, or from the inside-out, for that matter.

If we think that these objects have meaning, that’s because we impose semantics on the objects as we read. If you encounter a word that starts at the end of one column and ends at the top of the next, your mind stitches the two together without conscious thought. Likewise, if you see a line of 16 point text followed by a paragraph of 12 point text, you naturally assume the 16 point text was a heading.

Ok, it's all very well to paint a picture - but what if we want to copy and paste the text, or reflow it for display on a mobile phone? What if the “consumer” is actually a search-engine trying to index the document? What if the user is blind or otherwise disabled, and requires special Assistive Technology (AT) devices to read and to operate the computer?

HTML is Different, not Better

In conventional HTML, reading order and logical order are inherently aligned. HTML tags carry all the semantic information (<P>, <H1>, <H2> etc). If the goal is accessibility, what more could you want, right?

HTML (especially with CSS) has its own accessibility challenges, but at the end of the day, HTML is just text. PDF, at least technically, is not nearly so easy.

On the other hand, tagged PDF is an accessible vehicle for just about any document, regardless of source. If you can print it, you can make a PDF. Pretty much any PDF may be tagged to become an accessible PDF. That's hard to beat.

Universal Accessibility

What does it mean to say that an electronic document is “accessible”? If a document’s contents are structured and organized such that the meaning of the document is available to every consumer, then we can say that the document is accessible.

It’s not about file format. Word, HTML, PDF, Excel, Flash... they all have capabilities and limitations as file-formats for electronic documents. In most cases, each format can be made accessible, but it never happens by accident. Accessibility requires intention, and the difficulty of achieving real accessibility tends to vary as a function of the complexity of the content.

In the PDF format, accessibility is assured by adding “tags” - markers that identify the correct order of objects and the semantics of the document. Tags strongly resemble the HTML tags on which they were modeled.

What’s the “correct order”? There may be more than one; after all, there’s no “correct” way to read a newspaper. The idea of "correct order" is simply that whichever order the author selects for their PDF, it must make sense. It’s not OK, for example, to mix two separate articles together simply because the columns of text are adjacent - but it's perfectly legitimate to do so in the "reading order" (as the example in the technical annex makes clear).

Conclusion

PDF tags and PDF tags alone define the logical order of the document’s content, and thus, its accessibility. To the extent a PDF is tagged, it might be accessible. To determine whether it is, in fact, accessible, the tags need to be checked, and if necessary, corrected to ensure correct logical order and usage.

Users seeking to ensure their PDFs are accessible should focus on the tags. The "reading order" of the content on the PDF page just isn't a factor in accessibility, as we demonstrate below.

Technical Annex:
What “Reading Order” in PDF really means

The term “reading order” might lead one to think that it is relevant to accessibility, but it's not, notwithstanding the confusing representation of the issue in ISO 32000-1:2008, Section 14.8.2.3.

In PDF, “reading order” refers simply to the order in which the computer reads the file. It has nothing whatsoever to do with “logical order”, the sequence people use, which is defined in PDF by tags.

Section 14.8.2.3 will be modified in a new part of ISO 32000 to clear up this confusion over the significance of reading order when re-using PDF page content for accessibility or other purposes.

You can buy an official copy of ISO 32000-1:2008 directly from ISO, or download an authorized copy for free from Adobe Systems. 

Demonstration

PDF is capable of extraordinary complexity, sophistication and accuracy in rendering content. From typography to transparencies, from alpha channel to z-order, the range of possibilities in generating the file's reading order is effectively infinite, even for the same content!

The following image represents an example of content as rendered on a PDF page. Simple though it is, this example nonetheless demonstrates how reading order and logical order are utterly distinct in a PDF file.

An example of text spanning two columns. The text reads: The quick brown fox jumps over the lazy dog.

What follows is one possible example of actual PDF code for the above text. This code has been dramatically simplified to make things as clear as possible. The emphasis indicates the rendered text (see the image above) as it occurs in the PDF's "reading order".

q 1 0 0 -1 0 432 cm
0 g 0 G
BT
14 0 0 -14 72 84 Tm /F1.0 1 Tf (The quick) Tj
14 0 0 -14 147.6 84 Tm (the lazy) Tj
14 0 0 -14 72 100 Tm (brown fox) Tj
14 0 0 -14 147.6 100 Tm (dog.) Tj
14 0 0 -14 72 116 Tm (jumps over) Tj
ET Q

Of course, the "reading order" is this case is semantically incorrect, because the PDF creation software "painted" each line of text across the page, crossing the columns as it did so. Nonetheless, this example is 100% legitimate PDF, as per ISO 32000-1:2008.

If the example code given above included container information (not included to make the example more readable to non-developers) and tags, it would conform to the forthcoming ISO 14289-1 (PDF/Universal Accessibility), even though the "reading order" makes no sense.

If a PDF viewer cannot consume tags, you'll get your text in the above order.  That's NTDE (Not The Desired Effect), as we like to say.

If the PDF is correctly tagged and the viewing software supports tags for content extraction and reuse, the text would appear in correct logical order and with appropriate semantics (in this case, a simple paragraph) as follows:

<p>The quick brown fox jumps over the lazy dog.</p>

And that's why we can safely and responsibly ignore reading order when considering accessibility in PDF.

If you are unhappy with your results extracting content for reuse, using assistive technology, or otherwise consuming PDFs, be sure your software supports tagged PDF.

Key TakeAways

  1. A PDF is accessible without reference to its "reading order", but by reference to the tags.
  2. If the PDF has no tags, or the tags are incorrect, that PDF is not accessible or reliably reusable.
  3. If the creation, viewing or extraction software cannot create or use PDF tags (as appropriate), that software doesn't support accessible PDF.

Credentials

There's considerable misinformation regarding PDF accessibility. The resulting confusion is evident on a number of government websites and even in Adobe's Acrobat Professional software. As elsewhere in the accessibility world, opinions are often strongly held and fiercely defended. For this reason, it seems like a good idea to establish my credentials for this discussion. ISO 14289-1 isn't published yet, but I can tell you now that it doesn't even mention PDF "reading order". I'm not offering an opinion here; these are simply the facts.

I’ve been in the business of making PDF files accessible since Acrobat 5 was released in 2000, longer than anyone else except the developers at Adobe Systems who created the technology in the first place. Since 2005, I’ve chaired AIIM’s PDF/UA (Universal Accessibility) committee through scores of meetings as we drafted the forthcoming International Standard for accessible PDF, which is planned for publication in 2011 as ISO 14289-1. I'm also a long-time educator on PDF accessibility in articles, blog posts and seminars around the world. (Learn more about Appligent Document Solutions' efforts on behalf of ISO standards for PDF.)

Comments

Add a comment.

Posted by duffjohnson on Nov 12, 2010

Dubey, thanks for the question.

If you are extracting text from scanned PDFs, it's probably best to keep the problem at that level, and not get into tagging the PDFs.  There are a wide variety of software vendors with different packages (software, APIs, etc) for recognizing and extracting text from images. I don't want to get into recommending this or that vendor; the question is a bit out-of-scope for me to address here.

Posted by Dubey [117.192.250.174] on Nov 12, 2010

thanks a lot for very informative blog.

please pardon me for asking a question on this blog.  i am trying to find a way to extract some useful information from scanned(pdf ) copy of few documents. physical copies of  these documents contain text only(numbers and words) printed characters and 1-2 hand written words(more like a singed bills/cheques).

First of all i want to know whether is it possible or not, as it involves two processes first scanning(painting) second extracting?

do current scanner/OCRs/ICRs put tags for ordering? how much dpi and what kind of OCRs i need to use? what kind of soft libraries or tools i can use to automate extracting process?

 

Any direction/any suggestion will be very helpful to me

 

 

 

Posted by Karen McCall [98.229.70.112] on Nov 10, 2010

I have this argument all the time with both people using adaptive technology who insist on viewing PDF documents using  the print stream rather than by Tags, or by document authors and repair technicians who only use the Order Panel to “check reading order.” This is one of the reasons I say that in a perfectly accessible document the end-user settings can break the accessibility which in turn is why I won’t work on anything remotely defined to me as making a PDF or other document “JAWS compliant” or any other adaptive technology “compliant,” and why I “ranted” on my PDF web page about this.

It is also why I spend time talking about tri-fold and four fold brochures and making sure the Tags render the content in the way you would read a brochure rather than the way things are laid out on the 8.5 X 11 page. I usually talk about letting go of the “visual” and “accessing the content”. It's a way to try and get people to understand the rendering of content.

Cheers, Karen

 

Posted by duffjohnson on Nov 10, 2010

Screen shot of a page showing page-spanning content, multiple columns, headings and articles on the same page.Thank you for the comments.

First, I don’t assert that accessibility is strictly for screen-readers; I identify several types of possible “consumers” of PDF content, including non-human users such as search-engines.

As I’ve said, the sole purpose of “reading order” in PDF is, quite literally, to paint - to create a visual effect.  It should have been called “painting order”. The naming problem is the root cause of all the confusion - including at Adobe - about the role (if any) “reading order” should play in features such as Reflow.

Universal Accessibility is about the ability to abstract the content for a purpose OTHER than “painting”.  This could be a screen-reader, a screen-magnifier, a special keyboard or mouse-equivalent, search engine, a user performing a copy-paste.

While it is undeniably useful for some users in certain (very limited) circumstances, Acrobat’s “reflow” feature - as it exists today - does not qualify as Assistive Technology for the precise reason that it does not follow the tags, as ISO/DIS 14289-1 requires for both conforming PDF viewers and assistive technology.

For users requiring a screen-magnifier, there are PDF tags-consuming screen-magnifiers on the market, for example, aisquared’s ZoomText and Freedom Scientific’s MAGic.

Since today's Reflow mode does use PDF reading order, it cannot handle content which flows across pages, or includes multiple “flows” within a page (see the example).

As I noted, a forthcoming "part" of the PDF Reference, ISO 3200-2, will include a clarification of "reading order" so that developers don't confuse the concept with "logical order" again.

Lastly, to Ted’s statement: 

The “logical order” of the content needs to be correct/appropriate (but not necessarily the same) in both.

 This is correct - but you have to remember what "correct/appropriate" really means in the case of PDF content:

  • Appropriate use of “reading order” - painting
  • Appropriate use of “logical reading order” (tags) -abstraction from the original page

Posted by Ted Page [86.143.179.167] on Nov 10, 2010

Just to reinforce Sébastien’s point about reflow view, I’m afraid it’s incorrect to state that “PDF tags and PDF tags alone define the logical order of the document’s content”.

In fact it’s common for the “logical order” of a PDF’s content to be different in its tags and reflow views. Sometimes this is desirable (for example, for footnotes) but more often it’s a problem that needs correcting. The “logical order” of the content needs to be correct/appropriate (but not necessarily the same) in both.

Lastly, I don’t think I can bring myself to stop using the phrase “reading order” to describe the order in which the content is read by an assistive technology or by someone using reflow view. To do so, I believe would cause confusion, even if there is a parallel and different meaning of the phrase for those examining PDF from a more technical perspective.

Posted by Sébastien Delorme [78.120.171.244] on Nov 10, 2010

Hello,

Thank you for this very interesting article (and picky as I like it).

I have been working a few years on the accessibility of PDF documents and I regulary use the
wrongly term "reading order". I totally agree that we should differentiate "reading order" to "logical order imposed by tags".

But there is a point with which I absolutely do not agree: "
And that's why we can safely and responsibly ignore reading order when considering accessibility in PDF".

Accessibility is not just for screen readers users.

For instance, Some PDF readers, including Adobe Reader, allow a linearization of PDF (reflow display mode in Adobe Reader). This feature optimizes both text magnification on the screen and changes to text and background color (for example, in this view, you can magnify text size and it remains in the document pane, so you do not need to use the horizontal scroll bar to read it).

 

However, this method relies on the "reading order" (order in which the computer reads the file). Ensure a logical "Reading order" is essential for accessibility.

So "reading order" is not irrelevant to accessibility but
only irrelevant for screen readers.
L
ogical order imposed by tags and order in which the computer reads the file are both very important for accessibility.

 

Regards
Sébastien Delorme

This feature optimizes both text magnification on the screen and changes to text and background colour.

  Add a comment.

Server Desktop Services Support Why Us? About Us
AppendPDF
AppendPDF Pro
FDFMerge
FDFMerge Lite
pdfHarmony
Redax Enterprise Server
SecurSign
StampPDF Batch
APCrypt
APJavaScript
APSplit
APGetInfo
pdfAPilot Server 2
Redax
StampPDF plugin
StampPDF DE
AppendPDF DE
APSplit DE
PDF Forms
Designer/XFA Forms
PDF JavaScript
PDF Accessibility
Section 508
Publication Scanning
CD/DVD-ROMs
Custom Development
Software Support Policy
Technical Support
Product Documentation
FAQs
Sample Scripts
PDF Glossary
Contact Support

Talking PDF
Appligent Labs
Customers
Testimonials
Case Studies
Cost Effectiveness
Innovation
PDF Standards
Experience
Mission
History
People
Partners
Contact Us
News & Events
Site Accessibility
Site Index
 
Site Accessibility | Email the WebAdmin
Valid HTML 4.01! Section 508 Compliance logo