| skip navigation | |||||||
|
PDF Document Management Software, Services & Support |
||||||
|
|||||||
|
|
Each PDF Page is a PaintingWhy PDF “reading order” is irrelevant to accessibilityby Duff Johnson Monday, November 8, 2010 IntroductionThis article attempts to explain the concept of "reading order" in PDF files. Why is this necessary?
Many have come to use the term "reading order" as functionally synonymous with the logical order imposed by tags, but this interpretation is incorrect. I’ve tried to make this article comprehensible and useful to all; you’ll be the judge of my success. A technical annex is included at the end for those who want to see what "reading order" really means in PDF. Feel free to check my credentials at the end of the article. Perhaps you prefer PDF poetry?each PDF page is a painting individual brushstrokes merge with others on the canvas of a page painters consider their brushstrokes very carefully viewers see only the final product The PDF PaintbrushWhen you create a PDF, you’re painting a picture. Your paintbrush is the is the result of a combination of the software used to create the source document and the software you’ve chosen to convert your source document into the universal electronic document format we all know as PDF. Like the painter's brushstrokes, each character, each line and each image is fundamentally independent, but they can interact with each other to produce particular visual effects. On the PDF page, objects are connected by a coordinate system and not a lot else. There’s no logical, semantic connection between the letters comprising a word; characters simply happen at a series of locations on the rendered page. As originally designed, PDF is fundamentally a system for painting objects onto a page, plus a whole lot of other features we aren't talking about right now! There's no innate concept of words, sentences, paragraphs, columns, headings, images, tables, lists, footnotes - any of the semantic structures that distinguish a "document" from a meaningless heap of letters, shapes and colors. PDF is fundamentally about how the document appears on the page, not how it looks when abstracted from the page. When a PDF includes instructions to paint more than one object in the same spot (it happens all the time), the items stack on top of each other, with the last item painted appearing on the top of the stack. Unlike watercolors, each brushstroke only appears to blend with the others if one or more of them is semi-transparent. Another example: A PDF creator may choose to paint all the Times-Roman text on the page first, then come back and paint the text that appears in other fonts. Since it’s a painting, the order doesn’t really matter anymore than it matters whether Monet painted his water lilies from left-to-right or from right-to-left, or from the inside-out, for that matter. If we think that these objects have meaning, that’s because we impose semantics on the objects as we read. If you encounter a word that starts at the end of one column and ends at the top of the next, your mind stitches the two together without conscious thought. Likewise, if you see a line of 16 point text followed by a paragraph of 12 point text, you naturally assume the 16 point text was a heading. Ok, it's all very well to paint a picture - but what if we want to copy and paste the text, or reflow it for display on a mobile phone? What if the “consumer” is actually a search-engine trying to index the document? What if the user is blind or otherwise disabled, and requires special Assistive Technology (AT) devices to read and to operate the computer? HTML is Different, not BetterIn conventional HTML, reading order and logical order are inherently aligned. HTML tags carry all the semantic information (<P>, <H1>, <H2> etc). If the goal is accessibility, what more could you want, right? HTML (especially with CSS) has its own accessibility challenges, but at the end of the day, HTML is just text. PDF, at least technically, is not nearly so easy. On the other hand, tagged PDF is an accessible vehicle for just about any document, regardless of source. If you can print it, you can make a PDF. Pretty much any PDF may be tagged to become an accessible PDF. That's hard to beat. Universal AccessibilityWhat does it mean to say that an electronic document is “accessible”? If a document’s contents are structured and organized such that the meaning of the document is available to every consumer, then we can say that the document is accessible. It’s not about file format. Word, HTML, PDF, Excel, Flash... they all have capabilities and limitations as file-formats for electronic documents. In most cases, each format can be made accessible, but it never happens by accident. Accessibility requires intention, and the difficulty of achieving real accessibility tends to vary as a function of the complexity of the content. In the PDF format, accessibility is assured by adding “tags” - markers that identify the correct order of objects and the semantics of the document. Tags strongly resemble the HTML tags on which they were modeled. What’s the “correct order”? There may be more than one; after all, there’s no “correct” way to read a newspaper. The idea of "correct order" is simply that whichever order the author selects for their PDF, it must make sense. It’s not OK, for example, to mix two separate articles together simply because the columns of text are adjacent - but it's perfectly legitimate to do so in the "reading order" (as the example in the technical annex makes clear). ConclusionPDF tags and PDF tags alone define the logical order of the document’s content, and thus, its accessibility. To the extent a PDF is tagged, it might be accessible. To determine whether it is, in fact, accessible, the tags need to be checked, and if necessary, corrected to ensure correct logical order and usage. Users seeking to ensure their PDFs are accessible should focus on the tags. The "reading order" of the content on the PDF page just isn't a factor in accessibility, as we demonstrate below. Technical Annex:
|
||||||
Comments
Dubey, thanks for the question.
If you are extracting text from scanned PDFs, it's probably best to keep the problem at that level, and not get into tagging the PDFs. There are a wide variety of software vendors with different packages (software, APIs, etc) for recognizing and extracting text from images. I don't want to get into recommending this or that vendor; the question is a bit out-of-scope for me to address here.
thanks a lot for very informative blog.
please pardon me for asking a question on this blog. i am trying to find a way to extract some useful information from scanned(pdf ) copy of few documents. physical copies of these documents contain text only(numbers and words) printed characters and 1-2 hand written words(more like a singed bills/cheques).
First of all i want to know whether is it possible or not, as it involves two processes first scanning(painting) second extracting?
do current scanner/OCRs/ICRs put tags for ordering? how much dpi and what kind of OCRs i need to use? what kind of soft libraries or tools i can use to automate extracting process?
Any direction/any suggestion will be very helpful to me
I have this argument all the time with both people using adaptive technology who insist on viewing PDF documents using the print stream rather than by Tags, or by document authors and repair technicians who only use the Order Panel to “check reading order.” This is one of the reasons I say that in a perfectly accessible document the end-user settings can break the accessibility which in turn is why I won’t work on anything remotely defined to me as making a PDF or other document “JAWS compliant” or any other adaptive technology “compliant,” and why I “ranted” on my PDF web page about this.
It is also why I spend time talking about tri-fold and four fold brochures and making sure the Tags render the content in the way you would read a brochure rather than the way things are laid out on the 8.5 X 11 page. I usually talk about letting go of the “visual” and “accessing the content”. It's a way to try and get people to understand the rendering of content.
Cheers, Karen
First, I don’t assert that accessibility is strictly for screen-readers; I identify several types of possible “consumers” of PDF content, including non-human users such as search-engines.
As I’ve said, the sole purpose of “reading order” in PDF is, quite literally, to paint - to create a visual effect. It should have been called “painting order”. The naming problem is the root cause of all the confusion - including at Adobe - about the role (if any) “reading order” should play in features such as Reflow.
Universal Accessibility is about the ability to abstract the content for a purpose OTHER than “painting”. This could be a screen-reader, a screen-magnifier, a special keyboard or mouse-equivalent, search engine, a user performing a copy-paste.
While it is undeniably useful for some users in certain (very limited) circumstances, Acrobat’s “reflow” feature - as it exists today - does not qualify as Assistive Technology for the precise reason that it does not follow the tags, as ISO/DIS 14289-1 requires for both conforming PDF viewers and assistive technology.
For users requiring a screen-magnifier, there are PDF tags-consuming screen-magnifiers on the market, for example, aisquared’s ZoomText and Freedom Scientific’s MAGic.
Since today's Reflow mode does use PDF reading order, it cannot handle content which flows across pages, or includes multiple “flows” within a page (see the example).
As I noted, a forthcoming "part" of the PDF Reference, ISO 3200-2, will include a clarification of "reading order" so that developers don't confuse the concept with "logical order" again.
Lastly, to Ted’s statement:
The “logical order” of the content needs to be correct/appropriate (but not necessarily the same) in both.
This is correct - but you have to remember what "correct/appropriate" really means in the case of PDF content:
Just to reinforce Sébastien’s point about reflow view, I’m afraid it’s incorrect to state that “PDF tags and PDF tags alone define the logical order of the document’s content”.
In fact it’s common for the “logical order” of a PDF’s content to be different in its tags and reflow views. Sometimes this is desirable (for example, for footnotes) but more often it’s a problem that needs correcting. The “logical order” of the content needs to be correct/appropriate (but not necessarily the same) in both.
Lastly, I don’t think I can bring myself to stop using the phrase “reading order” to describe the order in which the content is read by an assistive technology or by someone using reflow view. To do so, I believe would cause confusion, even if there is a parallel and different meaning of the phrase for those examining PDF from a more technical perspective.
Hello,
Thank you for this very interesting article (and picky as I like it).
I have been working a few years on the accessibility of PDF documents and I regulary use the wrongly term "reading order". I totally agree that we should differentiate "reading order" to "logical order imposed by tags".
But there is a point with which I absolutely do not agree: "And that's why we can safely and responsibly ignore reading order when considering accessibility in PDF".
Accessibility is not just for screen readers users.
For instance, Some PDF readers, including Adobe Reader, allow a linearization of PDF (reflow display mode in Adobe Reader). This feature optimizes both text magnification on the screen and changes to text and background color (for example, in this view, you can magnify text size and it remains in the document pane, so you do not need to use the horizontal scroll bar to read it).
However, this method relies on the "reading order" (order in which the computer reads the file). Ensure a logical "Reading order" is essential for accessibility.
So "reading order" is not irrelevant to accessibility but only irrelevant for screen readers.
Logical order imposed by tags and order in which the computer reads the file are both very important for accessibility.
Regards
Sébastien Delorme