Why PDF “reading order” is irrelevant to accessibility
This article attempts to explain the concept of “reading order” in PDF files. Why is this necessary?
- End users are often frustrated by inconsistent and often illegible results when attempting to read PDF files on mobile devices, search for PDF content online, or when using assistive technology (AT) to read.
- Content authors and managers tasked with ensuring accessibility or Section 508 compliance in PDF documents often focus on objects rather than tags, thus missing the mark.
- Software developers are (understandably) confused by “reading order” as presented in the current PDF Reference (ISO 32000), the technical description of PDF.
Many have come to use the term “reading order” as functionally synonymous with the logical order imposed by tags, but this interpretation is incorrect.
I’ve tried to make this article comprehensible and useful to all; you’ll be the judge of my success. A technical annex is included at the end for those who want to see what “reading order” really means in PDF. Feel free to check my credentials at the end of the article.
The PDF Paintbrush
When you create a PDF, you’re painting a picture. Your paintbrush is the is the result of a combination of the software used to create the source document and the software you’ve chosen to convert your source document into the universal electronic document format we all know as PDF.
Like the painter’s brushstrokes, each character, each line and each image is fundamentally independent, but they can interact with each other to produce particular visual effects. On the PDF page, objects are connected by a coordinate system and not a lot else. There’s no logical, semantic connection between the letters comprising a word; characters simply happen at a series of locations on the rendered page.
As originally designed, PDF is fundamentally a system for painting objects onto a page, plus a whole lot of other features we aren’t talking about right now! There’s no innate concept of words, sentences, paragraphs, columns, headings, images, tables, lists, footnotes – any of the semantic structures that distinguish a “document” from a meaningless heap of letters, shapes and colors. PDF is fundamentally about how the document appears on the page, not how it looks when abstracted from the page.
When a PDF includes instructions to paint more than one object in the same spot (it happens all the time), the items stack on top of each other, with the last item painted appearing on the top of the stack. Unlike watercolors, each brushstroke only appears to blend with the others if one or more of them is semi-transparent.
Another example: A PDF creator may choose to paint all the Times-Roman text on the page first, then come back and paint the text that appears in other fonts. Since it’s a painting, the order doesn’t really matter anymore than it matters whether Monet painted his water lilies from left-to-right or from right-to-left, or from the inside-out, for that matter.
If we think that these objects have meaning, that’s because we impose semantics on the objects as we read. If you encounter a word that starts at the end of one column and ends at the top of the next, your mind stitches the two together without conscious thought. Likewise, if you see a line of 16 point text followed by a paragraph of 12 point text, you naturally assume the 16 point text was a heading.
Ok, it’s all very well to paint a picture – but what if we want to copy and paste the text, or reflow it for display on a mobile phone? What if the “consumer” is actually a search-engine trying to index the document? What if the user is blind or otherwise disabled, and requires special Assistive Technology (AT) devices to read and to operate the computer?
What does it mean to say that an electronic document is “accessible”? If a document’s contents are structured and organized such that the meaning of the document is available to every consumer, then we can say that the document is accessible.
It’s not about file format. Word, HTML, PDF, Excel, Flash… they all have capabilities and limitations as file-formats for electronic documents. In most cases, each format can be made accessible, but it never happens by accident. Accessibility requires intention, and the difficulty of achieving real accessibility tends to vary as a function of the complexity of the content.
In the PDF format, accessibility is assured by adding “tags” – markers that identify the correct order of objects and the semantics of the document. Tags strongly resemble the HTML tags on which they were modeled.
What’s the “correct order”? There may be more than one; after all, there’s no “correct” way to read a newspaper. The idea of “correct order” is simply that whichever order the author selects for their PDF, it must make sense. It’s not OK, for example, to mix two separate articles together simply because the columns of text are adjacent – but it’s perfectly legitimate to do so in the “reading order” (as the example in the technical annex makes clear).
PDF tags and PDF tags alone define the logical order of the document’s content, and thus, its accessibility. To the extent a PDF is tagged, it might be accessible. To determine whether it is, in fact, accessible, the tags need to be checked, and if necessary, corrected to ensure correct logical order and usage.
Users seeking to ensure their PDFs are accessible should focus on the tags. The “reading order” of the content on the PDF page just isn’t a factor in accessibility, as we demonstrate below.
The term “reading order” might lead one to think that it is relevant to accessibility, but it’s not, notwithstanding the confusing representation of the issue in ISO 32000-1:2008, Section 184.108.40.206.
In PDF, “reading order” refers simply to the order in which the computer reads the file. It has nothing whatsoever to do with “logical order”, the sequence people use, which is defined in PDF by tags.
Section 220.127.116.11 will be modified in a new part of ISO 32000 to clear up this confusion over the significance of reading order when re-using PDF page content for accessibility or other purposes.
PDF is capable of extraordinary complexity, sophistication and accuracy in rendering content. From typography to transparencies, from alpha channel to z-order, the range of possibilities in generating the file’s reading order is effectively infinite, even for the same content!
The following image represents an example of content as rendered on a PDF page. Simple though it is, this example nonetheless demonstrates how reading order and logical order are utterly distinct in a PDF file.
What follows is one possible example of actual PDF code for the above text. This code has been dramatically simplified to make things as clear as possible. The emphasis indicates the rendered text (see the image above) as it occurs in the PDF’s “reading order”.
q 1 0 0 -1 0 432 cm
0 g 0 G
14 0 0 -14 72 84 Tm /F1.0 1 Tf (The quick) Tj
14 0 0 -14 147.6 84 Tm (the lazy) Tj
14 0 0 -14 72 100 Tm (brown fox) Tj
14 0 0 -14 147.6 100 Tm (dog.) Tj
14 0 0 -14 72 116 Tm (jumps over) Tj
Of course, the “reading order” is this case is semantically incorrect, because the PDF creation software “painted” each line of text across the page, crossing the columns as it did so. Nonetheless, this example is 100% legitimate PDF, as per ISO 32000-1:2008.
If the example code given above included container information (not included to make the example more readable to non-developers) and tags, it would conform to the forthcoming ISO 14289-1 (PDF/Universal Accessibility), even though the “reading order” makes no sense.
If a PDF viewer cannot consume tags, you’ll get your text in the above order. That’s NTDE (Not The Desired Effect), as we like to say.
If the PDF is correctly tagged and the viewing software supports tags for content extraction and reuse, the text would appear in correct logical order and with appropriate semantics (in this case, a simple paragraph) as follows:
<p>The quick brown fox jumps over the lazy dog.</p>
And that’s why we can safely and responsibly ignore reading order when considering accessibility in PDF.
If you are unhappy with your results extracting content for reuse, using assistive technology, or otherwise consuming PDFs, be sure your software supports tagged PDF.
- A PDF is accessible without reference to its “reading order”, but by reference to the tags.
- If the PDF has no tags, or the tags are incorrect, that PDF is not accessible or reliably reusable.
- If the creation, viewing or extraction software cannot create or use PDF tags (as appropriate), that software doesn’t support accessible PDF.
There’s considerable misinformation regarding PDF accessibility. The resulting confusion is evident on a number of government websites and even in Adobe’s Acrobat Professional software. As elsewhere in the accessibility world, opinions are often strongly held and fiercely defended. For this reason, it seems like a good idea to establish my credentials for this discussion. ISO 14289-1 isn’t published yet, but I can tell you now that it doesn’t even mention PDF “reading order”. I’m not offering an opinion here; these are simply the facts.
I’ve been in the business of making PDF files accessible since Acrobat 5 was released in 2000, longer than anyone else except the developers at Adobe Systems who created the technology in the first place. Since 2005, I’ve chaired AIIM’s PDF/UA (Universal Accessibility) committee through scores of meetings as we drafted the forthcoming International Standard for accessible PDF, which is planned for publication in 2011 as ISO 14289-1. I’m also a long-time educator on PDF accessibility in articles, blog posts and seminars around the world. (Learn more about Appligent Document Solutions’ efforts on behalf of ISO standards for PDF.)
By Duff Johnson