PDF and HTML: Objects and Semantics | Talking PDF

Let’s step back a moment, and get metaphysical about electronic content. Can we do that?

Whatever else they might possess, electronic documents always possess two things: objects and semantics.

Document “objects” are the physical characters, images, lines, bullets and other features that consume ink or toner when printed.
Document “semantics” define the logical relationships between the aforementioned objects. Most users never think about such things, but they use document semantics to read nonetheless.

A screen-shot displaying an HTML rendition of a portion of the current page. Text objects are in black type, semantic markup is displayed in a contrasting red type. If you can read this, you already know that placing specific characters in a specific sequence forms a “word”. Organize words into lines and group the lines together – conventionally-abled users will see “paragraphs”.

Change the typeface and size of a specific line of text, and most users will look there for a “heading”.

Organize some words or numbers into a grid and users will see headings, rows and columns, and will thus understand it as tabular information (a “table”).

Unless FaceBook, Twitter and the briefest of email are your only outlets for the written word, you create “semantic structures” to organize your “objects” more or less every time you write. And if you can read, you use semantics to organize the objects you see.

Web content managers think about these things in more concrete terms, as HTML semantics. Most people don’t think about semantics much at all, a key reason why so much electronic content is inaccessible to users with disabilities.

However the concept is expressed in your preferred authoring software, semantics are chosen by the author. In HTML and PDF, semantics are called “tags”.

If you place <p> and </p> tags around a stream of characters, that means “paragraph”.

Change the paragraph tags to <h2> and </h2> and you’ve made a “heading”.

Tables are denoted with tags such as <table>, <tr> (for “table row”), <th> (for “table heading”), and so on. Other tags allow the author to denote lists, images, links and more.

PDF borrowed all these concepts with HTML over ten years ago.

Once semantics are understood, the way they apply to PDF files becomes easier to understand and appreciate and thus easier to manipulate. Content navigation, text-extraction and search-engine optimization (SEO) get easier. Conformance with accessibility standards such as the Web Content Accessibility Guidelines (WCAG) 2.0 and the forthcoming ISO 14289-1 (PDF/Universal Accessibility) becomes possible.

Learn more…

Appligent Document Solutions provides PDF tagging services for producing accessible, Section 508 compliant documents.

by Duff Johnson

HTML, ISO Standard, Semantics