A lot of people sat up and took notice when Google announced their book-scanning initiative. And not for nothing; when a company as powerful and innovative as Google says they are going to do something, it’s usually worth watching.
Per my earlier promise, I’ve been sniffing around this new Google site. From the PDF Perspective, then, a brief review of Google Book Search.
The end-product of a massive scanning project, Google Book Search is intended to eventually span millions of books. For many works in the public-domain, Google makes complete cover-to-cover scans of the book available to users as images in an online viewer and also… you guessed it, as a PDF.
The Imaging Work
Overall, the scanning quality is average, perhaps very slightly above average. The black and white pages from each book have are captured with JBIG2 compression, and are overlaid by a clever grayscale “screen” to produce the “patina” of an old document. Nice touch – it keeps the file-size very low indeed while preserving at least some of the “atmospherics” of an old book. Google managed to suppress edge-artifacts for the most part, but I’ve certainly noticed errors which should have been caught during imaging… about 1 in 300 pages or so has a boo-boo of some sort. Not too bad, but not too good either. For the price they are doubtless paying (and charging) for the service, I’m sure Google thinks it’s just fine the way it is.
This gadget displays an image of each page in your browser window, complete with buttons to move forward or backwards through pages, or to goto a specific page. If you’re looking at the page as the result of a text-search, your search-term is highlighted, although this works less well than it should – the highlight is usually “off”.
The book’s own Table of Contents is provided via adjacent links, as is information about the publisher and current editions available in print.
The downloadable PDF files
The first thing to say about the files I’ve downloaded from Google Book Search is that they are very “lightweight” – from 8 to 20 kb per page in size for “black and white” pages. Very nice… but in their zeal to produce the SMALLEST possible PDF files, the Googlistas left something important (actually two somethings) OUT.
- There’s no searchable text! Users who want to locate a word or phrase are out of luck. OK, they want you to do your searching online, not offline… fair enough. But if you were thinking about doing something offline that involves text search or extraction, you better reconsider.
- The OCR engine used to generate the text needed to support the full-text search feature online is so-so at best. I suspect it was selected for speed and robustness rather than quality. In fact, I’ll go further, and guess that Google wrote their own OCR engine. Either way, they could have done better.
- There aren’t any bookmarks! Users who might prefer to actually NAVIGATE a 300 page book rather than simply turn pages are also… you guessed it… out of luck.
- Since they don’t include text, the files are (can’t be) tagged, and are completely inaccessible to disabled users.
- File properties are left at Acrobat defaults. Clearly the presentation of the PDF (ie, the end-user experience) doesn’t overly concern the Googlistas.
Overall, the service is, of course, free, so whining about it most likely won’t change anything. It’s a good thing too… I recently found a fascinating “Glossary of Words Pertaining to the Dialect of Mid-Yorkshire” from the 1870s.
If I could ask them to change ONE thing, it would be this: It’s clear that Google is capturing the necessary metadata (how else do they create links for a table of contents on their site) when they scan the book, so it’s really mysterious why they don’t go ahead and slap that data into each PDF in the form of Bookmarks. Who knows? If Google Google’s this blog post, maybe they’ll fix it!
by Duff Johnson