Archive for October, 2009

OCR, Image/Text PDFs and the Mac

This week I’ve been staring at a collection of just over 29,000 PDFs. Image-only copies of thousands of documents created with “..the software that came with the scanner.”

My task? Figuring out the right tools and workflow to get these PDFs through an OCR process so we can unlock the content and make them more accessible. A number of these documents will end up in our MARS system, so exposing the text to the PDFBox indexing code that ships with DSpace is critical (as an aside, I’ve heard that Xpdf is a really nice replacement for PDFBox but I haven’t had time to tip it into our DSpace install yet).

I don’t have a precise OCR accuracy threshold in mind but assume if we can hit the mid-90% range we’ll find that retrieval doesn’t suffer.

I have seen a 2001 study by a group from Harvard University Library that found that 96.6% of searches will succeed on uncorrected OCR’d text. Also worth a look, Rose Holley’s recent article in D-Lib Magazine (”How Good Can It Get?“). She offers a number of interesting ideas on improving OCR accuracy in a large-scale digitization project. For some reason, it seems that most of the literature on OCR accuracy and retrieval focuses on scientific literature–where it appears to make very little difference. [ article behind pay wall ] [ freely viewable version ]

An ideal workflow would look something like this: fill a directory with image-only PDFs and point some sort of OCR process toward it. The final product would be yet another directory that contains “image-over-text” versions of the original PDFs (wherein the OCR’d text resides ‘inside’ the PDF as an extra ‘layer’ of content).

I’m trying out Mac-based solutions first (knowing that if it ends up being a Windows-based workflow we’ll likely use OmniPage (a product we already use with our ATIZ bookscanner)).

Read more »

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Javascript speed

I’ve long thought that if you wanted the fastest browser experience on a Mac, you went with the nightly Webkit build from http://nightly.webkit.org/.

So I was surprised today when I happened on the SunSpider JavaScript benchmark site and put several browsers through their paces.

One caveat, this test is measuring the core JavaScript engine and no other browser APIs or features. The results (smaller number is better):

Machine: MacPro (dual 2.8 quad-core); OS 10.6.1

Firefox 3.5.3 1036.8ms (32-bit)
Webkit Nightly (r49008) 434.8ms (64-bit)
Google Chrome (4.0.212.1) 434.4ms (32-bit)
Safari 4.0.3 364.6ms (64-bit)

Add to Del.icio.us Add to Technorati Stumble Upon Digg This