OCR, Image/Text PDFs and the Mac

This week I’ve been staring at a collection of just over 29,000 PDFs. Image-only copies of thousands of documents created with “..the software that came with the scanner.”

My task? Figuring out the right tools and workflow to get these PDFs through an OCR process so we can unlock the content and make them more accessible. A number of these documents will end up in our MARS system, so exposing the text to the PDFBox indexing code that ships with DSpace is critical (as an aside, I’ve heard that Xpdf is a really nice replacement for PDFBox but I haven’t had time to tip it into our DSpace install yet).

I don’t have a precise OCR accuracy threshold in mind but assume if we can hit the mid-90% range we’ll find that retrieval doesn’t suffer.

I have seen a 2001 study by a group from Harvard University Library that found that 96.6% of searches will succeed on uncorrected OCR’d text. Also worth a look, Rose Holley’s recent article in D-Lib Magazine (”How Good Can It Get?“). She offers a number of interesting ideas on improving OCR accuracy in a large-scale digitization project. For some reason, it seems that most of the literature on OCR accuracy and retrieval focuses on scientific literature–where it appears to make very little difference. [ article behind pay wall ] [ freely viewable version ]

An ideal workflow would look something like this: fill a directory with image-only PDFs and point some sort of OCR process toward it. The final product would be yet another directory that contains “image-over-text” versions of the original PDFs (wherein the OCR’d text resides ‘inside’ the PDF as an extra ‘layer’ of content).

I’m trying out Mac-based solutions first (knowing that if it ends up being a Windows-based workflow we’ll likely use OmniPage (a product we already use with our ATIZ bookscanner)).

Read more »

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Javascript speed

I’ve long thought that if you wanted the fastest browser experience on a Mac, you went with the nightly Webkit build from http://nightly.webkit.org/.

So I was surprised today when I happened on the SunSpider JavaScript benchmark site and put several browsers through their paces.

One caveat, this test is measuring the core JavaScript engine and no other browser APIs or features. The results (smaller number is better):

Machine: MacPro (dual 2.8 quad-core); OS 10.6.1

Firefox 3.5.3 1036.8ms (32-bit)
Webkit Nightly (r49008) 434.8ms (64-bit)
Google Chrome (4.0.212.1) 434.4ms (32-bit)
Safari 4.0.3 364.6ms (64-bit)

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Fix it till it breaks – into 64 bits

I like to fix things till they break. Today’s post is a cautionary tale for that admittedly small niche of sysadmins running OSX server on XServes upgraded in place from Leopard to Snow Leopard…

For the past few weeks I’ve been tweaking the JSP interface of our MARS system and doing some overdue “authority control” cleanup on subjects and authors. That’s been going so well that late this afternoon I decided to take a crack at updating a few packages originally installed via MacPorts back when the server was running Leopard server (the in-place Snow Leopard upgrade didn’t disturb the code in the /opt/local destination for Macport installs).

I pulled down version 3.2 of Apple’s Developer Tools (to insure 10.6 compatibility) and went to work. In no time at all I had upgraded ant, maven, postgres, bison, wget, openssl and a host of other dependencies. Rebooted and the fun began.  First up, Postgres:

FATAL: incorrect checksum in control file

Never saw that before.

Found a web posting on a Linux site explaining that this could easily happen if you tried to open a database with a 64-bit version of postgres when it had been closed by a 32-bit version. Then it hit me. Of course, on an XServe, Snow Leopard server defaults to 64-bit builds. Under Leopard, I had built a 32-bit version of Postgres.

Recommended solution from the Linux posting: forget about it. Only solution is to open the database under a 32-bit version of Postgres and then dump the data, reimporting it into a new database created by a 64-bit version.

I backed out my 64-bit upgrades, then manually uncommented the “build_arch i386″ line in macports.conf to force 32-bit builds….then started rebuilding 32-bit versions of all the code. That fixed most everything but not Postgres. I still had at least one load library mismatch crashing that compilation.

As a last ditch effort, I tarred up the entire /opt/local tree and did a nearly full replacement from a sparse image clone of the machine’s boot drive that I made with SuperDuper just before doing the Snow Leopard upgrade (meaning all that code was 32-bit). I didn’t disturb /opt/local/var/db (that’s where my postgres database lived) but deleted and then restored these three directories from the sparse image backup:

  • /opt/local/lib
  • /opt/local/bin
  • /opt/local/share

Rebooted…success!

To enable use of the “port” command on this box, I then reinstalled the Snow Leopard version of macports (restoring selected parts of /opt/local from the backup broke the port command). That went smoothly and “port” now works.

My takeaway: Say ‘no’ to that little voice in your head that suggests you should “improve” a system that’s running well…and don’t ever say anything bad about sparse image backups.

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

iPhone / iTouch / Android enabled

inode.jpgGot an iPhone 3GS the other day (clearly I’m behind the mobile curve but then I hate talking on the phone so it took me a while to “get it”).

Anyway, after just a few days with the thing, I realize I need to begin tweaking some of the library’s web-based content.

First (easy) step? A touch-device friendly theme for this weblog. Also applied it to the library’s news blog as well. Thus far it’s working well and it couldn’t be any easier to implement–just drop the code in your plugins folder and activate it. Presence of the mobile device automatically detected and touch theme is served when appropriate.

http://www.bravenewcode.com/wptouch/

tip: To do a screen capture on your 3G/3GS iPhone, press down the home key then hit the “top” button. Screen flashes and image goes into your photos folder.

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

MARS update (final)

Completed upgrading the new MARS (DSpace) server to from 10.5.8 to 10.6 (Snow Leopard Server). Went very smoothly. Here’s the sequence:

  • run SuperDuper to clone boot drive to second drive in XServe
  • shutdown server, pull boot drive and set aside (for disaster recovery)
  • move second drive to boot drive spot
  • boot off cloned drive
  • insert new disc
  • click on “install Snow Leopard Server”
  • enter serial number when prompted
  • reboot
  • all systems running normally
  • shutdown second time and reinsert original boot drive in second drive slot
  • after running updated OS for two days, will clone new boot drive to second drive

One quirk I’ve noticed: haven’t seen any mention of this elseweb but the /etc/rc.local that was faithfully launching Tomcat under 10.5.8 doesn’t seem to work under 10.6. Lauching Tomcat manually now until I figure out enough launchd magic to reenable autostarting (my last session with Lingon wasn’t that successful)..

I’ll also begin testing whether this new OS might solve the problem(s) that left our Manakin interface too unstable for everyday use.

You can check on that interface via this link:

http://digilib.gmu.edu:8080/xxmlui

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

MARS update

Finally have a tolerable JSP (java server pages) interface up and running. I’m no web designer but after firing up CSS Edit and reading (then re-reading) the 1.5.2 DSpace manual,  I was able to chase away most of the “boxy-ness” of the default JSP  interface.    Some things have changed since I last understood DSpace internals (roughly the 1.2 release)  but it was similar enough that in no time at all I was back in that “find DSpace and replace with MARS” groove.

I’m certainly ready to say the JSP interface is more stable than Manakin–seems to run smoother and use fewer resources too.  Downside? It’s more work to bend it into something visually interesting and it is clearly less flexible.

Given the experience of the past few weeks,  today I’m thankful for stable.

http://mars.gmu.edu

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Catching Up

Yes, those are in fact cobwebs hanging from the corners.  It’s not that I’ve been too busy to write something, just distracted, I guess. Vacation…a few other short trips…couple of deadlines…a digression into the twittersphere…it’s pretty easy to fall behind on one’s blogging…

Award winning

A few weeks ago we received one of the 2009 Campus Technology Innovation Awards. GM_WGandEC_small.jpg I wasn’t really expecting that (heard there were 349 nominees) so it came as a nice surprise even as it managed to burn up the last of my 15 minutes of fame. (I lost the bulk in June when this AP story appeared in over 2,000 places.  My favorite was this one which nicely showcased my sudden-onset facility with Spanish).

Read more »

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Next Page »