Mechanical Turk as Collection Development Tool

Poking around Amazon’s Mechanical Turk today, I found this “HIT” (Human Intelligence Task) available to webworkers.

The author/publisher is offering $4.00 if you request the book from your library (which I guess they hope will trigger a wave of purchases). I don’t know why it surprised me to see that this sort of thing happens…

HIT

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Mason Tweets

Earlier today a tweet from Dan Cohen pointed me to an interesting service offered by NC State:

http://twitter.ncsu.edu/

They were nice enough to offer a link to their Zend-framework based PHP code on the site so I spent a few minutes today building a Mason tweet aggregator. It still needs a bit of work and I appreciate the fact that it has that Web 1.0 look that seems to come so effortlessly to me, but it does work and I’ll eventually get around to “styling” it

http://gmutant.gmu.edu/tweet

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

OCR, Image/Text PDFs and the Mac

This week I’ve been staring at a collection of just over 29,000 PDFs. Image-only copies of thousands of documents created with “..the software that came with the scanner.”

My task? Figuring out the right tools and workflow to get these PDFs through an OCR process so we can unlock the content and make them more accessible. A number of these documents will end up in our MARS system, so exposing the text to the PDFBox indexing code that ships with DSpace is critical (as an aside, I’ve heard that Xpdf is a really nice replacement for PDFBox but I haven’t had time to tip it into our DSpace install yet).

I don’t have a precise OCR accuracy threshold in mind but assume if we can hit the mid-90% range we’ll find that retrieval doesn’t suffer.

I have seen a 2001 study by a group from Harvard University Library that found that 96.6% of searches will succeed on uncorrected OCR’d text. Also worth a look, Rose Holley’s recent article in D-Lib Magazine (“How Good Can It Get?“). She offers a number of interesting ideas on improving OCR accuracy in a large-scale digitization project. For some reason, it seems that most of the literature on OCR accuracy and retrieval focuses on scientific literature–where it appears to make very little difference. [ article behind pay wall ] [ freely viewable version ]

An ideal workflow would look something like this: fill a directory with image-only PDFs and point some sort of OCR process toward it. The final product would be yet another directory that contains “image-over-text” versions of the original PDFs (wherein the OCR’d text resides ‘inside’ the PDF as an extra ‘layer’ of content).

I’m trying out Mac-based solutions first (knowing that if it ends up being a Windows-based workflow we’ll likely use OmniPage (a product we already use with our ATIZ bookscanner)).

Read more »

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Javascript speed

I’ve long thought that if you wanted the fastest browser experience on a Mac, you went with the nightly Webkit build from http://nightly.webkit.org/.

So I was surprised today when I happened on the SunSpider JavaScript benchmark site and put several browsers through their paces.

One caveat, this test is measuring the core JavaScript engine and no other browser APIs or features. The results (smaller number is better):

Machine: MacPro (dual 2.8 quad-core); OS 10.6.1

Firefox 3.5.3 1036.8ms (32-bit)
Webkit Nightly (r49008) 434.8ms (64-bit)
Google Chrome (4.0.212.1) 434.4ms (32-bit)
Safari 4.0.3 364.6ms (64-bit)

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Fix it till it breaks – into 64 bits

I like to fix things till they break. Today’s post is a cautionary tale for that admittedly small niche of sysadmins running OSX server on XServes upgraded in place from Leopard to Snow Leopard…

For the past few weeks I’ve been tweaking the JSP interface of our MARS system and doing some overdue “authority control” cleanup on subjects and authors. That’s been going so well that late this afternoon I decided to take a crack at updating a few packages originally installed via MacPorts back when the server was running Leopard server (the in-place Snow Leopard upgrade didn’t disturb the code in the /opt/local destination for Macport installs).

I pulled down version 3.2 of Apple’s Developer Tools (to insure 10.6 compatibility) and went to work. In no time at all I had upgraded ant, maven, postgres, bison, wget, openssl and a host of other dependencies. Rebooted and the fun began.  First up, Postgres:

FATAL: incorrect checksum in control file

Never saw that before.

Found a web posting on a Linux site explaining that this could easily happen if you tried to open a database with a 64-bit version of postgres when it had been closed by a 32-bit version. Then it hit me. Of course, on an XServe, Snow Leopard server defaults to 64-bit builds. Under Leopard, I had built a 32-bit version of Postgres.

Recommended solution from the Linux posting: forget about it. Only solution is to open the database under a 32-bit version of Postgres and then dump the data, reimporting it into a new database created by a 64-bit version.

I backed out my 64-bit upgrades, then manually uncommented the “build_arch i386″ line in macports.conf to force 32-bit builds….then started rebuilding 32-bit versions of all the code. That fixed most everything but not Postgres. I still had at least one load library mismatch crashing that compilation.

As a last ditch effort, I tarred up the entire /opt/local tree and did a nearly full replacement from a sparse image clone of the machine’s boot drive that I made with SuperDuper just before doing the Snow Leopard upgrade (meaning all that code was 32-bit). I didn’t disturb /opt/local/var/db (that’s where my postgres database lived) but deleted and then restored these three directories from the sparse image backup:

  • /opt/local/lib
  • /opt/local/bin
  • /opt/local/share

Rebooted…success!

To enable use of the “port” command on this box, I then reinstalled the Snow Leopard version of macports (restoring selected parts of /opt/local from the backup broke the port command). That went smoothly and “port” now works.

My takeaway: Say ‘no’ to that little voice in your head that suggests you should “improve” a system that’s running well…and don’t ever say anything bad about sparse image backups.

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

iPhone / iTouch / Android enabled

inode.jpgGot an iPhone 3GS the other day (clearly I’m behind the mobile curve but then I hate talking on the phone so it took me a while to “get it”).

Anyway, after just a few days with the thing, I realize I need to begin tweaking some of the library’s web-based content.

First (easy) step? A touch-device friendly theme for this weblog. Also applied it to the library’s news blog as well. Thus far it’s working well and it couldn’t be any easier to implement–just drop the code in your plugins folder and activate it. Presence of the mobile device automatically detected and touch theme is served when appropriate.

http://www.bravenewcode.com/wptouch/

tip: To do a screen capture on your 3G/3GS iPhone, press down the home key then hit the “top” button. Screen flashes and image goes into your photos folder.

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

MARS update (final)

Completed upgrading the new MARS (DSpace) server to from 10.5.8 to 10.6 (Snow Leopard Server). Went very smoothly. Here’s the sequence:

  • run SuperDuper to clone boot drive to second drive in XServe
  • shutdown server, pull boot drive and set aside (for disaster recovery)
  • move second drive to boot drive spot
  • boot off cloned drive
  • insert new disc
  • click on “install Snow Leopard Server”
  • enter serial number when prompted
  • reboot
  • all systems running normally
  • shutdown second time and reinsert original boot drive in second drive slot
  • after running updated OS for two days, will clone new boot drive to second drive

One quirk I’ve noticed: haven’t seen any mention of this elseweb but the /etc/rc.local that was faithfully launching Tomcat under 10.5.8 doesn’t seem to work under 10.6. Lauching Tomcat manually now until I figure out enough launchd magic to reenable autostarting (my last session with Lingon wasn’t that successful)..

I’ll also begin testing whether this new OS might solve the problem(s) that left our Manakin interface too unstable for everyday use.

You can check on that interface via this link:

http://digilib.gmu.edu:8080/xxmlui

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Next Page »