Archive for February, 2007

Update on PKP Harvester

Late last month (Jan 22, 2007) the Public Knowledge Project released an updated version (2.0.1) of their PKP OAI Harvester software. I wrote about this tool a few months ago, but hadn’t really spent much time under the hood once I got the 2.0.0 release working in November. In the intevening months, I’ve discovered that I need to beef up my OAI harvesting chops so a new PKP release was timely.

Unpacking the software, I quickly abandoned the “upgrade” option (the documentation didn’t track with what I was seeing on the screen) and just did a clean install. In less than 15 minutes I had a working copy of 2.0.1 running.

As you probably know, OAI (Open Archives Initiative) is all about achieving interoperability between systems via the exchange of metadata. Put another way (minus buzzwords), it’s just an agreed upon way one system (a harvester) can ask another system (a provider) about the contents of the provider’s database and make sense of the answer. The harvester asks the question via a specially-crafted URL and the provider responds with an XML file. There’s more to it than that but that simple description captures the essence of why the OAI-PMH (Protocol for Metadata Harvesting) exists.

As an aside, the next iteration (OAI-ORE) has the potential to get really interesting. OAI-ORE (Open Archives Initiative-Object Reuse and Exchange) asks why we should be content with just exchanging metadata. Why not develop a protocol that enabled exchange of actual digital objects from various repositories—creating the opportunity for new and potentially more interesting intellectual products? Scholarly mashups? A functioning OAI-ORE infrastructure won’t arrive anytime soon (e.g., at the first meeting of the OAI-ORE Technical Committee last month, one goal was “reaching a shared problem statement”) but it is a promising idea and a project worth tracking.

But returning to the topic at hand, the nice thing about the PKP package is that it comes not only with an OAI harvesting module but also a MySQL backend to store and index the parsed information as well as a template-based user interface for the search and retrieval function. Point the harvesting module at a couple of OAI-compliant sites and in no time at all you’ve built your own local version of OAIster.

CgiI began building our database by calling up the Admininstrative module on the Harvester’s web-based admin interface and filling out a form for our MARS system (base OAI URL, metadata format, index method, and so on). Clicked the “Update Metadata Index” button and harvesting began. At about 450 records, the process stopped with an XML error displayed in my browser. Dorothea found the record in MARS and noticed an errant control-character embedded in one of its metadata fields. She cleaned that up and let me know that XML display errors almost always mean some sort of garbage in the data. Restarting the harvester from scratch, PKP retrieved just over 1300 records.

Next step was to identify OAI-compliant servers operated by higher education institutions in Virginia (my goal was to build a sort of regional gateway). I didn’t know of any OAI registries so I started my quest at OAIster. Spent a few minutes working through their 700+ OAI contributors but quickly saw that OAI base url’s weren’t included. A targeted lazyweb request (an email to the OAIster “contact us” link) was immediately productive. Within minutes I received a very helpful reply from Kat Hagedorn, the OAIster Metadata Harvesting Librarian, pointing me to the OAI-PM Data Provider Registry operated by Tom Habing at the Grainger Engineering Library of the University of Illinois.

The registry is a great resource. Not only does it gather information on 1400+ repositories, it offers additional services as well. There’s an RSS feed of recent additions/edits as well as an SRU service. For example, this query returns information about our MARS system (URL’s too long to display inline):

SRU query for Mason

Armed with this information I located a few sites in Virginia and began building the index. As it was harvesting, I kept trying to figure out why the item count from our MARS system was 1300 (when our current handle count is closer to 2,000). Couldn’t figure it out so I deleted the MARS collection from within PKP and reharvested, wondering if it would hit 1300 again. Nope…this time it stopped at 939. Hmmm.

 

RTFM

While there’s no mention on the web-based “Administration screen”, the documentation warns that for larger sites you need to run the harvest.php program from the server’s command line—to eliminate the possibility of a web timeout. Tried it and all MARS items were harvested (1900 items). I can now turn my attention to making a few tweaks in the user interface (out of the box, the PKP software does not report the contributing repository when displaying matches).

Our test system is here:

http://furbo.gmu.edu/OAIharvester

YAZ PROXY/SRU/VOYAGER/SOLARIS

Looking for a reader who has experience with the YAZ proxy on Solaris. I have a few questions about configuring the software to provide SRU services to the Voyager Z39.50 interface. Hit the “Email Me” link over on the right if you don’t mind a couple of quick questions.

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Snow day and the digital library

We’re enjoying a snow day today so I’m trying to catch up with some research I’ve been putting off–an environmental scan on digital libraries (what the term means these days, what sorts of support infrastructure others are building, what staffing levels look like, what services are contemplated or already offered, current standards and practices, and so on). Pulling many megabytes of reports, white papers, conference presentations and the like into my DEVONthink database but I haven’t yet hit upon the essence of what these documents are trying to tell me. I’ll keep hoping there is a unifying thread in there somewhere.

memex.jpgDigging out this information did give me a chance to find and re-read Vannevar Bush’s “As We May Think” article from 1945. Really an amazing vision when you consider what the world around him looked like: no networks, digital computers, or digital storage devices. Much of the article digresses into discussions of crazy analog devices (desktops with glowing surfaces that display microfilmed information) but when he begins describing the “memex” device, the clarity of his vision is scary. For example, here’s his description of how the memex machine would organize information:

“...associative indexing, the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another.

It’s a good thing Ted Nelson invented “hypertext” eighteen years later or the world might have completely forgotten about this concept.

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Back to MARS

Marsediticonlarge
For the past year and a half, Dorothea has been our go-to person for MARS (our DSpace installation) and she’s handled most of the sysadmin duties on our XServe as well. She’ll be leaving us at the end of this month (heading back to Madison to straighten out their statewide Minds@UW institutional repository). OK, I’m kidding about the “straightening out” part but unfortunately the rest is quite true.

This means, of course, that DSpace falls back into my lap until we go through a successful “search and hire” process for Dorothea’s replacement (stay tuned for information).

To help me ease back into all things DSpace, today I decided to try to fix a problem that’s been hanging around for months. About once every seven days, we reach a point where idle Postgres connectors accumulate in such numbers that Postgres bumps up against its ‘active process limit.’ DSpace, unable to grab a new connection to the Postgres database, instead turns its attention to spamming me with multiple error-message emails. The behavior seems to correspond to the frequency with which the Google crawlers visit us (their deep-burrowing crawl triggers multiple connections that never seem to terminate).

Doing my pre-coding homework, I launched DEVONagent to see if someone else had already solved this problem. Not really but I did find a workaround hack that Cory Snavely posted on the DSpace-Tech mailing list a few weeks ago. He suggested building a cron script using pgrep & pkill to find and destroy any processes that had “idle in transaction” in the process description, piping that to wc (to count them) and then killing off the oldest one if there were more than 20.

Sounded like a plan. In fact, it’s a great script if you’re running DSpace/Postgres on Solaris or most versions of Linux. We’re on OS X so of course I get to think different(ly).

For starters, pgrep and pkill don’t ship with OS X server. Annoying (they’ve been on Solaris since 2.7) but not a show stopper. I found a port for Darwin (OSX) on SourceForge (proctools) and compiled it.

First try didn’t work because I forgot that my desktop MacPro is Intel and our server runs on G5’s. Moving the source over and compiling on the server fixed that problem. My next surprise was that this ported version of pkill doesn’t support the “-o” flag that Cory was using to kill off the oldest idle process. So, my version kills off the newest match (tip: don’t use the “v” switch to reverse the logic on the “-n” switch thinking it will reverse newest to oldest. Using -nv will then kill off all EXCEPT the newest matching process–that could get sort of dangerous).

At any rate, I ended up with a little script using test, pgrep and pkill and decided to use launchd instead of cron to call it (that’s the new way, after all). That added a 45 minute excursion into launchd documentation and multiple iterations of testing but I finally got that working (thanks to the Lingon utility).

I think I’ll make one last modification to my script once I’m sure it’s working well enough to leave in place: have it email me once a week to remind me that it’s running. I can just imagine that in a few weeks I’ll forget all about it and someday the fact that the newest postgres idle process seems to disappear every 60 seconds will have me scratching my head.

Here’s the money line in the OS X version of the script:

/bin/test `/usr/local/bin/pgrep -f '127.0.0.1' | \
/usr/bin/wc -l ` -gt 20 && /usr/local/bin/pkill -n -f '127.0.0.1'

I’m logging the results of my script so I can better track how the idle processes rise and fall. Here’s what the log looks like:

07Feb2007:15:40::Idle: 7
07Feb2007:15:41::Idle: 7
07Feb2007:15:42::Idle: 7
07Feb2007:15:43::Idle: 7

Will be interesting to see what happens when Google starts bumping up the idle processes and my script is killing them off one every 60 seconds. With luck it will work well enough to keep the server below the error-spam threshhold.

I’ll be sure to update this post if it turns out this brute-force, symptomatic treatment creates any new problems (or fails to fix the problem it’s been tasked to solve).

* thanks to MarsEdit (a great off-line blog editor) for use of the logo.

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Three quick notes

parallels_logo.gifA new release of Parallels for Mac is out (RC2, Build 3150). It adds USB 2.0 support, a full-featured virtual CD/DVD drive and Coherence (the ability to run a Windows app without having to see Windows). For recent switchers, there’s a new “Transporter RC2″ bundled in as well—this lets you migrate an existing Windows installation into a Parallels Virtual machine.

http://parallels.com

WordPress 2.1 breaks ecto

If you use ecto to manage your WordPress blog, you might have noticed that ecto quit working with the release of WordPress 2.1.  I found mention of a “fix” out on the net so you can either wait until WordPress makes the correction (no ETA on this one) or do it yourself:

xlmrpc.php

line 978-982 probably looks like this:

$categories[] = array(
'categoryName' => get_cat_name($catid),
'categoryId' => $catid,
'isPrimary' => $isPrimary
);

change it to this:

$categories[] = array(
'categoryName' => get_cat_name($catid),
'categoryId' => (string)$catid,
'isPrimary' => $isPrimary
);

It should start working again…

 

Nerdiversion

Made a small tech breakthrough recently—creating a ringtone for my Razr from a song in my iTunes library.  Why? I could say it’s because my daughter told me the ring of my phone was so lame but I’m used to ignoring that sort of abuse.  The official reason is I wanted to experiment with Bluetooth on my laptop and a Bluetooth-enabled phone was what I had handy.

iTunes to Ringtone

The two pieces you need are WireTapPro ($19) (or Audio Hijack Pro ($32)) and a sound editor like Audacity (open source) or Fission ($32). I used WireTapPro (a utility for capturing your mac’s audio output and sending it to a file).

Step by step:

  1. Launch WireTapPro and configure it to capture audio output as a mono mp3, sampling bitrate of 22.050.
  2. Select a name and location for the resulting output file.
  3. Launch iTunes.
  4. Cue up the song you want and just before you reach the part you want to capture hit the “record” button on WireTapPro.
  5. Let the song play for about 30 seconds (that’s as long as most phones will ring)
  6. Hit the “stop” button on WireTapPro
  7. Optional: Use Audacity (or Fission) to edit the file WireTapPro created. Chances are you didn’t hit the precise beginning or end of the snippet you wanted to capture. Either of these utilities help you remove unwanted lead-in or lead-out.

Moving this mp3 file to your phone is a function of your phone’s BlueTooth configuration choices but it’s pretty much a simple file transfer.

One other tip: You can use this method on songs in your personal iTunes library or troll the iTunes music store and sample from the preview snippets they offer for every song in the store.

 

Add to Del.icio.us Add to Technorati Stumble Upon Digg This