Archive for November, 2006

desktop federation

federated.jpgA year or so ago I gave a presentation to our university’s President’s Library Planning Task Force—a group of faculty charged with helping describe where the library needs to be in 2010. The focus of my contribution was technology in the library–where we’ve been and where we’re headed. In the course of that, I took a fairly lengthy detour to talk about federated searching–not because we make much use of it but because I wanted the group to better understand the reasons we haven’t embraced it more fully. After all, that single search box is a seductive thing.

No need to rehash here the arguments made (a QuickTime video of the presentation slides is available in our MARS repository if you have both excess bandwidth and nothing better to do) but the theme was a simple one—federated searching just doesn’t work very well: not everything gets searched, compromises are made which tend to dumb-down the underlying target systems, true de-duping is virtually impossible, the value of relevancy ranking falls away since it’s based only on the metadata the search returned not the underlying content and so on. I was essentially talking about the difference between what results from “Just in Time” searching as opposed to “Just in Case” indexing.

I still think these problems and limitations are present in the metasearch products marketed to libraries but I’m beginning to see a future for metasearching that I was missing. Instead of relying on a centrally-hosted metasearch intermediary, why not pull the search piece back to the desktop where more flexible and powerful tools can be employed? In effect, this local application becomes your personal agent–searching the net in the background as you go about other tasks. If well designed, it can then handle de-duplication of results, rank them based on relevancy and present them to you in ways that transcend what a browser-based system can offer.

DEVONagent

federate.jpgFor the past week I’ve been reviewing an application that moves us toward that goal. It’s an OS X application (still sporting a “Panther” interface) but what it runs on is far less interesting than what it does.

Note to Windows users: head over to Copernic for a similar product. Ironically, Copernic was once the go-to product for this sort of thing on the Mac but they abandoned the platform when OS X arrived.

Out of the box, DEVONagent is a desktop metasearch engine for the open web—Google, Yahoo! and Metacrawler but also newsgroups, blogs and so on. You set the level of depth you want and enter your search. You can specify a fast scan (e.g., the top 100 results from just a few search engines) or go deep (you decide the maximum x hits per source from a long list of target options) with complex boolean logic fully supported. DEVONagent runs the search, gathers the results, throws away junk hits, eliminates the “404’s”, de-dupes the remaining content and then ranks the results. That alone is impressive but it’s really only the start.

DevonagentIt also builds an interactive topic map that shows the primary concepts from the now merged set of results. Click a topic in the map and you see lines drawn to others that are closely related. The text box below the map changes with each click, displaying excerpts from matching pages in the result set—each time highlighting the relevant topic in context. Click a new topic in the map and everything changes.
DEVONagent also offers the ability to save individual documents or the entire set to a local database (called the internal archive) for future use. Pages can be converted to RDF documents if desired.

If you already use DEVONthink (a personal information management package), the “send to DEVONthink” button on the toolbar comes in quite handy. I’m a Yojimbo user (can’t seem to give up the .mac database syncing Yojimbo provides) so I removed the DEVONthink button but the idea of integrating this sort of program with a local database manager is a good one. I get matches into Yojimbo via printing (selecting “PDF to Yojimbo” as my printer). It’s also possible to use the “Launch URL” service to send the link to Firefox for inclusion in a Zotero database.

Deep Web

But wouldn’t it be great if DEVONagent could also search things like JSTOR, First Search and other “restricted” content—like those centrally-hosted federated search systems I complained about earlier? Yes, and an XML plugin architecture is built into DEVONagent for just that purpose. It’s also here that the program comes up a bit short. Unfortunately, the “build your own plugin” function is poorly documented and much harder to configure and use than it should be. I was surprised to find that you can’t just go to DEVONtechnologies and download various XML plugins (in the way you can grab connection files at endnote.com). Reading through the support forums, I realize that this is increasingly what users are asking for so I guess there’s hope. Perhaps DEVONtechnologies thinks bringing order to the chaos of open web searching is sufficient achievement but I’d argue this could be a killer application with just a bit more work.

I did finally manage to build and use a JSTOR plugin but only after benefitting from a post in a DEVON forum and finally guessing the correct file extension to use when saving my plugin (turns out DEVONagent was looking for .plist instead of .xml). I was using TextMate to build my XML file instead of Apple’s Property List Editor so I didn’t get the automatic .plist extension.

Note to DEVONtechnologies: If your program is going to ignore plugins that don’t carry a particular file extension you should probably mention the extension to use at least once in the documentation.

Deep Web / Proxied

What I haven’t been able to do is use DEVONagent to search a site like JSTOR when it’s behind a proxy/authentication server. I hope to solve this problem soon (and I’m close) but it’s uncharted territory. Like many libraries, we use EZproxy to deliver content to authenticated off-campus users and the proxy-by-port-number scheme appears to be complicating things. I’ve had some luck using DEVONagent’s built-in web browser (built with Apple’s Cocoa WebKit) to authenticate with EZproxy before running an agent search but things still don’t work quite right. I’ll post a fix if/when I figure this out.

Conclusion

I don’t want to end this discussion on a negative note. DEVONagent is a great program and after using it for a few days I can honestly report that I’m going directly to Google much less often. Not only am I bypassing advertisements and sponsored links, I’m able to do other things while I let my “agent” handle it. There are still many parts of the program that I’ve not yet explored/mastered and a couple more video tutorials I want to watch on the DEVONtechnologies site but I’m confident I’ll continue to rely on this program. I also highly recommend the downloadable PDF documentation from the developer’s site—it will definitely improve your use of the program.

A fully-functional 30 day demo of DEVONagent is available. Should you decide to register the product, it sells for $49.95. Educational users can receive a 25% discount.

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Don’t accept cookies from strangers

CookiebyebyeWere I in the advertising business, I’d have a different opinion, but I don’t like third-party cookies. “Third-party” because the cookie comes from a server other than the one you originally visited. They’re an invasion of privacy. How?

You visit a web page that has an online banner ad. The ad resides on a different server and when it comes your way a cookie is sent along with it. Next time a page contains an inline advertisement from the same online advertising company (e.g., DoubleClick), your browser will send that cookie back along with the request for the ad. By choosing a unique URL for each advertisement and comparing it with the cookie’s payload and the referrer link, the advertiser knows which pages you’ve viewed and can draw reasonably accurate “maps” of where you’ve been. Over time, the advertiser can build an anonymous profile and further microtarget the “attack”

Many browsers offer a way to block 3rd party cookies (e.g., in Camino, it’s a setting that says “Accept cookies only from sites you visit”) but in the lastest release of Firefox (2.0), the ability to block third-party cookies has been removed. Is this a concession to the .com changes that mozilla.org went through (e.g., don’t upset the web advertisers?). I hope not but fortunately there is a way to fix the problem.

In your location window (where you’d ordinarily type a URL), enter

about:config

then in the Filter box type cookie

then look for this value:

network.cookie.cookieBehavior

Double click that line and for the integer value enter 1

Here are the possible values and their effect:

0 – all cookies allowed (default)
1 – only cookies from the originating server are allowed
2 – no cookies are allowed

http://kb.mozillazine.org/Network.cookie.cookieBehavior

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Harvesting tips

TipsToday’s post consists of a simple UNIX tip and a note about an interesting piece of software. Given that graphic, let’s start with the tip…

Recursive grep

Most anyone comfortable with the command line knows how to use grep to find a file that contains a particular bit of text—typically to see the line(s) where the match(es) occur. The next level of complexity (and the thing that was giving me trouble today until I figured this out) is searching recursively through a directory tree with grep to find files that reside in subdirectories below your starting point. To save you the trouble of searching this out, here’s one way to do it:

find [StartPoint] -depth -print | xargs grep [LookFor] <return>

To start in the current directory and then also search all files below that directory for the string “Harvester” you’d enter:

find . -depth -print | xargs grep “Harvester” <return>

Basically you’re using “find” to get all file names in the directory you’re in (and those descending from it) and then using the “pipe” command to feed these names to xargs which builds a filename list and gives them one at a time to grep. You might want to send the output to yet another text file if you’re searching for something common—the output could be quite lengthy (just add a > and a path and filename to the end of the command string shown).

Basic stuff but I’m sure I’ll look back here in a few months to remember the syntax. As an aside, any time you spend studying the “find” command will repay you many times over in keystrokes and time saved.

OAI Harvester

This next “tip” is a great illustration of what the government can accomplish when it decides to help libraries become more powerful stewards of the digital realm. Unfortunately (for my American readers), I’m talking about the Canadian government but given the proximity we can hope this enlightenment might eventually begin to trickle down.

Here’s an excerpt from the PKP website that explains what they’re about:

pkp_logo.gifThe Public Knowledge Project is a federally funded research initiative at the University of British Columbia and Simon Fraser University on the west coast of Canada. It seeks to improve the scholarly and public quality of academic research through the development of innovative online environments. PKP has developed free, open source software for the management, publishing, and indexing of journals and conferences. Open Journal Systems and Open Conference Systems increase access to knowledge, improve management, and reduce publishing costs.

We’ve installed their Open Journal Systems (OJS) package and have been quite happy with it. It offers a simple but well-designed platform for hosting the “backoffice” aspects of e-journal publishing as well as managing the presentation of the content for readers. Recommended.

Over the past month or so I’ve been thinking about ways to incorporate OAI services to build different front-ends to digital storage/achiving systems like DSpace, EAD collections, and so on. Following a tip from a colleague, I downloaded and installed another piece of software from the PKP group—the OAI Harvester. Why this amazing app isn’t listed on the “Tools” page maintained by the openarchives.org site escapes me.

It’s a LAMP application (Linux, Apache, MySQL, PHP) but runs quite nicely as a MAMP installation on OS X Server (happily using the MySQL, Apache and PHP installations that ship with 10.4.x). Also plays well with the APC cache I talked about the other day. It took only a few minutes to install, a couple more minutes to configure and was almost immediately useful.

cgihome.jpgAfter typing in the OAI query urls for two DSpace installations (Mason and WRLC), I waited and within a minute or two, I had a nice “union” catalog of the content from both systems. I then added another neighbor’s collections (University of Maryland). Here’s what it looks like (I’m tweaking/testing this installation so it may or may not be running when you follow the link):

http://furbo.gmu.edu/OAIharvester

Right now I’m playing around with “branding” the setup with some Mason-specific mods (which explains why I was trying to grep the string “Harvester” out of the various sub-directories) and generally learning how the package is put together. I may next expand my test database to cover OAI-compliant systems within Virginia and see how that goes. To find the appropriate OAI-query URLs, I’ll start at the OAI provider registry maintained by the University of Illinois at Urbana-Champaign.

Once I figure out all the ways to fix and break this package, and then sort out a group of OAI providers that it makes sense to collect, I think we’ll roll out a production version of this software.

http://pkp.sfu.ca/?q=harvester

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

APC (Alternative PHP Cache)

ApcgraphicA quick update to close (I hope) the book on PHP caching add-ons. As I mentioned before, I’ve been having a lot of trouble getting long-term use out of eAccelerator. Today I downloaded and installed the latest version of APC (Alternative PHP Cache) and thus far it’s working well.

I made a discovery during the compilation and configuration process for APC that may well explain the problems I was having with eAccelerator. Turns out a BSD-derivative UNIX (which describes Mac OS X Server) defaults to a shared memory segment size of 4 megabytes. That’s small compared to Linux 2.2 kernels which have a default maximum of 32 Mb. When I configured eAccelerator, I stupidly used 16Mb for my shared memory segment size then later increased it to 32Mb (more must be better, right?). Doh! I now assume it ran for weeks until it crossed the 4Mb threshold then overwrote memory shared by some other process—“bye bye” server.

If you’re curious, you can issue this command on your Mac to see the value of the shared memory segment size.

/usr/sbin/sysctl -a | grep shmmax

which should return:

kern.sysv.shmmax: 4194304

You’ll see the same value on a machine running either Mac OSX or OSX Server (10.4 in both cases).

Having gotten suddenly smarter, I set APC’s shared memory segment size to the correct value of 4 megabytes and instructed the cache to use up to 8 segments (yielding a 32Mb cache). Based on the readout I’m getting from the monitoring utility that ships with APC, I’ll never use anywhere near 32Mb for the cache. Here are the settings from /etc/php.ini

extension=apc.so
apc.enabled=1
apc.shm_segments=8
apc.optimization=0
apc.shm_size=4
apc.ttl=7200
apc.user_ttl=7200
apc.num_files_hint=1024
apc.mmap_file_mask=/tmp/apc.XXXXXX
apc.enable_cli=1

What sort of performance measures are we seeing with APC? After just over 6 hours of operation:

  • Cache hits: 95443
  • Cache misses: 77
  • Request rate: 4.29 cache requests per second
  • Cache size: 4.4 Mb

The hits vs. misses numbers are astounding, but they’re also very much driven by this server’s particular workload (more on that later). The last value is what I want to talk about first. We’ve crossed the 4Mb threshold and it only took 6 hours. If the shared memory overrun theory is correct, how could eAccelerator have run for a couple of weeks before hitting that limit? Good question. My current guess is that eAccelerator’s cache files aren’t the same size as APC’s (I know, for example, that eAccelerator was optimizing the opcodes while APC doesn’t do that) thus it took longer to hit the 4Mb limit.

Getting back to the hits vs. misses—it’s really because of the application we’re running. This server hosts a modified version of the Scout Internet Portal and the structure of that app differs from your typical server’s mix of php scripts. SPT executes a massive include stanza every time a page loads. Net effect: within the first 2 minutes of operation our cache has probably seen 98% of the php code it’s going to see. After an hour, there’s only miniscule growth in the size of the cached code. Combining APC with the Scout CWIS truly begs a YMMV.

One final takeaway: if you run SPT, you owe it to yourself and your users to investigate a PHP cache.

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

eAccelerator and OSX

OK…it was fun while it lasted but today I’m saying goodbye to eAccelerator…a PHP caching solution I was recommending a couple of months ago. Might be a great product in other environments but I’m having trouble with this on our XServe (Mac OS X Server 10.4.8).

When I first compiled and installed eAccelerator the performance was great. Stayed great right up until the moment ten days later when the server just wandered off into the weeds—had to hit the power button to bring it back.

System logs showed nothing (of course, when memory corrupts they rarely do). I realized the server was about a point release behind the XCode developer package on my desktop so I updated the server version and recompiled eAccelerator. It sprang to life and ran a full three weeks this time before flaking out. I was ready to give up on it and then noticed there had been a new release of the eAccelerator code so I built that version (0.9.5) and selected very safe, conservative compiler settings. Ran without incident for maybe three more weeks and then zzzzzzzzzzzzzz.

I admit defeat.

I just unistalled the eAccelerator cache and will now research building an OSX version of the APC cache. My colleague who had the good fortune to meet Rasmus Lerdorf (creator of PHP) a few weeks ago told me he asked about caching and APC is what Rasmus uses.

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Business Models

Portico

Attended a presentation today by a representative of Portico (a “dark archive” of e-journal content brought to us by the folks who created JSTOR). It’s an interesting project, trying to solve some of the same problems that LOCKSS addresses.

I find the technology behind LOCKSS more interesting but mention Portico here because we’re talking about business models. How about this one—the people who pay 20% of the costs of the service (publishers) get to make 100% of the rules on how the people paying 80% (libraries) will use it.

As I listened to the presentation, I tried to think of what a service like Portico might look like if the publishing industry just designed it themselves. Had to finally admit I might just be looking at it.

OK, you’re right, they probably would drop the part about kicking in 20% of the costs.

Novell and Microsoft

The first time these two tangled, Netware crushed Microsoft’s LAN manager. Then a few years later, Windows NT effectively eliminated Netware. Round three finds Novell and Microsoft announcing today that they’ll cooperate on getting Windows and Novell’s SuSE Linux to interoperate.

Here’s a IT press blurb on the agreement and the official joint press release.

There are three major pieces to the deal:

Virtualization – Microsoft and Novell will jointly develop a compelling virtualization offering for Linux and Windows. [Guess it has to be really compelling since other companies like Parallels and VMware already do this quite well]

Web Services for managing physical and virtual servers – they’ll work to tie Active Directory together with eDirectory. [Not to be too snarky, but can we can hope the combined weight of these two tied together will be enough to sink them both for good?]

Document format compatibility – Will work out how Open Office and Microsoft Office documents can be shared. 

The agreement will run until 2012. Wonder if anyone will care by then?

I run a couple of SuSE Linux servers (including the one hosting this weblog) so I’m hoping any tech transfer that occurs flows from SuSE to Windows and not the other way around.

Add to Del.icio.us Add to Technorati Stumble Upon Digg This