Archive for January, 2007

Take out the trash

Earlier this month we moved our library’s primary webserver to a new machine—another in a sucession of servers since we began the site in 1994. Checking the document root after this latest migration, I noticed we had a few files lying about with timestamps of 1998! Guess it really is time to clean out some of the accumulated cruft.

Unfortunately, the idea of going through the site, directory by directory, removing files that are no longer linked anywhere just takes more time than it seems to be worth (after all, disk space is cheap and there are always other things that need to be worked on).

However, today I thought of a relatively quick way to figure out what needs whacking:

wget -mr http://library.gmu.edu

wget is a free GNU utility that makes it simple to retrieve files via HTTP. trash.jpgThe “-m” switch means to mirror the site (in this case, start with the library’s webserver root directory and grab all files recursively). Since it’s using the HTTP protocol, it only grabs files that are referenced in URLs–which means the cruft (files not linked to an active page) falls away. If you run wget from your desktop machine, you end up with the site replicated on your local drive.

wget isn’t distributed as part of OS X but you can grab and compile a copy via DarwinPorts (remember to install the Developer tools from your OS X install disk before working with DarwinPorts). There are also precompiled install packages available for OSX if you’re not a Fink or DarwinPorts user. Most linux distros offer wget in the default install and there are Windows ports as well.

I ran the “wget -mr http://library.gmu.edu” command in my home directory and ended up with a directory named “library.gmu.edu” and within it, the entire “linked-in” site laid out just as it is on the production server. How much digital flotsam was there?

size of /htdocs on “live” server: 543,604,000 bytes
size of wget mirror (only linked files): 190,316,000 bytes

Yikes! Roughly 350 megabytes of orphaned content? Well, Not really. I found .htaccess-protected directories that didn’t get linked in (you can configure wget to traverse these as well if desired) and a couple of php directories that are actually called by other servers (105 megabytes), and several “backstage/testing” directories (100 megabytes) which means there are still roughly 30 megabytes of cruft (old backup files, inactive pages, abandoned directories, and the like) hanging about. I’ll focus on removing that.

The fast way to do this would seem to be:

  1. use wget on a 2nd machine to pull down all the linked-in files from the original site
  2. make a tar archive of those files on the 2nd machine
  3. move the tar file over to the original server
  4. delete all files in the original web documents directory
  5. copy the tar file over to the original web documents directory and untar it

Voila! a clean website.

Fast, yes, but not recommended.

Step 4 (where we delete the original /htdocs directory) would also likely remove important files that the wget operation missed (e.g., .htaccess configuration files; php source files, .asp files if you use those, and so on).

The better (but slower) method is to pull down a site via wget then compare the contents of this copy to the original site, deleting files as it’s clear they’re not needed. I’ll offer one tip: rather than deleting the file, rename it with a leading underscore (e.g., myfile.html becomes _myfile.html). Later, it’s trvial to delete these files en masse (e.g., rm _*) but if you find you need them, they’re easily renamed and put back into service.

One other thing that’s kinda neat. If you want to pull down a site and then convert the links so the site runs locally (really good for testing or burning a CD-ROM), try this:

wget -mr –convert-links http://TopLevelURL

Not quite as functional as programs devoted to web archiving (e.g., for the Mac, I like DeepVacuum which is basically a GUI for wget*) but for a free command line utility, it’s pretty amazing. Spend a bit of time with the wget documentation and you will be rewarded with a useful tool.

*OK, if you want to get technical about it, your browser is also a GUI for wget but it’s not nearly as functional.

WordPress 2.1

Just a quick note to encourage WordPress bloggers that haven’t yet upgraded to release 2.1 to move it up on the to-do list. The web-based editing function is *much* improved.

 

 

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Notre Dame’s IDR report

IDR report
University Libraries at Notre Dame has released a report on Phase I of their Institutional Digital Repository pilot project. Developed by the 29 members of the IDR team, the report weighs in at just over 75 pages (I’m trying to resist the analogy of artisans and cathedral construction).

Much here is either specific to Notre Dame’s environment or well known to anyone actively involved in building or using or promoting an IR. Nevertheless, it’s interesting to see once again that everyone’s grappling with variations of the same issues. Having Notre Dame make their experience widely available is certainly helpful and appreciated.

I recommend Appendix G. (IDR Marketing Plan – Draft). This section makes the point that marketing an IR service isn’t a separate phase of the project—it’s a critical piece of a successful operation’s day-to-day activity.

http://www.library.nd.edu/idr/documents/idr-final-report.shtml

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

A Personal Content Management System

I have a mechanically-inclined friend who doesn’t get too upset when the unexpected car repair hits—he says it’s just an Toolsopportunity to buy a new tool. I envy his cheerful outlook and realize that I’m not there yet but I am making progress. For example, the other day when I found out I’ll have to chair a task force I didn’t just see the down side (you know, the part where it’s your job to turn an open-ended charge into a compelling strategic vision and a vague sense that “we probably should be doing something” into a tightly-focused action plan). I also realized:

Hey, this is an opportunity to try out some new software.”

The choice of what to try was easy enough. A few months ago I posted a note touting a desktop federated search product called DEVONagent which I’m still using and now consider my go-to tool for in-depth web research. Given that this new project will begin with a wide-ranging survey of what others are doing or thinking about, I decided it was a good time to try DEVONthink. If DEVONagent is all about finding information, then DEVONthink is about making some sense of it.

So what is DEVONthink?

On one level, it’s a free-form database for documents (Word files, RTF, text, etc.), PDF’s, images, web pages, mp3’s, Dtpro 1QuickTime files, RSS feeds, and more. Each database you create can have groups and within these groups a collection of all sorts of files, webpages, text notes, etc. When you add an item to your database (e.g., either via “drag and drop” or an application’s Services menu), you can either move the item into the appropriate folder/group or let DEVONthink apply its AI engine to automatically make the classification (yes, you can correct or change if necessary). I have just over 100 items in my database (scattered across seven or eight groups), and despite the relatively small sample size, I’ve yet to see DEVONthink’s “auto-classify” fail to figure out the proper group for a item. This auto-classification and the other semantic tools built into DEVONthink just get stronger as your database grows.

The AppleScript/Automator support in DEVONthink Pro extends import options. For example, you can set up “watch” folders and attach specific AppleScripts to each. One possibility: set up a PDF import folder and attach the text conversion script to it. When you drop a PDF into this folder, DEVONthink will convert the PDF to Rich Text and import the resulting item into your database.

Combining DEVONagent with DEVONthink takes database building to the next level. Run a search with DEVONagent, then use the built-in browser within DA to step through the results. If you see a ‘keeper’ just push the “send to DEVONthink” button and decide whether you want to store it as a PDF, a complete web archive (preserving all graphic content as well) or perhaps just a URL. Since DEVONagent has stripped out duplicates, spam, junk links and advertisements from the results set, you can quickly get just the important “hits” into your local DEVONthink database.

But DEVONthink is more than a database–it’s also a feature-rich work environment. There’s an integrated browser (built with the Safari Webkit), an RSS reader, an RTF editor, PDF viewer and more. A number of scripts and automator actions also simplify the task of working with the database and performing actions like backup. DEVONthink Pro also offers a database-wide concordance.

Dt SearchThe AI capability within DEVONthink makes searching a breeze: find a document (exact text or fuzzy searching supported) then use it to identify related documents via the “see also” button on the results page. I found a web posting by Steven Johnson (from January 2005) that offers a really nice explanation of how semantic searching works within DEVONthink. The program’s evolved since that time but the ideas are still well worth reading and fully applicable.

I’ve only been using DEVONthink Pro in a serious way for a week and I continue to discover new things about the program. That doesn’t mean it’s poorly designed or difficult to use—just that it is an incredibly powerful piece of software that rewards exploration. For example, today I discovered it’s possible to export a group (or an entire database) as a website. I can imagine that would be very useful if I decide I need to share some of these documents with others. I tried a sample using a group of documents about ILL and electronic delivery. The site you end up with is pretty crude—just a directory listing of the files—but it wouldn’t take much work to build an intervening page that gave the documents some context. According to the documentation, you can do this document linking within DEVONthink using a wiki-style syntax—I’ll admit I haven’t seen the need for that yet. If this sounds like something you could use, a sample database on the DEVONthink website shows how an entire website could be built within DEVONthink and then exported to a server.

There are three versions of DEVONthink: Personal, Professional and Pro Office. DEVONtechnologies offers a “feature comparison” chart that helps differentiate the products.

I started with DEVONthink personal and finally decided to upgrade to the Professional version to pick up the multiple database support and AppleScripting capability. DEVONtechnologies offers liberal “full-feature demo” use for each product (150 hours of non-contiguous runtime) so it’s easy enough to grab a copy, give it a thorough testing and see if it meets your needs. If you’re affiliated with an educational institution, you can request and receive a 25% discount. DEVONtechnologies offers great technical support, has an active “users forum” and seems to be on a very regular update cycle.

Current version of DT Pro is 1.3 beta 2. Universal Binary. Requires Mac OSX 10.3.9 or higher (10.4 recommended).

http://www.devonthink.com

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Voyager’s back

Just a quick note to explain today’s OPAC downtime.

We needed to apply a jumbo patch update to our Voyager system, fixing a number of bugs and (we hope) not introducing too many new ones. The patch got off to a slow start until a permissions problem was discovered then zoomed along until we hit what I knew was going to be a problem—separating off our Web OPAC server from the machine running Voyager and Oracle. The Voyager documentation assumes you’ll run everything on the same machine but for performance reasons, we use a separate server to handle the web traffic. Thanks to error logs, I finally figured out that several Oracle load libraries were missing and moved them over from the primary server.

Presto…OPAC returns.

Add to Del.icio.us Add to Technorati Stumble Upon Digg This

Library website moves, etc.

library.gmu.edu
Yesterday’s modifications to our campus DNS completed the incremental move of our library’s web site to new servers. Spent several hours “tailing” the log files to see what did and didn’t make the transition but problems were minor—mostly related to system-level fixes for forms processing (e.g., tweaking sendmail) and apache configuration (virtual hosts configs are a bit different in Apache 2.2).

The move took our site off an old Sun E250 (Solaris 8/Apache 1.3.6) and placed it on a newer V240 (Solaris 10/Apache 2.2.2). Most of the heavy lifting on the library’s site occurs on an Apple XServe which we transitioned during the Fall 2006 (our “database portal” and “e-journal finder” and a MySQL server) so this move of the largely static portion of our site was chiefly done to get off older hardware before it began failing.

The change was so smooth, that naturally I decided this afternoon to move off the final, much smaller virtual host that was still using the old hardware. Didn’t work. Don’t yet know why but thankfully I was able to get the campus DNS reset to the old values for that vhost and things are again working.

No classes next week so I have plenty of time to start the ‘fix it till it breaks’ cycle again on Monday.

Nice (and free) OS X disk utility

Stumbled across a simple but useful utility the other day–for quickly measuring the size in bytes of a given folder and all subfolders and files within it.Whatsize

The files and folders are automatically sorted by size which brings new efficiencies to your “delete some stuff to make more room” efforts. You can read about the utility and download a copy from ID-Design’s website.

Thanks to Clyde Deda for making this program available. Requires OS X 10.3.9 or later (Universal Binary).

WhatSize

XPad

Another application that just transitioned from Shareware to Freeware (thanks to a spat between the developer and MacZot) is XPad (billed as the ‘ultimate notepad’). This one appears to be PowerPC only.

You can grab a copy at http://getxpad.com.

Add to Del.icio.us Add to Technorati Stumble Upon Digg This