Take out the trash
Earlier this month we moved our library’s primary webserver to a new machine—another in a sucession of servers since we began the site in 1994. Checking the document root after this latest migration, I noticed we had a few files lying about with timestamps of 1998! Guess it really is time to clean out some of the accumulated cruft.
Unfortunately, the idea of going through the site, directory by directory, removing files that are no longer linked anywhere just takes more time than it seems to be worth (after all, disk space is cheap and there are always other things that need to be worked on).
However, today I thought of a relatively quick way to figure out what needs whacking:
wget -mr http://library.gmu.edu
wget is a free GNU utility that makes it simple to retrieve files via HTTP.
The “-m” switch means to mirror the site (in this case, start with the library’s webserver root directory and grab all files recursively). Since it’s using the HTTP protocol, it only grabs files that are referenced in URLs–which means the cruft (files not linked to an active page) falls away. If you run wget from your desktop machine, you end up with the site replicated on your local drive.
wget isn’t distributed as part of OS X but you can grab and compile a copy via DarwinPorts (remember to install the Developer tools from your OS X install disk before working with DarwinPorts). There are also precompiled install packages available for OSX if you’re not a Fink or DarwinPorts user. Most linux distros offer wget in the default install and there are Windows ports as well.
I ran the “wget -mr http://library.gmu.edu” command in my home directory and ended up with a directory named “library.gmu.edu” and within it, the entire “linked-in” site laid out just as it is on the production server. How much digital flotsam was there?
size of /htdocs on “live” server: 543,604,000 bytes
size of wget mirror (only linked files): 190,316,000 bytes
Yikes! Roughly 350 megabytes of orphaned content? Well, Not really. I found .htaccess-protected directories that didn’t get linked in (you can configure wget to traverse these as well if desired) and a couple of php directories that are actually called by other servers (105 megabytes), and several “backstage/testing” directories (100 megabytes) which means there are still roughly 30 megabytes of cruft (old backup files, inactive pages, abandoned directories, and the like) hanging about. I’ll focus on removing that.
The fast way to do this would seem to be:
- use wget on a 2nd machine to pull down all the linked-in files from the original site
- make a tar archive of those files on the 2nd machine
- move the tar file over to the original server
- delete all files in the original web documents directory
- copy the tar file over to the original web documents directory and untar it
Voila! a clean website.
Fast, yes, but not recommended.
Step 4 (where we delete the original /htdocs directory) would also likely remove important files that the wget operation missed (e.g., .htaccess configuration files; php source files, .asp files if you use those, and so on).
The better (but slower) method is to pull down a site via wget then compare the contents of this copy to the original site, deleting files as it’s clear they’re not needed. I’ll offer one tip: rather than deleting the file, rename it with a leading underscore (e.g., myfile.html becomes _myfile.html). Later, it’s trvial to delete these files en masse (e.g., rm _*) but if you find you need them, they’re easily renamed and put back into service.
One other thing that’s kinda neat. If you want to pull down a site and then convert the links so the site runs locally (really good for testing or burning a CD-ROM), try this:
wget -mr –convert-links http://TopLevelURL
Not quite as functional as programs devoted to web archiving (e.g., for the Mac, I like DeepVacuum which is basically a GUI for wget*) but for a free command line utility, it’s pretty amazing. Spend a bit of time with the wget documentation and you will be rewarded with a useful tool.
*OK, if you want to get technical about it, your browser is also a GUI for wget but it’s not nearly as functional.
WordPress 2.1
Just a quick note to encourage WordPress bloggers that haven’t yet upgraded to release 2.1 to move it up on the to-do list. The web-based editing function is *much* improved.
|
|
|
|
Comments(0)

opportunity to buy a new tool. I envy his cheerful outlook and realize that I’m not there yet but I am making progress. For example, the other day when I found out I’ll have to chair a task force I didn’t just see the down side (you know, the part where it’s your job to turn an open-ended charge into a compelling strategic vision and a vague sense that “we probably should be doing something” into a tightly-focused action plan). I also realized:
QuickTime files, RSS feeds, and more. Each database you create can have groups and within these groups a collection of all sorts of files, webpages, text notes, etc. When you add an item to your database (e.g., either via “drag and drop” or an application’s Services menu), you can either move the item into the appropriate folder/group or let DEVONthink apply its AI engine to automatically make the classification (yes, you can correct or change if necessary). I have just over 100 items in my database (scattered across seven or eight groups), and despite the relatively small sample size, I’ve yet to see DEVONthink’s “auto-classify” fail to figure out the proper group for a item. This auto-classification and the other semantic tools built into DEVONthink just get stronger as your database grows.
The AI capability within DEVONthink makes searching a breeze: find a document (exact text or fuzzy searching supported) then use it to identify related documents via the “see also” button on the results page. I found a 
