Ever since I saw Jeffrey Zeldman’s comment on tag clouds (“Tag Clouds Are The New Mullets“), I’ve been interested in getting one. A tag cloud? No a mullet (just kidding). What I’ve been interested in is the way that tag clouds help you visualize the relative importance of particular content at a site. A list of subjects (tags) ranked by the number of times they occur would give the same information but like a regular haircut, not much visual impact.
With that bit of background, the other day I decided to build a tag cloud that captured and visually ranked the subject headings used in our catalog (710,000+ unique subjects). I realized that this wouldn’t be as informative as a years-consuming conspectus of the collection but what the heck, it would be visually compelling.
Step one involved running an SQL+ query against our Voyager system to produce a list of all the subject headings in our catalog and the number of bibliographic records attached to each of them. Here’s the SQL piece if you’re interested (it’s Voyager-specific and includes sub-headings):
from bib_index, bib_master
where bib_master.suppress_in_opac = ‘N’
and bib_master.bib_id = bib_index.bib_id
and substr(index_code,1,1) = ’6′
group by normal_heading
That created a text file listing the subjects and the count, separated by a ‘~’ delimiter. I wrote a short PERL program to take that list and create a ready-to-import file for MySQL, building a new database that I could manipulate via PHP.
open(OUTFILE,”>subjex.sql”);
print OUTFILE “use subjex;\n”;
print OUTFILE “CREATE TABLE subjex (\n”;
print OUTFILE “ID int(11) DEFAULT ’0′ NOT NULL auto_increment,\n”;
print OUTFILE “Subject char(230),\n”;
print OUTFILE “Count int(6),\n”;
print OUTFILE “PRIMARY KEY (ID)\n”;
print OUTFILE “);\n\n”;
$increment = 0;
$InputLine = <INFILE>;
while ($InputLine ne “”) {
$inc++;
@Chunks = split(/~/,$InputLine);
print OUTFILE ‘INSERT INTO subjex VALUES (‘.$inc.’,”‘.$Chunks[0].’”,”‘.substr($Chunks[1],0,5).’”);’.”\n”;
$InputLine = <INFILE>;
}
close OUTFILE;
close INFILE;
Two more commands…
# mysqladmin create subjex
# mysql < subjex.sql
…and I had a database. To make sure it was working right, I tried a few simple queries. In the course of this “research” I made what I think must be an important discovery:
| INTERNATIONAL ECONOMIC RELATIONS | 1388 |
| ECONOMICS | 1524 |
| CALIFORNIA MAPS TOPOGRAPHIC | 1528 |
| UNITED STATES RACE RELATIONS | 1559 |
| VIDEO RECORDINGS FOR THE HEARING IMPAIRED | 1645 |
| BUDGET UNITED STATES | 1859 |
| FEATURE FILMS | 3037 |
| ELECTRONIC BOOKS | 32887 |
OK, so I see we have quite a few Electronic Books but is that really a subject? Or how about our second most popular term Feature Films? Would a book on feature films that got made into a movie be filed under Feature Films—Feature Films? I love recursion but there’s a time and place for it.
I think I’m going to have to do a bit more work before I move to the final piece of my OPAC tag cloud…or maybe I’ll just finish the project and see how it looks. All that’s left is a PHP script that builds the webpage of alphabetized subject headings using a CSS stylesheet and the frequency data to scale the font up or down. I’m not yet sure how I’ll develop the necessary AI to toss font-scaling skews like Electronic Books but now that Web 2.0 has made everyone comfortable with a perpetual beta, I guess I don’t have to worry about that right away…
By Family Man Librarian » Questions about tagging December 19, 2005 - 11:48 am
[...] I love the concept of developing a tag cloud, and am intrigued by the work of some to tie this into the library online catalog. See, e.g., the work being done at the University Libraries at George Mason University using subject headings extracted from a Voyager system. This is just one example; there are many others. What I am not clear on, though, is whether this is really the answer or solution to the inherent disparity in tags used by different people. Does this disparity not really matter? Of course, there has always been disparity in library cataloging, even when using subject thesauri like LCSH. [...]