What are they e-reading? (2011)

Based on usage, here are the ten most popular e-books in our Safari Online Books collection for 2011.  Each title includes [number of accesses during the year].   In total, an e-book was pulled from our virtual shelves 356,563 times last year.   The top 10:

  • Program Development in Java: Abstraction, Specification, and Object-Oriented Design [26,372]
  • Internet & World Wide Web: How to Program (4th edition) [20,417]
  • Effective Java: 2nd edition [15,853]
  • Computer Security: Art and Science [8165]
  • Joomla! 1.5:  A User’s Guide: Building a Successful Joomla! Powered Website, 2nd Edition [6,914]
  • Test Driven: Practical TDD and Acceptance TDD for Java Developers [4,131]
  • Head First Servlets and JSP, 2nd edition [4,061]
  • Head First HTML with CSS & XHTML [3,791]
  • Head First Design Patterns [3,465]
  • OCA Oracle Database 11g: Administration: Exam Guide (Exam 1Z0-052) [2,713]

And the least popular e-book? Hard to say.  We had 1,054 titles that saw only one virtual access during the year and 2,983 that just gathered cosmic dust.  5,097 (roughly 63%) of the titles in this particular e-book collection had at least one access.

 

Posted in General

What are they e-reading?

I often wonder who reads the e-books we link into our catalog. While I love reading on my Kindle I can’t go more than few “pages” into a web-based e-book before my head starts splitting. Of course, I have to overcome years of paper-based reading to be satisfied with these webified versions so perhaps my reaction isn’t typical.

We offer just over 8,070 Safari e-books in our catalog and during the first two months of this year 1,370 of those titles were accessed (roughly 17% usage). So, what’s hot?

Top ten titles for 2011 (with year-to-date access in parentheses):

  • Program Development in Java: Abstraction, Specification, and Object-Oriented Design (3587)
  • Computer Security: Art and Science (3202)
  • Head First Design Patterns (1146)
  • CCIE Professional Development Routing TCP/IP (1180)
  • Internet & World Wide Web: How to Program, 4th ed. (1062)
  • Learning SAS® by Example: A Programmer’s Guide (877)
  • Effective Java™, Second Edition (772)
  • CompTIA A+® Certification All-in-One Exam Guide, 7th Edition (671)
  • Head First HTML with CSS & XHTML (660)
  • Test Driven: Practical TDD and Acceptance TDD for Java Developers (658)

Least popular title that had at least one use thus far in 2011?

  • Zend Framework in Action
Posted in Library Tech

What happens to the mid-major library?

I was listening to the latest Digital Campus podcast on my way home yesterday when the discussion began to hit close to home…

Talking about a day in the maybe not-so-distant future when most books are available in one e-form or another:

Dan Cohen: “I wonder what will happen to libraries of the size we have here at Mason, you know, the one to two million volumes, pretty much recent collection (the past 100 years), doesn’t have a deep catalog of rare books? What happens in a world of all digital book content to that kind of library?

I still get the Library of Congress or Harvard or the University of Michigan, but it’s hard to give a rationale for why a library like Fenwick here at Mason sticks around. It’s a lot of heating, a lot of physical plant…and it’s a lot of people. And I love libraries, but aside from that fact, the sort of Upstairs/Downstairs ‘Well this is where the poor people go to get their sad, old printed books’ …you know, what happens to it? Even now it’s not a place where people start their research…”

Dan, I’ve been asking myself some form of that question for at least ten years.

While I surely have a salary-driven bias, I’ve always assumed there will still be something we’ll call a library when the e-future arrives. But I sometimes wonder–will it have evolved from today’s library or have been created as a replacement for it?  Thinking about how we get from here to there, I worry:

  • Will we, as a profession, spend too much energy chasing improvement in the transactional metrics of success (items circulated, reference questions asked, gatecount, etc.)? Once tried-and-true measures of library utility, they’re in irrevocable and ever-accelerating decline. Shouldn’t we accept that and begin redeploying resources in pursuit of new opportunities?
  • Will we recognize and be able to exploit transformational moments as they appear? Or will we pass on them as “not something libraries traditionally do?” Put another way, how far is it from “heart of the university” to “vestigial organ?”

Today one ‘mid-major’ library has roughly the same collection as the next one on the list.  That sameness, combined with the trend toward outsourcing what were once considered core enterprise-level services  (e.g., campus email systems moving to a vendor-supplied cloud), seems a dangerous mix for the library. Let me share my own “worst case” scenario:

The world has gotten past the friction that limits universal satisfaction with today’s e-readers and e-content and into that environment, a large ‘web-scale‘ vendor appears…offering the university a subscription that provides e-access to all e-content along with a strategically-priced bundle of e-reference services.

Think it can’t happen? The ears of that cat are already peeking out of the bag. Consider a product like Summon. The ProQuest business model for Summon is surely based on two facts:

  • the leased (or licensed if you prefer) e-content of each library is roughly the same
  • it actually resides on the servers of vendors outside the library

Why not engage in a bit of corporate cooperation and then sell access to a cloud-based index of that content over and over to each and every one of those libraries?  To the degree that you can vertically integrate content leases with the search mechanism–well, that’s what they call “lock-in” gravy.

So, back to Dan’s question. What does ‘the library’ do for a second act? I’m guessing:

  • Our footprint (buildings and staff) will be much, much smaller
  • We’ll offer very fast and ubiquitous networking on site and focus on high-end equipment/tools for manipulating and reworking digital content
  • We’ll offer on-demand services like “find and print” or “find and import” so users can build their own libraries
  • We’ll develop special tools and services to aggregate e-content in locally relevant ways (a 21st century analog to the old “finding aid”)
  • We’ll put much more emphasis on supporting teaching and learning
  • We’ll focus as much energy on data-driven research as we do today on the bibliographic-driven counterpart
  • We’ll offer more service and financial support for the front-end of the scholarly communication process (e.g., paying fees for campus authors in OA journals, helping authors secure their rights and protect the value of their intellectual investment, etc.)
  • We’ll still be doing the “special collections and archives” thing as that will be a large part of what differentiates libraries

We’ll surely still find that students are starting their research elseweb…

Posted in General

Life imitates Art

Speaking of librarianship and technology (as happens here from time to time), I want to highlight a couple of signs from the rally down on the Mall this past weekend. Here’s a frame from some video I shot:

Thanks to a tweet from Dorothea, I now know the inspiration for that one…

Years of wacky vt100 emulation have given me an appreciation for my other favorite (sorry, don’t have a photo of it but this pretty much captures the idea):

Posted in General

What different sort algorithms sound like

from andrut:

This particular audibilization is just one of many ways to generate sound from running sorting algorithms. Here on every comparison of two numbers (elements) I play (mixing) sin waves with frequencies modulated by values of these numbers. There are quite a few parameters that may drastically change resulting sound – I just chose parameteres that imo felt best.

Posted in Desktop Software, General, Library Tech

Fun with the 245 tag

Over the past few months I have, on more than one occasion, found myself making a full extract of the bibliographic (MARC) records in our library’s catalog. Turns out, this sort of thing happens frequently when you run your own ILS locally but also belong to a consortium where individual members are busily adding different “discovery” layers to the underlying catalog that they all share.

Some of this work is constant, sometimes it comes in spurts. Nightly, for example, I have a script that updates an AquaBrowser instance our consortium operates and then sends a second copy of those changes to Serials Solutions so another member’s Summon instance will reflect more current information.   Less frequently, I respond to a request for an extract to populate trial instances of another product someone else is considering.

Let’s not even mention the skunkworks instance of VuFind that I run as a sort of library geek hobby.

Yesterday, I decided to take one of those extract files and try a text mining experiment on my desktop Mac. To begin, I ran the file of nearly 1.7 million MARC records through a Windows VM so MarcEdit could produce a plain text version of the data.

As a first cut, I extracted all the 245 tags (titles) from the file.

grep "=245 " MasonBibs.txt > 245tags.txt [return]

which yielded:

=245  10$aAdolescence.$cPrepared by the society's committee. Edited ...
=245  10$aGuidance in educational institutions.
=245  14$aThe teaching of reading: a second report.
=245  10$aHighways into the Upper Amazon Basin.
=245  10$aProust et le roman,$bessai sur les formes et techniques ...
=245  10$aCreative management in banking.

Interesting, but clearly more processing was needed. With a short perl script I removed the tag labels, subfield codes and most punctuation. Seconds later, I had a 169MB text file that looked like this short excerpt (a 245 tag on each line):

Adolescence Prepared by the society s committee Edited by Nelson B Henry
Guidance in educational institutions
The teaching of reading a second report
Highways into the Upper Amazon Basin
Proust et le roman essai sur les formes et techniques du roman dans
Creative management in banking

A second perl script normalized the capitalization then split out and counted the words. I used the “%08d” construct in the “printf” statement to insure I’d have a list sortable by usage when the script finished.

#!/opt/local/bin/perl
use strict;
use warnings;

my %count_of;
while (my $line = <>) {
  $line =~ tr/[A-Z]/[a-z]/;
  foreach my $word (split /\s+/, $line) {
    $count_of{$word}++;
      }
     }
  for my $word (sort keys %count_of) {
     printf "%08d : $word\n", $count_of{$word
  }

countwords.pl < 245tags.out > wordlist.txt [return]

Here’s an excerpt of the output:

00000002 : salinewater
00000001 : saling
00000001 : salingar
00000064 : salinger
00000002 : salinghi
00000002 : salinian
00000001 : salinisation
00000004 : salinities

Final step was to sort the 453,672 words/lines in this file by the number of occurrences:

sort < wordlist.txt > 245tags_sorted.txt [return]

Voilà! I now know that these are the four most common words used in titles represented in our catalog:

the (1,343,112)
of (1,190,200)
and (918,245)
by (522,495)

then two outliers:

resource (450,200)
electronic (448,118)

and then back to prepositions and other unsurprising terms:

in (363,788)
a (346,909)
to (286,701)
on (252,793)
for (229,914)
edited (155,248)
states (126,221)
from (125,065)
with (124,319)
united (123,889)
committee (86,990)

Obviously, there’s not much of interest here and the point of the post is really to share the methodology and code snippets for anyone interested in running other experiments. However, I did find it odd that the words “electronic” and “resource” reached what I’d consider stopword status. Could we really be moving that close to the digital library I’ve been working toward for all these years?

Well, I’d like to think so but I’m guessing it’s the fact that not too long ago we loaded 306,000+ records from the Lexis-Nexis US Serial Set and that has skewed the frequency count for several terms. My sense is that a large load of these sorts of specialized records also has a negative effect on most users; that is, it helps build an ever-larger haystack for those seeking a needle of information that has nothing to do with that particular set of records.

Of course, that’s a problem we need to solve with better search tools, not by restricting the scope of our content.

Beyond developing a workflow that might yet yield an interesting outcome, there was one small spinoff benefit to this little wordcount experiment. Skimming the list of words that appeared only once across all titles, I was able to easily spot a number of misspellings.

For example, if you look at that little excerpt of my original word count file, you’ll see:

00000001 salinisation

I checked the catalog and yes, it’s a misspelling (although the variant title and subject headings saved the “word anywhere” searcher on this one):

Title:           	 Salinisation of land and water resources : human causes, extent...
Variant Title:	 Salinization of land and water resources
Primary Material:	 Book
Subject(s):	 Salinization --Control.
                         Salinization --Control --Case studies.
                         Soil salinization.
	                 Water salinization.

Now I’m wondering if there’s a way I can use this tag-extraction work with a spell-checker to assist in some automated way with our neverending quest for perfect metadata.

Posted in Desktop Software, General, Library Tech

what they’re reading…

Extracted the titles from readings our faculty have placed in our e-reserves system this semester, removed terms like “Chapter” and “Ch.”, normalized case and fed the result into Wordle.

Just curious, I performed the same operations on readings from Fall 2005:

I’m beginning to suspect it’s our social scientists who make greatest use of our e-reserves service.

Posted in Library Tech