Part of me is telling me it’s madness to even touch this topic but I am going to have to let out a few niggling gut feelings about it that just won’t go away.
I used to advise would-be repository managers that best-practice would be to include with each repository object a controlled subject vocabulary with the rest of the metadata. But of course I was realistic enough to know that staffing and time constraints would make that a dead horse not worth the effort flogging. Besides, in the present environment well chosen keywords are all one really needs to get by.
Aside on keywords
But I did advise that best practice would also mean that there should be someone to vet keywords provided by authors doing their own deposits. Even the keywords often listed on a published article or thesis should be vetted. Some of the keywords selected by authors were scarcely adequate as finding aids for all but a few who knew that particular author’s preferences. Social and uncontrolled tagging works best when the tags
- really do cover the content of the paper, and not just one or two aspects of it,
- and when they are “socialized” enough to be the words most meaningful to a relatively wide audience.
I am sure many editors/cataloguers working with repositories will know that not all authors can be relied on to provide the most useful keywords. A bit of professional tinkering is sometimes good for best-practice.
Controlled vocabularies currently used in repositories
A few repositories do use controlled vocabularies like LCSH. But most of the ones I know of rely on classification codes that were designed for research reporting principally for funding purposes. In Australia that is the RFCD code (Research Fields, Courses and Disciplines), one of the 3 codes under the ASRC umbrella. An easy summary of or intro to these is on the Australian Research Council website.
But the first thing any cataloguer used to “real subject vocabularies” is that these codes are not, and were never designed to be, descriptive subjects in the normal sense of the word. They were designed to be research reporting codes, exactly what they are called. Academics know them well enough, and it is quite right that they are entered with the publications and other papers that are entered into the academic institutional repositories.
I’m not sure if they are used very much for subject searching, but that doesn’t matter. Keywords can take care of that, and besides, the codes are still necessary data in a repository that contains a large measure of publicly funded research and related contributions.
Having these codes in repositories is also a good idea for another reason that I will come to at the end of this post.
But what of the real controlled subject vocabularies?
Controlled vocabularies and the future
If Web 3.0 is the future then everything I read about it tells me that controlled vocabularies are being positioned for a potential inconceivable when they were first designed.
Web 3.0 is about the semantic web. A great easy intro to explaining what this is all about is a slideshare presentation by Freek Brijl titled Web 3.0 Explained with a Stamp. Brijl takes a question like:
I want all the red stamps, designed in Europe, but used in the U.S.A., between 1880 and 1990
and shows how with semantic web methods (mainly RDF) can navigate the complexities of such a question to produce a better answer than currently possible.
At present, with the repositories in particular being a flat content database to be mined (federated or not), the first two words in the above question, “red stamps”, will pose a problem that guarantees we will get a lot of unwanted hits. We are sure to pick up
- green stamps from Khmer Cambodia,
- yellow stamps about the Red Sea,
- blue stamps with pictures of red dragons,
- and white stamps celebrating a Red Cross anniversary,
- plus, correctly, red stamps!
That’s just from navigating the first two words in our query.
But if that word “red” meaning the colour, had a unique URI identifier, and if the search query pressed the right button to light up that particular URI identifying the “red” we want, then we could immediately have really restricted our search to “red stamps”.
And this is where the controlled vocabularies come in again. The LCSH is so structured that it is impossible to confuse the “red” colour per se with the political symbols, place names, etc.
And a Library of Congress report, “Response to On the Record: Report of the Library of Congress Working Group on the Future of Bibliographic Control” (dated 1 June 2008 — although its pages sometimes refer to 2007 events in the future tense!), refers to the work that has been done that binds each LCSH term to a unique URI. By itself, this doesn’t mean much. But it’s part of a larger work that lies at the heart of the semantic web, and the ability for users to ask and retrieve sensible results for the question above about certain red stamps.
If Web 2.0 is about fancy linking of databases, Web 3.0 is about linking concepts. Forget about directly searching for the words “red” in all the databases. Send that search for “red” through the appropriate controlled vocabulary, through a broader concept, “colour”, that immediately filtered out all the geographical and institutional concepts.
That can only be done through a controlled vocabulary structure.
And the work has been underway with a simple knowledge organization system known as SKOS.
The Library of Congress has been working with SKOS to enable its controlled subject vocabulary to be built in to Web 3.0 — the semantic web. SKOS is about linking concepts, controlled vocabularies (not only LCSH), in order to enrich a web search experience through RDF. See the SKOS website for further details. In the nuttiest of shells, however, RDF “simply” means linking up something on the web via its unique URI identifier with something else on the by its unique URI identifier, through a specific “verb” command, like, “is a subproperty of” or “is associated with” or “is a result of”, etc.
Library of Congress appears to be working towards migrating their controlled subject vocabulary into the future. See particularly recommendations 3 and 4 in the LC Response report above. See also the blurb and slide presentation by Alistair Miles.
So what about the here and now?
Web 3.0 is still a few years away. And many repository managers don’t have the resources to build into their workflows the time to add full controlled subject vocabularies for each deposit. Just monitoring keywords for best practice databases is onerous enough for most.
One: I’d be interested in doing a study on cost-benefit ratios for adding controlled vocabularies to repository workflows:
- Some libraries are entering certain repository deposits into their library catalogues anyway. With controlled subject vocabularies added for discovery purposes. I know at least one library that enters all major theses into their catalogue and sets up a link to the full text that happens to be housed in their repository. . . .
- And costs are reducing the number of hard copies of both journals and books now purchased, , , ,
- and suppliers are providing their own raw catalogue data with each product sold to the library, , , ,
- . . . So is there possibly more time for cataloguers to work on repository records?
Two: But perhaps more realistic in many scenarios, so long as repositories house some sort of controlled vocabulary, even if it is not one designed for subject discovery, but for research reporting for funding purposes, then they are probably in a position to sync in to Web 3.0 and the semantic web when it comes to them. It would “only” be a matter of someone’s time to crosswalk the RFCD codes referred to above to LCSH options, coupled at that time with their unique URI’s. All those “not elsewhere classified” entries in the RFCD list will mean a bit more work for those in the trenches. But that’s not as bad as each cataloguer having to transfer the whole lot one by one. Anyway, before then, maybe Australian Standards will have seen the wisdom of preparing for the Web 3 and have begun their own tasks necessary for this adaptation.