Continuing here to share my “education in metadata” — and since I’m still discussing the early months, the post may be of interest to other peoples beginning their metadata journeys too . . . .
An easy MARC intro for a cataloguer into metadata
One good thing about having a library cataloguing background when it came to investigating repositories and seeing how they might be able to talk to each other, was knowing the blunter as well as some of the finer points of MARC. Some have argued that MARC is not really a metadata schema itself but a format for encoding metadata. And that’s one reason it works particularly well as a crosswalk for discrete bits of data from one schema to another.
I had cut my teeth with an EPrints repository, but when it finally osmosed into me that I had to mind my big P’s and my little p’s (learning that ePrints with a big P was a TM for a particular software package, and eprints with a little p was as often as not used as a generic term for electronic print collections, except when different users said that was only half true), it also became apparent that no institution could assume that once it chose a particular repository it would always have that particular brand of repository.
Currently the different configurations and software bases of repositories mean choosing one is like choosing a car: each make of car has some things one likes but not all, etcetera. And once one make and model is selected for one’s needs now, in a few years one’s circumstances are likely going to mean a different car would be more appropriate, or even necessary.
So given the current state of repository software, one needs to be sure that if one stores all one’s digital resources into one particular make of repository now, one will be able, with little effort or complication, transfer that data to another repository in the future.
I have since learned that some promoters of proprietary repositories promise their repositories will do this for clients, but it’s a topic that’s quickly glibbed over and a little digging shows the promise may not be all it seems. Yes, the data may all be put in a neat package and given back to the library when they cease to use a particular repository, but at least in one case I know it will be stripped of its configurations and be a complete mess. The soft boiled egg will be returned scrambled, leaving repository managers finding staff, workflows and money to reconstruct the original soft boiled egg.
So one has to expect that one’s institution will one day, or just might one day, want to migrate data from one repository to another. And MARC is one handy tool, an old well-known favourite of librarians, that is complex and versatile enough to carry out most migration tasks. By mapping data in a repository to MARC one can then map from that MARC record to the new data format of the new repository.
That is one solution. Librarians know what MARC is capable of conveying as discrete units of data. And thus how granular the migration process can be from one repository to another.
Moving beyond MARC
But there is more. The demands being placed upon repositories are leading them into data configurations that have no traditional counterpart in the normal library cataloguing of resources.
For example: it may be desirable, eventually obligatory, for a repository to differentiate preprints from postprints; articles submitted for publication, articles published, and articles written but not published or submitted; an author’s (different?) affiliations with different publications; whether a particular article or paper, as opposed to a journal or other collation, is peer-reviewed or not; and more.
MARC is good, it can and will continue to do most things. But it will not be able to do everything, at least not easily. One other possibility in certain circumstances is METS, a container for carrying a number of different schema-clumps of data. MARC cannot easily link multiple affiliations to respective authors. But the MODS schema can. And either a MODS schema and/or a MARC one can be carried by a METS package. But not all repositories will support the embedded structure carried by MODS. At least not yet. Plan for change.
But MARC is still going to be good for a while yet with a most vital day to day purpose of the repositories — Open Archive (OAI) Harvesting. That is, migrating the data that is essential for one’s resources being discovered on the internet.
Grappling with standards clashes and confusions over harvesting
Part of our repository evaluation process in RUBRIC was to compare their OAI harvesting functionalities. The easiest way to do this was to migrate data from one repository to another with MARC. Only a few data elements (title, author, date, subject etc) are used for OAI harvesting, and these are easily handled by MARC.
MARC was not the end product as it usually is in a catalogue record. For our repository purposes it was only the half-way house between data destinations. The final destination for harvesting is the Simple Dublin Core schema. Its elements are basic. Cataloguers used to MARC will first wonder how on earth such a blunt instrument can be useful for anything. But recall that library searchers do mostly search by title, author, subject, date, anyway. The more complex data is not lost. It can be kept in reserve (not in Simple Dublin Core but some other schema) till a particular resource is discovered. And then it can go into action and show its stuff as required.
This (OAI harvesting) was one of the hardest aspects of repository functions to come to terms with. Contradictions abounded. The Dublin Core standard insisted that the DC term “identifier” must only refer to the digital resource itself and not to the metadata record describing that resource. So why did EPrints use the DC term “identifier” to point to the metadata record instead? I sent off an email to EPrints to ask. They were as helpful as they could be, but we were all talking from different perspectives and not immediately understanding what each other knew/didn’t know/needed to know. I think I spoke to a technician who knew that by using “identifier” for the metadata page EPrints worked. Blank! when I tried to raise the DCMI stipulation. “Will look into it for next version.”
It was not easy for a newbie like myself to immediately discover that although OAI-PMH harvesting used the DC schema, this OAI protocol had to make its own rules in how to use that DC schema to make harvesting work. I asked harvesters for an explanation also. For some reason I interpreted their responses as something like an ad hoc “making do” with how things worked.
The problem was not with the DC schema. It was with the repository institutions. The institutions did not, from the pureness of their hearts, simply want users having direct access to their resources. Reputations, and the necessary finances and career advancements that reputations attracted, were part of the game. Users had to be directed first to the institution’s repository, with full professional header branding, as the gateway to the resources. The user needed to be taken first to the metadata page — where the branding decorating the details of the resource hit them in the eyes.
Repositories could not be sold in universities otherwise. So the OAI-PMH harvesting had to break the DCMI regulation for the use of the “identifier” element.
I wish I understood that from the very beginning! As it was, this basic piece of information only came to me when it came time for the RUBRIC tech team to test the nitty gritty of OAI harvesting with VTLS’s VITAL repository.
There are several other issues that arise from the simplicity of the simple DC for harvesting, yet to be discussed.
But I can see a quite different set of issues relating to interoperability on the horizon. OAI-PMH and DC will still be basic tools, — especially if/when URI’s can be assigned to each piece of data, one for an author name, another for an affiliation, etc.
One can begin to see the importance of choosing a repository with an established and proficient support base, and to have a planning agenda to sustain whatever is chosen until the next choice has to be made, whether to upgrade or change platforms completely.