Metadata | Metalogger

April 9, 2010

Never throw good old stuff away – you never know when you will need it again

Filed under: Dublin Core,MARC,Metadata — Neil Godfrey @ 8:58 pm

So after having migrated the bare essentials of MARC records to Dublin Core, in particular for digital resources, it is now seen as a good idea if end-users could browse by broad subject categories. Hang on, wasn’t that what Melvil Dewey thought, too? What did we do with all those Dewey numbers? Uh oh . . . . . woops!

Image by Klara Kim via Flickr

(http://www.flickr.com/photos/klara/2925416747/sizes/l/)

Comments Off

June 12, 2009

Maintaining the distinction between metadata and content files : datasets

Filed under: Metadata — Neil Godfrey @ 9:00 pm
Tags: CDWA, cultural heritage, dataset, datasets, EAD, ISAD, ISAD(G), MODS

Case study: an archival body responsible for cultural preservation

Case study: An archival body has collated text documents, photographs, plans, and more to assist with the preservation of national heritage buildings and monuments. They would now like to have these digitized and stored in a library and generally made publicly accessible.

This gives rise to some interesting conceptual nuances in figuring out the best way to structure the data in a library (repository) record.

Textual and pictorial metadata

The archiving agency has a text form that they fill in to accompany the other photographs, volumes, plans etc. But this text form once filled in is not metadata about those accompanying photographs, volumes, plans, etc. The text form contains information about the history, ownership, occupancy, style, architectural and structural details, gazette info, original architects, builders, etc etc about the monument or building.

So all the data that is held by the archiving agency — text forms, photographs, etc — is data about the real building or monument “out there”, in the street somewhere.

In fact, what the archivists have is a combination of textual and pictorial metadata about real buildings.

The text information and the digitized volumes and photographs all describe the real monuments, either textually or pictorially. The “intellectual content” or “work” for preservation is the real building itself.

The real target content

The digitized graphic data and text data about the monuments are datasets. These datasets are sets of data about the real monuments. Although they are archival materials, they are not archival content in the same sense that an historic treaty or letter or diary are archival content in their own right. The archival agency here uses their data (both text and graphic) to inform them about the real monuments “out there”.

Compare a history book. A history book can be about events that happened in the real world, but the history book itself is the primary targeted content of a library repository. Users read it as intellectual content in its own right, and to engage with the ideas of the historian. This is not the same as with the above content used for the preservation of cultural heritage buildings. In this case, the archived content can be said to be metadata informing users about, and offering tools for, the intellectual content of the cultural heritage in the real world.

The archived photographs, for example, are a form of metadata, that is pictorial metadata, about real monuments. These photographs are not the same as artistic or historic photographs that one might archive for their archival value in their own right. The archival photographs need to be archived or preserved as vital metadata about the real monuments – the real content of interest to which the photographic metadata points.

Schema considered for handling this kind of dataset:

ISAD(G): General International Standard Archival Description and EAD: Encoded Archival Description

These schema appear at first glance to be the obvious choices for encoding archival data. The EAD has been developed from ISAD. But caution is to be advised:

The structure of these schema is hierarchical. That is, it is built around Fonds and Series. This is perfectly fine for nondigital document filing. And it is also fine for digital storage systems that are prepared to “sit as is” in perpetuity. For local display purposes, here and now and the foreseeable future with minimal technological changes, this will probably work very well.
But hierarchical structures in metadata are fraught with difficulties. Hierarchical structures are rarely sustainable across different software platforms. They are also difficult to crosswalk with other metadata schema. (The EAD website does offer a crosswalk to MARC 21, but some of the MARC 21 fields used (e.g. 351 – for hierarchical information) are not common and are not easily mappable to other schema such as MODS or Dublin Core.) For this reason alone EAD is not an ideal schema for long term preservation and interoperability purposes.
Hierarchical structures also work against OAI searching. The context of any hits can be lost if that information is at a different hierarchical level, with the result that the hit can appear to be either meaningless or misleading via an OAI or DC search.
For the above sorts of reasons, the trend in OAI and the digital library/repository communities is towards nonhierarchical, flat protocols. EAD may be fine for local usage and search and display purposes, but it has the limitation of being left behind in a silo, while our intent is to be sharing data as widely as possible according to our L2010 vision.
ISAD and EAD are designed for addressing collections where the primary content for direct preservation (archiving) are documents such as historical papers and images etc. As explained above, this is not the case with the PMB data. ISAD and EAD are not designed for datasets. Nor are they designed to meet the requirements of OAI protocol searches.

VRA: Visual Resources Association and TEI: Text Encoding Initiative

VRA also suffers from some of the limitations of EAD:

VRA is designed for addressing collections where the primary content for direct preservation (archiving) are documents such as historical, cultural or artistic images etc. As explained above, this is not the case with the PMB data. VRA is not designed for datasets.
The properties supported by VRA are too limited to support the range of potential properties required by PMB.

TEI is designed for manuscripts and large volumes of text.

CDWA: Categories for the Description of Works of Art, and CDWA Lite

CDWA covers the many of the properties required for descriptive (and some preservation) metadata. It meets the basic descriptive requirements of PMB for their data. However, it also has limitations:

I have doubts about how widely supported it would be by software platforms and search and display interfaces, at least without much local IT tweaking. And if systems could be massaged to support it now, there will remain the questions of future sustainability as software and hardware requirements and practices change.
CDWA Lite looks easier to use than the full CDWA, but unfortunately the reason it is “Lite” is because most of the properties that pertain to buildings and real-life monuments have been removed. It is more useful for museum and gallery artworks.

MODS: Metadata Object Description Schema

MODS has the advantage of being able to express the CDWA and other properties specific to PMB requirements in a basically flat structure. It is also highly interoperable with other schema, including Dublin Core. It is an ideal base for creating OAI compliant DC for basic search. It is also very widely recognized and supported by the Library of Congress, and known well enough to be read and understood by most data service providers.

A major advantage of using MODS is that it will be the simplest way of leaving the door open for OAI-ORE as well as OAI-PMH searches and retrievals of data. If the controlled vocabularies used in the archival records and MODS properties are eventually compliant with KOS and RDF models, then semantic web interrogation of the database has the potential to explore and discover relationships and knowledge otherwise lost from view.

Draft MODS-DC schema for data about cultural monuments, not about the monuments themselves

Draft of MODS-DC schema for data about cultural heritage monuments, not about the monuments themselves: http://tinyurl.com/lcuhyy

Comments (2)

March 25, 2009

MACAR focus and other possibilities?

Filed under: Metadata — Neil Godfrey @ 2:09 pm

Since moving to the National and Public library scene the extent to which MACAR‘s focus has been on the sorts of resources housed in academic repositories has become very apparent. MACAR included non-academic library representatives but they were outnumbered by those from the academic scene. Now I have joined the National Library Board of Singapore (NLB) I have had to face up to just how limited in focus MACAR’s past work has been. On the other hand, maybe progress best kicks off when it is made a small step at a time.

At the NLB a preliminary list of resource types for the national and public library sectors has been prepared. It contains over 50 resource type terms. Some of those will almost certainly be rationalized, but it is clear a list of resource types in a national library is longer than the current MACAR resource type list. And such a list is only the beginning of working towards best practice OAI (ORE?) metadata applications.

With some of the MACAR members now working with ANDS, and MACAR coming under the umbrella of CAUL, one might expect MACAR to be entrenched further with research and academic library requirements. I wonder if there are any realistic chances of national libraries cooperating with each other and even with academic libraries, and if there are real benefits to be gained by their doing so. Since moving into the national and public library sector I think there are real potentials for users if it could happen. To expose and share the specialist heritage and wider cultural and digitized newspaper collections that such libraries house, especially alongside libraries dedicated to research and education, can only be a Good Thing.

Comments Off

October 17, 2008

The Politics of Research: another reason Research codes (RFCD / FOR. . .) are NOT a Subject substitute

Filed under: Metadata,Repositories — Neil Godfrey @ 1:04 pm

There are many reasons many university institutional research repositories use reporting codes (in Australia RFCD and FOR) as a convenient substitute for a subject search index. “Out of the box” technical configurations of repository software, costs of adding new configurations and adjusting existing portals to accommodate them, status quo situations with harvesters and costs of entering and maintaining controlled vocabularies are the most obvious ones.

Today I had my first collective meeting with library, legal, R&D and academic reps at our university as a first step towards garnering broader institutional support for a research repository. While we were discussing some details of the data required for deposit into the repository one of the academics gave me a thorough lesson in what those research codes meant in his circle.

A project on forest ecology could fit under research codes for forestry or ecology. In deciding which one to use for a grant application, it will be natural to consider which one is the more likely to lead to the better financial support. Foresters would obviously be more sympathetic to a forest ecology project that offered improved timber yields, and ecologists more favourable to one reducing logging quotas.

Technically librarians or repository editors can, of course, simply add more codes to cover all bases, but by doing that they will be compromising any preservation and authentication functions the repository might have. The fact that a research paper was reported and funded as a “forestry” project, and not as an “ecology” one, may well be considered a vital part of the record.

While it has been easy to think of research codes as a “good enough” substitute for subject indexes in repositories, this particular academic mentor left no room for ambiguity:

I can’t imagine anyone using research codes to search for subjects!

Comments Off

June 28, 2008

Controlled subject vocabularies in repositories – no real problem(?)

Filed under: Metadata,Repositories — Neil Godfrey @ 7:42 pm
Tags: keywords, LCSH, SKOS

Part of me is telling me it’s madness to even touch this topic but I am going to have to let out a few niggling gut feelings about it that just won’t go away.

I used to advise would-be repository managers that best-practice would be to include with each repository object a controlled subject vocabulary with the rest of the metadata. But of course I was realistic enough to know that staffing and time constraints would make that a dead horse not worth the effort flogging. Besides, in the present environment well chosen keywords are all one really needs to get by.

Aside on keywords

But I did advise that best practice would also mean that there should be someone to vet keywords provided by authors doing their own deposits. Even the keywords often listed on a published article or thesis should be vetted. Some of the keywords selected by authors were scarcely adequate as finding aids for all but a few who knew that particular author’s preferences. Social and uncontrolled tagging works best when the tags

really do cover the content of the paper, and not just one or two aspects of it,
and when they are “socialized” enough to be the words most meaningful to a relatively wide audience.

I am sure many editors/cataloguers working with repositories will know that not all authors can be relied on to provide the most useful keywords. A bit of professional tinkering is sometimes good for best-practice.

Controlled vocabularies currently used in repositories

A few repositories do use controlled vocabularies like LCSH. But most of the ones I know of rely on classification codes that were designed for research reporting principally for funding purposes. In Australia that is the RFCD code (Research Fields, Courses and Disciplines), one of the 3 codes under the ASRC umbrella. An easy summary of or intro to these is on the Australian Research Council website.

But the first thing any cataloguer used to “real subject vocabularies” is that these codes are not, and were never designed to be, descriptive subjects in the normal sense of the word. They were designed to be research reporting codes, exactly what they are called. Academics know them well enough, and it is quite right that they are entered with the publications and other papers that are entered into the academic institutional repositories.

I’m not sure if they are used very much for subject searching, but that doesn’t matter. Keywords can take care of that, and besides, the codes are still necessary data in a repository that contains a large measure of publicly funded research and related contributions.

Having these codes in repositories is also a good idea for another reason that I will come to at the end of this post.

But what of the real controlled subject vocabularies?

Controlled vocabularies and the future

If Web 3.0 is the future then everything I read about it tells me that controlled vocabularies are being positioned for a potential inconceivable when they were first designed.

Web 3.0 is about the semantic web. A great easy intro to explaining what this is all about is a slideshare presentation by Freek Brijl titled Web 3.0 Explained with a Stamp. Brijl takes a question like:

I want all the red stamps, designed in Europe, but used in the U.S.A., between 1880 and 1990

and shows how with semantic web methods (mainly RDF) can navigate the complexities of such a question to produce a better answer than currently possible.

At present, with the repositories in particular being a flat content database to be mined (federated or not), the first two words in the above question, “red stamps”, will pose a problem that guarantees we will get a lot of unwanted hits. We are sure to pick up

green stamps from Khmer Cambodia,
yellow stamps about the Red Sea,
blue stamps with pictures of red dragons,
and white stamps celebrating a Red Cross anniversary,
plus, correctly, red stamps!

That’s just from navigating the first two words in our query.

But if that word “red” meaning the colour, had a unique URI identifier, and if the search query pressed the right button to light up that particular URI identifying the “red” we want, then we could immediately have really restricted our search to “red stamps”.

And this is where the controlled vocabularies come in again. The LCSH is so structured that it is impossible to confuse the “red” colour per se with the political symbols, place names, etc.

And a Library of Congress report, “Response to On the Record: Report of the Library of Congress Working Group on the Future of Bibliographic Control” (dated 1 June 2008 — although its pages sometimes refer to 2007 events in the future tense!), refers to the work that has been done that binds each LCSH term to a unique URI. By itself, this doesn’t mean much. But it’s part of a larger work that lies at the heart of the semantic web, and the ability for users to ask and retrieve sensible results for the question above about certain red stamps.

If Web 2.0 is about fancy linking of databases, Web 3.0 is about linking concepts. Forget about directly searching for the words “red” in all the databases. Send that search for “red” through the appropriate controlled vocabulary, through a broader concept, “colour”, that immediately filtered out all the geographical and institutional concepts.

That can only be done through a controlled vocabulary structure.

And the work has been underway with a simple knowledge organization system known as SKOS.

The Library of Congress has been working with SKOS to enable its controlled subject vocabulary to be built in to Web 3.0 — the semantic web. SKOS is about linking concepts, controlled vocabularies (not only LCSH), in order to enrich a web search experience through RDF. See the SKOS website for further details. In the nuttiest of shells, however, RDF “simply” means linking up something on the web via its unique URI identifier with something else on the by its unique URI identifier, through a specific “verb” command, like, “is a subproperty of” or “is associated with” or “is a result of”, etc.

Library of Congress appears to be working towards migrating their controlled subject vocabulary into the future. See particularly recommendations 3 and 4 in the LC Response report above. See also the blurb and slide presentation by Alistair Miles.

So what about the here and now?

Web 3.0 is still a few years away. And many repository managers don’t have the resources to build into their workflows the time to add full controlled subject vocabularies for each deposit. Just monitoring keywords for best practice databases is onerous enough for most.

Two responses:

One: I’d be interested in doing a study on cost-benefit ratios for adding controlled vocabularies to repository workflows:

Some libraries are entering certain repository deposits into their library catalogues anyway. With controlled subject vocabularies added for discovery purposes. I know at least one library that enters all major theses into their catalogue and sets up a link to the full text that happens to be housed in their repository. . . .
And costs are reducing the number of hard copies of both journals and books now purchased, , , ,
and suppliers are providing their own raw catalogue data with each product sold to the library, , , ,
. . . So is there possibly more time for cataloguers to work on repository records?

Two: But perhaps more realistic in many scenarios, so long as repositories house some sort of controlled vocabulary, even if it is not one designed for subject discovery, but for research reporting for funding purposes, then they are probably in a position to sync in to Web 3.0 and the semantic web when it comes to them. It would “only” be a matter of someone’s time to crosswalk the RFCD codes referred to above to LCSH options, coupled at that time with their unique URI’s. All those “not elsewhere classified” entries in the RFCD list will mean a bit more work for those in the trenches. But that’s not as bad as each cataloguer having to transfer the whole lot one by one. Anyway, before then, maybe Australian Standards will have seen the wisdom of preparing for the Web 3 and have begun their own tasks necessary for this adaptation.

Comments (2)

June 3, 2008

4 articles: state of the art, standards, harmonization and (research data) reuse issues

Filed under: Metadata — Neil Godfrey @ 12:54 pm

Catching up with some long-overdue reading and have found 4 articles of particular interest for their informative descriptions of some current issues in metadata.

State of the Art and Future Directions

An April 2008 JISC document about Metadata for digital libraries by Richard Gartner

Standards

Anyone who is a regular Ariadne reader will already know about an updating on what is happening in the various standards worlds:

Sarah Currier’s overview of current initiatives in standards for educational metadata:

Metadata for Learning Resources: An Update on Standards Activity for 2008

From the article:

This article does not aim to be a critical look at such initiatives; it simply aims to give readers with an interest in educational metadata a snapshot of the current landscape, both for easy reference, and to encourage wider participation in these standards activities. The major areas of development covered in this article are:

LOM Next: plans for the next version of the IEEE LOM.

The Joint DCMI/IEEE LTSC (Learning Technology Standards Committee) Taskforce: bringing together the two major metadata standards used for learning resources, and providing an RDF translation for the LOM.

DC-Education Application Profile (DC-Ed AP): a modular application profile purely looking at educational aspects of resources, based on community requirements.

The United Kingdom’s Joint Information Systems Committee Learning Materials Application Profile (JISC LMAP) scoping study : working alongside a number of similar projects looking at application profiles for repositories in other areas, e.g. images.

International Standards Organisation Metadata for Learning Resources (ISO MLR): based primarily in Canada, this international standards body is devising a new international standard for educational metadata, in response to perceived limitations of the IEEE LOM.

The European Commission’s PROLEARN Harmonisation of Metadata project: a study into the issues and challenges of achieving harmonisation in metadata, given the heterogeneous landscape

Harmonization (including metadata reuse issues across different abstract models)

The second is Harmonization of Metadata Standards (January 21, 2008 ) edited by Mikael Nilsson:

Contents:

1. INTRODUCTION
2. THE NOTION OF METADATA
3. METADATA STANDARDS
4. HARMONIZATION
4.1 Abstract Model standards
4.2 Vocabulary standards
4.2.1 Element vocabularies
4.2.2 Value vocabularies
4.3 Syntax standards
4.4 Application profiles
5. CHALLENGES FOR HARMONIZATION
5.1 Application Profiles in DC and LOM
5.2 Identifying and reusing elements
5.3 Requirements for application profiles
5.4 Summary of obstacles
6. ADDRESSING THE HARMONIZATION ISSUES

Reuse of database/resources for research

Another looking at resource reuse issues in research and database repositories is a chapter titled Ontology Reuse Metadata Model from a 2007 digital dissertation by Elena Pâslaru-Bontaş, A Contextual Approach to Ontology Reuse.

Comments (2)

June 2, 2008

AACR2 “Statement of Responsibility” vs DC Creator

Filed under: Dublin Core,Metadata — Neil Godfrey @ 4:25 pm

As librarians the “statement of responsibility” has always worked beautifully as both the source for access points and as a description of the resource. And authority files have been the logical solution to name variations.

The Dublin Core property, “creator”, is not so easily managed, as anyone who begins working with repositories soon discovers. Author names and affiliations are problematic: How to handle multiple affiliations and periodically changing affiliations? How to store or link to authors’ personal and institutional webpages, blogs and wikis, email addresses, name variations and other contact details?

Attempting to think in terms of “statement of responsibility”, even simply trying to quantitatively expand the AACR2 concept, soon breaks down in the IR environment.

The reason is that the IR world is attempting to deal with the creator as two distinct concepts that can never be equivalent.

On the one hand, there is the creator as a name on a published document. The name there is “a statement of responsibility”, and it is also a description of the document. “This book is one that is written by A.B.”

But on the other hand, we want to treat the creator as a person who sits in an office at a computer and who has a work history and number of ongoing professional activities and access points. In this case, the creator can never be “a statement of responsibility”, apart from some joking remarks I am sure can be found there.

So what is the relationship then between the two concepts of “creator”?

The former is an expression of the latter. (But I am not not not using the term “expression” as applied in the FRBR model.) The name on a title page of a document is an expression of the person in the office. Each new publication of that person in the office will carry another title-page expression of that personal creator.

Sometimes a personal creator will use a different form of their name; sometimes they will write with a different affiliation; sometimes they will be a leading or contact author among many authors, other times they will be a sole or minor author.

So a single real person creator can and often does have multiple creator expressions. It might seem like a nit picky splitting of hairs, since our intuitions can tell us when me mean which is which without any fuss. But computers can’t be programmed with intuition yet, so nitpicky breakdowns are necessary to explain it to machines to they can understand it and do what we want them to do for us.

AACR2 only had to worry about the expressions of the creator and call them “the statement of responsibility”. It is only trying to manage collections of paper or audio-visual and even some digital resources but not authors in offices.

If AACR2 can be said to be for 2 dimensional resource collections, maybe we are now looking towards virtual 3-D collections. At least when the organizational and financial props come into place. Because maybe what the above implies is the need for a community recognized access point of not “document” or “image” or “dataset” resources etc, but of “creator” resources, or simply a database of, or access point for, creators.

(And org and $$ props need to be coupled with community support, which calls for leadership and educational efforts . . . . Another story.)

Although I am not using the word “expression” in the way the FRBR model applies it, the FRBR model does nonetheless deal with the issue by assigning the dc term “creator” to a scholarly work entity and separating out the “agent” (with personal names and details) as another entity.

DCMI Terms and the DC abstract model (DCAM) explain this more technically.

The DCMI term ‘creator’ is a “property” term. It is defined as “an entity primarily responsible for making the resource”.

Other property terms include “title”, “subject”, “type”, “publisher” and heaps more.

In addition to “property terms” DCMI also uses “class terms”. These include Agent, BibliographicResouce, MediaType, and lots more but not as many as the property terms.

The DCAM shows that a property can belong to any number of different classes. The “creator” property can logically belong to both the Agent and the BibliographicResource classes.

The creator is thus a discrete property that can belong to other classes but, can never be confused with those other classes.

The old “statement of responsibility” is clearly nothing more than just what it says, a descriptive statement on a title page or cover. It has served libraries well and will deserve an honorable memorial.

Comments Off

May 12, 2008

Meta-reflections 3: two cases for interoperability, and a role for MARC

Filed under: Metadata,Repositories — Neil Godfrey @ 10:16 pm

Continuing here to share my “education in metadata” — and since I’m still discussing the early months, the post may be of interest to other peoples beginning their metadata journeys too . . . .

An easy MARC intro for a cataloguer into metadata

One good thing about having a library cataloguing background when it came to investigating repositories and seeing how they might be able to talk to each other, was knowing the blunter as well as some of the finer points of MARC. Some have argued that MARC is not really a metadata schema itself but a format for encoding metadata. And that’s one reason it works particularly well as a crosswalk for discrete bits of data from one schema to another.

Migration

I had cut my teeth with an EPrints repository, but when it finally osmosed into me that I had to mind my big P’s and my little p’s (learning that ePrints with a big P was a TM for a particular software package, and eprints with a little p was as often as not used as a generic term for electronic print collections, except when different users said that was only half true), it also became apparent that no institution could assume that once it chose a particular repository it would always have that particular brand of repository.

Currently the different configurations and software bases of repositories mean choosing one is like choosing a car: each make of car has some things one likes but not all, etcetera. And once one make and model is selected for one’s needs now, in a few years one’s circumstances are likely going to mean a different car would be more appropriate, or even necessary.

So given the current state of repository software, one needs to be sure that if one stores all one’s digital resources into one particular make of repository now, one will be able, with little effort or complication, transfer that data to another repository in the future.

I have since learned that some promoters of proprietary repositories promise their repositories will do this for clients, but it’s a topic that’s quickly glibbed over and a little digging shows the promise may not be all it seems. Yes, the data may all be put in a neat package and given back to the library when they cease to use a particular repository, but at least in one case I know it will be stripped of its configurations and be a complete mess. The soft boiled egg will be returned scrambled, leaving repository managers finding staff, workflows and money to reconstruct the original soft boiled egg.

So one has to expect that one’s institution will one day, or just might one day, want to migrate data from one repository to another. And MARC is one handy tool, an old well-known favourite of librarians, that is complex and versatile enough to carry out most migration tasks. By mapping data in a repository to MARC one can then map from that MARC record to the new data format of the new repository.

That is one solution. Librarians know what MARC is capable of conveying as discrete units of data. And thus how granular the migration process can be from one repository to another.

Moving beyond MARC

But there is more. The demands being placed upon repositories are leading them into data configurations that have no traditional counterpart in the normal library cataloguing of resources.

For example: it may be desirable, eventually obligatory, for a repository to differentiate preprints from postprints; articles submitted for publication, articles published, and articles written but not published or submitted; an author’s (different?) affiliations with different publications; whether a particular article or paper, as opposed to a journal or other collation, is peer-reviewed or not; and more.

MARC is good, it can and will continue to do most things. But it will not be able to do everything, at least not easily. One other possibility in certain circumstances is METS, a container for carrying a number of different schema-clumps of data. MARC cannot easily link multiple affiliations to respective authors. But the MODS schema can. And either a MODS schema and/or a MARC one can be carried by a METS package. But not all repositories will support the embedded structure carried by MODS. At least not yet. Plan for change.

But MARC is still going to be good for a while yet with a most vital day to day purpose of the repositories — Open Archive (OAI) Harvesting. That is, migrating the data that is essential for one’s resources being discovered on the internet.

Grappling with standards clashes and confusions over harvesting

Part of our repository evaluation process in RUBRIC was to compare their OAI harvesting functionalities. The easiest way to do this was to migrate data from one repository to another with MARC. Only a few data elements (title, author, date, subject etc) are used for OAI harvesting, and these are easily handled by MARC.

MARC was not the end product as it usually is in a catalogue record. For our repository purposes it was only the half-way house between data destinations. The final destination for harvesting is the Simple Dublin Core schema. Its elements are basic. Cataloguers used to MARC will first wonder how on earth such a blunt instrument can be useful for anything. But recall that library searchers do mostly search by title, author, subject, date, anyway. The more complex data is not lost. It can be kept in reserve (not in Simple Dublin Core but some other schema) till a particular resource is discovered. And then it can go into action and show its stuff as required.

This (OAI harvesting) was one of the hardest aspects of repository functions to come to terms with. Contradictions abounded. The Dublin Core standard insisted that the DC term “identifier” must only refer to the digital resource itself and not to the metadata record describing that resource. So why did EPrints use the DC term “identifier” to point to the metadata record instead? I sent off an email to EPrints to ask. They were as helpful as they could be, but we were all talking from different perspectives and not immediately understanding what each other knew/didn’t know/needed to know. I think I spoke to a technician who knew that by using “identifier” for the metadata page EPrints worked. Blank! when I tried to raise the DCMI stipulation. “Will look into it for next version.”

It was not easy for a newbie like myself to immediately discover that although OAI-PMH harvesting used the DC schema, this OAI protocol had to make its own rules in how to use that DC schema to make harvesting work. I asked harvesters for an explanation also. For some reason I interpreted their responses as something like an ad hoc “making do” with how things worked.

The problem was not with the DC schema. It was with the repository institutions. The institutions did not, from the pureness of their hearts, simply want users having direct access to their resources. Reputations, and the necessary finances and career advancements that reputations attracted, were part of the game. Users had to be directed first to the institution’s repository, with full professional header branding, as the gateway to the resources. The user needed to be taken first to the metadata page — where the branding decorating the details of the resource hit them in the eyes.

Repositories could not be sold in universities otherwise. So the OAI-PMH harvesting had to break the DCMI regulation for the use of the “identifier” element.

I wish I understood that from the very beginning! As it was, this basic piece of information only came to me when it came time for the RUBRIC tech team to test the nitty gritty of OAI harvesting with VTLS’s VITAL repository.

There are several other issues that arise from the simplicity of the simple DC for harvesting, yet to be discussed.

But I can see a quite different set of issues relating to interoperability on the horizon. OAI-PMH and DC will still be basic tools, — especially if/when URI’s can be assigned to each piece of data, one for an author name, another for an affiliation, etc.

One can begin to see the importance of choosing a repository with an established and proficient support base, and to have a planning agenda to sustain whatever is chosen until the next choice has to be made, whether to upgrade or change platforms completely.

Comments (2)

May 10, 2008

Meta-reflections 2

Filed under: Metadata,Repositories — Neil Godfrey @ 11:04 am
Tags: Eprints

I did not come into RUBRIC cold. I was involved in the planning stages of the implementation of an EPrints repository, and then as a cataloguer I was the one responsible for testing how the data entries and outputs worked.

The point was to understand how and where the repository data stood in relation to the rest of the library’s resources, and in relation to the needs and interests of the various university departments and academics who would be using the repository.

We began low-key. Not with award winning research papers or datasets embedded with configurations for their re-use. Rather, it was decided to make our first entries in the repository all the fourth year engineering projects. New students wanted to consult these, and lecturers encouraged the study of them. We would soon learn that once in the repository they would acquire a sizable audience beyond our university, too.

But low-key soon made its voice heard loudly if not immediately very clearly. It was a low key beginning in one respect, but to make it all work I quickly found myself visiting the head and other academics in the engineering department in order to confidently assess the metadata requirements for this sort of resource material. Dewey, USMARC, AACR2, the LOC and OCLC sites all left me high and dry with some of my the metadata needs for this new type of database.

On the technical side, some of the files accompanying some the projects were simply too big or formatted in a way that would not fit in the repository, let alone be viewable to users. Some files were uselessly battened down with password-only access. Some MSWord documents that we wanted to convert to PDF contained formatting that made that process extremely difficult and time-consuming. Most of these issues were solvable or postponable at that stage.

But on the metadata side, academics used a standard research classification code to classify their works. But when I looked into that code (the RFCD component of ASRC) it emerged that this research classification served a different purpose from the descriptive subject classifications such as LCSH. It was not a descriptive discovery classification scheme at all, but a classification scheme for grants and policy reviews. Yet it was the schema known among and used by academic institutions. For starters, there were too many “XYZ not elsewhere classified” entries in it to be of real use as a true subject finding aid.

It appeared from a library cataloguer’s perspective that with our repository we were trying to make a square peg fit into a round hole.

I was yet to learn that the repository was not going to be simply a quantitative extension of traditional library resource services. The repository was not just an enhancement of our services. It was going to be something quite qualitatively different, heading in a direction hitherto alien to libraries, and we needed to start thinking of ways traditional library services themselves could start working with it.

The outcomes from a metadata perspective of this exploratory testing were three documents in particular. The first was a manual of best practice for metadata entry procedures for our EPrints 2.(X) repository. Not just a do/don’t list, but a guideline that explained rationales so that the principles could be applied with a bit of thought and understanding for whoever was to do the data entry, and to know how to work through the inevitable curly questions that come along with data entry.

But that was the easy bit. Much of the fun and frustration of this exploratory work was trying to figure out how to handle all the exceptions to the rules.

Did the exceptions matter? Was it important to standardize repository and library catalogue data? If there was no immediate need, what of the future? What function(s) did each bit of data serve and to what extent could the software take care of our needs, and to what extent were data entry guidelines required?

To help work through these questions I compiled a list of all the potential metadata issues, from the big and fat to the nit-picky or even possibly illusory. This was constructed in 3 columns. The first was a description of each issue. The second was a comparison of that issue with what was understood from normal library practice. And the third was for comments on what the real or imaginary implications might be in the longterm for each “issue”.

This helped focus and prioritize the issues, and eventually I produced a more formal thesaurus of metadata exceptions for USQ’s Eprints repository.

The attempted goal in compiling those documents was to sift through known and true library standards and best practices and to see how these could be applied, if at all, to repository metadata. And where no normal library data standards did apply then to try to assess what a standard or best practice should look like. This sometimes meant attempting to extrapolate from a rule in cataloguing and seeing if it could or should be justified in the repository context.

I have no doubt that given what I have learned since I would change some, possibly much, of what I wrote in those documents. But one has to start somewhere.

I was still very green when it came to broader issues such as OAI harvesting. And we were only working at this stage with pdf documents.

But will elaborate in future meta-reflections . . . . .

Comments Off

May 9, 2008

Meta-reflections 1

Filed under: Metadata,Repositories,Repositories 101 — Neil Godfrey @ 11:41 pm

Before I start my next job I’ve decided it’s time to look back on the things I’ve learned, where I’ve succeeded and where I did not (and why), in working as a “metadata specialist” for two years and 2 months with RUBRIC, and before that working in various capacities (mostly as a cataloguer/editor) with an EPrints repository.

RUBRIC was a project funded by a national government department to assist university libraries to implement repositories. The acronym stands for Regional Universities Building Repository Infrastructure Collaboratively.

There was a “RUBRIC Central”, consisting of a technical team, a planning and PR person, and project manager and myself, the metadata specialist, and this Central team worked as per the acronym collaboratively with repository managers in different universities. How this worked can be seen in a PowerPoint presentation by Kate Watson and Vicki Picasso, both involved repository managers, titled The RUBRIC Project: the benefits of collaboration through partnerships. This is linked on the IDEA 2006 page.

So my introduction into the nitty gritty of metadata and repositories with RUBRIC was in the context of working with several universities, each with different library practices, priorities and infrastructure, most of whom did not yet have repositories. The requirement was to help each get started with a “first generation” repository. The couple that did have repositories were still in their neophyte stages and welcoming of professional assistance to help establish themselves on the right footing.

The first step was to evaluate different repository systems so each university could better assess the one most suited to their situation. We investigated DSpace, VTLS’s VITAL and the University of Queensland’s Fez (two Fedora based repositories), and I also had had experience with EPrints.

This evaluation meant working closely with the technical team of programmers. This involvement consisted of weekly meetings where goals and needs were raised and assessed, and sitting beside them in the same workroom so we could regularly talk the issues through together.

It also, most importantly, meant regular communication and prioritization and troubleshooting through TRAC, wikis and teleconferences.

So the technical skills of the programmers was combined with my library cataloguing skills and understanding of library terminology needs.

So I was put in a rare and enviable position of being able to begin studying metadata issues in the context of:

a variety of repository systems
different universities with different needs and resources
close daily liaison with computer technicians and regular liaison with project partners
and a well resourced and coordinated team and collaborative partnership that kept us all focussed on the issues

Unfortunately the weather can change and towards the end of the project some of the above started to fade, and in hindsight I can see where I could have been more alert to have worked to compensate or correct that, and how I could have improved on some of what I attempted, but that’s hindsight. I’ll discuss that side of it when the time comes for that particular meta-reflection.

It was a great start. I needed to understand metadata not just for one institution, but for eight universities, and how metadata would or could work in four different repository systems, with a marriage of technical and metadata specialists.

Subsequent discussions with more experienced repository managers who were not part of the RUBRIC project began to help me understand just how well positioned I was to help new repository managers avoid certain pitfalls and risks — and to help me acquire an expanding awareness of what future needs would be and the broader environment our partners would be entering.

Next post will begin to discuss the first metadata specifics that needed exploration and solutions.

Comments (1)

April 9, 2010

June 12, 2009

Case study: an archival body responsible for cultural preservation

The real target content

Schema considered for handling this kind of dataset:

ISAD(G): General International Standard Archival Description and EAD: Encoded Archival Description

VRA: Visual Resources Association and TEI: Text Encoding Initiative

CDWA: Categories for the Description of Works of Art, and CDWA Lite

MODS: Metadata Object Description Schema

Draft MODS-DC schema for data about cultural monuments, not about the monuments themselves

March 25, 2009

October 17, 2008

June 28, 2008

Aside on keywords

Controlled vocabularies currently used in repositories

Controlled vocabularies and the future

So what about the here and now?

June 3, 2008

State of the Art and Future Directions

Standards

Harmonization (including metadata reuse issues across different abstract models)

Reuse of database/resources for research

June 2, 2008

May 12, 2008

An easy MARC intro for a cataloguer into metadata

Migration

Moving beyond MARC

Grappling with standards clashes and confusions over harvesting

Next

May 10, 2008

May 9, 2008

Blogroll