Metalogger

October 10, 2008

Are repositories set to be left out in the cold?

Filed under: Dublin Core,Harvesting,Repositories — Neil Godfrey @ 1:01 pm

Repositories and their harvesters have a rule of their own that violates Dublin Core standards. Because of this, are repositories and harvesters on target for a massive retroversion or major set of patches if they are to be a part of the semantic web? (I don’t know, but I’d like to be sure about the answer.)

Once again at a Dublin Core conference I listened to some excellent presentations on the functionality and potential applications of Dublin Core, but this time I had to see if I could poop the party and ask at least one speaker why the nice theory and applications everywhere simply did not work with the OAI harvesting of repositories.

I like to think that standards have good rationales. The web, present and future (e.g. the semantic web) is  predicated upon internationally recognized standards like Dublin Core. According to the DCMI site the fifteen element descriptions of Simple Dublin Core have been formally endorsed by:

  • ISO Standard 15836-2003 of February 2003 [ISO15836]
  • ANSI/NISO Standard Z39.85-2007 of May 2007 [NISOZ3985]
  • IETF RFC 5013 of August 2007 [RFC5013]

But there is one area where there is a clear conflict between DCMI element definitions and OAI-PMH protocols. The DC usage guide explains the identifier element:

4.14. Identifier

Label: Resource Identifier

Element Description: An unambiguous reference to the resource within a given context. Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system. Examples of formal identification systems include the Uniform Resource Identifier (URI) (including the Uniform Resource Locator (URL), the Digital Object Identifier (DOI) and the International Standard Book Number (ISBN).

Guidelines for content creation:

This element can also be used for local identifiers (e.g. ID numbers or call numbers) assigned by the Creator of the resource to apply to a particular item. It should not be used for identification of the metadata record itself.

Contrast the OAI-PMH protocol:

A unique identifier unambigiously identifies an item within a repository; the unique identifier is used in OAI-PMH requests for extracting metadata from the item. Items may contain metadata in multiple formats. The unique identifier maps to the item, and all possible records available from a single item share the same unique identifier.

The same protocol explains that an item is clearly distinct from the resource and points to metadata about the resource:

  • resource – A resource is the object or “stuff” that metadata is “about”. The nature of a resource, whether it is physical or digital, or whether it is stored in the repository or is a constituent of another database, is outside the scope of the OAI-PMH.
  • item – An item is a constituent of a repository from which metadata about a resource can be disseminated. That metadata may be disseminated on-the-fly from the associated resource, cross-walked from some canonical form, actually stored in the repository, etc.
  • record - A record is metadata in a specific metadata format. A record is returned as an XML-encoded byte stream in response to a protocol request to disseminate a specific metadata format from a constituent item.

I wrote about this clash of standards and protocols in another post last year. One response was to direct readers to Best Practices for OAI Data Provider Implementations and Shareable Metadata.

The working result for many repositories is a crazy inconsistency. Within a single Dublin Core record for OAI harvesting the same element name, identifier, can actually be used to identify different things:

<oai_dc:dc
     xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
     http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
   <dc:title>Using Structural Metadata . . . </dc:title>
   <dc:creator>Dushay, Naomi</dc:creator>
   <dc:subject>Digital Libraries</dc:subject>
   <dc:description>[Abstract here]</dc:description>
   <dc:description>23 pages including 2 appendices</dc:description>
   <dc:date>2001-12-14</dc:date>
   <dc:type>e-print</dc:type>
   <dc:identifier>http://eprints.repository.edu/318/</dc:identifier>
   <dc:identifier>1-85636-082-X</dc:identifier>
 </oai_dc:dc>

In this OAI DC the first identifier identifies the splash page for the resource in the repository. The second identifier identifies the resource itself. It works for now, between agreeable partners. But how sustainable is such a contradiction? What is the point of standards?

As far as I understand the issue, this breakdown in the application of the Dublin Core standard is the result of institutional repositories needing their own branding to come between users and the resources they are seeking. Without that branding they would scarcely have the institutional support that enables them to exist in the first place.

Surely there must be other ways for harvesters to be aware of the source of any particular resource harvested and hence there must be other ways they can meet the branding requirement. Surely there is a way to retrieve an identified resource (not an identified metadata page about the resource) and to display it with some branding banner that will alert users to the repository — and related files and resources — where it is archived. Yes?

I mention “related files and resources” along with the branding page — but maybe this is a separate issue. Where a single resource consists of multiple files then is the metadata page a valid proxy for that resouce anyway? Or is there another way of displaying these?

Australia has had the advantage of a national metadata advisory body, MACAR. The future of MACAR into next year is still under discussion, but such an issue would surely be an ideal focus for such a body — to examine how this clash impacts the potentials of repositories today and in the future. A national body like MACAR has a lot more leverage for pioneering changes if and where necessary.

What should be done?

What can be done?


But is there more? more confusion of terms?

In having another look at the DCMI site for this post I noticed something else in the latest DC Element Set description page:

Term Name: identifier
URI: http://purl.org/dc/elements/1.1/identifier
Label: Identifier
Definition: An unambiguous reference to the resource within a given context.
Comment: Recommended best practice is to identify the resource by means of a string conforming to a formal identification system.

DCMI recommends that an identifier be “a string”. In the context of RDF and the semantic web my understanding of “string” is a dead-end set of letters as opposed to a resolvable uri or “thing”. But the DC Usage Guide “explains” that an applicable formal identification system allowed here can also be a URI.  So what don’t I understand about the difference between strings and (RDF) things, now?


12 Comments

  1. Neil, I’ve also noticed that metadata/resource conflation issue. Re your last q. about URIs, I think the answer is that a URI can be a string or a thing, depending on the element: in elements with literal values (like Identifier) a URI is a string, while in elements with non-literal values it represents a thing.
    Irvin

    Comment by Irvin Flack — October 13, 2008 @ 1:53 pm

  2. Hi Irvin. Literals and nonliterals are still relatively new to me — my latest understanding was that a literal value itself is a dead end, cannot go anywhere. But does not a URI by definition go or take a user somewhere else? Probably this explanation was too simplistic. Is this not correct? What am I misunderstanding? Thanks for any clarification.

    Comment by neilgodfrey — October 13, 2008 @ 2:23 pm

  3. I think the URI in the value of dc.identifier is a special case: it’s just a set of characters rather than a link to another description. In properties with non-literal values the URI is used to represent the resource that is the value of that property, eg the author in the case of dc.creator. And then that resource (the author in that example) in turn can be the subject of another description — with the URI as the link between the two descriptions.

    I just had a thought about why this would make sense: with dc.identifier the resource that the URI identifies is the described resource itself. So it would be redundant for _that_ URI to be the subject of another description. So that’s why DC models dc.identifier as having a literal value and why a URI in that field is a string. That’s my interpretation, anyway.

    Douglas Campbell made a comment related to this last month on the DC.Architecture list (in regards to the new draft dc xml encoding http://dublincore.org/documents/2008/09/01/dc-ds-xml/) :

    http://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind0809&L=DC-ARCHITECTURE&T=0&F=&S=&P=1398

    “* 4.4.1 – in the sentence “Note that a literal value surrogate can not contain a value URI or a vocabulary encoding scheme URI” you might consider adding to the end “, but it may contain a URI as a string if that is the value of the literal.”, just to clarify that situation.”

    Comment by Irvin Flack — October 14, 2008 @ 12:19 pm

  4. […] all the messages that people send over it are coherent.There are ugly bits.Particularly its use of non-standard Dublin Core identifiers.OAI-PMH is, basically A Good Thing. That’s not to say that it’s painless, though. The […]

    Pingback by PT’s blog » What the OAI-ORE protocol can do for you — October 14, 2008 @ 1:48 pm

  5. Folks:

    I think there are two problems with the assumptions behind this discussion. Firstly: OAI-PMH. I’ve done a lot of teaching of OAI-PMH to librarians, and the fact that the same words mean different things in libraryland and OAI-land is a huge problem. One important thing is that an OAI identifier is about the wrapper around the metadata description ONLY–it is not about the resource being described and does not clash in that sense with DC:identifier. That said, there are a lot of variations in the way people interpret and use the standards that make life difficult for people seeking to use the resulting metadata.

    If you look at the photo of David at: http://www.oaforum.org/tutorial/english/page3.htm#section3 you’ll see part of the problem (I hope). “Item” and “record” are defined and used very differently than they are in libraries, and it’s really important to understand the difference if you are going to use OAI as a data provider or a harvester of data. There used to be a pretty good diagram of all this in the old OAI documentation, but it’s been removed, I’m afraid, and instead we have a lot more examples (which don’t help, IMO).

    BTW, OAI has addressed the notion of branding, partly because NSDL was very interested in it–it’s part of the “About” container. I think your thought that data providers route users in particular ways because of branding concerns is not quite the case. There are a lot of reasons, but we found branding not to be that big an issue (but that may be because we dealt with it explicitly from the start). Frankly, most of the problem was ignorance, and bad applications.

    The DC identifier IS about the resource being described, but there is no assumption that it’s a URL, URI, or anything you can immediately use “out of the box.” OAI-PMH (or something like it) is pretty essential if you’re trading DC records around, since DC explicitly does not specify any administrative metadata and OAI-PMH fills that bill pretty well.

    The situation you describe in the posting, where there’s a splash page and an internal identifier in the same description is very common, and not even the worst kind of scenario. When I was involved in aggregating metadata for NSDL some years ago, we would sometimes get people who gave us the same URL in every description, because they wanted to send everyone through their webpage so they could count them as users. In that environment a splash page where the user could choose between a number of formats (sort of a work record in the FRBR sense) was reasonably functional. In NSDL we had to do some serious evaluation and normalization to get stuff to work. Librarians aren’t used to that, and it does rankle, I’ll admit it.

    An aside: One of the reasons DCMI is getting into Application Profiles big time is because of this ambiguity of expectation, but it’s going to be a while before people start doing those properly, thus making life easier for harvesters!

    I won’t tackle the “literals/non-literals” issue but would instead refer you to Andy Powell’s recent blog posting on the issue, which might help clarify things a bit: http://efoundations.typepad.com/efoundations/2008/08/the-importance.html

    I agree with the comment that OAI-PMH is A Good Thing, but like most things built by one community and used by another, it’s important to get things really clear going in.

    Regards,
    Diane

    Comment by Diane Hillmann — October 22, 2008 @ 4:06 am

  6. Hi Diane, do excuse the delay in opening the gate for your post — the urls had defaulted the comment to be spam-vetted before going live —

    I have been flat out finishing up a repository job at Murdoch University (Perth, Western Australia) while preparing to leave for a new job o/seas, and look forward to getting back to catching up with Irvin’s and Diane’s comments as soon as my world settles closer to normal again.

    Comment by neilgodfrey — October 25, 2008 @ 10:03 am

  7. Finally catching up with blogging again! —

    Re literals and non-literals, the article by Andy Powell that Diane links to above did finally answer my questions about the concepts, and I see where my understanding was wrong in my initial post now. Actually Andy’s article (and the W3C doc he cites) returned me to my original understanding of the difference, only this time with a bit more precision and clarity. Les likely to get waylaid by an incomplete explanation again in future. To highlight the particular sections that hit the nail on the head for me —-

    Literals are used to identify values such as numbers and dates by means of a lexical representation. Anything represented by a literal could also be represented by a URI, but it is often more convenient or intuitive to use literals.

    (from http://www.w3.org/TR/rdf-concepts/)

    Andy’s words:

    So, in Dublin Core metadata, each value is either “a literal” (a literal value) or “a physical, digital or conceptual entity” (a non-literal value) and the choice of which to use is largely one of convenience.

    Comment by neilgodfrey — December 10, 2008 @ 10:19 pm

  8. Hi Diane,

    Not sure if you’ll read this after such a long delay, but here goes anyway – – –

    You said that the OAI identifier is about the wrapper, which is also clear from the OAI protocols I cited. But the rest of the same OAI record is about the resource addressed by the metadata and not about the wrapper. It is only this particular DC property that is used to refer to the wrapper and only when it has a certain value.

    Would it not be more consistent if the OAI-PMH used some other OAI element with the basic DC — something like a kind of OAI-PMH Application Profile — to enable one identifier to point to the wrapper while the pure DC identifier keeps itself pure by pointing to the resource? (The wrapper analogy possibly risks further confusion — if METS is a wrapper then maybe OAI DC is more like a label?)

    P.S. — one more query arising from your comment — What are some of the other reasons (apart from the requirement for institutional support and hence the branding one) you suggested institutions use for harvesting the splash page in preference to the resource itself? Thanks.

    Comment by neilgodfrey — December 10, 2008 @ 10:37 pm

  9. […] but comments I have added in response to another by Diane Hillman that addressed a discussion re Are Repositories Set to be Left Out in the Cold? (or at least maybe set for a bit of a reconfig one […]

    Pingback by back again — first to continue a discussion on oai and dc « Metalogger — December 10, 2008 @ 10:47 pm

  10. Neil:

    If you take a look at a real OAI record (here’s one from the UIUC OAI Registry–a great resource, btw): http://gita.grainger.uiuc.edu/registry/SampleRecord.asp?cid=1112902 you’ll see that the OAI identifier and the dc:identifier are from different namespaces. This making of distinctions is what namespaces are all about, right? These identifiers have vastly different purposes, as you’ve noted, and keeping them straight is really really important.

    As for the splash page issue … if you look at http://arxiv.org, you’ll note that they, too, use those kind of pages, basically to enable users to make a choice of format for the document from that page. This is the Physics Archive, in case you’re not familiar with them (they’re the granddaddy of all domain repositories), and they basically build their metadata counter to the DC one-to-one rule so that they can create one record for all the various formats. It simplifies things for their system, but not something I generally recommend, though it’s certainly less onerous than doing the same thing with physical and digital resources.

    Comment by Diane Hillmann — December 12, 2008 @ 6:19 am

  11. Hi Diane. Glad you were able to respond still despite the huge time gap with me getting back to this discussion.

    The specific example of the OAI record you link to in fact raises many biggish questions about repositories. Few repositories that I am familiar with in Australia would use an off-site html page link as the primary identifier of the resource. Many would prefer to use dc:relation for this — but this is because the primary idea of a repository is to be a storer and preserver and display base for an institution’s resources. This is not the same as supplying additional web pages with html links to a resource off-site while having no actual archived resource within the respository itself. Normally (at least among the institutional repositories with which I am most familiar) the general purpose is to archive a resource for preservation and other purposes, and if there is an html link identifying another version or copy of this resource elsewhere, then that html link will be assigned to a dc relation of the archived resource. Besides, many repository software solutions until now have not been able to store html documents in the first place. The html version of a document will normally be a related resource to what is actually archived in the repository. Links to html documents are more commonly simply added to normal catalogue records – special reasons are needed for them to be located in repositories as such I would think. A repository record linking to an off-site html page for the resource it describes is not strictly acting as a “repository of resources” but as yet another link page to other html pages. Fair enough if that’s a policy, but it’s not a core repository function as normally understood and for which repository systems were initially designed.

    More commonly, the OAI records of a repository that archives documents and wishes to direct users to them will do so by making the oai dc:identifier the same as the identifer for the splash page. . . . and hence we return to my original problem.

    I use square brackets so they won’t be deleted by WordPress in this space . . . .

    [record]
    [header]
    [identifier]oai:eprints.gla.ac.uk:3089[/identifier]
    [datestamp]
    [metadata]
    [oai_dc:dc xsi . . . . ]
    [dc:title] . . .
    [dc:creator] . . .
    [dc:identifier]http://eprints.gla.au.uk/3089[/dc:identifier]
    . . . . . .

    This is the pattern normally found — the oai identifier is taken from the splash-page identifier which is embedded in the metadata for the resource, not the splash page itself.

    But your example would suggest that this does not have to be the case – that the harverster should be able to take the oai identifier from another source apart from the simple (oai) dc record itself.

    Comment by neilgodfrey — December 12, 2008 @ 2:00 pm

  12. Neil:

    I think there are a lot of acceptable patterns, and some are more usable than others in an automated environment. One of the tricks is to be able to identify the patterns and use them appropriately in an application environment.

    I think we’re a long way from having these issues settled down, and indeed there is now new interest in solutions that aren’t OAI-PMH. So we’ll see. My experience as an aggregator suggests that there’s not much point in requiring people to do things a particular way. You just have to deal with what you get, and perhaps take the opportunities that arise to educate your date providers.

    Diane

    Comment by Diane Hillmann — March 18, 2009 @ 4:19 am


RSS feed for comments on this post.

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: