September 2, 2007

how repository display configurations can clash with oai harvesters

Filed under: Harvesting,Repositories — Neil Godfrey @ 3:14 am

The basic metadata supporting OAI harvesting is Simple Dublin Core. A data provider (repository) that intends to be compliant with the requirements of OAI harvesting will produce an unqualified DC datastream as a minimum requirement.

At least one repository solution, VTLS’s VITAL, is designed to use the simple DC data as the basis for the repository’s metadata splash page that contains the repository institution’s branding and is used to direct the user to the archived document that the user is requesting. (As far as I know at this point this is not an issue with open source repositories such as Eprints and Dspace.)

This means the repository is attempting to use a single datastream for two different purposes. That becomes a problem if the oai-dc data is constructed in a way that meets one purpose (e.g. the oai harvester), but that particular dc construct is not what we want for the other function (e.g. the portal display of the metadata page). That metadata display page with the link to the deposited document would be better linked to some other datastream — such as a MARC or MODS or VRA or anything OTHER than the OAI-DC data configuration.

This means that a repository manager must be very clear about exactly what it wants a service provider (SP) to display from its repository. For example, does one want the service provider to display the repository’s metadata splash page for each document, so that public users will be first directed to their metadata details for a particular record, where institutional branding also is prominent, and from there link to the full article or document? Or does one want to cut out one’s repository branding and descriptive metadata page and allow the SP to take a user directly to the article. What the SP will do will depend on how certain data is entered into the oai-dc datastream.

When an SP receives a request for a particular article in your repository, it will rely on the oai-dc record to “identify” that particular article. It thus looks for a dc.identifier value with a resolvable URI link.

This means that:

  1. If the URI value in a dc.identifier is the link to the repository’s metadata page, complete with the full descriptive metadata record of the article, institution’s branding, and link to the full text of the document, then the SP will direct users to this repository page.
  2. If, however, the URI value in the dc.identifier is the link directly to the article itself, possibly offline at, say, a publisher’s site, then the SP will bypass the repository metadata page and direct users directly to the article wherever it is located.
  3. If there are dc.identifier values that are non-resolvable text strings such as an ISSN the SP will ignore these for this purpose.

Normally a repository can and will be configured so that documents deposited into it will generate in the oai-dc a dc.identifier value that is a handle or link to the repository metadata page first. 

But if the repository contains only a link to an offsite copy of the document, and if this is also entered into a dc.identifier field, the SP will direct users away from the repository to the off-site document. No problem, perhaps, for the self-effacing repository manager who wishes to serve the user more than the reputation of the institution supporting the repository, but not politically savvy if one of the very arguments presented to fund the repository in the first place was that it would increase the institution’s exposure to the world. That mediating branding page is normally pretty important.

There are two or three ways around this but each has drawbacks.

One can enter an offsite link to dc.relation instead of dc.identifier. This reserves the dc.identifier field for the repository default metadata page link — normally machine generated by the repository itself.

Another solution is to enter the offsite link into the dc.format so that it would look like this:

<dc.format>PDF </dc.format>

Either of these solutions will cause a problem for the manager whose repository is dependent on mapping its portal display options from that same oai-dc record.

It will mean that a portal display link to a deposited record within the repository itself will be mapped from dc.identifier, while a portal display link to an offsite record will be mapped from dc.relation or dc.format.

So the consistency issue arises if one’s repository depends on mapping its display from the same data that is used for oai harvesting.

 An offsite link will have to be treated the same way for display purposes (same portal display label terms) as all other values entered in other dc.relation or dc.format fields.

That can cause headaches. One wants to show an onsite link and an offsite link to an article in a similar way for users. What is important to them is that they can see at a glance a constant way to get to the article regardless of where it is stored. One does not want to present an offsite link in a way that looks like it is not a link to the article described, but to some “relation” of it, for example. One can rename “relation” for the portal display, but it means that whatever display name is chosen for the display must be constant for all other display names mapped from that same dc.relation in the oai-dc field.

And whatever solution is decided upon will need to consider preservation and sustainability questions. One day the records will be migrated to some other software — what will happen to any such solutions then? What systems will ensure consistency over time within a repository, and what issues will arise in the broader world of databases needing to be able to talk to each other in the future?

Repository designs need to allow for record and metadata displays to be configured independently of the data used for oai harvesting.


  1. I think offsite URI’s should appear in dc.relation (along with any other relationships to other data, however expressed) because the repository metadata is meant to describe and identify the material IN the repository. When the metadata links to data stored elsewhere I think that’s a relationship matter since, as DCMI states for dc.relation – “A related resource. Recommended best practice is to identify the related resource by means of a string conforming to a formal identification system.”

    The material IN the repository goes in dc.identifier because it is “An unambiguous reference to the resource within a given context.
    Recommended best practice is to identify the resource by means of a string conforming to a formal identification system.” and I think the particular repository is the “given context”.

    I’d wouldn’t think that dc.format is the place for such things at all.

    Comment by simon — September 3, 2007 @ 5:35 am

  2. The problem I see with this is that inconsistency is built in to the oai-dc data when the repository sees itself as the primary user service, and thus wishes to draw a user’s attention to itself.

    So in the oai-dc we have listed:

    the title of the document,
    the creator of the document,
    the subject of the document,
    the description of the document,
    the date of the document,
    the format of the document,
    the identifier of the repository metadata page!

    Consistency would demand that we use the identifier of the document. The only reason for the inconsistency, or at least a primary one, is the branding issue. A primary purpose of the repository is to promote the institution to which it belongs.

    The dc.relation is normally used for a document that is related to the primary one identified in the oai-dc record, such as a previous edition or a second part or series title of the primary document. It can also be a varying format of the article (e.g. an html version of a pdf document). See

    A dc.identifier can also be a doi or bibliographic citation which can apply as much to an offsite as the repository location of the same document. So it seems inconsistent to say there is an exception in the case of a URL.

    The option of using dc.format as a container for the offsite link was presented in a D-Lib Magazine article in part addressing this issue:

    Comment by neilgodfrey — September 3, 2007 @ 7:33 am

  3. Point taken re the tensions between metadata and resource identification. Presumably where more than one resource is attached to a single metadata record you would recommend a dc.identifier for each resource? The branding issue is only part of the problem though. It depends on what the harvester is seeking – is it a resource harvester or a metadata harvester?

    I still think that dc.relation is the place for links to resources beyond the local environment. recommends dc.relation be used for “The URI of each available format of the eprint. If necessary, repeat this element for multiple formats. Also repeat this element if the eprint is available from other locations, for example from the publisher’s Web site.”

    I don’t understand your sentence about dc.identifier: “So it seems inconsistent to say there is an exception in the case of a URL.” I didn’t imply that as far as I can recall.

    Did the Van de Sompel article lead to the promised specification? It seems to me that their concern was resource harvesting as opposed to metadata harvesting.

    Comment by simon — September 4, 2007 @ 5:22 am

  4. Hi Simon,

    Good points. There are really two functions that are closely intertwined here as I understand it: 1. Metadata harvesting, and 2. Resource discovery & retrieval — as you point out in reference to the Van de Sompel article. I’m not sure if what follows is really addressing your concerns, but I’ll give it a go.

    The OAI protocol is for metadata (as opposed to resource) harvesting — but for the purpose of resource discovery. I know that sounds elementary and I’m appreciating more all the time of clarifying in my own mind these most basic concepts. And the Dublin Core element set of metadata is for resource discovery.

    So the OAI compliant Service Provider harvests the DC metadata — but not for its own sake, but for the purpose of resource discovery by the user. Hence the oai-dc contains, in theory, all the metadata required to discover and retrieve the resource itself. So the dc.identifier in oai philosophy can be expected to be as much the direct identifier of the resource itself as is the dc.title, dc.description, etc. (I don’t take the “given context” to refer to the repository or location of the resource, but in this case to the context of the related metadata information — the identifier of the “thing” with the dc title, and dc creator etc in that datastream.)

    That’s what I meant by the inconsistency in the in the case of the dc.identifier value in the oai-dc data. The oai-dc contains the title, creator, description of the document or resource, not the metadata. The only exception in the case of repositories is with the dc.identifier’s that contain a URI to the metadata page itself.

    At the recent DC-2007 Conference some of the “experts/leaders” in the OAI and DCMI worlds at first could not see any “issue” with an OAI compliant Service Provider bypassing the Repository and taking users directly to the document from the harvested oai-dc data — that in their mind was exactly what it was supposed to do, and preferable in theory to going via the repository middleman.

    Others with their feet closer to the repository trenches spoke of this as “the branding issue” — hence my term.

    Fez does put the direct uri link to the resource/document in the dc.format element. So it’s not unknown.

    I’m not recommending either dc.relation or dc.format as alternatives. I can see why you favour dc.relation for the publisher site and many others agree. I don’t know of any better alternative at this point.

    Your reference to the eprints article also makes more sense of the dc.relation — as “related” formats of the primary one that is “identified” as archived in the repository — at least as I recall that article. But open to correction. The issue we have with repositories like the VITAL one, however, is that the primary resource to be identified has no copy in the repository archive from which to refer. The primary resource referenced is the one on the publisher site. It is not theoretically related to anything else. So some opt for “dc.format” and some for “dc.relation”.

    So long as the Service Provider is aware and wants to work with the link to the document resource itself. Some SPs will want to full text search the document.

    Comment by neilgodfrey — September 4, 2007 @ 7:27 am

  5. I think we are on the same wavelength!

    This discussion demonstrates that, in an OAI-PMH environment, repository managers need to take care in where resource “identifiers” are placed, as their choices will affect what is discoverable as a result – metadata or resources.

    You are right. The choice will be influenced by “branding”, and the branding is likely to be decided within an institution, but not necessarily at the repository manager level. There was discussion of this issue at the RUBRIC Project Foster Day in February 2007, and I recall a divergence of views between the partners. So it’s not surprising that DC-2007 revealed the same thing.

    The difficulty for the practitioner is making the right choices to support both metadata and resource harvesting. But OAI-PMH is pretty strict in what it says at – “Note that the identifier described here is not that of a resource. The nature of a resource identifier is outside the scope of the OAI-PMH. To facilitate access to the resource associated with harvested metadata, repositories should use an element in metadata records to establish a linkage between the record (and the identifier of its item) and the identifier (URL, URN, DOI, etc.) of the associated resource. The mandatory Dublin Core format provides the identifier element that should be used for this purpose.”

    This suggests to me that dc.identifier is NOT the place for any resource IDs, if I am to differentiate between metadata harvesting and resource discovery. But perhaps I should have realised that before.

    Comment by simon — September 5, 2007 @ 2:16 am

  6. The OAI is unambiguous, as you rightly state.

    Just a minor detail: Dublin Core, the primary tool of OAI, is equally unambiguous about how to use the identifier element:

    4.14. Identifier

    Label: Resource Identifier

    Element Description: An unambiguous reference to the resource within a given context. . . . . .

    Guidelines for content creation: . . . . It should not be used for identification of the metadata record itself.”

    So OAI uses a resource identifier as a metadata identifier! Wouldn’t life be boring without our ability to create little inconsistencies like this! 🙂

    Comment by neilgodfrey — September 5, 2007 @ 3:24 am

  7. Have you seen the Best Practices for OAI Data Provider Implementations and Shareable Metadata? See It discusses some of these issues and problems…

    This was developed by members of the Digital Library Federation and the National Science Digital Library who have been active in metadata aggregations and in providing metadata via oai pmh. An updated version will be published soon out of the DLF – and we’ll also be updating the wiki (this is reliant on me finding time to do that…).

    Comment by Sarah Shreeves — September 9, 2007 @ 10:30 pm

  8. Thanks, Sarah. Much appreciate the link! Looks like I had to learn the long-way around.

    Comment by neilgodfrey — September 11, 2007 @ 8:11 am

  9. […] wrote about this clash of standards and protocols in another post last year. One response was to direct readers to Best Practices for OAI Data Provider Implementations and […]

    Pingback by Are repositories set to be left out in the cold? « Metalogger — October 10, 2008 @ 1:02 pm

RSS feed for comments on this post.

%d bloggers like this: