Metalogger

June 15, 2010

Who uses Dublin Core?

Filed under: Dublin Core — Neil Godfrey @ 9:04 pm

To get some idea, I checked the OAI-PMH data provider list and the OAIster contributor list.

Why OAIster? From http://www.lib.umich.edu/digital-library-production-service-dlps/oaister/

OAIster is a union catalog of digital resources.

In [2009], OAIster transitioned to OCLC. OAIster metadata is available:

  • As a freely available, discretely searchable database at http://oaister.worldcat.org/.
  • Included within WorldCat.org, also freely available and searchable with other WorldCat items.
  • Through FirstSearch, as a separate database available to subscribers to the FirstSearch service.

From the OCLC home of OAIster:

OAIster is a union catalog of millions of records representing open archive digital resources that was built by harvesting from open archive collections worldwide using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Today, OAIster boasts more than 23 million records representing digital resources from more than 1,100 contributors. Additionally, the OAIster records are included in search results for those libraries with WorldCat Local and WorldCat Local “quick start.”

OAIster is one of the largest aggregations of records pointing to open archive collections in the world. Contributors to OAIster are required to use the OAI-Protocol for Metadata Harvesting (OAI-PMH), which requires the use of simple Dublin Core.

Search hits on OAIster are as follows (from the University of Michigan Library site):

The same page (http://www.lib.umich.edu/digital-library-production-service-dlps/oaister-statistics) shows graphs for the increase in records and repositories harvested.

Listed among the OAIster 1100 oaistercontributors are institutions from the following 58 countries: (more…)

Advertisements

April 9, 2010

October 10, 2008

Are repositories set to be left out in the cold?

Filed under: Dublin Core,Harvesting,Repositories — Neil Godfrey @ 1:01 pm

Repositories and their harvesters have a rule of their own that violates Dublin Core standards. Because of this, are repositories and harvesters on target for a massive retroversion or major set of patches if they are to be a part of the semantic web? (I don’t know, but I’d like to be sure about the answer.)

Once again at a Dublin Core conference I listened to some excellent presentations on the functionality and potential applications of Dublin Core, but this time I had to see if I could poop the party and ask at least one speaker why the nice theory and applications everywhere simply did not work with the OAI harvesting of repositories.

I like to think that standards have good rationales. The web, present and future (e.g. the semantic web) is  predicated upon internationally recognized standards like Dublin Core. According to the DCMI site the fifteen element descriptions of Simple Dublin Core have been formally endorsed by:

  • ISO Standard 15836-2003 of February 2003 [ISO15836]
  • ANSI/NISO Standard Z39.85-2007 of May 2007 [NISOZ3985]
  • IETF RFC 5013 of August 2007 [RFC5013]

But there is one area where there is a clear conflict between DCMI element definitions and OAI-PMH protocols. The DC usage guide explains the identifier element:

4.14. Identifier

Label: Resource Identifier

Element Description: An unambiguous reference to the resource within a given context. Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system. Examples of formal identification systems include the Uniform Resource Identifier (URI) (including the Uniform Resource Locator (URL), the Digital Object Identifier (DOI) and the International Standard Book Number (ISBN).

Guidelines for content creation:

This element can also be used for local identifiers (e.g. ID numbers or call numbers) assigned by the Creator of the resource to apply to a particular item. It should not be used for identification of the metadata record itself.

Contrast the OAI-PMH protocol:

A unique identifier unambigiously identifies an item within a repository; the unique identifier is used in OAI-PMH requests for extracting metadata from the item. Items may contain metadata in multiple formats. The unique identifier maps to the item, and all possible records available from a single item share the same unique identifier.

The same protocol explains that an item is clearly distinct from the resource and points to metadata about the resource:

  • resource – A resource is the object or “stuff” that metadata is “about”. The nature of a resource, whether it is physical or digital, or whether it is stored in the repository or is a constituent of another database, is outside the scope of the OAI-PMH.
  • item – An item is a constituent of a repository from which metadata about a resource can be disseminated. That metadata may be disseminated on-the-fly from the associated resource, cross-walked from some canonical form, actually stored in the repository, etc.
  • record – A record is metadata in a specific metadata format. A record is returned as an XML-encoded byte stream in response to a protocol request to disseminate a specific metadata format from a constituent item.

I wrote about this clash of standards and protocols in another post last year. One response was to direct readers to Best Practices for OAI Data Provider Implementations and Shareable Metadata.

The working result for many repositories is a crazy inconsistency. Within a single Dublin Core record for OAI harvesting the same element name, identifier, can actually be used to identify different things:

<oai_dc:dc
     xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
     http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
   <dc:title>Using Structural Metadata . . . </dc:title>
   <dc:creator>Dushay, Naomi</dc:creator>
   <dc:subject>Digital Libraries</dc:subject>
   <dc:description>[Abstract here]</dc:description>
   <dc:description>23 pages including 2 appendices</dc:description>
   <dc:date>2001-12-14</dc:date>
   <dc:type>e-print</dc:type>
   <dc:identifier>http://eprints.repository.edu/318/</dc:identifier>
   <dc:identifier>1-85636-082-X</dc:identifier>
 </oai_dc:dc>

In this OAI DC the first identifier identifies the splash page for the resource in the repository. The second identifier identifies the resource itself. It works for now, between agreeable partners. But how sustainable is such a contradiction? What is the point of standards?

As far as I understand the issue, this breakdown in the application of the Dublin Core standard is the result of institutional repositories needing their own branding to come between users and the resources they are seeking. Without that branding they would scarcely have the institutional support that enables them to exist in the first place.

Surely there must be other ways for harvesters to be aware of the source of any particular resource harvested and hence there must be other ways they can meet the branding requirement. Surely there is a way to retrieve an identified resource (not an identified metadata page about the resource) and to display it with some branding banner that will alert users to the repository — and related files and resources — where it is archived. Yes?

I mention “related files and resources” along with the branding page — but maybe this is a separate issue. Where a single resource consists of multiple files then is the metadata page a valid proxy for that resouce anyway? Or is there another way of displaying these?

Australia has had the advantage of a national metadata advisory body, MACAR. The future of MACAR into next year is still under discussion, but such an issue would surely be an ideal focus for such a body — to examine how this clash impacts the potentials of repositories today and in the future. A national body like MACAR has a lot more leverage for pioneering changes if and where necessary.

What should be done?

What can be done?


But is there more? more confusion of terms?

In having another look at the DCMI site for this post I noticed something else in the latest DC Element Set description page:

Term Name: identifier
URI: http://purl.org/dc/elements/1.1/identifier
Label: Identifier
Definition: An unambiguous reference to the resource within a given context.
Comment: Recommended best practice is to identify the resource by means of a string conforming to a formal identification system.

DCMI recommends that an identifier be “a string”. In the context of RDF and the semantic web my understanding of “string” is a dead-end set of letters as opposed to a resolvable uri or “thing”. But the DC Usage Guide “explains” that an applicable formal identification system allowed here can also be a URI.  So what don’t I understand about the difference between strings and (RDF) things, now?


June 2, 2008

July 14, 2007

June 26, 2007

dc.source — an attempt to clarify why it is not something else (updated 1.40 pm)

Filed under: Dublin Core,MARC,Repositories — Neil Godfrey @ 2:41 am

Librarians and their clients are used to thinking of sources as citations. And this carries over into confusion in the Dublin Core metadata world.

We are used to thinking of a bibliographic or cited “source” for an article, but in MARC “source” can mean an actual institution or donor who provided the material (tag 037 for “source of acquisition) and in DCMI it can mean the page or book from which an article was scanned.

Like any term the word “source” is used differently depending on perspectives of users.

The following DCMI links may help clarify the DC meaning and use of their term “source” in their elements.

http://dublincore.org/documents/usageguide/glossary.shtml

http://dublincore.org/documents/usageguide/elements.shtml#source

The DCMI definition of source is “Information about a second resource from which the present resource is derived”, and they give 2 examples:

  1. a page from which a picture was copied;
  2. a call number of a book from which a pages were scanned.

In repositories especially we are depositing works by authors that are subsequently published in journals etc. So strictly speaking in this case the author is the source, though obviously we use ‘creator’ or ‘contributor’ in this case. And the publishing journal title is a subsequent related title. That journal might be a “source” of info for a student later on, but it is not the “source” of the original article itself, which is what repositories are dealing with, and which DC is attempting to isolate with this term.

And the whole thing gets more confusing when, as one of our partners commented, one uploads a postprint, a publisher’s version of a document. Is not the publishing journal title then the ‘source’ while this would not be the case with a pre-print. Obviously this gets damn messy if we are going to be martinets about semantics. We naturally want a single source whether the document is a preprint or a postprint. But the confusion of this particular example also demonstrates why the publishing “journal title” cannot be the actual dc.source. And this leads in to the MARC mapping from the host or publishing journal. . . .

Relation to the MARC 773 (or 787) tag

The mapping of the Host Item Entry MARC 773 to dc.relation is based on the standard LOC and DC crosswalks for these. One example is at http://www.loc.gov/marc/dccross_199911.html

This conforms with the standard DCMI definition of relation. Note that the LOC standard description for 773 is “host item” and that too indicates a “relationship” to the document being archived in the repository. A more complete 773 field with page references for the article identified in the subfields technically turns the 773 into a “dc.identifier”. But no need to go there for now.

Language codes in repositories: English, eng, en or en-aus?

Filed under: Dublin Core,Harvesting,MARC,Repositories — Neil Godfrey @ 2:20 am

Collating here a few thoughts that have arisen out of a range of questions and puzzles about language codes that have arisen over past year or so, inc reference to MARC mapping . . . .

Portal display

Firstly, in an essentially monolingual repository I can’t see a reason to include the language note in the portal display. To cover the exceptions when articles in languages other than English will be archived then surely the simplest add on is to enter a separate note field (originally entered in a MARC 546 in cases where repositories rely on migrating MARC records?) to make this clear. Though surely the title and abstract details themselves that are on the main display normally will tell users the language anyway. (The 546 field is a perfect place to enter “English” if one wants.)

Secondly, libraries used to using the MARC 546 field for language description as their main language identifying element may be running a risk if they rely on data in these fields to be migrated to a Dublin Core element. 546 is a free text field for language notes, not strictly for coded language values. The MARC language codes are entered in either the 008/35-37 fixed field or the 041 field or both. 546 potentially contains descriptive notes in any uncontrolled format.

eng, en, en-aus — what’s the difference?

But what of the variations one sees in standard codes for language? Frex, English can be entered as en, eng or en-aus.

eng, en and en-aus are all valid ISO/RFC standard formats for identifying the English language or English language as used in Australia.

The 3 letter code ISO 639 standard was largely derived from the MARC language codes. So default MARC entries that may appear in the 008/35-37 will be valid ISO 639 language codes.

But there is also a 2 letter ISO 639 standard code.

The reason for the difference is that the shorter code was designed for “terminologies, lexicography and linguistics” and the subsequent 3 letter code was developed for “bibliographic and terminology needs”.

For practical purposes machines harvesting repositories are not going to know the difference; they’ll read both.

See http://www.loc.gov/standards/iso639-2/faq.htm for the LOC FAQ site giving more detailed explanations.

Function of the language element

The primary function of the language element is to facilitate refined searching. International service providers obviously will best achieve this by recognizing standardized formats of data. Hence the value of having the ‘eng’ in MARC 008/35-37 and/or the ‘eng’ or ‘en-aus’ etc. in the MARC 041 to map as values for the dc.language element.