October 10, 2008

Are repositories set to be left out in the cold?

Filed under: Dublin Core,Harvesting,Repositories — Neil Godfrey @ 1:01 pm

Repositories and their harvesters have a rule of their own that violates Dublin Core standards. Because of this, are repositories and harvesters on target for a massive retroversion or major set of patches if they are to be a part of the semantic web? (I don’t know, but I’d like to be sure about the answer.)

Once again at a Dublin Core conference I listened to some excellent presentations on the functionality and potential applications of Dublin Core, but this time I had to see if I could poop the party and ask at least one speaker why the nice theory and applications everywhere simply did not work with the OAI harvesting of repositories.

I like to think that standards have good rationales. The web, present and future (e.g. the semantic web) is  predicated upon internationally recognized standards like Dublin Core. According to the DCMI site the fifteen element descriptions of Simple Dublin Core have been formally endorsed by:

  • ISO Standard 15836-2003 of February 2003 [ISO15836]
  • ANSI/NISO Standard Z39.85-2007 of May 2007 [NISOZ3985]
  • IETF RFC 5013 of August 2007 [RFC5013]

But there is one area where there is a clear conflict between DCMI element definitions and OAI-PMH protocols. The DC usage guide explains the identifier element:

4.14. Identifier

Label: Resource Identifier

Element Description: An unambiguous reference to the resource within a given context. Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system. Examples of formal identification systems include the Uniform Resource Identifier (URI) (including the Uniform Resource Locator (URL), the Digital Object Identifier (DOI) and the International Standard Book Number (ISBN).

Guidelines for content creation:

This element can also be used for local identifiers (e.g. ID numbers or call numbers) assigned by the Creator of the resource to apply to a particular item. It should not be used for identification of the metadata record itself.

Contrast the OAI-PMH protocol:

A unique identifier unambigiously identifies an item within a repository; the unique identifier is used in OAI-PMH requests for extracting metadata from the item. Items may contain metadata in multiple formats. The unique identifier maps to the item, and all possible records available from a single item share the same unique identifier.

The same protocol explains that an item is clearly distinct from the resource and points to metadata about the resource:

  • resource – A resource is the object or “stuff” that metadata is “about”. The nature of a resource, whether it is physical or digital, or whether it is stored in the repository or is a constituent of another database, is outside the scope of the OAI-PMH.
  • item – An item is a constituent of a repository from which metadata about a resource can be disseminated. That metadata may be disseminated on-the-fly from the associated resource, cross-walked from some canonical form, actually stored in the repository, etc.
  • record – A record is metadata in a specific metadata format. A record is returned as an XML-encoded byte stream in response to a protocol request to disseminate a specific metadata format from a constituent item.

I wrote about this clash of standards and protocols in another post last year. One response was to direct readers to Best Practices for OAI Data Provider Implementations and Shareable Metadata.

The working result for many repositories is a crazy inconsistency. Within a single Dublin Core record for OAI harvesting the same element name, identifier, can actually be used to identify different things:

   <dc:title>Using Structural Metadata . . . </dc:title>
   <dc:creator>Dushay, Naomi</dc:creator>
   <dc:subject>Digital Libraries</dc:subject>
   <dc:description>[Abstract here]</dc:description>
   <dc:description>23 pages including 2 appendices</dc:description>

In this OAI DC the first identifier identifies the splash page for the resource in the repository. The second identifier identifies the resource itself. It works for now, between agreeable partners. But how sustainable is such a contradiction? What is the point of standards?

As far as I understand the issue, this breakdown in the application of the Dublin Core standard is the result of institutional repositories needing their own branding to come between users and the resources they are seeking. Without that branding they would scarcely have the institutional support that enables them to exist in the first place.

Surely there must be other ways for harvesters to be aware of the source of any particular resource harvested and hence there must be other ways they can meet the branding requirement. Surely there is a way to retrieve an identified resource (not an identified metadata page about the resource) and to display it with some branding banner that will alert users to the repository — and related files and resources — where it is archived. Yes?

I mention “related files and resources” along with the branding page — but maybe this is a separate issue. Where a single resource consists of multiple files then is the metadata page a valid proxy for that resouce anyway? Or is there another way of displaying these?

Australia has had the advantage of a national metadata advisory body, MACAR. The future of MACAR into next year is still under discussion, but such an issue would surely be an ideal focus for such a body — to examine how this clash impacts the potentials of repositories today and in the future. A national body like MACAR has a lot more leverage for pioneering changes if and where necessary.

What should be done?

What can be done?

But is there more? more confusion of terms?

In having another look at the DCMI site for this post I noticed something else in the latest DC Element Set description page:

Term Name: identifier
Label: Identifier
Definition: An unambiguous reference to the resource within a given context.
Comment: Recommended best practice is to identify the resource by means of a string conforming to a formal identification system.

DCMI recommends that an identifier be “a string”. In the context of RDF and the semantic web my understanding of “string” is a dead-end set of letters as opposed to a resolvable uri or “thing”. But the DC Usage Guide “explains” that an applicable formal identification system allowed here can also be a URI.  So what don’t I understand about the difference between strings and (RDF) things, now?

May 16, 2008

Reflecting on falling through the cracks, and segmented leadership in the Australian repository scene

Filed under: Harvesting,Repositories — Neil Godfrey @ 1:01 am

My last post recollecting the time it took to learn about the difference between a DCMI rule and an OAI-PMH rule for the meaning of dc:identifier — a difference that only made sense in the context of the politics of what repositories are about — in hindsight looks very embarrassing. It’s obvious when you know.

But it was not obvious to everyone I spoke to who is closely tied up with the DCMI community. And asking strangers via email questions about something complex which one is learning from scratch can be fraught with the cloudiness of not quite understanding what the real issue is, and therefore how to frame the question, and the two parties not quite understanding the frames of reference of each other.

The answer only finally became obvious after several face to face encounters at a conference, and then finally finding “the key person” — a harvester woman! — to talk to, with pen and paper and lots of doodling diagrams. Till then, some who were specialists in their particular area were saying the conflict ought not to exist, and that it needed to be fixed. I was beginning to think I knew more of the issues and questions than the veterans, if not the answers. But it was really only a question of finding which one of the scores of people in the room to single out for this particular question. A question relating to DC did not necessarily mean that a DC specialist theorist would know how to answer it.

Lesson: in something as complex and new as repositories and their related activities such as harvesting, we cannot rely on the normal channels of communication and learning that work for well-established protocols and systems, as we have with normal library functions. I found massive background reading was essential, and even then there turned out to be gaps that were only filled by direct personal exchanges.

It’s a team sport, with all players needing to share their experiences and issues, and to get together often (not exclusively virtually either) to plan and discuss what they are doing, hassling over, etc. Then the simple and obvious things really are.

But that’s hardly the optimum way of operating — it’s too easy for one to fall through cracks along the way and wait to be picked up and dusted off.

That was when I was involved with simple “first generation” repositories. Deposit an object, retrieve an object, with all the preservation and authentication bits in between.

There were other issues too even at this basic level. Some harvesters complained that the data they were picking up from repositories included a lot of “noise”. Sometimes a maverick repository would use a DC element for data unrelated to its real purpose. In other cases multiple terms would be used to describe the one type of resource (e.g. periodical, newspaper, journal). And in other cases there would be too many of the same DC elements coming through (e.g. Date) without any obvious differentiation (e.g. date published, date copyrighted, date awarded, etc).

None of those was or is insuperable. Why not (relatively) simply set up a program that will enable the harvester to streamline the data it receives — so that the known common alternatives (e.g. periodical, magazine) were all dumped under the resource type “journal” or whatever the desired appropriate standard? Or in the case of multiple undifferentiated dc elements like dc:date, then would it be too difficult for a specialist harvester to take the initiative and introduce a slightly modified dc schema (a DC application profile, possibly one already in successful use elsewhere) for, say, theses?  There are other work-arounds, but a business-case / cost-benefit study should help assess the best alternative for the long-term.

One reason they have not been resolved here until now may be, I think, because Australia has lacked a coordinating or leadership body relating to these areas. Ad hoc team-work has its limits. Australia has had a number of bodies — ARROW, the National Library’s Discovery Service and the Australasian Digital Thesis Program and other libraries who  relate only to one or two of these — working on their own remits without a real coordinating vision.

Each of these bodies grew up like Topsy. There is now MACAR, and that body is looking at recommending metadata standards for repositories. What it is working on is important. But it does not have the resources to meet often enough and to make its presence felt strongly enough, or to address comprehensively enough the key issues affecting all stakeholders, to be seen as a coordinating leader providing the vision and programs needed to smooth out the issues each separate body feels are part of the way things must be for the foreseeable future. Leadership in Australian libraries has traditionally come from the National Library. What I missed when learning of the multi-faceted issues of repositories and metadata was something like a National LIbrary coordinating leadership in this area. Such a nationally recognized body (or one with clear  sponsorship by the National Library) might have had the means to lead in smoothing out the respective issues faced by each discrete part of the repository-harvesting picture.

But now there are other developments on the horizon that appear to have the potential to augment the very purposes and functionings of repositories. Till now Australian repositories have mainly been storage bins for single objects, sometimes multi part or multi file objects. They are often promoted as vehicles to showcase an institution’s (and an individual academic’s) scholarly output. But the next stage may be to use repositories as tools for research as and with the needs of end users being the main rationale.

Going beyond first generation repositories, — in scholarly communities a single work can consist of many parts — a text discussion, datasets relating to the text, specialized types of images that are not only illustration but the very source and object of analysis. The sort of idea now being worked out is that of a user being able to draw out a representation of such an image from one repository and compare it with data harvested from another research repository in a single operation.

Developers are currently testing ways to harvest not just the representations of single pdf or jpeg objects from repositories, but to harvest, say,  URI’s assigned to selected parts different objects across a number of repositories. In simplest terms, it may be possible, for example, to “harvest” or “create” a complete journal edition from the multiple journal articles scattered across a range of repositories. Okay, why. But think through the possibilities once it is understood that’s the sort capability we want to establish.

The implications are vast. Different types of repositories from a range of institutions need to be part of a framework that has the sort of consensus that will make this possible. The technological infrastructure. The institutional support for each repository and the agreement on standards and policies that will support this, and the growth of a research community program that will facilitate the use of all this.

The spadework for some of this has now begun with OAI-PMHs sponsorship in the OAI-ORE project. Proposals for an Australian Data Commons are now being tabled. With effort, planning and maturing these early-day visions and testings will generate their own leadership.

But in the meantime university library repositories have proved how responsive they are willing to be to a national leadership plan and vision. They all focussed on “what to do now” with the immediate future in mind when that was clearly spelled out as RQF. Okay, money and a bit of compulsion were at play there too. But they did not exercise their collective freedom to dig in and protest.

Libraries like authorities, whether AACR2 or the National Library. And being able to confidently adapt an authority to one’s own institutional requirements, without sacrificing anything important, makes some tasks worthwhile. In that sense, the authorities are seen as friendly guides towards the vision, with whom they willingly cooperate.

Till now changes have been happening so fast that there has scarcely been time for an acknowledged leadership in these areas to emerge. Everyone is grappling to learn their areas — and sometimes something can fall through the cracks for a time, waiting to be rescued along the way. The leaderships that do exist are in segmented areas.

September 2, 2007

how repository display configurations can clash with oai harvesters

Filed under: Harvesting,Repositories — Neil Godfrey @ 3:14 am

The basic metadata supporting OAI harvesting is Simple Dublin Core. A data provider (repository) that intends to be compliant with the requirements of OAI harvesting will produce an unqualified DC datastream as a minimum requirement.

At least one repository solution, VTLS’s VITAL, is designed to use the simple DC data as the basis for the repository’s metadata splash page that contains the repository institution’s branding and is used to direct the user to the archived document that the user is requesting. (As far as I know at this point this is not an issue with open source repositories such as Eprints and Dspace.)

This means the repository is attempting to use a single datastream for two different purposes. That becomes a problem if the oai-dc data is constructed in a way that meets one purpose (e.g. the oai harvester), but that particular dc construct is not what we want for the other function (e.g. the portal display of the metadata page). That metadata display page with the link to the deposited document would be better linked to some other datastream — such as a MARC or MODS or VRA or anything OTHER than the OAI-DC data configuration.

This means that a repository manager must be very clear about exactly what it wants a service provider (SP) to display from its repository. For example, does one want the service provider to display the repository’s metadata splash page for each document, so that public users will be first directed to their metadata details for a particular record, where institutional branding also is prominent, and from there link to the full article or document? Or does one want to cut out one’s repository branding and descriptive metadata page and allow the SP to take a user directly to the article. What the SP will do will depend on how certain data is entered into the oai-dc datastream.

When an SP receives a request for a particular article in your repository, it will rely on the oai-dc record to “identify” that particular article. It thus looks for a dc.identifier value with a resolvable URI link.

This means that:

  1. If the URI value in a dc.identifier is the link to the repository’s metadata page, complete with the full descriptive metadata record of the article, institution’s branding, and link to the full text of the document, then the SP will direct users to this repository page.
  2. If, however, the URI value in the dc.identifier is the link directly to the article itself, possibly offline at, say, a publisher’s site, then the SP will bypass the repository metadata page and direct users directly to the article wherever it is located.
  3. If there are dc.identifier values that are non-resolvable text strings such as an ISSN the SP will ignore these for this purpose.

Normally a repository can and will be configured so that documents deposited into it will generate in the oai-dc a dc.identifier value that is a handle or link to the repository metadata page first. 

But if the repository contains only a link to an offsite copy of the document, and if this is also entered into a dc.identifier field, the SP will direct users away from the repository to the off-site document. No problem, perhaps, for the self-effacing repository manager who wishes to serve the user more than the reputation of the institution supporting the repository, but not politically savvy if one of the very arguments presented to fund the repository in the first place was that it would increase the institution’s exposure to the world. That mediating branding page is normally pretty important.

There are two or three ways around this but each has drawbacks.

One can enter an offsite link to dc.relation instead of dc.identifier. This reserves the dc.identifier field for the repository default metadata page link — normally machine generated by the repository itself.

Another solution is to enter the offsite link into the dc.format so that it would look like this:

<dc.format>PDF </dc.format>

Either of these solutions will cause a problem for the manager whose repository is dependent on mapping its portal display options from that same oai-dc record.

It will mean that a portal display link to a deposited record within the repository itself will be mapped from dc.identifier, while a portal display link to an offsite record will be mapped from dc.relation or dc.format.

So the consistency issue arises if one’s repository depends on mapping its display from the same data that is used for oai harvesting.

 An offsite link will have to be treated the same way for display purposes (same portal display label terms) as all other values entered in other dc.relation or dc.format fields.

That can cause headaches. One wants to show an onsite link and an offsite link to an article in a similar way for users. What is important to them is that they can see at a glance a constant way to get to the article regardless of where it is stored. One does not want to present an offsite link in a way that looks like it is not a link to the article described, but to some “relation” of it, for example. One can rename “relation” for the portal display, but it means that whatever display name is chosen for the display must be constant for all other display names mapped from that same dc.relation in the oai-dc field.

And whatever solution is decided upon will need to consider preservation and sustainability questions. One day the records will be migrated to some other software — what will happen to any such solutions then? What systems will ensure consistency over time within a repository, and what issues will arise in the broader world of databases needing to be able to talk to each other in the future?

Repository designs need to allow for record and metadata displays to be configured independently of the data used for oai harvesting.

July 16, 2007

July 14, 2007

June 26, 2007

Language codes in repositories: English, eng, en or en-aus?

Filed under: Dublin Core,Harvesting,MARC,Repositories — Neil Godfrey @ 2:20 am

Collating here a few thoughts that have arisen out of a range of questions and puzzles about language codes that have arisen over past year or so, inc reference to MARC mapping . . . .

Portal display

Firstly, in an essentially monolingual repository I can’t see a reason to include the language note in the portal display. To cover the exceptions when articles in languages other than English will be archived then surely the simplest add on is to enter a separate note field (originally entered in a MARC 546 in cases where repositories rely on migrating MARC records?) to make this clear. Though surely the title and abstract details themselves that are on the main display normally will tell users the language anyway. (The 546 field is a perfect place to enter “English” if one wants.)

Secondly, libraries used to using the MARC 546 field for language description as their main language identifying element may be running a risk if they rely on data in these fields to be migrated to a Dublin Core element. 546 is a free text field for language notes, not strictly for coded language values. The MARC language codes are entered in either the 008/35-37 fixed field or the 041 field or both. 546 potentially contains descriptive notes in any uncontrolled format.

eng, en, en-aus — what’s the difference?

But what of the variations one sees in standard codes for language? Frex, English can be entered as en, eng or en-aus.

eng, en and en-aus are all valid ISO/RFC standard formats for identifying the English language or English language as used in Australia.

The 3 letter code ISO 639 standard was largely derived from the MARC language codes. So default MARC entries that may appear in the 008/35-37 will be valid ISO 639 language codes.

But there is also a 2 letter ISO 639 standard code.

The reason for the difference is that the shorter code was designed for “terminologies, lexicography and linguistics” and the subsequent 3 letter code was developed for “bibliographic and terminology needs”.

For practical purposes machines harvesting repositories are not going to know the difference; they’ll read both.

See for the LOC FAQ site giving more detailed explanations.

Function of the language element

The primary function of the language element is to facilitate refined searching. International service providers obviously will best achieve this by recognizing standardized formats of data. Hence the value of having the ‘eng’ in MARC 008/35-37 and/or the ‘eng’ or ‘en-aus’ etc. in the MARC 041 to map as values for the dc.language element.

June 5, 2007

March 28, 2007

Thesis types in repositories

Filed under: E-Theses and ETD conference,Harvesting — Neil Godfrey @ 6:53 am

Australian-NZ repositories need to be able to separate their ADT material into a single Set for ADT harvesting.

Repository Metadata for thesis so far is generally in the following places:

Marc: 502

DC: type or DC:description

The ETD standard metadata is to use and qualifiers.

It’s probably a good idea to have ETD-MS in a thesis record for more than just ADT purposes. And if ETD-MS is in there, then ideally that could inform the RDF of the types of theses that are required for the ADT Set.

But how to get the ETD-MS data?

Most 502 MARC fields I know enter the generic “Thesis” instead of the name of the Thesis in the first part.

502 $a Thesis (PhD) — USQ, 2005

Can we get a program to read a MARCXML 502 field to translate the (PhD) part to be turned into “Doctor of Philosophy” as the value for the element, and then again into “doctoral” for the element?

What work would be involved? Thinking of additional namespaces and things in the data.

Next Page »