April 9, 2010

August 5, 2008

Australasian Digital Thesis (ADT) data map to MARC

Filed under: E-Theses and ETD conference,MARC — Neil Godfrey @ 10:09 am

Sharing here a crosswalk between Australasian Digital Thesis data and MARC. This has been done many times before, including by moi, but am making my latest available stab at it for benefit of others who are still in early days of (Australian) repository implementation.

November 20, 2007

More MODS advantages over MARC

Filed under: MARC,MODS — Neil Godfrey @ 1:34 pm

I have recently had opportunities to work with MODS in creating templates for research and publications data external to the university repository environment.

Multiple note fields can be added each with its own customized display label in MODS. One can add multiple 500 general note fields in MARC but MARC does not recognize a $i display text to differentiate them.


<note displayLabel=”Objectives” />
<note displayLabel=”Background” />
<note displayLabel=”Methodology” />
<note displayLabel=”Progress” />
<note displayLabel=”Implication” />

In the same way displayLabels can be used to distinguish quite different “title” types — e.g. the titles of research program and subprogram as opposed to the title of specific research activity.

And multiple affiliations and addresses can be nested with each personal name in MODS. MARC allows only one — NR (not repeatable) — $u subfield for an affiliation or address in each 100 or 700 field.

<name type=”personal”>
<namePart />

Those are TWO HUGE advantages as anyone trying to squeeze nontraditional data into digital archives will appreciate.

June 26, 2007

dc.source — an attempt to clarify why it is not something else (updated 1.40 pm)

Filed under: Dublin Core,MARC,Repositories — Neil Godfrey @ 2:41 am

Librarians and their clients are used to thinking of sources as citations. And this carries over into confusion in the Dublin Core metadata world.

We are used to thinking of a bibliographic or cited “source” for an article, but in MARC “source” can mean an actual institution or donor who provided the material (tag 037 for “source of acquisition) and in DCMI it can mean the page or book from which an article was scanned.

Like any term the word “source” is used differently depending on perspectives of users.

The following DCMI links may help clarify the DC meaning and use of their term “source” in their elements.

The DCMI definition of source is “Information about a second resource from which the present resource is derived”, and they give 2 examples:

  1. a page from which a picture was copied;
  2. a call number of a book from which a pages were scanned.

In repositories especially we are depositing works by authors that are subsequently published in journals etc. So strictly speaking in this case the author is the source, though obviously we use ‘creator’ or ‘contributor’ in this case. And the publishing journal title is a subsequent related title. That journal might be a “source” of info for a student later on, but it is not the “source” of the original article itself, which is what repositories are dealing with, and which DC is attempting to isolate with this term.

And the whole thing gets more confusing when, as one of our partners commented, one uploads a postprint, a publisher’s version of a document. Is not the publishing journal title then the ‘source’ while this would not be the case with a pre-print. Obviously this gets damn messy if we are going to be martinets about semantics. We naturally want a single source whether the document is a preprint or a postprint. But the confusion of this particular example also demonstrates why the publishing “journal title” cannot be the actual dc.source. And this leads in to the MARC mapping from the host or publishing journal. . . .

Relation to the MARC 773 (or 787) tag

The mapping of the Host Item Entry MARC 773 to dc.relation is based on the standard LOC and DC crosswalks for these. One example is at

This conforms with the standard DCMI definition of relation. Note that the LOC standard description for 773 is “host item” and that too indicates a “relationship” to the document being archived in the repository. A more complete 773 field with page references for the article identified in the subfields technically turns the 773 into a “dc.identifier”. But no need to go there for now.

Language codes in repositories: English, eng, en or en-aus?

Filed under: Dublin Core,Harvesting,MARC,Repositories — Neil Godfrey @ 2:20 am

Collating here a few thoughts that have arisen out of a range of questions and puzzles about language codes that have arisen over past year or so, inc reference to MARC mapping . . . .

Portal display

Firstly, in an essentially monolingual repository I can’t see a reason to include the language note in the portal display. To cover the exceptions when articles in languages other than English will be archived then surely the simplest add on is to enter a separate note field (originally entered in a MARC 546 in cases where repositories rely on migrating MARC records?) to make this clear. Though surely the title and abstract details themselves that are on the main display normally will tell users the language anyway. (The 546 field is a perfect place to enter “English” if one wants.)

Secondly, libraries used to using the MARC 546 field for language description as their main language identifying element may be running a risk if they rely on data in these fields to be migrated to a Dublin Core element. 546 is a free text field for language notes, not strictly for coded language values. The MARC language codes are entered in either the 008/35-37 fixed field or the 041 field or both. 546 potentially contains descriptive notes in any uncontrolled format.

eng, en, en-aus — what’s the difference?

But what of the variations one sees in standard codes for language? Frex, English can be entered as en, eng or en-aus.

eng, en and en-aus are all valid ISO/RFC standard formats for identifying the English language or English language as used in Australia.

The 3 letter code ISO 639 standard was largely derived from the MARC language codes. So default MARC entries that may appear in the 008/35-37 will be valid ISO 639 language codes.

But there is also a 2 letter ISO 639 standard code.

The reason for the difference is that the shorter code was designed for “terminologies, lexicography and linguistics” and the subsequent 3 letter code was developed for “bibliographic and terminology needs”.

For practical purposes machines harvesting repositories are not going to know the difference; they’ll read both.

See for the LOC FAQ site giving more detailed explanations.

Function of the language element

The primary function of the language element is to facilitate refined searching. International service providers obviously will best achieve this by recognizing standardized formats of data. Hence the value of having the ‘eng’ in MARC 008/35-37 and/or the ‘eng’ or ‘en-aus’ etc. in the MARC 041 to map as values for the dc.language element.

June 8, 2007

MODS and MARC — what losses are there in crosswalking? do they matter?

Filed under: MARC,MODS,Repositories — Neil Godfrey @ 12:04 am

Is there any real disadvantage with using MODS as opposed to MARC for repository data storage?

Yes, there is some loss of data if one tries to walk from MARC to MODS. But why and when would one want to ever make that journey? But more importantly, what loss exactly does occur? The thought of losing data sounds fraught with horrendous potential to cataloguers so it pays to see exactly what is lost and then decide that the question is not whether there is loss of data but whether it matters — in the context of institutional repositories.

Check the title data as shown on the mapping guide at

245 $a$f$g$k maps to <title> <titleInfo
245 $b
maps to <subTitle>
245 $n (and $f$g$k following $n)
maps to <partNumber>
245 $p (and $f$g$k following $p)
maps to <partName>
245 ind2 is not 0 maps to <nonSort>

So the granularity lost here is what is found in $f $g $k

Which are:

$f Inclusive dates (NR)

The time period during which the entire content of the described materials was created.

$g Bulk dates (NR)

The time period during which the bulk of the content of the described materials was created.

$k Form (R)

A term that is descriptive of the form of the described materials, determined by an examination of their physical character, the subject of their intellectual content, or the order of information within them.

How often are those used in academic libraries? What will fall apart if they are combined in the one field with MODS?

What about the personal author field? (I’m not including the corporate author field because I do not yet know how the equivalent of a 110 would be entered into a repository that is for the purpose of archiving the works of academics. If the repository is to showcase the work of its academics, what room is there for a corporately authored document? Libraries store books by conferences and corporate bodies. Repository databases store the individual works of each author.
100, 700 <name> maps to type=”personal”
100 maps to <role><roleTerm> type=”text”
use text “creator” if desired, to maintain indication of “main entry”

100, 700 $a$q maps to <namePart>
100, 700 $d maps to <namePart> with type=”date”
100, 700 $b$c maps to <namePart> with type=”termsOfAddress”
100, 700 $e maps to <role><roleTerm> with type=”text”
100, 700 $4 maps to <role><roleTerm> with type=”code”
100, 700 $u maps to <affiliation> under <name>

This means we lose the first MARC indicator that defines whether the personal name is to be a Forename (0), Surname (1) or Family Name (3). Is that going to be a problem in the repository’s archive?

It also means that the fuller form of the name ($q) is not going to be demarcated from the initial entry of the name. If this is really an issue I am sure a stylesheet can be written to recognize the brackets surrounding that name and its granularity will still be preserved anyway.

$b for a name’s numeration is melded with $c, the titles and other words associated with the name. The $b only applies when the entry is a Forename anyway, which is not granularized either. I don’t personally see a problem with placing III’s and Sir’s in the one data entry field.

No time to make a complete comparison. Maybe a future post I can explore more. But title and main author are key elements and I don’t see any real issue in “lossed” data over using MODS in place of MARC with these.