Since the ETD-2007 conference I have followed up several discussions with stakeholders in repositories and electronic theses in Asia, Europe and North America attempting to assess best practice for Australian repositories within the international context. Scholars are going to want to know what has been produced globally. The trick has been trying to assess how the balance between national requirements and future proofing for international interoperability can be best achieved.To date I have advised that the best we can do at the moment is to stick with a granular scheme like the MARCXML or MODS until the some clarity begins to emerge from the cloud of different ETD metadata schemas and current harvesting options and various consortia, national and international efforts to impose some order on the various types of theses emerges.
I’ve since read the report conducted by UK’s JISC, the National Library of Sweden and the SURFfoundation in the Netherlands. The report is of a pilot project that attempted to harvest repositories across 5 nations to evaluate issues of interoperability. So the following recommendations are the best I think that can be made for any Australian repository to save itself a lot of hassles in the future regarding the best exposure of their digital theses — they are made with the global picture in mind and with a view to what to reasonably expect in the short term.
Many of these recommendations are from literature cited within the report and not specified directly in the report itself:
- Title of thesis: Use colons to separate subtitles. Repeat for multiple titles.
- Author/creator: Invert the name (surname, forename, prefix); If initial and full name are available, enter as: Jannsen, J. (John). Generational suffixes (Jr., Sr.,) follow the family name; If in doubt give the name as it appears – do not invert; Omit titles like Dr, IR, etc.
- Do not enter an author’s affiliation in the same element field as the author/creator. (But note MARC allows affiliation in the $u subfield of the 100 field – do not enter name and org in the one subfield.)
- Do not confuse author/creator with publisher of contributor.
- Subject: can be keywords, keyword phrases or classification codes/schemes. Use the first occurrence of the dc.subject element for the keywords. Either enter one keyword or keyword phrase per dc.subject element or separate a string of keywords/keyword phrases with a semi-colon. Avoid keywords that are too general – opt for the most specific. If the subject is a person or an organization use the same form of the name of the person or org that you would use if they were in the Author/creator element. In cases of controlled or standard classification schemes, encode each term in a separate element. Use the capitalization in the scheme.
- Description: can be abstract, table of contents, other ….
- Publisher: Use this for the commercial or noncommercial publisher – not for the institution the author is affiliated with or otherwise associated with the creator. In cases of organizations where there is a hierarchy – list parts of the hierarchy from largest to smallest, separated by full stops. If unclear about hierarchy, enter as it appears.
- Contributor: Use ONLY for a supervisor and not for other types of contributors (e.g. jury members, editors; data collectors)
- Date: Recommended best practice is ISO 8601 [W3CDTF] – follows YYYY–MM–DD format. (The YYYY is mandatory: the MM and DD optional). Use only ONE Date – and this for the date thesis is “published” . Additions like Zulu Time in DSpace should not be part of the metadata.
- Type: Driver thesaurus includes: Doctoral thesis; Master thesis; Bachelor thesis — these are 3 recommended terms.
- Format: use controlled vocab (e.g. internet media types [MIME] defining computer media formats. Also recommended to have one dc:format element “text” if a text file.
- Language: ISO 639-1 – that is, the 2 letter code: en for English
At another level:
- Use metadata format with high granularity and high detail (add extra fields for Eprints or extra qualified DC fields in DSpace)
- For mapping these to a simple schema like DC, follow a mapping used by active communities – e.g. the LOC mapping guides; the EPRINTS mapping to DC
- Allow for separate elements for thesis specific data:
- a field to say it is a “doctoral thesis”,
- a field to tell when it was “published”;
- a field to indicate the supervisor;
- name of the degree;
- level of the degree;
- country the degree was given in (in case of cultural differences on the value of the degree);
- discipline of the doctoral thesis;
- optionally date to drop an embargo if any (but not confused with dc.date).
- Where MARC or MODS is the base schema used entering the above data will ideally mean adding additional fields in MARC and MODS – with small data bits in each.
- Do not use a schema like the North American ETD-MS or UK’s UKETD_DC. None of these is widely (globally) harvested and each has limitations. Do not use a specific ETD format schema. (Arguably the best thesis schema is a German one also with limited application at present.) Keep things generic with a granular schema such as MARC or MODS (or extra fields in Eprints or Qualified DC in DSpace).
I have not given a rationale or authority for these recommendations here — many can be found in the report or its cited documents. I will be doing a more thorough presentation of these in the coming weeks though not on this blog. This is just the dot-points.
The SURF report can be found at http://www.surffoundation.nl/smartsite.dws?ch=ENG&id=13191 (Some of the recommendations above are from DRIVER cited in the report.)
The above are future-proofing recommendations. Some elements such as the Supervisor are not harvested by the ADT service, at least not yet. But that element’s day will come.