June 29, 2007

ETD Uppsala conference update 7

Filed under: E-Theses and ETD conference — Neil Godfrey @ 5:57 am

Other miscellaneous notes, no longer attributable :

importance when studying surveys etc to observe what people DO, not just what they say they do. thus for example people may not really want an interface like a google box, but really want a structured break-down search box into categories, e.g. one column for authors, an adjacent column for titles, an adjacent one for types, another for year ….. People may well prefer structure with tabs to tick etc. — unlike Web of Science’s navigation.

importance of being able to take uses to the data itself so they can display it in their own preferential way. (web.2.0 ..c.f. itunes, greasemonkey….)

the resource type “thesis” in metadata schema will need to have subdivisions not just the type of thesis (e.g. research, professional, coursework…) but also whether it is pdf, scanned pdf, thesis by publication, multimedia (one student in US now is doing hers on a wiki).

From Last Day of Conference:

A session on plagiarism addressed mainly academic “cheating” rather than third party property issues and such. Another session focussed on catching up with getting all the other old print theses online — the logistics and strategies for coping with scanning these and adding them to a repository collection. One I regrettably missed but have since followed up via email (at least made personal contact at Uppsala) was a case study in Belgium of a library coping with changes that need to be made to the theses metadata over time, and how new policies and metadata issues in relation to etd’s are handled in repositories. Look forward to reading that paper in depth and reporting on the experiences it discusses.

In humanities at least, is there a need for a database separate from a thesis database — the separate one being fore the massive supporting evidence underlying the theses?

And just to make it simpler, we should be preparing for cases of dissertations that are co-authored — with parts of the dissertations being re-prints from published journals….

These posts reflect, of course, my own experiences of the conference and not the totality of what was covered. So many nuggets come up at such conferences that do not lend themselves easily to this sort of note-taking, though I have tried a few times to include them — I know many more will come to me over time as specific contexts jog memory and that will be time for making more notes no doubt. Many of such nuggets come from informal discussions, question and answer sessions, and other asides…. One of the biggest benefits was simply in meeting others from around the world, all continents, who are involved in working towards the same things — and thus knowing where we at RUBRIC and Australia do fit in with the larger picture. This is invaluable for better knowing how to interpret many of the articles one reads, and the various policies and practices both locally here and elsewhere, and to keep in mind a practical vision of what is required for the goals of meeting the tech changes and requirements this imposes on metadata (my specialty of course) and other aspects of repository management.

I have much to follow up on now — and have already blanketed the globe with follow up emails to certain other delegates, some of whom I met there and others I may have met — and to examine afresh the metadata requirements of Australian ETDs — not to forget getting a larger view of other related repository issues as well. I have made references to specifics throughout the posts, and expect to share some of the followup work here in future posts.

And many at least now have heard of RUBRIC, too, both from personal contacts and more formal discussions following the sessions, not to forget of course the presentation of Peter Sefton! Now that was a real hit with many subsequent mentions in the sessions.  Many commented with envy that there was an organization like RUBRIC that would send a metadata delegate to such a conference almost as a matter of policy — to be on the cutting edge in order to deliver the best services possible. So I should thank RUBRIC management (past and present) for making it possible for me to attend.

ETD Uppsala conference update 6

Filed under: E-Theses and ETD conference — Neil Godfrey @ 4:36 am

Forgot to add earlier that the TDL uses the Manikin module interface — worth comparing with the normal DSpace view. The next session I was able to discuss with others (e.g. MIT — Craig Thomas) their use of Manikin as well for improving DSpace’s functionality, and how they found it in ‘real life’.

Else Nygren addressed the differences between old and new ways of learning which I found very interesting. Spoke of problems mixing Metalib with Google habits, the need to find the habits of users, and to make content accessible across cultural and cognitive barriers. Asked afterwards who the “new users” a repository should be alert for Else spoke of interested young people, not university students. I’m sure there are more interested among groups other than the young, too. I don’t see that public accessibility, open access, will be of interest exclusively to students and academics.

One of the most interesting sessions from my particular metadata perspective was Session 5’s “Discovery and Access” segment, and I made the most of using the Q and A session at the end of it. Sharon Reeves discussed user generated metadata for etd’s in Canada, in particular for the national LAC (Library and Archives Canada). Austin McLean of Proquest read Dr Livia Vasas’s (unable to attend in person) paper, and John Hagen of West Virginia libraries spoke on Building Effective Discovery Tools for Academic Promotion and Tenure Evidence. UMI’s PQDT (Proquest’s progenitor database?) apparently pays authors royalties on sales of copies of online theses? LAC uses ETD-MS — cataloguers don’t look at the record so there are no controlled vocabularies. (Compare this with the controlled subject vocabularies I noticed in other networks of repositories in the U.S.) There was a table in her presentation showing the relationship between MARC and ETD-MS which I must see in detail as soon as it is available. I was curious to know why ETD-MS was chosen by Canada (it has not been adopted by ADT in Australia reportedly because it is not yet a universally recognized or adopted standard.) I also wanted to know if it was chosen over comparisons with other metadata schema.

The other main query I had was the problem of reconciling different (international differences) meanings of the terms “doctoral” and “masters” etc. Not all doctoral theses are research theses in all countries, although that term might be the definition that explains it IS a research thesis in, say, the Netherlands. Clearly we cannot rely on or expect a common terminology. The differences in the terms are culturally and politically rooted. It is up to additional metadata fields to clarify the natures of each thesis type.

Place this in the context of the value of ETD-MS. I don’t think that that schema does justice to this problem. The global solution has not yet arrived, but this did highlight for me the importance of building the required granularity into the metadata schema now — whether through a MODS application or other. This is going to have to be a priority that I will want to work on and make a proposal for others here in Australia.

But while my time was with this session I was missing out on comparative developments in India and Japan. Clearly Australia needs to be in step with Asia as much as Europe given much of our research focus. But I am currently following up personal contacts made with some of the delegates from these countries.

Also missed was DissOnline Portal by Germany’s National Library Natascha Schumann — a topic I’d really need to tackle with input from ICE-RS Peter Sefton; also EthOS in the UK — but I’ve since meeting Susan Copeland briefly followed up with the metadata issues and schema involved here, and will be making use of those in evaluating Australian needs.

The afternoon session was also a bit of a head spinner for me. There was a session on the power of pdf files now to embed video and sound files in them, thus enabling interactive simulations within pdf’s. But discussions with others subsequently showed some strong divide and necessary cautions over this technology. Joan Cheverie of Georgetown Uni spoke of social science data and etd’s, and Austin of Proquest also made an appearance in this context, though there was no apparent linkage between the 2 institutions. These in part made reference to their use of controlled vocabularies, a topic of some interest to me at different levels – contrary to the presentations either side of this one. Concept maps in NDLTD were discussed by Edward Fox. The limitations of Scirus, for one, in not listing the department awarding the thesis, was commented on. This underscored for me the impossibility of standard schema and terminologies, and the need for interoperable (read, in part, granular) local or national schema for future-proofing our databases. But again I found worthwhile the opportunity at the Q and A conclusion to discuss and ask their views on the relative benefits of controlled vocabularies in the context of the available technologies. I know this is something that many will find infrastructure impositions upon them deciding the issue for them, but I did find myself leaning again and further towards maintaining controlled vocabs if at all possible.

Again, there were session I missed and I look forward to catching up with some of the sessions discussing situations in Italy and elsewhere. It is a plus to have made contact with the personnel involved, and knowing that a communication has begun with some that I have since begun to follow through. The abstracts at least at this stage are online at the conference site, and probably email addresses for others interested too.

Some of the keynote speakers did succeed in their intention to be provocative, but some of the delegates felt they were being too much so — and if taken at their word one might be left with the impression that repositories have no place at all. But the balance here that needs a place before going that far is the work of integrated systems, such as ICE and other systems working towards this world and in use in Sweden and elsewhere. Peter Murray-Rust’s presentations, for example, should be read in tandem with Peter Sefton’s.

June 28, 2007

ETD Uppsala conference update 5

Filed under: E-Theses and ETD conference — Neil Godfrey @ 10:16 am

Greg Crane spoke of the need and inevitability of moving beyond book-imitation pdf files. He used Peseus Classics Online as an example of the potential we should be aiming towards — where texts contain multiple links for each word — to dictionaries, to other related texts, to commentaries. The potential impact will move us beyond the slow and limited intake of information that comes currently from reading lines at a time, then moving on to other texts …. a 2 dimensional process as opposed to the 3 dimensional or more organic structure possible with the sort of thing we now see at Perseus.

I don’t know the technical structure behind Perseus, but I know Perseus well enough to see it as one model for a future online database — and as for metadata implications, what it is calling for is work on ontologies and the semantic web (i suspect perseus is not based on that at present but i could be wrong — and I see Greg has an article online discussing this Perseus project in more depth that I must read) — and that means RDF ideally rather than traditional schema such as MODS or MARC or DC. — though the RDF structured content could generate such schema when needed. (My thoughts arising from Greg’s presentation.)

Next session I attended covered Emory University’s work (Martin Halbert) on integrating IR’s upon Fedora, and building Web 2.0 web services on top of the Fedora repositories for ETD submission and admin and user/public dissemination processes. The approach is to balance flexibility and standards to achieve interoperability. I have requested a copy of the paper presented for this to investigate in more detail the metadata issues behind this balance of flexibility and standards.

I was intrigued by Adam Mikeal’s presentation on the Texas Digital Library. This is a consortium of libraries that deposit their ETD’s with the TDL — a federated collection of ETD’s apparently similar to our original Australasian Digital Theses Program. The metadata application used is a MODS application for theses, not ETD-MS. I had a brief discussion with Adam afterwards and have since received more info on the schema used. Keen to follow this through and see how it might be adapted for Australian needs.

An Indian presentation followed that pointed in a similar direction as the way the TDL is going — a centralized ETD repository — a national database collection. There are several ETD repositories in India but the IR scene is not uniform, hence the hopes for the national db to fill the need.

By attending that group of sessions I missed RUBRIC colleague Peter Sefton’s presentation, but, well, I have heard Peter discuss aspects of the Integrated Content Environment for Research and Scholarship (ICE-RS) piecemeal a number of times: in this context, it’s about writing and publishing a thesis, multimedia format, in pdf/html, with versioning controls in the process, and preservation and descriptive metadata . . . But check out the full story in his own presentation at USQ Eprints repository.

One can’t attend all simultaneous presentations and another I would have loved to have attended was another discussing how SURF (The Netherlands), JISC (UK) and DIVA (Sweden) have begun a project to harvest ETD’s from repositories internationally.

Where is Australia here? But having at least shaken hands with some of these people and “being there”, it gives one some hope that follow up contacts can begin to work towards making things happen for the Australian-New Zealand ADT program. Earlier this year I was appalled when email correspondence indicated that Australian repositories (Arrow Discovery Service) is a nonentity in the UK and Europe, and a bit player in some OAIster or SCIRUS harvesters. Will have to begin email links now between the Europeans and ADT here to see where we can move, and if that fails, to see what foundations can be laid to propel future collaboration between Australian IR’s — ETD’s being the driving force? — and the “world”.

Another presentation I missed while attending one that presented MODS for e-theses, was Ana Pavani’s (Brazil) “Looking at ETD’s from Different Points of View”. This promised a discussion of the considerable efforts put into metadata sets and union catalogue creation for the discovery of e-theses. I have already emailed Ana for more details to catch up here. In another presentation it was clear that ETD’s have the potential in many quarters, whether housed in separate collections or part of the rest of an IR, to promote the university or granting institutions given the right structures and metadata and recovery systems.

I also regretted not being able to attend NDLTD presentations, but I did meet several people from NDLTD and its UK and European sub-projects, and look forward to replies from emails I have since sent back to them to resume contact, and to continue online engagement in what is happening re international cooperative potentials for harvesting of ETD’s. (Again, where has Australia been till now!!)

To be contd….

June 27, 2007

ETD Uppsala conference update 4 — Australian/ADT requirements?

Filed under: E-Theses and ETD conference — Neil Godfrey @ 6:36 am

This is not really an update on the etd conference but a spinoff of thoughts from there, specifically about our Australian-Australasian situation.

We don’t need to adopt one of the theses metadata schema currently used in the US or Europe but we should develop something compatible with those while meeting our own needs.

The US ETD-MS could be seen as a minimal thesis schema, the simple dublin core with a handful of additional etd elements added. But the UKETD-DC is a much richer thesis schema. It is an application of simple DC, some DC refinements (qualified elements), and about 10 “local refinements” such as publisher.institution for the awarding institution, publisher.department for the author affiliation, and publisher.commercial for a publisher.

There is also the French schema (TEF — theses electroniques francaises) which incorporates DC, DCterms, METS, METSRIGHTS, as well as TEF thesis specific elements. Germany is revamping their html based MetaDiss into XMetaDiss to be xml based, and compatible with ETD-MS.

We should be contacting reps from GUIDE (Guiding Universities in Doctoral E-Theses) — a working group of the NDLTD focussed on European doctoral e-theses and NDLTD et al to be doing the equivalent in Australia and the ADT program.

Maybe the ADT program needs to be extended with a subbranch to look at harvesting other thesis types from repositories too?

I’m looking forward to studying the various e-thesis schema more closely with a view to Australian needs, and proposing something more concrete asap.

And not just the metadata schema — but a closer look at the multiple long term requirements for preservation and extensibility for theses in the broader Australian context.

June 26, 2007

MODS for Theses

Filed under: E-Theses and ETD conference,MODS — Neil Godfrey @ 2:57 am

I hope to discuss this more fully in a later post but am making available here a MODS application profile for theses in repositories.

Thanks to Adam Mikeal (from the Texas Digital Library consortium) for forwarding me this. Though it is no doubt also online elsewhere.

MODS application profile for theses

dc.source — an attempt to clarify why it is not something else (updated 1.40 pm)

Filed under: Dublin Core,MARC,Repositories — Neil Godfrey @ 2:41 am

Librarians and their clients are used to thinking of sources as citations. And this carries over into confusion in the Dublin Core metadata world.

We are used to thinking of a bibliographic or cited “source” for an article, but in MARC “source” can mean an actual institution or donor who provided the material (tag 037 for “source of acquisition) and in DCMI it can mean the page or book from which an article was scanned.

Like any term the word “source” is used differently depending on perspectives of users.

The following DCMI links may help clarify the DC meaning and use of their term “source” in their elements.

The DCMI definition of source is “Information about a second resource from which the present resource is derived”, and they give 2 examples:

  1. a page from which a picture was copied;
  2. a call number of a book from which a pages were scanned.

In repositories especially we are depositing works by authors that are subsequently published in journals etc. So strictly speaking in this case the author is the source, though obviously we use ‘creator’ or ‘contributor’ in this case. And the publishing journal title is a subsequent related title. That journal might be a “source” of info for a student later on, but it is not the “source” of the original article itself, which is what repositories are dealing with, and which DC is attempting to isolate with this term.

And the whole thing gets more confusing when, as one of our partners commented, one uploads a postprint, a publisher’s version of a document. Is not the publishing journal title then the ‘source’ while this would not be the case with a pre-print. Obviously this gets damn messy if we are going to be martinets about semantics. We naturally want a single source whether the document is a preprint or a postprint. But the confusion of this particular example also demonstrates why the publishing “journal title” cannot be the actual dc.source. And this leads in to the MARC mapping from the host or publishing journal. . . .

Relation to the MARC 773 (or 787) tag

The mapping of the Host Item Entry MARC 773 to dc.relation is based on the standard LOC and DC crosswalks for these. One example is at

This conforms with the standard DCMI definition of relation. Note that the LOC standard description for 773 is “host item” and that too indicates a “relationship” to the document being archived in the repository. A more complete 773 field with page references for the article identified in the subfields technically turns the 773 into a “dc.identifier”. But no need to go there for now.

Language codes in repositories: English, eng, en or en-aus?

Filed under: Dublin Core,Harvesting,MARC,Repositories — Neil Godfrey @ 2:20 am

Collating here a few thoughts that have arisen out of a range of questions and puzzles about language codes that have arisen over past year or so, inc reference to MARC mapping . . . .

Portal display

Firstly, in an essentially monolingual repository I can’t see a reason to include the language note in the portal display. To cover the exceptions when articles in languages other than English will be archived then surely the simplest add on is to enter a separate note field (originally entered in a MARC 546 in cases where repositories rely on migrating MARC records?) to make this clear. Though surely the title and abstract details themselves that are on the main display normally will tell users the language anyway. (The 546 field is a perfect place to enter “English” if one wants.)

Secondly, libraries used to using the MARC 546 field for language description as their main language identifying element may be running a risk if they rely on data in these fields to be migrated to a Dublin Core element. 546 is a free text field for language notes, not strictly for coded language values. The MARC language codes are entered in either the 008/35-37 fixed field or the 041 field or both. 546 potentially contains descriptive notes in any uncontrolled format.

eng, en, en-aus — what’s the difference?

But what of the variations one sees in standard codes for language? Frex, English can be entered as en, eng or en-aus.

eng, en and en-aus are all valid ISO/RFC standard formats for identifying the English language or English language as used in Australia.

The 3 letter code ISO 639 standard was largely derived from the MARC language codes. So default MARC entries that may appear in the 008/35-37 will be valid ISO 639 language codes.

But there is also a 2 letter ISO 639 standard code.

The reason for the difference is that the shorter code was designed for “terminologies, lexicography and linguistics” and the subsequent 3 letter code was developed for “bibliographic and terminology needs”.

For practical purposes machines harvesting repositories are not going to know the difference; they’ll read both.

See for the LOC FAQ site giving more detailed explanations.

Function of the language element

The primary function of the language element is to facilitate refined searching. International service providers obviously will best achieve this by recognizing standardized formats of data. Hence the value of having the ‘eng’ in MARC 008/35-37 and/or the ‘eng’ or ‘en-aus’ etc. in the MARC 041 to map as values for the dc.language element.

June 17, 2007

ETD Uppsala conference update 3

Filed under: E-Theses and ETD conference — Neil Godfrey @ 6:42 am

keeping up with daily updates as planned has proved physically impossible. will have to catch up as soon as i catch up on sleep.  

it’s been interesting conference in many ways, and have met delegates from south, central, north america, africa, southern asia and most parts of europe, japan and russia. have no excuse now for not knowing where australia fits in with the rest of the world and placing questions of practice in the broader context.

generally the best parts come from informal discussions after the different sessions. impossible to hear all the presentations one wants to and will want to have to catch up with some of those on the web afterwards. have heaps to follow up on and many new contacts to work with and bounce issues off in future.

many have been very impressed with an organization like rubric for sending me to this conference, especially when i explained to them that it was their policy to keep us on the cutting edge of the issues.

June 15, 2007

ETD conference 2007

Filed under: E-Theses and ETD conference — Neil Godfrey @ 11:21 am

10th International Symposium on Electronic Theses and Dissertations Uppsala, Sweden, June 13-16, 2007


ETD Uppsala Conference Update2

Filed under: E-Theses and ETD conference — Neil Godfrey @ 6:50 am

Thoughts: theses as “resource types” are too diverse to ever find a truly adequate thesaurus breakdown to cover them all. But the research theses may be the only subtype of thesis that is truly in demand as a distinct harvestable type. And if harvesters are seeking these out they are likely to zero in on the national bodies that are increasingly collating these records. Suspect no need to go any further than the research type of thesis as the only desirable breakdown in a thesaurus. Or not even in a thesaurus of standard type terms. It could be left to the metadata (and sets) to have those research theses taken care of by the national bodies, eg ADT, and harvesters single those out for that type. Will discuss this further with others, including European and US contacts have have made so far at the conference.

As for metadata schema covering theses metadata, also want to follow up the MODS-Thesis profile developed by the Texas based consortia. That may be the way to go — keep it consistent with a schema already well supported sounds the best. But will have a looksee as soon as can see a copy of it, and compare with the other schema for theses. And see how it can be incorporated into a complete package then that takes care of OAI harvesting for all records — general IR deposits (including theses as one of the rest), and research theses (ADT in Australia) being pulled out of that.

 Conference speakers:

Greg Crane: need to prepare to move beyond book imitation pdf’s. Compare Perseus classics collection online — rich cross annotations of primary sources — “multimedia” — to become more common theses types in future.

To be contd.

Next Page »