July 6, 2007

Why bother with a specialist etd schema anyway?

Filed under: E-Theses and ETD conference — Neil Godfrey @ 3:53 am

Just to be contrary (– not really — merely expressing thoughts in flux — that’s what blogs are for, right?) why bother with an ETD metadata schema at all? Why treat theses any differently from any other resource in repositories? Obviously there are some specific differences that need attention when it comes to images or videos in comparison with text or pdf files, but this is a function of format. The concept “thesis” is of course not a format but an intellectual content based idea.

By all means maintain the uniquely “thesis” metadata in a repository record (awarding institution, degree name and level, etc), but for harvesting purposes, is there any need to go beyond what is already available through simple Dublin Core data?

Simple Dublin Core enables the harvesting of all scholarly repositories by OAI-PMH compliant service providers. The dc.type element can be used to limit searches or harvesting to specific subgroups in a repository (e.g. theses!). And anyone searching a particular topic among scholarly works will not have their search necessarily limited to a particular type of thesis, but will embrace scholarly research reports as well, other discussion papers and peer-reviewed findings. Is not that a more valuable service to users than a limited “research thesis only” database?

These questions may sound rhetorical but they are not.

Thought is being given in Europe and US re improving ETD-MS and UKETD_DC or other schema. How much of this is necessary or possible? “Doctoral” can refer to a research thesis or a coursework thesis. Research theses can be doctoral or masters. There is no international uniformity of meaning or thesis types. Can we go far wrong if we focus instead on simply making available in the most efficient ways the repository databases of scholarly works, rather than discrete (nationally and even regionally distinct) terms within those databases?



  1. I’m ruminating on your post, so I’m looking much harder at dc:type element, and perhaps the genre element (which would be 008/24 (m) theses, which is where I would look for it in MARC, though I don’t see that mapped in your comparison doc).
    MODS type element is high-level ‘text’ (I think some cells are still off in the comparison doc). I don’t see in simple dc where the theses element would be clearly and simply encapsulated (as opposed to UKETD_DC). At LAC, the docs say that the harvesting uses DC or ETD-ms, but there is a note that etd-ms is not supported by DSpace, so further crosswalk is required. Though I can’t see where the ETD-ms might need further improvement, I don’t quite see yet how simple DC could do the trick (enlightenment is welcomed).

    Comment by Mia — July 6, 2007 @ 3:17 pm

  2. Hi Mia,

    My interest is in finding solutions that will work across a range of repository solutions in a variety of university library environments. (RUBRIC is a support group to assist universities establish repositories — and that has involved testing and supporting a range of solutions.)

    Where libraries are importing theses from MARC catalogue records then many may be able to rely on the 008/24. But I am thinking more of ongoing data entry / document deposit contexts. To this end, if one is using a MARC datastream in a repository (such as a fedora based repository — which most of our partners are currently using) then the 655 MARC field seems the best default field to enter the “type” element (using a defined thesaurus of terms.) There can be multiple “type” fields too. (But on both these points do feel free to provoke revisionist thought! The repository world is still a long way from the concrete being set.)

    As for the DC element “type”, the only relevant DCMI term listed is “text” which is clearly insufficient for scholarly works. But the DCMI definition of “type” does include “genre” (“Type includes terms describing general categories, functions, genres, or aggregation levels for content.”) I’ve struggled over the definitions of the different concepts and compared these with actual widespread practice in both repository data providers and OAI service providers, finally determining that the use of DC.Type for more narrowly defined types of scholarly resources (“dc.type thesis” or “dc.type book” etc) are widely enough practiced, understood and recognized in the OA environment (including Eprints, Dspace and Fedora based repositories) to work well — both for Sets creation and more general OAI harvesting and searching.

    Is this view short sighted? Are we running into problems up ahead, in your view?

    Comment by neilgodfrey — July 6, 2007 @ 11:50 pm

  3. Hi Neil,

    As I’ve only started to turn my attention on this particular set of issues, I haven’t yet developed any alternative views (either short or far!). I’ve seen a few analyses of harvested data where the dc:type contains values “Electronic Thesis or Dissertation” and “Thesis”. Perhaps those are trivial differences when clobbering the data together later (but what about nifty values like “yes” or “other” (Mark Jordan’s EDT2006 ppt), etc.

    In the U Illinois Mellon study, an Appendix on normalization of the Type element ( references DCT1 as being used at many of the 39 repositories that were harvested. Paper just predates DCT2 .

    It seems to me that if IR materials have to be separated so that they can be later harvested as “theses” (by defining a Set, for example, as you posted elsewhere), then isn’t some basic thesis descriptor lacking in the original data capture/input stage (like if it’s captured as simple DC, or has to be extracted as such?)

    I also can’t quite connect the dots on what/where is the quasi-official list of genre terms, unless it’s DCT2 (?). Please forgive me if these observations are elementary. And thanks.

    Comment by Mia — July 7, 2007 @ 5:57 pm

  4. In Australia there are currently meetings of different groups to attempt to agree on and propose common metadata standards for the scholarly content in the IR world. If each “region” with broader recognition as working according to some sort of authority can do something similar then I suppose we will have a basis for consistent crosswalks and interoperability.

    Ideal would be for an international standard from the beginning but is that really possible? — this link illustrates some of the problem:

    Even re what is happening in Australia, if hopefully fairly representative groups do agree on metadata standards they can only at best recommend their proposals, seek consensus, and hope that they will be followed and recognized as de facto standards. From there they have the potential to become official standards.

    My discussion about Sets for thesis types was intended for the context of singling out a certain type of thesis that qualifies for the national Australasian Digital Thesis Program — an online collection of research level
    theses — and the specific requirements of ADT for harvesting these from repositories. There are specific limitations in the metadata in this case that go back to the original way ADT software (pre repository) and metadata requirements was designed. By including a special Set for these based on a dc.type value would not be a problem since anything in the dc.type field could form a logical Set. But I don’t mean to suggest that Sets should be the way around all harvesting. Only that in this case, it would not hurt to have a setup where Sets could be based on dc.type and by doing so the ADT requirements are met. So it’s not a case of IR materials “having to be separated” by Sets etc — but a harmless workable expedient for maintaining the ADT database.

    Not sure if this is really becoming as clear as mud. Just ask or tell me if it is.

    As for the more general question of theses metadata in IRs — by all means the richer and more granular the better for various reasons, but to what extent do those reasons extend into harvesting? For example, are users really going to want to search only for “doctoral” theses on a particular topic, knowing that “doctoral” can mean anything from research to coursework thesis, and that in some cases a “masters” thesis can also be a research level thesis? Aren’t users more likely to search scholarly databases on a particular topic, with a view to selecting the most relevant whether it is a research report, masters thesis, book or doctoral thesis?

    There is no standard list of genre terms for scholarly works. Again, to see what is there in just one part of the world have a look at the spreadsheet I prepared.

    Elementary questions are what we are all still grappling with. The questions you raise are all worth revisiting at every stage of what we are attempting to do. Elementary questions are the only ones I think I can understand 😉

    Comment by neilgodfrey — July 9, 2007 @ 1:01 am

  5. Hi again Mia,

    I did not do justice to your query about the DCTypes in my previous response. The DC subtypes for Text are problematic for scholarly works found in IRs. As alluded to in my previous response, the Text.Thesis.Doctoral breakdown is not adequate even for doctoral theses at an international level, since Doctoral can refer to a range of types of theses (coursework, professional, research. . .) etc. It is, of course, a valid value for a description of a record, however. The breakdown of Text.Journal and Text.Magazine will be problematic — and especially in IRs where deposits are at the individual article level anyway. The Text.Article definition covers essays, stories and preprints and other “short” written forms. Again we run into difficulties with breakdowns for archives of scholarly materials.

    There is Text.Proceedings, but no room for Conference posters, — and does “Proceedings” refer to a collection of conference papers (as we would find in a monograph on a library shelf) or can it also refer to a single conference paper that would be deposited into the IR.

    So DC text subtypes don’t currently meet the requirements of university IRs, and if the principle purpose of DC is harvesting, then it seems much less problematic to rely on unqualified DC for that as much as possible.

    But let me know if I’m misunderstanding some of your query and comment.

    Comment by neilgodfrey — July 9, 2007 @ 2:25 am

  6. Hi Neil,

    Not at all – this provides me with yet much further food for thought.

    An impressive array of variations in your spreadsheet. If ETD data need to be all things to all people, then I will need to go further in my mental model of what is ‘sufficient’ at the IR macro level , which in turn rather rules out simple DC in my view.

    If, as you say, the principle purpose of DC is harvesting, then everything changes (for me, anyway). Minimalist principles of unqualified DC were to permit/encourage a broad range of non-specialists, (was my understanding) at the end-user data capture/input stages, in a variety of situations and settings–essentially to lower the cost and barriers to some standardized metadata creation. That initial naivite was revisited pretty early on (wasn’t it?) by introducing qualified DC, then other schemes (MODS, etc.) started to emerge due to lack of granularity, and now a multiplicity of crosswalks, etc. Application profiles are just starting to become somewhat clearer to me, though, so perhaps that is the next phase, I don’t know.

    The parent institution would want to derive different values from ETDs in an IR, and those goals are different from what a scholar would want to derive. Most scholars qua scholars when doing scholarly research, aren’t searching theses (much less masters level theses, major research reports, or other pseudo-theses entities done by students for course or degree requirements), so I completely concur with your observations in that regard.

    On the other hand, most scholars are also thesis advisors, so on second thought there may be plenty of overlap between their “non-scholarly” needs to search specifically for theses (possibly as in: which theses have I supervised; and so on) and the parent institution’s. So perhaps we do need to have the most expansive view of any harvested data, since there will be many pivot points and matrices which we cannot now anticipate.

    Even among the university sector where there are plenty of experienced IT, systems, and librarians well-schooled in various aspects of this very limited domain of ‘the thesis entity’, there is less than universal and consistent application of DC, qualified or not (Dushay and Hillman’s work at NSDL, for eg). (Forget other conundrums like serials entities; confusions between type and format, etc.etc.)

    I’m afraid my comments are rather more of a rambling set of observations than questions, really; so thanks for your indulgence.

    Comment by Mia — July 9, 2007 @ 8:43 pm

  7. Just a quick comment on the fly (…. more later….) but you touch on the question of what belongs in the public domain and what needs to be reserved as local data. Should thesis supervisors be routinely added as public notes to all theses? What of the case (I’m thinking of a recent case that came up here) where the thesis author had strong conflicts with a supervisor and sees nothing but technical validity (or not even that) in having that supervisor associated with their thesis? (Presumably the supervisor’s name was nonetheless included in the text of the thesis itself, — but the point is that the copyright holder, the author, strongly opposed on various grounds having a particular supervisor appearing as an access point associated with their thesis.) I have since encountered others who have doubts about the advisability of making the supervisor a routinely public access point to all theses.

    Comment by neilgodfrey — July 9, 2007 @ 9:10 pm

RSS feed for comments on this post.

%d bloggers like this: