July 14, 2007


  1. And just as I was trying to get my non-tech head around Type-Consistent Digital Objects in D-Lib someone points me to this book, Everything is Miscellaneous. Already our basic repository solutions built around 2-D metadata schema instead of 3-D RDF are starting to look clunky.

    Comment by neilgodfrey — November 30, -0001 @ 12:00 am

  2. Neil,

    Have you ever run across situations where no dc:type is specified? Or cases where far fewer than the simple dc set are utilized? It seems to me that there are more situations where ‘repeatable’ might need to be limited, but has anyone introduced any ‘mandatory’ (i.e., not optional) elements?

    Comment by Mia — July 17, 2007 @ 11:11 am

  3. I know of no situations were dc:type is avoided as a matter of policy. But my experience is that most repositories do include a few mandatory elements, such as creator, title and type. There is an Australian body currently toying with making 4 dc elements — creator, title, description and type — mandatory for a scholarly exchange database (not repositories). UQ’s Fez and USQ’s Eprints have been configured for mandatory elements, varying for each dc:type value, and all the DSpace repositories I know demand a title as a minimum (and prefer dc:contributor to dc:creator).

    If you don’t mind manipulating a spreadsheet there is a comparative table, titled “metadata spreadsheet” (found beneath the similarly titled “metadata types spreadsheet” link) which lays out what dc elements are mandatory, which ones are recommended, optional, etc for various repository defaults and harvesters. Blanks indicate where the dc element is not used at all.

    The links are in the metadata chapter site of the RUBRIC Toolkit. I should write up a summary of some of these spreadsheets in more readable form.

    Comment by neilgodfrey — July 17, 2007 @ 10:38 pm

  4. Thanks, this is interesting. I probably shouldn’t be surprised by the inconsistency, er, variety? — but I am, and find it, well, somewhat disappointing, I have to admit.

    Comment by Mia — July 18, 2007 @ 3:41 pm

  5. “Inconsistency”, “variety”, . . . how about “flexibility”? πŸ˜‰ A basic principle of the DCMI is that the schema elements should be not only repeatable but also optional. Are not the inconsistencies we observe actually DC principles of optionality and repeatability in practice?

    Dublin Core is a very blunt instrument and institutional repositories will inevitably vary in the terms that best express their individual situations and national requirements. To my mind OA harvesting does work because the bluntness of DC, its weakness, is also its strength.

    But I am more than ready to admit I am too close to see what you are seeing. So do please feel free to express reasons for disappointment in detail.

    Comment by neilgodfrey — July 18, 2007 @ 8:32 pm

  6. I myself argue all the time for flexibility, so that is familiar territory indeed πŸ™‚ Yes, of course. Having the biggest picture possible is important. Simplicity is very important to get any traction. Universities are highly political organizations, and compete for funds and status not only between departments and faculties, but also with infrastructure services, such as the library and the parent IT and HR departments. So, there is that overarching political dimension.

    We know that despite the apparent simplicity of simple dc, there is just nothing straightforward about getting human beings to codify information for machines. Turns out, it isn’t simple after all.

    Some time back Tennant put together that piece on the ‘bitter harvest’. All those different date formats encountered in the data recorded by only 5 institutions. Having examined a fair amount of documents and data over the years, it’s easy to recognize the signs of human footprints. There are good reasons for those footprints.

    Optional/repeatable is the flip side of mandatory/limited. Several excellent articles on shareable metadata in the last few months shed light on some of these issues (e.g., Sharon Reeves article in First Monday, etc.) The July issue of D-Lib has a terrific array of articles I just burned through, (including one which evaluates the non-use of Cornell’s Dspace).

    Perhaps I should start thinking about harvesting as kind of a “best bets” approach. Maybe it’s enough to “approximate” an answer to any query, use that info (however sparse) to assemble some plausible components (parts of which could be compound information objects), and provide some kind of facets/pivot points to ‘reach in’ further.

    Just keeping this all simmering on the back burner…

    Comment by Mia — July 18, 2007 @ 11:35 pm

  7. Woah, too much detail now! πŸ™‚ — Yep, it can look disheartening. I would like to think that the Cornell experience and others like it serve as a warning model for marketing and broader policy and institutional administrative issues. But on the technical side of things it also in part points to some of the weaknesses of structuring repositories around Collections. There are many issues to be addressed at both data and service provider levels as well as at the repository structure level itself — and at policy and organizational levels too in many cases. And the importance of controlled subject (and some other) vocabularies — that always seems to run up against the argument from the trenches about the qualifications and/or time of those entering the data.

    As for the data provider and repository structure side of things, I’m very interested in RDF potentials and applications re a semantic web environment. I have just had Open Library’s ThingDB brought to my attention too, which I want to look more closely at.

    I think my original question was prompted by another question more immediately on my mind right now, and that is discussed in other posts. It is the specific question of a standard resource type vocabulary as a value for the dc:type element. Is there a Use Case for developing such a standard thesaurus? (I’m again thinking of our local (Australian) situation.) It goes without saying that there needs to be consistency of terms within a repository, but under what circumstances, and who, would benefit from having a single standard set of resource terms for OAI harvesting? (The question sounds like a no-brainer posed in the raw like that, but there are reasons for asking it that I won’t repeat in this comment box.) And does the recent D-Lib article on Type Consistent Digital Objects have a bearing on this question? But I’m cheating — and bringing in a question from another post to this one.

    Comment by neilgodfrey — July 19, 2007 @ 7:43 am

  8. I just had a look at that article (Saidis & Delis) and though I don’t possess (and won’t be acquiring) the technical background required to have its many mysteries revealed to me, I can certainly see much merit in adopting a Prototype. Things then (and must) conform and are verified according to the prototype, and future modifications are made to IT — that idea is quite appealing.

    I am all for a standard type of resource vocabulary. At least I think I am (especially when I look at the 365 terms in the OAI mapping. I see some common non-English language terms like monografia and livre, plurals, abbreviations (diss), etc. Why does the list stop at 365 terms? Is it because we haven’t yet encountered the 366th term, and when we do, it gets added to the list? Why are a handful of terms in non-English languages represented, what about other terms, other languages etc. etc.)

    So, wouldn’t we all benefit from a single standard set of resource terms?
    There has to be agreement everywhere. In order to achieve agreement, we have to keep finding (and someone has to keep articulating) ever-higher-order levels of abstractions. Isn’t that what DOPs is trying to address? (am I cheating back? I have to stay skimming on the surface as I’m not a deep sea diver)

    Comment by Mia — July 19, 2007 @ 12:46 pm

  9. I let my other question take over for a moment — the “flexibility/variance/inconsistencies” we see in the display I pointed to earlier, to my view demonstrate OA harvesting working quite successfully with repositories. Google searches bring the academics’ papers to the top of the results hits. And the google hit and higher citation count of papers is one of the strongest selling points in establishing the IRs. But I admit I am speaking from a different regional context than the one you referred to in the survey discussing the Cornell et al experiences.

    The OAIster list does not include all the variations they receive. But I would question the terms they don’t include. There are not so many resource types really as 365 in their map — many are the same in different languages. But it does show, at least to me, that it makes no difference if one repository uses “article” and another “articles”. Such discrepancies are not a problem for technology. And we are addressing the resource types known and understood within a specialist community, the scholarly one. Slight variations in terminologies (article or journal article, working or discussion paper) are readily understood across the different geographical bases because they are talking essentially the same language. We are not addressing the complexity of meanings and terminologies that a whole library has to address.

    Repository managers can always negotiate with harvesters too. Admittedly I am speaking from the experience of one region, but the various resource types entered in the different repository solutions would have no problem with harvesting issues. If a new or dubious term should be considered, it is no problem to phone or email and check with harvesters about its compatibility. But again, my primary question is: what is a Use Case that relates to harvested materials? Under what circumstances would someone search just one resource type, how often, why, and who? (I can find many use cases within institutions, but my question is re harvested searches.)

    I can imagine and know of people searching across repository harvesters, or google, for a particular author or topic. And those looking for a topic are wanting to know everything that is available in order to decide what is the most useful. They don’t want to limit a topic search to just a particular type of thesis, or a report only, or just a book chapter.

    But I am more than happy to be persuaded otherwise. That’s why I have been asking for Use Cases that will demonstrate the value of a national resource type standard thesaurus.

    I would not be asking this question if the various repository technologies were not already well established. The use case value has to be balanced against the realities of the technologies in existence, the supports for those technologies and the needs they meet with their current systems. MARC was developed around existing standards, but can we really imagine Eprints and Dspace and others reconfiguring their systems to meet a new standard of terms? How will they handle existing stakeholders? Even if they all did in one nation or region, how will that affect harvesters anyway? They would still have to handle international variations. That’s why I wonder why the simplest solution, if needed, is the cross-walk at the harvester level.

    I am not meaning to sound nihilistic. I am more than willing to be persuaded otherwise (indeed this current view of mine is a quite recent one, something of an embarrassing deconversion from my previous view) — but would like to be aware of Use Cases to reconvert me.

    Comment by neilgodfrey — July 19, 2007 @ 8:10 pm

  10. Absolutely; we shouldn’t be inventing solutions to scenarios that we postulate might exist, or when they do exist, occur .01% of the time, etc. Those can be instructive and entertaining exercises to work on, but can also be incredible time wasters. Its true that we frequently fall into that trap; so it’s helpful that you are hammering away at this one. There has to be demonstrable, quantifiable, arguable need.

    Comment by Mia — July 20, 2007 @ 2:48 pm

  11. And just as I was stretching my non-tech head to breaking point to get it around Type-Consistent Digital Objects in D-Lib someone points me to this book, Everything is Miscellaneous. Already our basic repository solutions built around 2-D metadata schema instead of 3-D RDF (which can fill out any metadata schema values any time anyway) are starting to look clunky.

    Comment by neilgodfrey — July 20, 2007 @ 8:58 pm

RSS feed for comments on this post.

%d bloggers like this: