Case study: an archival body responsible for cultural preservation
Case study: An archival body has collated text documents, photographs, plans, and more to assist with the preservation of national heritage buildings and monuments. They would now like to have these digitized and stored in a library and generally made publicly accessible.
This gives rise to some interesting conceptual nuances in figuring out the best way to structure the data in a library (repository) record.
Textual and pictorial metadata
The archiving agency has a text form that they fill in to accompany the other photographs, volumes, plans etc. But this text form once filled in is not metadata about those accompanying photographs, volumes, plans, etc. The text form contains information about the history, ownership, occupancy, style, architectural and structural details, gazette info, original architects, builders, etc etc about the monument or building.
So all the data that is held by the archiving agency — text forms, photographs, etc — is data about the real building or monument “out there”, in the street somewhere.
In fact, what the archivists have is a combination of textual and pictorial metadata about real buildings.
The text information and the digitized volumes and photographs all describe the real monuments, either textually or pictorially. The “intellectual content” or “work” for preservation is the real building itself.
The real target content
The digitized graphic data and text data about the monuments are datasets. These datasets are sets of data about the real monuments. Although they are archival materials, they are not archival content in the same sense that an historic treaty or letter or diary are archival content in their own right. The archival agency here uses their data (both text and graphic) to inform them about the real monuments “out there”.
Compare a history book. A history book can be about events that happened in the real world, but the history book itself is the primary targeted content of a library repository. Users read it as intellectual content in its own right, and to engage with the ideas of the historian. This is not the same as with the above content used for the preservation of cultural heritage buildings. In this case, the archived content can be said to be metadata informing users about, and offering tools for, the intellectual content of the cultural heritage in the real world.
The archived photographs, for example, are a form of metadata, that is pictorial metadata, about real monuments. These photographs are not the same as artistic or historic photographs that one might archive for their archival value in their own right. The archival photographs need to be archived or preserved as vital metadata about the real monuments – the real content of interest to which the photographic metadata points.
Schema considered for handling this kind of dataset:
ISAD(G): General International Standard Archival Description and EAD: Encoded Archival Description
These schema appear at first glance to be the obvious choices for encoding archival data. The EAD has been developed from ISAD. But caution is to be advised:
- The structure of these schema is hierarchical. That is, it is built around Fonds and Series. This is perfectly fine for nondigital document filing. And it is also fine for digital storage systems that are prepared to “sit as is” in perpetuity. For local display purposes, here and now and the foreseeable future with minimal technological changes, this will probably work very well.
- But hierarchical structures in metadata are fraught with difficulties. Hierarchical structures are rarely sustainable across different software platforms. They are also difficult to crosswalk with other metadata schema. (The EAD website does offer a crosswalk to MARC 21, but some of the MARC 21 fields used (e.g. 351 – for hierarchical information) are not common and are not easily mappable to other schema such as MODS or Dublin Core.) For this reason alone EAD is not an ideal schema for long term preservation and interoperability purposes.
- Hierarchical structures also work against OAI searching. The context of any hits can be lost if that information is at a different hierarchical level, with the result that the hit can appear to be either meaningless or misleading via an OAI or DC search.
- For the above sorts of reasons, the trend in OAI and the digital library/repository communities is towards nonhierarchical, flat protocols. EAD may be fine for local usage and search and display purposes, but it has the limitation of being left behind in a silo, while our intent is to be sharing data as widely as possible according to our L2010 vision.
- ISAD and EAD are designed for addressing collections where the primary content for direct preservation (archiving) are documents such as historical papers and images etc. As explained above, this is not the case with the PMB data. ISAD and EAD are not designed for datasets. Nor are they designed to meet the requirements of OAI protocol searches.
VRA: Visual Resources Association and TEI: Text Encoding Initiative
VRA also suffers from some of the limitations of EAD:
- VRA is designed for addressing collections where the primary content for direct preservation (archiving) are documents such as historical, cultural or artistic images etc. As explained above, this is not the case with the PMB data. VRA is not designed for datasets.
- The properties supported by VRA are too limited to support the range of potential properties required by PMB.
TEI is designed for manuscripts and large volumes of text.
CDWA: Categories for the Description of Works of Art, and CDWA Lite
CDWA covers the many of the properties required for descriptive (and some preservation) metadata. It meets the basic descriptive requirements of PMB for their data. However, it also has limitations:
- I have doubts about how widely supported it would be by software platforms and search and display interfaces, at least without much local IT tweaking. And if systems could be massaged to support it now, there will remain the questions of future sustainability as software and hardware requirements and practices change.
- CDWA Lite looks easier to use than the full CDWA, but unfortunately the reason it is “Lite” is because most of the properties that pertain to buildings and real-life monuments have been removed. It is more useful for museum and gallery artworks.
MODS has the advantage of being able to express the CDWA and other properties specific to PMB requirements in a basically flat structure. It is also highly interoperable with other schema, including Dublin Core. It is an ideal base for creating OAI compliant DC for basic search. It is also very widely recognized and supported by the Library of Congress, and known well enough to be read and understood by most data service providers.
A major advantage of using MODS is that it will be the simplest way of leaving the door open for OAI-ORE as well as OAI-PMH searches and retrievals of data. If the controlled vocabularies used in the archival records and MODS properties are eventually compliant with KOS and RDF models, then semantic web interrogation of the database has the potential to explore and discover relationships and knowledge otherwise lost from view.
Draft MODS-DC schema for data about cultural monuments, not about the monuments themselves
Draft of MODS-DC schema for data about cultural heritage monuments, not about the monuments themselves: http://tinyurl.com/lcuhyy