Metalogger

October 7, 2011

Repository Building Diary. Day 1

Filed under: Repositories — Neil Godfrey @ 10:17 am

Let’s say the “day” here represents a stage of maybe a few weeks. The university where I have been asked to take responsibility for the digital collections for both research reporting purposes and showcasing the university’s intellectual output, for the digitization of both rare cultural resources as well as of short term learning materials, along with coordination of research data management generally, has just had some very nice news that gives my job a little extra shine. It has been ranked very respectably within the top 400 universities (top 4% of world universities) in the Times Higher Education Rankings for 2011-12. Not that it really matters that much, of course. I recall once working in a university that one year scored at or near the bottom of a national “recommended universities guide” and we were all soberly informed that the rankings were rubbish and meant absolutely nothing. Two years later I think it was and the same university scored top of the same list. Suddenly we had free champagne and a 15 piece band playing and self-congratulatory speeches all round. 😉

Still, there is an incredible amount of cultural research and history waiting to be digitally preserved here and made open to the world. If we had not scored in the top 400 it would have done nothing to dampen my enthusiasm to showcase a lot of the important research and heritage being produced and stored in Australia’s “top end”.

Someone asked me which of my jobs over the last few years I have liked the best and I had to say that I have loved them all (well, mostly) despite the inevitable difficulties each one brings with it. Each one has its own challenges and that’s good for me. Keeps me learning. Every game is different.

This one has started out well because the Fez-Fedora repository is in very early days of development and the early version of the Greenstone repository for e-learning course materials is no longer worth maintaining, and there is very, very little content currently open-access at all, so we have a chance to start from the ground floor. It is a great opportunity to shape things from early days before systems get too big to manoeuvre easily.

Most university repositories start out their repositories with something easy and that shows quick results while workflows are being fine-tuned, and that something is generally deposits of digitized theses, usually higher degree theses.

But you need a few things up front to be able to do this. You need staff or volunteers to be able to enter or edit the content and records. You need a means of ensuring that the theses are digitized in an appropriate format. You need a repository system that can be easily managed and configured to meet your needs — needs concerning permissions, access, embargo periods, OAI harvesting, etc. And you need to be allowed to enter theses into your repository in the first place!

It’s not much good starting out by entering theses contrary to the written rules, policies and guidelines of the university itself. That’s an invitation to walk into legal disaster. The institutional research and examinations guidelines required that the library negotiate with each student who had authored a higher degree thesis for permission to deposit and make open access their thesis in its repository. This was contrary to my experience in other academic institutions. I have known of cases where students have attempted to have their research theses withheld from public access for up to five years and more but where the (deputy) vice chancellor of research has stepped in and told them a flat No. It is research to which the public has a right to have access. It goes online.

The guidelines we were faced with meant us having to seek permission from every author with a very lengthy and complex form detailing all the rights and wrongs and legalities of what their and our rights would be. And nothing could be deposited until we received a reply from authors going back who knows how many years. It meant we have virtually nix responses and no backlog of theses all about to go online at all.

Of course there are the normal exceptions — third party copyright issues, commercial publication embargoes, cultural and other personal sensitivities, etc — that are good grounds for limiting access to certain theses. That’s not the problem. We ask the authors to assure us that there are no such issues. Those are fair and reasonable reasons to withhold these from open access. We make reasonable efforts to contact authors and other related parties and are always clear that we make theses public in good faith and will remove any if we subsequently learn it should not be online.

Well my first task, one of them, was to go through the appropriate channels to seek changes to the guidelines to the university’s research and examinations practices. I emailed details of all the arguments for the change, and included a file of all the Australian universities that currently have some mandate or practice of virtual routine Open Access deposit of higher degree theses in their institutional repositories — about 80% of them — including a CAIRSS survey result to support the same. And bingo! The library director called me in after meeting the responsible committee and informed me they had all agreed to the change in the guidelines!

It’s nice to start out with a little but very important first step. Now the guidelines are changed we can draft a very brief note for authors and the default will be for us to deposit and make the theses publicly accessible — contingent upon the normal qualifications as mentioned, of course.

Now we can start to build up a huge backlog of theses to be entered so we can then as a next step justify a request for a staff increase to get them in the system! 😉

Advertisements

November 21, 2008

Repository-publishing-reporting workflow integration

Filed under: Repositories — Neil Godfrey @ 1:24 pm

Having attempted a few times on http://techessence.info/node/104 to register to add a comment, without success, to Roy Tennant’s referral to my repository comparison, I am left to make one comment here.

Roy rightly comments that the picture is bigger than I have indicated here (my blog comparison is in fact a truncated rump of something I prepared for another institution to assist them to investigate one specific issue), but I would suggest that his illustration of Digital Commons extra capabilities vis a vis DSpace is one example of many factors that seem to me to cloud decision making among some institutions.

A peer view/publishing system tied in with a repository in order to save on numbers of finger clicks to process publishing and reporting and other administrative work is a good thing.

But a scholarly publishing workflow system does not have to be the preserve of a single enterprise solution. One example from Australia: the Integrated Content Environment for Research and Scholarship (ICE-RS) is a Federal Government funded project to create a collaborative authoring and publishing (cum repository deposit!) tool that is open-source, and capable of integration with repository systems. It does not handle peer-review, though there are thoughts afloat for integrating ICE with the Public Knowledge Project’s Open Journal Systems software for peer review.

Moreover, the same number of clicks for authors can be reduced for repository ingest purposes where there can be collaboration and integration with a university’s research office and reporting system. Some universities have been able to have current papers (including preprints) deposited in their repositories directly from a research system. The University of Sydney has a workflow line between their IRMA research reporting system and their DSpace repository. Murdoch University is engaged in establishing a similar workflow between their IRMA system and Fedora repository.

Further, Roy’s allusion to the “highly popular DSpace” repositories may be seen as further testimony to the success and sustainability of open-source solutions in a world of highly competitive and highly expensive enterprise products.

More on ICE-RS:

Australian govt archive news release

Slide presentation

ICE: Integrated Content Management

Integrated approach to preparing, publishing, presenting and preserving theses

ASK-OSS post — courseware publishing system

Why ICE works

ICE developers blog

More on Dr Peter Sefton’s blog

October 19, 2008

INFORMAL comparison of some institutional repository solutions

Filed under: Repositories — Neil Godfrey @ 2:47 pm

Over the last few years I have worked closely with a number of different institutional repository solutions, both open-source and enterprise products. There are several I have not had personal experience with, but I have taken opportunities to speak with a wide number of users of these products, too, as well as with representatives and producers of those solutions. I have also sought input from other users of repositories I am personally familiar with in an attempt to balance out my own personal impressions. The following comparison is based on feedback primarily from managers of the systems — whether they have live production systems or have done extensive testing on systems they expect to take live soon.

The purpose of this comparison is to give an intro level guideline for institutions interested in “what (else) is out there”.

  • The comparisons are not a systematic point-by-point balanced presentation. Anyone interested in a serious in-depth comparison or study of any particular repository solution would need to speak to other users themselves, as well as the producers or agents of the solutions.
  • It is also restricted to the repositories I know from my experiences in Australia.
  • I have not referenced costs or specific institutions here.
  • Nor have I attempted a serious comparison of the IT architecture across the systems.
  • The main focus is on support provided/needed and functionality of each product.

Digital Commons

http://www.bepress.com/ir/

Digital Commons is a “presentation repository”, not a “preservation repository”. Emphasis in a repository designed primarily to showcase an institution’s research is on an attractive and compelling interface for users, including self-submitters. Digital Commons is a hosted solution (i.e. hosted in California). There is no hardware to purchase, install or maintain. An institution can begin to upload papers immediately after installation. Purchase and maintenance is on a renewable one year or limited number of years basis.

Digital Commons cannot be synchronized with another preservation repository for migration purposes. A preservation repository, unlike Digital Commons, will record and preserve authentication, versioning, rights, structural and descriptive metadata. In Digital Commons such data will not be preserved for migration/exit strategy purposes to a preservation repository.

Support

  • Three universities in Australia using Digital Commons reported that the service from BePress is “very good”.
  • The setup period consists of about 3 weeks. All report that Digital Commons is easy to set up.
  • Phone hook-ups were used for training and instruction at the beginning. Web demonstrations accompanied these.
  • Requests by a university to change the front page appearance were responded to quickly and changes made efficiently.
  • Another institution has requested many ‘fine tuning’ modifications to their instance of Digital Commons, and all these requests have been met “pretty quickly”. One institution wanted the format of citations changed and BePress effected this change quickly for them.
  • One institution who has used Digital Commons for more than 2 years said they had never had any down-time with it.
  • Nightly uploads of new documents.

Functionality

  • Documents can be set to open or closed access
  • Different authentications can be set up for different users
  • Self submission is possible
  • Workflow stages can be configured “to some extent”, so that a central library service can monitor self-submitted documents for quality control and copyright issues
  • Embargo functionality
  • Different types of media files can be deposited (e.g. mp3, pdf, video)
  • OAI harvesting
  • PRPs (Personal Researcher Pages) – this was a strong selling point at one institution. In addition to the central epubs repository, links can take one to a PRP of an author, and this PRP can contain a list of their publications, be used as their homepage, and a point from which to access their documents. The library instance of Digital Commons “harvests” these PRPs and includes links back to the PRPs on the document pages.
  • Documents organized by collections
  • Able to hide preparatory work on a document being uploaded until it is ready to go live.
  • Reporting and statistics

DigiTool

http://www.exlibrisgroup.com/category/DigiToolOverview

An Ex Libris product, DigiTool is a Digital Asset Management System (DAMS). It is designed primarily for teaching functions, and its repository capacities consist of a set of additional modules. The DAMS is primarily designed for teachers to share their digital objects (images, course notes and notices, indexes of resources, exam papers etc.). Much of this data is ephemeral.

DigiTool is not a hosted solution, but there is a community of users, a consortia client of Ex Libris, who do support a “hosted server” – UNILINC: http://www.unilinc.edu.au/services/hosting.html

Support

Users of DigiTool report that it definitely requires their own local IT support to configure it appropriately for specific institutional needs.

Functionality

  • Functionality depends on the modules purchased. Modules are available for:
  • Academic self-submission
  • another for collection management (for arranging objects, adding thumbnails and descriptive metadata)
  • a JPEG 2000 viewer as an optional plugin
  • OAI interoperability (harvesting) module
  • The Ex Libris demo of DigiTool says it is scalable, and users of DigiTool all spoke of it being able to do much more than their immediate requirements.
  • Ex Libris also advertises that it “supports interoperability through open architecture”.
  • Embargo periods
  • Self-submission via the Deposit Module. The Deposit Module provides an interface and workflow which enables submission of objects and metadata by non-staff users.
  • Workflow also allows for authorized staff to control, edit, and approve/decline the submitted material.
  • Different levels of authentication: user/patron authentication is handled by the Local User management function or via LDAP.
  • Copyright: this can be managed by manual assignment of access rights to the object.
  • Objects can be assigned with access rights permissions.
  • The following formats are supported for load into DigiTool: MARCXML, DCXML, MODSXML, CSV, METS  — Given the claim that DigiTool is based on open architecture, one should expect the data stored would be migratable to other systems.
  • ExLibris advertises that DigiTool supports preservation standards such as PREMIS and the OAIS (Trusted Repositories) standard model.
  • DigiTool’s “interoperability module” for OAI harvesting does not configure OAI compliant Dublin Core. This is not a problem for harvesting by the NLA’s Discovery Service because DS have configured their service provider to read and harvest their DigiTool feeders. But seamless OAI harvesting cannot be guaranteed by other service providers. DigiTool users are expected to be informed of this issue by USQ-Repository Services. I have not been able to learn if this is unique to DigiTool or is also an issue with other proprietary solutions discussed in this report.
  • Some users see DigiTool’s deposit procedure as “klunky” in trying to get it to do what they want. Editing of objects can require hours (overnight) to take effect; citations need to be created separately since they are not automatically generated; and multiple key strokes are required for some “simple” operations such as moving an object from open to restricted access.
  • One user said that the upcoming version of DigiTool “promises” to be able to give them the ability to handle hierarchical structures. “We think it will do what we want.”
  • Citations need to be specifically created in DigiTool – they are not automatic as in EPrints.

Reasons for adoption

Most institutions who have adopted it or who are considering doing so have said that their primary reason was to establish synchronicity with their other Ex Libris products. Some specifically added that it was policy for them to favour enterprise solutions over open-access solutions.

DSpace

http://www.dspace.org/ and http://libraries.mit.edu/dspace-mit/index.html

DSpace is an open source solution developed by MIT. It has a large and active community of users. At least 450 registered DSpace repositories worldwide are evidence of DSpace’s robustness, ease of implementation, simplicity of maintenance and ongoing use, and low-cost.

Support

  • A large and active community of supporters with experience and expertise available to draw on
  • Thorough online documentation for IT staff and managers for customization and implementation
  • Step by step online tutorials
  • Online assistance
  • The amount of local IT support required for the implementation of DSpace depends on the extent of configuration changes an institution wishes to make.
  • DSpace provides a module, Manakin, which enables the configuration of much more “original” interfaces without “intensive long term” IT support.
  • Institutions with basic largely “out of the box” configurations report that they can do “in the main” without local IT support. The payoff is that a few “minor issues” (e.g. maintaining correct indexing records when changing the location of an object from one collection to another) persist.

Functionality

  • DSpace manages objects in an hierarchical collections based structure. Collections (or hierarchies of collections) display alphabetically on the main page.
  • This Collections based organization, with inbuilt workflow and authentication capabilities, enables different faculties or departments to manage their own deposits and structure of their collections. Workflows can be set up to still provide for central quality control and final editing by the library.
  • Descriptive metadata for the objects has a flat structure, which means that in cases of objects with multiple authors from different affiliations, there is no automatic guarantee that data can be transferred intact from one repository to another. This requires IT support in order to set up, say, a METS package, in order to encapsulate the data in its original relationships for successful migration.
  • Workflows and authentications are supported.
  • Embargo periods are supported (metadata page displays but the attached document becomes public at a preset date)
  • Objects can be made inactive to be hidden from public view.
  • Different mime types are supported, including video and audio.
  • DSpace is integrated with Research Management systems in several universities.

EPrints

http://www.eprints.org/

EPrints is an open source solution developed and supported by the University of Southampton. EPrints is “easy to install, easy to configure, and needs minimal maintenance. Once installed, it simply works without fuss. Over a year, no maintenance has been required to the UTas server apart from updates.” (Arthur Sale, UTas)

All EPrints administrators I have contacted have spoken well of its simplicity and stability. It is widely seen as an ideal repository solution for initial implementation in a university with limited financial resources and IT support.

“EPrints is a mature software package, with an established community. It offers a complete solution for managing a research repository for Open Access. EPrints can be put to other uses, but for other uses such as image repositories alternative software might be more appropriate. . . . However, the software is under active development and it is particularly useful as an Open Access document repository.” (http://rubric.edu.au/repositories/eprints.htm)

Support

“Many institutions do not have the resources necessary to build or maintain an institutional repository. The EPrints Services team offers a complete range of advice and consultancy to support institutions who have adopted, or who are looking to adopt, the EPrints solution. We can provide as much or as little support as you need to create and maintain a professional repository.” – EPrints site

This assistance is gratis to those implementing and maintaining an EPrints repository.

I contacted at least half a dozen universities using EPrints and expressed the unqualified praise for the level and timeliness of support from Southampton. This praise came from both IT staff who have had to liaise with Southampton as well as from repository managers.

Patches and upgrades are released regularly. Users have remarked on the ease with which these are installed and the robustness of their maintenance.

Functionality

  • EPrints supports OAI-PMH harvesting protocols.
  • Plug-ins have been developed so it can also support specific research reporting requirements and for supporting the emerging SWAP (Scholarly Works Application Profile) that is pioneering interoperability and semantic web developments for scholarly works.
  • Self-submission (with a simple self-submission interface that is quick and easy to learn) is supported.
  • Workflows can be configured for editors, submitters, and monitoring staff with different permissions.
  • Objects can be removed from open access.
  • Batch import (e.g. of ADT records) is supported.
  • Peer review status, publication status, copyright and other administrative information, and citation generation and statistics by objects and author are all part of the “out of the box” package.
  • EPrints supports text (in particular pdf) and image files, including multiple files per object.
  • Flat metadata structure. So when there are multiple authors with different affiliations there is no guarantee that the right author-affiliation matches will be maintained in future migrations to other repositories. Ensuring this requires some workarounds that are available, but need IT support to implement. In raising the flat metadata structure issue, it should also be mentioned that EPrints are developing an RDF module that converts their metadata into “triples” (subject-predicate-predicate). RDF (Resource Descriptive Framework) is the basis of the emerging Web 3.0 (Semantic Web) and enables data to be converted into multiple schema, including complex hierarchical structures.
  • EPrints has recently begun to support preservation metadata with the work of its Presev project and this has preservation functions have been implemented with EPrints 3:

1 A history module to record changes to an object and actions performed on an object
2 METS/DIDL plugins to package and disseminate data for delivery to external preservation services

  • Slow indexing issues in EPrints have been rectified with the EPrints 3 version.

EPrints development has ensured robust functionality and this has limited the file types supported in the earlier versions of the “out of the box” EPrints. But successive releases are allowing for wider variety of mime-types to be supported.

One limitation up till now (version 3.0) with EPrints has been the failure of the Embargo facility to publish objects on the release dates. These need to be manually published.

Dates for theses appear as “date published” instead of “date completed” in at least one institution. It is not clear to me if this is a configuration issue that is resolvable with IT and/or Southampton support.

Users

269 archives are known to be running EPrints archives worldwide.

The reason for this recommendation was that it was “easy to set up, easy to configure, easy to use, and has a very large, open community supporting it and using it  . . . and not all that expensive .  .  . for small institutions such as these, it was ideal.” (USQ-RS Manager in personal email)

Equella

http://www.thelearningedge.com.au/

Equella is developed by The Learning Edge International (a Tasmanian based company).

Equella is primarily a “Learning Content Management System”. Learning Edge also describe it as a repository, but speak of it as a teaching repository tool. It is “a fully integrated Digital Repository and Content Authoring Tool”. It can be used as a collaborative lesson planning tool as well as a repository. It can plug into Blackboard and WebCT.

Users have described its administration module as well developed. “A non-IT person can set it up with a graphic user interface for collections and permissions and configuration.”

Support

  • One institution needed “some” local IT support to set up Equella. They do not need local support for much more than activating periodic patches that Equella sends out now.
  • Another also said that the “setting up” of Equella was most complex part, but that this included the preparatory work. Equella is so very flexible that many decisions need to be made in advance about what exactly was wanted, what should be the best and proper policies. Once this work had been recorded it was relatively easy to setup. All the development work is done by Learning Edge.
  • One institution said that the initial setup involved two days’ training, which was described as “very adequate”.

Functionality

  • self-submission (including students of higher theses)
  • different levels of authentication (easy to setup – a simple switch; manages various levels of permissions among academics – very flexible)
  • workflow systems (staged steps to check for copyright compliance with OAKLIST, Sherpa, etc)
  • digital rights management (can aggregate objects to groups for specific workflows and permissions)
  • embargo periods
  • OAI-PMH harvesting (e.g. to Google Scholar)
  • records can be pulled out of Research Master and sent back into Research Master
  • one user noted that the user frontend is boring, uninspiring, but the functionality behind it is “great fun” – relying on their own IT people to soon rectify this web appearance.


Users

Equella has about 18 clients in Australia, including several Tasmanian, Queensland, South Australian and Victorian TAFEs and Education Departments. One institution chose Equella to handle 4 primary tasks:

  • RQF reporting (now ERA)
  • As a backend to Moodle CMS
  • To be the base of the e-reserve collection
  • Developing a university research repository
  • Open Source solutions were not an option because of limitations of the brief.

Another because:

  • it has a well defined administration module, so a non-IT person can set it up with a graphic user interface with various collections and permissions, and they did not want to align with a particular library system

Fez

http://dev-repo.library.uq.edu.au/wiki/index.php/Main_Page

Fez is a front end to the Fedora repository software. It is developed by the University of Queensland Library as an open source project hosted on SourceForge. See http://sourceforge.net/projects/fez/

Project Overview: http://espace.library.uq.edu.au/documentation/

Fez is part of the Australian Partnership for Sustainable Repositories (APSR). Fez is one of the deliverables of the APSR eScholarship testbed in the University of Queensland Library.

Support

  • University of Queensland developer/s does offer support for other users of Fez. These should be contacted for details.
  • To implement Fez an IT officer with the following basic knowledge set would be required:
    • * MySQL
      * Fedora
      * Fez – written in PHP, but also CSS, html (Smarty html templating)
      * Apache
      * (related other pre-req software)
      * Understanding how it all fits together
  • This would also involve that person (obviously) having programmer level access to the server on which Fez ran (something to be considered, depending on your IT department’s policies).
  • The University of Queensland developer

RUBRIC recommendation at February 2007:

For institutions wanting to run a general purpose repository Fez is a promising choice, provided that the technical resources are available to manage it. Contact the developers, as they may be able to offer support under a formal arrangement.

Functionality

  • Fez is a rapidly maturing repository software application.
  • Fez is built around constructs known as “Communities” and “Collections”.
  • Supports self-archiving
  • Workflow authentications and authorizations. These are configurable through GUI interface.
  • Security based on FezACML to describe user roles and rights on a per object basis or through parent collection or community security inheritance.
  • Security at object granularity.
  • Statistics (eg Downloads per Author, per Community, per Collection, per Subject etc).
  • Preservation metadata extraction
  • OAI service provider for harvesting
  • Supports migration to and from other repository systems (DSpace, EPrints, VITAL)

VITAL

http://www.vtls.com/products/vital

VITAL provides every feature–ingesting, storing, indexing, cataloging, searching and retrieving–required for handling large text and rich content collections. VITAL takes advantage of technology standards such as RDF, XML, TEI, EAD and Dublin Core to easily describe and index an assortment of electronic resources. VITAL leverages the benefits of open-source solutions such as Apache, MySQL, McKOI and FEDORA™. VITAL conforms to common Internet data communications standards such as TCP/IP, HTTP, SOAP and FTP. Additional standards utilized include WSDL Web Services, OAI-PMH, Dublin Core, MARCXML, JHOVE, MIX (Metadata for Images in XML Schema), and SRU.  (from the VTLS site)

A PREMIS datastream is also generated at ingest.

Support

Australia’s community of VITAL users has been able to coordinate support through ARROW. 2008 is the final year of the ARROW project. ARROW is expected to be replaced by a CAUL sponsored body, CAIRRS, with a remit of supporting the Australian university repository community more generally. VTLS has acknowledged that their past record in prior testing of new product versions, and their follow up support, could have been better. At a recent ARROW meeting in Brisbane, a VITAL representative assured users that VITAL itself would upgrade users’ current versions to 3.4, due for release around the end of October. In the past, patches have not always been available for recognized issues (including significant ones such as certain PDF files not able to be indexed) and users have had to wait for new version releases. VTLS does have an online “hotline” for logging such issues.

Functionality

The VITAL product promises much. It has the advantages of a Fedora base, which enables the storage of a wide variety of content types, and greater sophistication in their management and support for any standard metadata schema (hierarchical or flat).

  • Documents can be set to active or inactive (public or hidden) — although at the moment of deposit they necessarily default to active
  • Self submission is possible through VALET
  • Workflow stages can be configured so that a central library service can monitor self-submitted documents for quality control and copyright issues
  • No embargo functionality
  • Different types of media files can be deposited (e.g. mp3, pdf, video)
  • OAI harvesting
  • Statistics — although some institutions opt to hide this function because the stats file consistently corrupts at regular intervals or the statistics re-set to zero, although this is said to be fixed in version 3.4
  • Copyright: this can be managed by manual assignment of access rights to the object.
  • Supports migration to and from other repository systems (DSpace, EPrints, VITAL) with METS.
  • Ability to search the full-text content of PDF, DOC, RTF and other document formats — although there are currently security issues with this function, such as public searches not being completely cut off from “hidden objects”
  • Ability to display multi-page documents.
  • Integrated editors for easy editing of metadata.
  • Customizable templates for display of content.

October 17, 2008

October 10, 2008

Are repositories set to be left out in the cold?

Filed under: Dublin Core,Harvesting,Repositories — Neil Godfrey @ 1:01 pm

Repositories and their harvesters have a rule of their own that violates Dublin Core standards. Because of this, are repositories and harvesters on target for a massive retroversion or major set of patches if they are to be a part of the semantic web? (I don’t know, but I’d like to be sure about the answer.)

Once again at a Dublin Core conference I listened to some excellent presentations on the functionality and potential applications of Dublin Core, but this time I had to see if I could poop the party and ask at least one speaker why the nice theory and applications everywhere simply did not work with the OAI harvesting of repositories.

I like to think that standards have good rationales. The web, present and future (e.g. the semantic web) is  predicated upon internationally recognized standards like Dublin Core. According to the DCMI site the fifteen element descriptions of Simple Dublin Core have been formally endorsed by:

  • ISO Standard 15836-2003 of February 2003 [ISO15836]
  • ANSI/NISO Standard Z39.85-2007 of May 2007 [NISOZ3985]
  • IETF RFC 5013 of August 2007 [RFC5013]

But there is one area where there is a clear conflict between DCMI element definitions and OAI-PMH protocols. The DC usage guide explains the identifier element:

4.14. Identifier

Label: Resource Identifier

Element Description: An unambiguous reference to the resource within a given context. Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system. Examples of formal identification systems include the Uniform Resource Identifier (URI) (including the Uniform Resource Locator (URL), the Digital Object Identifier (DOI) and the International Standard Book Number (ISBN).

Guidelines for content creation:

This element can also be used for local identifiers (e.g. ID numbers or call numbers) assigned by the Creator of the resource to apply to a particular item. It should not be used for identification of the metadata record itself.

Contrast the OAI-PMH protocol:

A unique identifier unambigiously identifies an item within a repository; the unique identifier is used in OAI-PMH requests for extracting metadata from the item. Items may contain metadata in multiple formats. The unique identifier maps to the item, and all possible records available from a single item share the same unique identifier.

The same protocol explains that an item is clearly distinct from the resource and points to metadata about the resource:

  • resource – A resource is the object or “stuff” that metadata is “about”. The nature of a resource, whether it is physical or digital, or whether it is stored in the repository or is a constituent of another database, is outside the scope of the OAI-PMH.
  • item – An item is a constituent of a repository from which metadata about a resource can be disseminated. That metadata may be disseminated on-the-fly from the associated resource, cross-walked from some canonical form, actually stored in the repository, etc.
  • record – A record is metadata in a specific metadata format. A record is returned as an XML-encoded byte stream in response to a protocol request to disseminate a specific metadata format from a constituent item.

I wrote about this clash of standards and protocols in another post last year. One response was to direct readers to Best Practices for OAI Data Provider Implementations and Shareable Metadata.

The working result for many repositories is a crazy inconsistency. Within a single Dublin Core record for OAI harvesting the same element name, identifier, can actually be used to identify different things:

<oai_dc:dc
     xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
     http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
   <dc:title>Using Structural Metadata . . . </dc:title>
   <dc:creator>Dushay, Naomi</dc:creator>
   <dc:subject>Digital Libraries</dc:subject>
   <dc:description>[Abstract here]</dc:description>
   <dc:description>23 pages including 2 appendices</dc:description>
   <dc:date>2001-12-14</dc:date>
   <dc:type>e-print</dc:type>
   <dc:identifier>http://eprints.repository.edu/318/</dc:identifier>
   <dc:identifier>1-85636-082-X</dc:identifier>
 </oai_dc:dc>

In this OAI DC the first identifier identifies the splash page for the resource in the repository. The second identifier identifies the resource itself. It works for now, between agreeable partners. But how sustainable is such a contradiction? What is the point of standards?

As far as I understand the issue, this breakdown in the application of the Dublin Core standard is the result of institutional repositories needing their own branding to come between users and the resources they are seeking. Without that branding they would scarcely have the institutional support that enables them to exist in the first place.

Surely there must be other ways for harvesters to be aware of the source of any particular resource harvested and hence there must be other ways they can meet the branding requirement. Surely there is a way to retrieve an identified resource (not an identified metadata page about the resource) and to display it with some branding banner that will alert users to the repository — and related files and resources — where it is archived. Yes?

I mention “related files and resources” along with the branding page — but maybe this is a separate issue. Where a single resource consists of multiple files then is the metadata page a valid proxy for that resouce anyway? Or is there another way of displaying these?

Australia has had the advantage of a national metadata advisory body, MACAR. The future of MACAR into next year is still under discussion, but such an issue would surely be an ideal focus for such a body — to examine how this clash impacts the potentials of repositories today and in the future. A national body like MACAR has a lot more leverage for pioneering changes if and where necessary.

What should be done?

What can be done?


But is there more? more confusion of terms?

In having another look at the DCMI site for this post I noticed something else in the latest DC Element Set description page:

Term Name: identifier
URI: http://purl.org/dc/elements/1.1/identifier
Label: Identifier
Definition: An unambiguous reference to the resource within a given context.
Comment: Recommended best practice is to identify the resource by means of a string conforming to a formal identification system.

DCMI recommends that an identifier be “a string”. In the context of RDF and the semantic web my understanding of “string” is a dead-end set of letters as opposed to a resolvable uri or “thing”. But the DC Usage Guide “explains” that an applicable formal identification system allowed here can also be a URI.  So what don’t I understand about the difference between strings and (RDF) things, now?


October 2, 2008

Online journals and institutional repositories – comparing their potential impacts on research methods (and journal publications)

Filed under: Repositories — Neil Godfrey @ 10:55 pm

While discussing the potential impact of institutional repositories on journal publications, an academic alerted me to an article — Electronic Publication and the Narrowing of Science and Scholarship by James A. Evans — discussing research into the impact of the online journal culture on research methods in the world of scholarship.

The article is interesting for several reasons, but I want to address its conclusions in relation to research repositories – something that the research did not intend to address. I think such a comparison is worthwhile to the extent that it helps clarify what I see as how institutional repositories and online journal databases each differently potentially impacts both research methods and publishing companies’ futures. I also can’t resist a comment on my thoughts of where we are headed subsequent to the current world of online databases, whether of journals or research repositories.

The opening abstract of the article notes:

Online journals . . .  are used differently than print — scientists and scholars tend to search electronically and follow hyperlinks rather than browse or peruse . . . .

IR comparison comment: IRs, on the other hand, do facilitate browsing by keywords, titles, authors, year, resource type (that is, text or still image or video . . . )  — not just searching as per online journals like JSTOR or EBSCO.

Evans’ research into the impact of online journal databases found that:

As deeper backfiles became available, more recent articles were referenced; as more articles became available, fewer were cited and citations become more concentrated within fewer articles.

Costs and benefits of online journal databases

His interpretation of this paradox:

  1. as online searching replaces browsing in print, there is greater avoidance of older and less relevant literature;
  2. hyperlinking through an online archive puts experts in touch with consensus about the most important prior work – what work is broadly discussed and referenced;
  3. thus online search bypasses many marginally related articles that are still skimmed by print researchers.

Findings and ideas that do not become consensus quickly will be forgotten quickly.

This research ironically intimates that one of the chief values of print library research is poor indexing. Poor indexing – indexing by titles and authors, primarily within core journals – likely had unintended consequences that assisted the integration of science and scholarship. By drawing researchers through unrelated articles, print browsing and perusal may have facilitated broader comparisons and led researchers into the past.

Evans sees this as one more step away from the contextualized monograph:

the contextualized monograph, like Newton’s Principia or Darwin’s Origin of the Species, to the modern research article. The Principia and Origin, each produced over the course of more than a decade, not only were engaged in current debates, but wove their propositions into conversations with astronomers, geometers, and naturalists from centuries past.

Thus the higher efficiency with which arguments can be framed with the assistance of online searching and hyperlinking, “the more focused – and more narrow – past and present” they will be.

It is not a strictly fair comparison, but this does remind me of a time I was inspired with fresh insights into a topic I was investigating simply from the serendipitous luck of accidentally noticing a title on a tangential topic placed just one library shelf above the one which held the exact classification numbers I was directed to browse. No online search – nor any print citation index – could ever enable me to repeat that particular stroke of luck.

In other words, any technological change will charge some costs against the way we used to do things. Maybe the advent of the printing press led some researchers to miss the chance discoveries they made in the days they had to rely on personal travel to where certain hand-copied books were known to be stored. But each change also brings its own new avenues for broader comparisons and insights from unintended serendipity. If this is not currently happening on a large enough scale to impact the statistical research of Evans’ article, then it is reassuring to know that this is not the end of the story.

But no doubt Evans also would acknowledge that the broader conversations involved in works like Principia and Origin were themselves scarcely the outcome of unintended consequences of the relatively poor indexing of the print media. Today’s researchers and scholars, I suspect, are under far more institutional pressures to specialize, produce and publish at a certain rate than regularly experienced by Newton and Darwin.

Innovation also follows demands and needs, including those of researchers. Online journal and e-book databases and catalogs are not the only points to which the electronic media are leading.

Comparing IRs with online journals, their functions and potential impacts

IRs are finding their way into a growing number of universities. (See my list of current Australian IRs and links to other registries.) These IRs do support broader opportunities for browsing, not just searching, by both uncontrolled keywords and sometimes controlled vocabularies, resource types, authors and titles. Browsing is one of the alleyways the Evans article laments is missing with electronic searches. That is not the case with most IRs. Nor are IRs restricted to the localized browsing of a single institution’s research repository. They can be browsed collectively through harvesters such as OAIster, the Australia’s Discovery Service, etc.

IRs are something other than a high-tech attempt at a more efficient journal database. They are quite different. They are a means of individual academics – and their institutions — making visible and accessible their collective works. They are a means of both showcasing and preserving personal and institutional research, and also of making publicly funded research instantly and openly accessible to all.

IRs offer the capacity to more easily interrogate the discussions of individual authors over time. This is surely a potentially useful alternative to journal databases that are structured around topics. Researchers do publish across many journals and the history of a single researcher’s output can be immediately apparent in an IR even though they have been published across a wide range of journals.

Not only immediately apparent, but immediately accessible in those repositories generated by the open access principle that holds that publicly funded research should be publicly available. Most online journal databases restict access to those who belong to an institution with the appropriate subscriptions. If a journal is neither online nor held in print by an institution then one must wait days for the interlibrary loan process before accessing the article. In an open access repository it is instantly available. No waiting time or tedious paperwork/online-form processes to negotiate.

Journal titles that originally hosted these publications are also referenced, thus in some cases also raising awareness of journals that might otherwise have been less widely known.

Where it’s all headed?

That’s the present. And the web is still in its infancy. Our wheels are still revving in the first generation of the Web, Web 1.0. Online journal databases, and institutional repositories too, are nothing more than a mass of web pages or documents waiting to be accessed. They are little more than a “more efficient” form of the print media and print indexing. In the case of IRs, they have the bonus of allowing more innovative and extensive browsing, too. Web 2.0 is a cute next step allowing social networking, which a growing number of scholars are finding really is more than simply cute. But the next evolutionary step, Web 3.0, is beginning to mutate.

This will be the semantic web, where information will be meaningfully contextualized in the way early (19th century!) information managers and innovators (thinking of Charles Cutter), and who knew only the print medium, originally intended. The semantic web will mean an online world where all the varying information topics (not just web pages or pdf files) have their uri namespaces and where it will be possible for users to search through them via meaningful relationship enquiries (not just “X links to Y” but “how X links to Y”; and both X and Y can be interrogated within their ontological relationship to each other – is one a subset of the other? or is one an echo of another in a different discipline?) Not that Cutter envisaged the semantic web, of course. But he did seek a way of organizing information in a way that was more meaningful and useful than the classification systems that we ended up with in libraries.

Part of this will be the exchange and reuse of objects within datasets and research databases (ORE). Datasets from different fields and disciplines can scarcely “talk” to each other today because of their different measuring and conceptual modelling. Any overlapping concepts are known to few outside those familiar with both disciplines, and even those few may rarely be able to make use of such overlaps because of their varying languages. The next stage of web development will see a working towards the ability to interrogate meaningfully, and select and re-use in other contexts the specific information we seek, as well as the ability to explore major and minor side-avenues. We will not be restricted to searching or browsing pages that have been prepared for searching and browsing (and “mere” hyperlinking) by others.

If we are moving away from the “humanist” benefits of inefficient print indexing, we are, I believe, moving towards an even greater scope for creatively exploring the total chaos of information.

September 16, 2008

Getting there

Filed under: Repositories — Neil Godfrey @ 9:01 pm

Golly gosh. No sooner do I go and start a new job at Murdoch University in Perth than another job with even greener grass beckons. Murdoch needed some help with getting their repository off the ground and I was free to move and liked the idea of helping out, so I’m really quite thrilled to find we’ve been able to make some real progress in my all too short stay here.

When I arrived the repository was simply broken. Perth is “the most isolated city in the world” and I wondered if that was partly to blame for their predicament despite all the bridging of distance technology is supposed bring us. I suppose most people who come to find work here are looking for the much bigger bucks offered by the mining companies. University libraries won’t compete with their offers.

Anyway, first job was to fix the technical side so we at least had a repository that worked, if not perfectly. By worked I mean that something that allowed a simple pdf file or jpeg image to be uploaded, with descriptive and rights etc metadata attached, and with a preservation and a Dublin Core datastream tossed in as well. And that that package was searchable and displayable.

Some of the broken bits that prevented even that level of progress turned out to be nothing more than not knowing where to find a particular window that was deviously hiding behind a larger one. Damn I hate computer software that shows no mercy for the neophytes!

Other issues took a bit more persistent and methodical diagnosis. Outside tech help had to be called in to discover an incorrect url namespace in a configuration. Thanks USQ Repository Services tech team! Another issue turned out to be related to a sneaky comma gatecrashing an alternate id value field. And the remaining issues are simple crosswalk or indexing matters.

All is now good technically. It works. And records look like they are supposed to look in the repository, or are in the final finishing touches of being so.

But there was one matter I had been postponing while focusing on getting the thing to actually work. I could never quite understand a certain discussion I heard regularly that related to publisher journal article links. It turned out that we were trying too hard to start with too much, and some directions travelled were not necessary at all. Not surprising, given that there had been so many handicaps — acute shortages in the IT department, limited skills required for crucial understanding  of other vital systems and procedures — faced by the library that held up the repository for so long.

I figured that trying to get everything in place before really starting the repository was only going to continue to hold up everything. Let’s just start crawling before getting ready to take our first walk. Don’t try to arrange to have 400 academics all come on board from the first weeks. Let’s just get 4 to start with. Even one will do. Suddenly we were working with real data. No longer trying to set up the whole she-bang before the get-go. And with real data, and with real academics who clearly were responsive to the idea of the repository, we could begin to see other fine-tuning that needed to be done. And we were responding to real needs as experienced by the real personalities and institutional climate of the institution.

Now we are crawling fast. Soon we will be able to walk — go into live production. And we will have a nice core of library and academic personnel enthusiastically working together to draw on and to build the core from which it will grow.

What we’ve had to discard is a lot of effort that was trying, I think, to set up too much before we started. And especially when I believe that much of that effort was really bypassing the immediate goals of the repository — open access to the university’s research. That was only delaying the start. Trying to bite off something too big, like a python struggling to gullet a cow. All that effort in trying to crush the beast to a manageable size before it can ever get anything near its throat. It will get there. But not very fast.

I’m excited about all the small things I’ve been able to get done so far, and to help set in train. I won’t be here long enough to see the repository fully bloom, but it’s great to think I’ll be leaving with things set on a positive course here.

Another issue related to the repository development has been the library’s relations with R&D here, or lack of such a relationship. Developing a libray research assistance strategy is soooo much easier with a repository thrown into the mix. Librarians can stop hitting their heads against the wall of trying to get academics to do things in ways that the librarians can see are better and offer some real goodies. Greater exposure that repositories have been demonstrated to bring. (And being in positions to be conduits for even further information goodies relating to national research databases and object reuse and exchange, etc.) A research repository has the potential, at least with the right strategies for building institutional support, to get libraries and R&D people and authors to collaborate for the benefit of all.

And one other thought I can’t escape from my experience here so far. All the progress in troubleshooting issues relating to the implementation of the proprietary repository have come from (a) consultation with the community of other users of the system; (b) tech support from USQ-Repository Services; and (c) my own testing and investigations. In the process of compiling a report on repository options for the library and university it became increasingly clear that those in this industry who say in relation to open source software repository solutions, “You get what you pay for. I.e. nothing”, are unaware of the realities of opensource repository development and supports. But that’s another topic entirely.

August 20, 2008

Repository – Research Office relations

Filed under: Repositories — Neil Godfrey @ 5:55 pm

Thought I’d share here results of a bit of informal asking around about what has made for good working relationships between managers of institutional repositories in universities and  their research office departments.

One of the first mentioned ingredients for success in institutions that boasted of exellent IR-RO relations was that IR managers were represented on important institutional research and higher degrees working groups and committees. That representation was stressed as very important by IR managers I spoke to who had successful RO relationships.

Conversely, key RO people are found on IR management committees helping steer the repository projects.

Another common ingredient was that IR managers made very early contact with their RO people before any external pressures to do (e.g. RQF reporting) so came into effect. A cooperative working relationship to assist with sharing of resources was started early, and it was based on a willingness to mutually assist one another professionally. IR managers would ask RO for assistance in finding key contacts to approach for making IR deposits, and this matured into a mutual sharing of information for each other’s benefit. Sometimes the IR would have records, or vital metadata links, that were of benefit to and sought out by the RO.

I assume that such a relationship presupposes a respect for the contributions each has to make.

There also appears to be a mutual respect for each others data requirements, and a willingness on the part of the librarian IR managers to work with the data they receive from research departments and systems. That is, there is no conflict over data standards. RO has one set of needs, and the IR another. So IRs with successfully managed relationships with their research departments will set up their own data checks for their own purposes. They may notify RO of some discrepancies, but will leave it up to RO to do their own thing. Besides, librarians may not always have it “right” in this field. Where IRs may have freely accepted an academic’s RFCD codes, for example, it may well be the case that RO really does sometimes know better in specific instances.

One other common attribute was the high regard in which the IRs were held in these universities. For some this was relatively easy — such as when the IR was the brainchild coming from VC or deputy VC level. For others, harder work on the part of the librarians to push through the initial inertia barrier may have been necessary.

So in sum:

  1. IR managers on RO committees
  2. RO key personnel on IR committees
  3. Start early — not simply as a response to external pressure
  4. Mutual sharing of data (and mutual respect for what each other can contribute)
  5. Acceptance of each other’s unique roles and functions (avoiding criticism of the other’s data and standards)

#3 of course is an historical factor and can’t be changed. But #4 maybe is an area where there may always be potential for planned growth and improvement on the part of IR managers. Being up to date with research requirements and knowing how one’s own skills and IR functions can assist RO’s in a changing environment (e.g. ERA now being introduced), and working to build a research support strategy, — these were a couple of areas one successful IR manager suggested to me as ways repository folk can maintain a useful role in partnership with RO folk.

I thank those I spoke to for their helpful feedback. I now feel ready to make a plan for building up IR-RO relations myself, now! 🙂

August 4, 2008

Australian university repositories (research and publications) – updated 12/08/08

Filed under: Repositories — Neil Godfrey @ 12:46 pm

Updated 12th August, 08

I have compiled the following from a combination of the ROAR list, the ARROW list and web searches against a list of Australian universities. It is more up to date than the current ROAR list, but I have also restricted my list to university research and publications repositories. There’s also a detailed list in GoogleGroup InstitutionalRepositoriesCommunity-ANZ, which discovered after compiling this list.

Since posting it here others have directed me to OpenDOAR and the Arrow Discovery Service lists, so am linking them here for some effort at completion.

And the EDNA List of Australian Institutional Repositories.

My initial limited sources served my immediate purpose, which was:

  • to get an overview of the extent of use of the different repository platforms in different types of Australian universities (well-funded, less financial, research-teaching emphasis),
  • and then to follow through by exploring who was moving from one platform to another and to hopefully make contact and discuss reasons and comparisons. Some of that info is included here, too.

Digital Commons users

For Digital Commons info see
http://www.bepress.com/ir/
http://leven.comp.utas.edu.au/AuseAccess/pmwiki.php?n=Software.ProQuest

Bond University
e-publications@bond

Edith Cowan University
Research Online @ ECU

Southern Cross University
ePublications@SCU

University of Wollongong
Research Online

DigiTool users

For DigiTool info see
http://www.exlibrisgroup.com/category/DigiToolOverview

Charles Sturt University
CRO (CSU Research Output)

The University of Melbourne*
University of Melbourne ePrints Repository (UMER)

DSpace users

For DSPace info see
http://www.dspace.org/
http://www.dspace.org/introduction/index.html

Australian National University* (was ePrints, moved to DSpace)
Demetrius

Flinders University~
Flinders Academic Commons

Griffith University~
Griffith Research Online

Swinburne University of Technology
Swinburne Image Bank

The University of Sydney*
Sydney eScholarship Repository

The University of Adelaide*
Adelaide Research and Scholarship

University of Technology, Sydney
UTS iRespository

EPrints users

For EPrints info see
http://www.eprints.org/
http://eprints.soton.ac.uk/

Curtin University
espace@Curtin

James Cook University~
JCU ePrints

Queensland University of Technology
QUT ePrints

University of Southern Queensland
USQ ePrints

University of Tasmania
UTas ePrints

Victoria University
VU Eprint Respository

Fez

For Fez info see
http://dev-repo.library.uq.edu.au/wiki/index.php/Main_Page
http://www.library.uq.edu.au/escholarship/

The University of Queensland* (was ePrints, moved to Fedora/Fez)
UQ eSpace

VITAL

For VITAL info see
http://www.vtls.com/products/vital

Central Queensland University
aCQUIRe

Macquarie University
Macquarie University Research Online

Monash University*
Monash University ARROW Respository

Swinburne University of Technology
Swinburne Research Bank

University of New England
e-publications@UNE

The University of Newcastle, Australia~
NOVA

University of South Australia
arrow@UniSA

The University of New South Wales*
UNSWorks

University of the Sunshine Coast
Coast Research Database

University of Western Sydney
UWS Research Repository

Other types of repositories

Deakin University — has a “teaching and learning research repository” DTLRR

RMIT University — has a Learning Objects repository

Other software — see the “Other Softwares” ROAR list

Govt of SA Health Publications uses Digital CommonsSA Health Publications

Australian Govt Dept of Defence uses DSpaceDTSO Publications Online

ALIA did use an EPrints repository for about 95 publications but this is not currently online

ADT runs the ETD-db — ADT

Others – some pending

Universities not listed above, some of which are in the process of preparing to go live with production repositories soon:

Australian Catholic University
Charles Darwin University (Fez pending)
La Trobe University~ (VITAL pending?)
Murdoch University~ (VITAL pending)
University of Ballarat (VITAL pending?)
University of Canberra
University of Notre Dame Australia
University of Western Australia* (DigiTool pending)

Those I’ve marked with “?” I have not yet personally confirmed.

Don’t hesitate to let me know of any updates/oversights!

June 28, 2008

Next Page »