Metalogger

October 19, 2008

INFORMAL comparison of some institutional repository solutions

Filed under: Repositories — Neil Godfrey @ 2:47 pm

Over the last few years I have worked closely with a number of different institutional repository solutions, both open-source and enterprise products. There are several I have not had personal experience with, but I have taken opportunities to speak with a wide number of users of these products, too, as well as with representatives and producers of those solutions. I have also sought input from other users of repositories I am personally familiar with in an attempt to balance out my own personal impressions. The following comparison is based on feedback primarily from managers of the systems — whether they have live production systems or have done extensive testing on systems they expect to take live soon.

The purpose of this comparison is to give an intro level guideline for institutions interested in “what (else) is out there”.

  • The comparisons are not a systematic point-by-point balanced presentation. Anyone interested in a serious in-depth comparison or study of any particular repository solution would need to speak to other users themselves, as well as the producers or agents of the solutions.
  • It is also restricted to the repositories I know from my experiences in Australia.
  • I have not referenced costs or specific institutions here.
  • Nor have I attempted a serious comparison of the IT architecture across the systems.
  • The main focus is on support provided/needed and functionality of each product.

Digital Commons

http://www.bepress.com/ir/

Digital Commons is a “presentation repository”, not a “preservation repository”. Emphasis in a repository designed primarily to showcase an institution’s research is on an attractive and compelling interface for users, including self-submitters. Digital Commons is a hosted solution (i.e. hosted in California). There is no hardware to purchase, install or maintain. An institution can begin to upload papers immediately after installation. Purchase and maintenance is on a renewable one year or limited number of years basis.

Digital Commons cannot be synchronized with another preservation repository for migration purposes. A preservation repository, unlike Digital Commons, will record and preserve authentication, versioning, rights, structural and descriptive metadata. In Digital Commons such data will not be preserved for migration/exit strategy purposes to a preservation repository.

Support

  • Three universities in Australia using Digital Commons reported that the service from BePress is “very good”.
  • The setup period consists of about 3 weeks. All report that Digital Commons is easy to set up.
  • Phone hook-ups were used for training and instruction at the beginning. Web demonstrations accompanied these.
  • Requests by a university to change the front page appearance were responded to quickly and changes made efficiently.
  • Another institution has requested many ‘fine tuning’ modifications to their instance of Digital Commons, and all these requests have been met “pretty quickly”. One institution wanted the format of citations changed and BePress effected this change quickly for them.
  • One institution who has used Digital Commons for more than 2 years said they had never had any down-time with it.
  • Nightly uploads of new documents.

Functionality

  • Documents can be set to open or closed access
  • Different authentications can be set up for different users
  • Self submission is possible
  • Workflow stages can be configured “to some extent”, so that a central library service can monitor self-submitted documents for quality control and copyright issues
  • Embargo functionality
  • Different types of media files can be deposited (e.g. mp3, pdf, video)
  • OAI harvesting
  • PRPs (Personal Researcher Pages) – this was a strong selling point at one institution. In addition to the central epubs repository, links can take one to a PRP of an author, and this PRP can contain a list of their publications, be used as their homepage, and a point from which to access their documents. The library instance of Digital Commons “harvests” these PRPs and includes links back to the PRPs on the document pages.
  • Documents organized by collections
  • Able to hide preparatory work on a document being uploaded until it is ready to go live.
  • Reporting and statistics

DigiTool

http://www.exlibrisgroup.com/category/DigiToolOverview

An Ex Libris product, DigiTool is a Digital Asset Management System (DAMS). It is designed primarily for teaching functions, and its repository capacities consist of a set of additional modules. The DAMS is primarily designed for teachers to share their digital objects (images, course notes and notices, indexes of resources, exam papers etc.). Much of this data is ephemeral.

DigiTool is not a hosted solution, but there is a community of users, a consortia client of Ex Libris, who do support a “hosted server” – UNILINC: http://www.unilinc.edu.au/services/hosting.html

Support

Users of DigiTool report that it definitely requires their own local IT support to configure it appropriately for specific institutional needs.

Functionality

  • Functionality depends on the modules purchased. Modules are available for:
  • Academic self-submission
  • another for collection management (for arranging objects, adding thumbnails and descriptive metadata)
  • a JPEG 2000 viewer as an optional plugin
  • OAI interoperability (harvesting) module
  • The Ex Libris demo of DigiTool says it is scalable, and users of DigiTool all spoke of it being able to do much more than their immediate requirements.
  • Ex Libris also advertises that it “supports interoperability through open architecture”.
  • Embargo periods
  • Self-submission via the Deposit Module. The Deposit Module provides an interface and workflow which enables submission of objects and metadata by non-staff users.
  • Workflow also allows for authorized staff to control, edit, and approve/decline the submitted material.
  • Different levels of authentication: user/patron authentication is handled by the Local User management function or via LDAP.
  • Copyright: this can be managed by manual assignment of access rights to the object.
  • Objects can be assigned with access rights permissions.
  • The following formats are supported for load into DigiTool: MARCXML, DCXML, MODSXML, CSV, METS  — Given the claim that DigiTool is based on open architecture, one should expect the data stored would be migratable to other systems.
  • ExLibris advertises that DigiTool supports preservation standards such as PREMIS and the OAIS (Trusted Repositories) standard model.
  • DigiTool’s “interoperability module” for OAI harvesting does not configure OAI compliant Dublin Core. This is not a problem for harvesting by the NLA’s Discovery Service because DS have configured their service provider to read and harvest their DigiTool feeders. But seamless OAI harvesting cannot be guaranteed by other service providers. DigiTool users are expected to be informed of this issue by USQ-Repository Services. I have not been able to learn if this is unique to DigiTool or is also an issue with other proprietary solutions discussed in this report.
  • Some users see DigiTool’s deposit procedure as “klunky” in trying to get it to do what they want. Editing of objects can require hours (overnight) to take effect; citations need to be created separately since they are not automatically generated; and multiple key strokes are required for some “simple” operations such as moving an object from open to restricted access.
  • One user said that the upcoming version of DigiTool “promises” to be able to give them the ability to handle hierarchical structures. “We think it will do what we want.”
  • Citations need to be specifically created in DigiTool – they are not automatic as in EPrints.

Reasons for adoption

Most institutions who have adopted it or who are considering doing so have said that their primary reason was to establish synchronicity with their other Ex Libris products. Some specifically added that it was policy for them to favour enterprise solutions over open-access solutions.

DSpace

http://www.dspace.org/ and http://libraries.mit.edu/dspace-mit/index.html

DSpace is an open source solution developed by MIT. It has a large and active community of users. At least 450 registered DSpace repositories worldwide are evidence of DSpace’s robustness, ease of implementation, simplicity of maintenance and ongoing use, and low-cost.

Support

  • A large and active community of supporters with experience and expertise available to draw on
  • Thorough online documentation for IT staff and managers for customization and implementation
  • Step by step online tutorials
  • Online assistance
  • The amount of local IT support required for the implementation of DSpace depends on the extent of configuration changes an institution wishes to make.
  • DSpace provides a module, Manakin, which enables the configuration of much more “original” interfaces without “intensive long term” IT support.
  • Institutions with basic largely “out of the box” configurations report that they can do “in the main” without local IT support. The payoff is that a few “minor issues” (e.g. maintaining correct indexing records when changing the location of an object from one collection to another) persist.

Functionality

  • DSpace manages objects in an hierarchical collections based structure. Collections (or hierarchies of collections) display alphabetically on the main page.
  • This Collections based organization, with inbuilt workflow and authentication capabilities, enables different faculties or departments to manage their own deposits and structure of their collections. Workflows can be set up to still provide for central quality control and final editing by the library.
  • Descriptive metadata for the objects has a flat structure, which means that in cases of objects with multiple authors from different affiliations, there is no automatic guarantee that data can be transferred intact from one repository to another. This requires IT support in order to set up, say, a METS package, in order to encapsulate the data in its original relationships for successful migration.
  • Workflows and authentications are supported.
  • Embargo periods are supported (metadata page displays but the attached document becomes public at a preset date)
  • Objects can be made inactive to be hidden from public view.
  • Different mime types are supported, including video and audio.
  • DSpace is integrated with Research Management systems in several universities.

EPrints

http://www.eprints.org/

EPrints is an open source solution developed and supported by the University of Southampton. EPrints is “easy to install, easy to configure, and needs minimal maintenance. Once installed, it simply works without fuss. Over a year, no maintenance has been required to the UTas server apart from updates.” (Arthur Sale, UTas)

All EPrints administrators I have contacted have spoken well of its simplicity and stability. It is widely seen as an ideal repository solution for initial implementation in a university with limited financial resources and IT support.

“EPrints is a mature software package, with an established community. It offers a complete solution for managing a research repository for Open Access. EPrints can be put to other uses, but for other uses such as image repositories alternative software might be more appropriate. . . . However, the software is under active development and it is particularly useful as an Open Access document repository.” (http://rubric.edu.au/repositories/eprints.htm)

Support

“Many institutions do not have the resources necessary to build or maintain an institutional repository. The EPrints Services team offers a complete range of advice and consultancy to support institutions who have adopted, or who are looking to adopt, the EPrints solution. We can provide as much or as little support as you need to create and maintain a professional repository.” – EPrints site

This assistance is gratis to those implementing and maintaining an EPrints repository.

I contacted at least half a dozen universities using EPrints and expressed the unqualified praise for the level and timeliness of support from Southampton. This praise came from both IT staff who have had to liaise with Southampton as well as from repository managers.

Patches and upgrades are released regularly. Users have remarked on the ease with which these are installed and the robustness of their maintenance.

Functionality

  • EPrints supports OAI-PMH harvesting protocols.
  • Plug-ins have been developed so it can also support specific research reporting requirements and for supporting the emerging SWAP (Scholarly Works Application Profile) that is pioneering interoperability and semantic web developments for scholarly works.
  • Self-submission (with a simple self-submission interface that is quick and easy to learn) is supported.
  • Workflows can be configured for editors, submitters, and monitoring staff with different permissions.
  • Objects can be removed from open access.
  • Batch import (e.g. of ADT records) is supported.
  • Peer review status, publication status, copyright and other administrative information, and citation generation and statistics by objects and author are all part of the “out of the box” package.
  • EPrints supports text (in particular pdf) and image files, including multiple files per object.
  • Flat metadata structure. So when there are multiple authors with different affiliations there is no guarantee that the right author-affiliation matches will be maintained in future migrations to other repositories. Ensuring this requires some workarounds that are available, but need IT support to implement. In raising the flat metadata structure issue, it should also be mentioned that EPrints are developing an RDF module that converts their metadata into “triples” (subject-predicate-predicate). RDF (Resource Descriptive Framework) is the basis of the emerging Web 3.0 (Semantic Web) and enables data to be converted into multiple schema, including complex hierarchical structures.
  • EPrints has recently begun to support preservation metadata with the work of its Presev project and this has preservation functions have been implemented with EPrints 3:

1 A history module to record changes to an object and actions performed on an object
2 METS/DIDL plugins to package and disseminate data for delivery to external preservation services

  • Slow indexing issues in EPrints have been rectified with the EPrints 3 version.

EPrints development has ensured robust functionality and this has limited the file types supported in the earlier versions of the “out of the box” EPrints. But successive releases are allowing for wider variety of mime-types to be supported.

One limitation up till now (version 3.0) with EPrints has been the failure of the Embargo facility to publish objects on the release dates. These need to be manually published.

Dates for theses appear as “date published” instead of “date completed” in at least one institution. It is not clear to me if this is a configuration issue that is resolvable with IT and/or Southampton support.

Users

269 archives are known to be running EPrints archives worldwide.

The reason for this recommendation was that it was “easy to set up, easy to configure, easy to use, and has a very large, open community supporting it and using it  . . . and not all that expensive .  .  . for small institutions such as these, it was ideal.” (USQ-RS Manager in personal email)

Equella

http://www.thelearningedge.com.au/

Equella is developed by The Learning Edge International (a Tasmanian based company).

Equella is primarily a “Learning Content Management System”. Learning Edge also describe it as a repository, but speak of it as a teaching repository tool. It is “a fully integrated Digital Repository and Content Authoring Tool”. It can be used as a collaborative lesson planning tool as well as a repository. It can plug into Blackboard and WebCT.

Users have described its administration module as well developed. “A non-IT person can set it up with a graphic user interface for collections and permissions and configuration.”

Support

  • One institution needed “some” local IT support to set up Equella. They do not need local support for much more than activating periodic patches that Equella sends out now.
  • Another also said that the “setting up” of Equella was most complex part, but that this included the preparatory work. Equella is so very flexible that many decisions need to be made in advance about what exactly was wanted, what should be the best and proper policies. Once this work had been recorded it was relatively easy to setup. All the development work is done by Learning Edge.
  • One institution said that the initial setup involved two days’ training, which was described as “very adequate”.

Functionality

  • self-submission (including students of higher theses)
  • different levels of authentication (easy to setup – a simple switch; manages various levels of permissions among academics – very flexible)
  • workflow systems (staged steps to check for copyright compliance with OAKLIST, Sherpa, etc)
  • digital rights management (can aggregate objects to groups for specific workflows and permissions)
  • embargo periods
  • OAI-PMH harvesting (e.g. to Google Scholar)
  • records can be pulled out of Research Master and sent back into Research Master
  • one user noted that the user frontend is boring, uninspiring, but the functionality behind it is “great fun” – relying on their own IT people to soon rectify this web appearance.


Users

Equella has about 18 clients in Australia, including several Tasmanian, Queensland, South Australian and Victorian TAFEs and Education Departments. One institution chose Equella to handle 4 primary tasks:

  • RQF reporting (now ERA)
  • As a backend to Moodle CMS
  • To be the base of the e-reserve collection
  • Developing a university research repository
  • Open Source solutions were not an option because of limitations of the brief.

Another because:

  • it has a well defined administration module, so a non-IT person can set it up with a graphic user interface with various collections and permissions, and they did not want to align with a particular library system

Fez

http://dev-repo.library.uq.edu.au/wiki/index.php/Main_Page

Fez is a front end to the Fedora repository software. It is developed by the University of Queensland Library as an open source project hosted on SourceForge. See http://sourceforge.net/projects/fez/

Project Overview: http://espace.library.uq.edu.au/documentation/

Fez is part of the Australian Partnership for Sustainable Repositories (APSR). Fez is one of the deliverables of the APSR eScholarship testbed in the University of Queensland Library.

Support

  • University of Queensland developer/s does offer support for other users of Fez. These should be contacted for details.
  • To implement Fez an IT officer with the following basic knowledge set would be required:
    • * MySQL
      * Fedora
      * Fez – written in PHP, but also CSS, html (Smarty html templating)
      * Apache
      * (related other pre-req software)
      * Understanding how it all fits together
  • This would also involve that person (obviously) having programmer level access to the server on which Fez ran (something to be considered, depending on your IT department’s policies).
  • The University of Queensland developer

RUBRIC recommendation at February 2007:

For institutions wanting to run a general purpose repository Fez is a promising choice, provided that the technical resources are available to manage it. Contact the developers, as they may be able to offer support under a formal arrangement.

Functionality

  • Fez is a rapidly maturing repository software application.
  • Fez is built around constructs known as “Communities” and “Collections”.
  • Supports self-archiving
  • Workflow authentications and authorizations. These are configurable through GUI interface.
  • Security based on FezACML to describe user roles and rights on a per object basis or through parent collection or community security inheritance.
  • Security at object granularity.
  • Statistics (eg Downloads per Author, per Community, per Collection, per Subject etc).
  • Preservation metadata extraction
  • OAI service provider for harvesting
  • Supports migration to and from other repository systems (DSpace, EPrints, VITAL)

VITAL

http://www.vtls.com/products/vital

VITAL provides every feature–ingesting, storing, indexing, cataloging, searching and retrieving–required for handling large text and rich content collections. VITAL takes advantage of technology standards such as RDF, XML, TEI, EAD and Dublin Core to easily describe and index an assortment of electronic resources. VITAL leverages the benefits of open-source solutions such as Apache, MySQL, McKOI and FEDORA™. VITAL conforms to common Internet data communications standards such as TCP/IP, HTTP, SOAP and FTP. Additional standards utilized include WSDL Web Services, OAI-PMH, Dublin Core, MARCXML, JHOVE, MIX (Metadata for Images in XML Schema), and SRU.  (from the VTLS site)

A PREMIS datastream is also generated at ingest.

Support

Australia’s community of VITAL users has been able to coordinate support through ARROW. 2008 is the final year of the ARROW project. ARROW is expected to be replaced by a CAUL sponsored body, CAIRRS, with a remit of supporting the Australian university repository community more generally. VTLS has acknowledged that their past record in prior testing of new product versions, and their follow up support, could have been better. At a recent ARROW meeting in Brisbane, a VITAL representative assured users that VITAL itself would upgrade users’ current versions to 3.4, due for release around the end of October. In the past, patches have not always been available for recognized issues (including significant ones such as certain PDF files not able to be indexed) and users have had to wait for new version releases. VTLS does have an online “hotline” for logging such issues.

Functionality

The VITAL product promises much. It has the advantages of a Fedora base, which enables the storage of a wide variety of content types, and greater sophistication in their management and support for any standard metadata schema (hierarchical or flat).

  • Documents can be set to active or inactive (public or hidden) — although at the moment of deposit they necessarily default to active
  • Self submission is possible through VALET
  • Workflow stages can be configured so that a central library service can monitor self-submitted documents for quality control and copyright issues
  • No embargo functionality
  • Different types of media files can be deposited (e.g. mp3, pdf, video)
  • OAI harvesting
  • Statistics — although some institutions opt to hide this function because the stats file consistently corrupts at regular intervals or the statistics re-set to zero, although this is said to be fixed in version 3.4
  • Copyright: this can be managed by manual assignment of access rights to the object.
  • Supports migration to and from other repository systems (DSpace, EPrints, VITAL) with METS.
  • Ability to search the full-text content of PDF, DOC, RTF and other document formats — although there are currently security issues with this function, such as public searches not being completely cut off from “hidden objects”
  • Ability to display multi-page documents.
  • Integrated editors for easy editing of metadata.
  • Customizable templates for display of content.
Advertisements

October 17, 2008

October 10, 2008

Are repositories set to be left out in the cold?

Filed under: Dublin Core,Harvesting,Repositories — Neil Godfrey @ 1:01 pm

Repositories and their harvesters have a rule of their own that violates Dublin Core standards. Because of this, are repositories and harvesters on target for a massive retroversion or major set of patches if they are to be a part of the semantic web? (I don’t know, but I’d like to be sure about the answer.)

Once again at a Dublin Core conference I listened to some excellent presentations on the functionality and potential applications of Dublin Core, but this time I had to see if I could poop the party and ask at least one speaker why the nice theory and applications everywhere simply did not work with the OAI harvesting of repositories.

I like to think that standards have good rationales. The web, present and future (e.g. the semantic web) is  predicated upon internationally recognized standards like Dublin Core. According to the DCMI site the fifteen element descriptions of Simple Dublin Core have been formally endorsed by:

  • ISO Standard 15836-2003 of February 2003 [ISO15836]
  • ANSI/NISO Standard Z39.85-2007 of May 2007 [NISOZ3985]
  • IETF RFC 5013 of August 2007 [RFC5013]

But there is one area where there is a clear conflict between DCMI element definitions and OAI-PMH protocols. The DC usage guide explains the identifier element:

4.14. Identifier

Label: Resource Identifier

Element Description: An unambiguous reference to the resource within a given context. Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system. Examples of formal identification systems include the Uniform Resource Identifier (URI) (including the Uniform Resource Locator (URL), the Digital Object Identifier (DOI) and the International Standard Book Number (ISBN).

Guidelines for content creation:

This element can also be used for local identifiers (e.g. ID numbers or call numbers) assigned by the Creator of the resource to apply to a particular item. It should not be used for identification of the metadata record itself.

Contrast the OAI-PMH protocol:

A unique identifier unambigiously identifies an item within a repository; the unique identifier is used in OAI-PMH requests for extracting metadata from the item. Items may contain metadata in multiple formats. The unique identifier maps to the item, and all possible records available from a single item share the same unique identifier.

The same protocol explains that an item is clearly distinct from the resource and points to metadata about the resource:

  • resource – A resource is the object or “stuff” that metadata is “about”. The nature of a resource, whether it is physical or digital, or whether it is stored in the repository or is a constituent of another database, is outside the scope of the OAI-PMH.
  • item – An item is a constituent of a repository from which metadata about a resource can be disseminated. That metadata may be disseminated on-the-fly from the associated resource, cross-walked from some canonical form, actually stored in the repository, etc.
  • record – A record is metadata in a specific metadata format. A record is returned as an XML-encoded byte stream in response to a protocol request to disseminate a specific metadata format from a constituent item.

I wrote about this clash of standards and protocols in another post last year. One response was to direct readers to Best Practices for OAI Data Provider Implementations and Shareable Metadata.

The working result for many repositories is a crazy inconsistency. Within a single Dublin Core record for OAI harvesting the same element name, identifier, can actually be used to identify different things:

<oai_dc:dc
     xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
     http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
   <dc:title>Using Structural Metadata . . . </dc:title>
   <dc:creator>Dushay, Naomi</dc:creator>
   <dc:subject>Digital Libraries</dc:subject>
   <dc:description>[Abstract here]</dc:description>
   <dc:description>23 pages including 2 appendices</dc:description>
   <dc:date>2001-12-14</dc:date>
   <dc:type>e-print</dc:type>
   <dc:identifier>http://eprints.repository.edu/318/</dc:identifier>
   <dc:identifier>1-85636-082-X</dc:identifier>
 </oai_dc:dc>

In this OAI DC the first identifier identifies the splash page for the resource in the repository. The second identifier identifies the resource itself. It works for now, between agreeable partners. But how sustainable is such a contradiction? What is the point of standards?

As far as I understand the issue, this breakdown in the application of the Dublin Core standard is the result of institutional repositories needing their own branding to come between users and the resources they are seeking. Without that branding they would scarcely have the institutional support that enables them to exist in the first place.

Surely there must be other ways for harvesters to be aware of the source of any particular resource harvested and hence there must be other ways they can meet the branding requirement. Surely there is a way to retrieve an identified resource (not an identified metadata page about the resource) and to display it with some branding banner that will alert users to the repository — and related files and resources — where it is archived. Yes?

I mention “related files and resources” along with the branding page — but maybe this is a separate issue. Where a single resource consists of multiple files then is the metadata page a valid proxy for that resouce anyway? Or is there another way of displaying these?

Australia has had the advantage of a national metadata advisory body, MACAR. The future of MACAR into next year is still under discussion, but such an issue would surely be an ideal focus for such a body — to examine how this clash impacts the potentials of repositories today and in the future. A national body like MACAR has a lot more leverage for pioneering changes if and where necessary.

What should be done?

What can be done?


But is there more? more confusion of terms?

In having another look at the DCMI site for this post I noticed something else in the latest DC Element Set description page:

Term Name: identifier
URI: http://purl.org/dc/elements/1.1/identifier
Label: Identifier
Definition: An unambiguous reference to the resource within a given context.
Comment: Recommended best practice is to identify the resource by means of a string conforming to a formal identification system.

DCMI recommends that an identifier be “a string”. In the context of RDF and the semantic web my understanding of “string” is a dead-end set of letters as opposed to a resolvable uri or “thing”. But the DC Usage Guide “explains” that an applicable formal identification system allowed here can also be a URI.  So what don’t I understand about the difference between strings and (RDF) things, now?


October 2, 2008

Online journals and institutional repositories – comparing their potential impacts on research methods (and journal publications)

Filed under: Repositories — Neil Godfrey @ 10:55 pm

While discussing the potential impact of institutional repositories on journal publications, an academic alerted me to an article — Electronic Publication and the Narrowing of Science and Scholarship by James A. Evans — discussing research into the impact of the online journal culture on research methods in the world of scholarship.

The article is interesting for several reasons, but I want to address its conclusions in relation to research repositories – something that the research did not intend to address. I think such a comparison is worthwhile to the extent that it helps clarify what I see as how institutional repositories and online journal databases each differently potentially impacts both research methods and publishing companies’ futures. I also can’t resist a comment on my thoughts of where we are headed subsequent to the current world of online databases, whether of journals or research repositories.

The opening abstract of the article notes:

Online journals . . .  are used differently than print — scientists and scholars tend to search electronically and follow hyperlinks rather than browse or peruse . . . .

IR comparison comment: IRs, on the other hand, do facilitate browsing by keywords, titles, authors, year, resource type (that is, text or still image or video . . . )  — not just searching as per online journals like JSTOR or EBSCO.

Evans’ research into the impact of online journal databases found that:

As deeper backfiles became available, more recent articles were referenced; as more articles became available, fewer were cited and citations become more concentrated within fewer articles.

Costs and benefits of online journal databases

His interpretation of this paradox:

  1. as online searching replaces browsing in print, there is greater avoidance of older and less relevant literature;
  2. hyperlinking through an online archive puts experts in touch with consensus about the most important prior work – what work is broadly discussed and referenced;
  3. thus online search bypasses many marginally related articles that are still skimmed by print researchers.

Findings and ideas that do not become consensus quickly will be forgotten quickly.

This research ironically intimates that one of the chief values of print library research is poor indexing. Poor indexing – indexing by titles and authors, primarily within core journals – likely had unintended consequences that assisted the integration of science and scholarship. By drawing researchers through unrelated articles, print browsing and perusal may have facilitated broader comparisons and led researchers into the past.

Evans sees this as one more step away from the contextualized monograph:

the contextualized monograph, like Newton’s Principia or Darwin’s Origin of the Species, to the modern research article. The Principia and Origin, each produced over the course of more than a decade, not only were engaged in current debates, but wove their propositions into conversations with astronomers, geometers, and naturalists from centuries past.

Thus the higher efficiency with which arguments can be framed with the assistance of online searching and hyperlinking, “the more focused – and more narrow – past and present” they will be.

It is not a strictly fair comparison, but this does remind me of a time I was inspired with fresh insights into a topic I was investigating simply from the serendipitous luck of accidentally noticing a title on a tangential topic placed just one library shelf above the one which held the exact classification numbers I was directed to browse. No online search – nor any print citation index – could ever enable me to repeat that particular stroke of luck.

In other words, any technological change will charge some costs against the way we used to do things. Maybe the advent of the printing press led some researchers to miss the chance discoveries they made in the days they had to rely on personal travel to where certain hand-copied books were known to be stored. But each change also brings its own new avenues for broader comparisons and insights from unintended serendipity. If this is not currently happening on a large enough scale to impact the statistical research of Evans’ article, then it is reassuring to know that this is not the end of the story.

But no doubt Evans also would acknowledge that the broader conversations involved in works like Principia and Origin were themselves scarcely the outcome of unintended consequences of the relatively poor indexing of the print media. Today’s researchers and scholars, I suspect, are under far more institutional pressures to specialize, produce and publish at a certain rate than regularly experienced by Newton and Darwin.

Innovation also follows demands and needs, including those of researchers. Online journal and e-book databases and catalogs are not the only points to which the electronic media are leading.

Comparing IRs with online journals, their functions and potential impacts

IRs are finding their way into a growing number of universities. (See my list of current Australian IRs and links to other registries.) These IRs do support broader opportunities for browsing, not just searching, by both uncontrolled keywords and sometimes controlled vocabularies, resource types, authors and titles. Browsing is one of the alleyways the Evans article laments is missing with electronic searches. That is not the case with most IRs. Nor are IRs restricted to the localized browsing of a single institution’s research repository. They can be browsed collectively through harvesters such as OAIster, the Australia’s Discovery Service, etc.

IRs are something other than a high-tech attempt at a more efficient journal database. They are quite different. They are a means of individual academics – and their institutions — making visible and accessible their collective works. They are a means of both showcasing and preserving personal and institutional research, and also of making publicly funded research instantly and openly accessible to all.

IRs offer the capacity to more easily interrogate the discussions of individual authors over time. This is surely a potentially useful alternative to journal databases that are structured around topics. Researchers do publish across many journals and the history of a single researcher’s output can be immediately apparent in an IR even though they have been published across a wide range of journals.

Not only immediately apparent, but immediately accessible in those repositories generated by the open access principle that holds that publicly funded research should be publicly available. Most online journal databases restict access to those who belong to an institution with the appropriate subscriptions. If a journal is neither online nor held in print by an institution then one must wait days for the interlibrary loan process before accessing the article. In an open access repository it is instantly available. No waiting time or tedious paperwork/online-form processes to negotiate.

Journal titles that originally hosted these publications are also referenced, thus in some cases also raising awareness of journals that might otherwise have been less widely known.

Where it’s all headed?

That’s the present. And the web is still in its infancy. Our wheels are still revving in the first generation of the Web, Web 1.0. Online journal databases, and institutional repositories too, are nothing more than a mass of web pages or documents waiting to be accessed. They are little more than a “more efficient” form of the print media and print indexing. In the case of IRs, they have the bonus of allowing more innovative and extensive browsing, too. Web 2.0 is a cute next step allowing social networking, which a growing number of scholars are finding really is more than simply cute. But the next evolutionary step, Web 3.0, is beginning to mutate.

This will be the semantic web, where information will be meaningfully contextualized in the way early (19th century!) information managers and innovators (thinking of Charles Cutter), and who knew only the print medium, originally intended. The semantic web will mean an online world where all the varying information topics (not just web pages or pdf files) have their uri namespaces and where it will be possible for users to search through them via meaningful relationship enquiries (not just “X links to Y” but “how X links to Y”; and both X and Y can be interrogated within their ontological relationship to each other – is one a subset of the other? or is one an echo of another in a different discipline?) Not that Cutter envisaged the semantic web, of course. But he did seek a way of organizing information in a way that was more meaningful and useful than the classification systems that we ended up with in libraries.

Part of this will be the exchange and reuse of objects within datasets and research databases (ORE). Datasets from different fields and disciplines can scarcely “talk” to each other today because of their different measuring and conceptual modelling. Any overlapping concepts are known to few outside those familiar with both disciplines, and even those few may rarely be able to make use of such overlaps because of their varying languages. The next stage of web development will see a working towards the ability to interrogate meaningfully, and select and re-use in other contexts the specific information we seek, as well as the ability to explore major and minor side-avenues. We will not be restricted to searching or browsing pages that have been prepared for searching and browsing (and “mere” hyperlinking) by others.

If we are moving away from the “humanist” benefits of inefficient print indexing, we are, I believe, moving towards an even greater scope for creatively exploring the total chaos of information.