One of the reasons for the development of the OAI protocol for harvesting repositories was that the alternative, federated searching of repositories, could not handle the inconsistency in the metadata across those repositories. In other words, the OAI protocol is itself in part an answer to the problem of inconsistent repository metadata.
The OAI online tutorial explains why repositories are harvested rather than cross searched directly. In a 1999 meeting at Santa Fe the two methods — harvesting or cross-searching — were considered. It was concluded that:
Digital library experience suggested that cross searching does not scale well, at least partly because the search service degrades to the level of the slowest and least reliable server in the cross search set. . . . The more servers are cross-searched, the higher are the chances of encountering one or more slow or unreliable servers.
There is also the problem of knowing which target servers to use in any particular cross search. Collection descriptions – where they are available at all – may be inconsistent across repositories, were not designed for machine-to-machine communication and require time-consuming examination by end-users.
Differences in query language syntax and search attribute variation (between servers and over time) introduce barriers of complexity, either for the end user or the cross-search software, or both.
Ranked merging of results from distributed servers presents further technical and user-interface problems, and different size and types of targets can skew results. A browse interface is very difficult to build when the metadata to be browsed is distributed across a number of repositories.
It was suggested that a solution would be to get all the metadata records together in one place.
(cited from the OAI online tutorial)
Repositories are so easy even machines can understand them
Repositories store the output of a focused community. As said in my previous post, it matters little if one university speaks of a scholarly article and another speaks of a scholarly journal article. They all know what each other means. It is not too hard for technology to aggregate records using both terms. If there were no such inconsistency of terminology in the first place the argument for federated searching of repositories, as opposed to harvesting them, would have been stronger.