Dataset Integration Hack

The problem
How to integrate several distributed but Open Access and Open Licensed datasets so that they can be served via a metadata portal from a single web service.

The datasets: Open Access Classical Data

Platform
OAI-PMH server and DC metadata. (JN, MR, JMV: more info please?)

An OAI-PMH server is the natural solution to having metadata records that need to be served to clients. It allows for clients to search on those records in a variety of ways (importantly including 'records modified since date') and to perform automatic translation of records from one format to another.

If there are metadata records out in the wild, the harvester can fetch those and merge them into its set of records. This is particularly useful if any of the datasets change over time, since only the modified records need to be fetched and updated. If those records aren't being exposed via OAI-PMH, or don't change, then a custom harvester/converter is needed.

Dublin Core metadata is not the only metadata format that an OAI-PMH server can serve. It is worth spending some time to determine what information is required and use or create an appropriate schema; if it isn't Dublin Core, then a fallback DC can be provided as well.

JOAI is a Java implementation of OAI-PMH data provider and harvester that is both easy to set up and play around with, and is also suitable for production use.

Extraction
Metadata will be extracted on a case-by-case basis from the source data, with additional global parameters provided from local knowledge as required. Ideally, and eventually, individual datasets would provide their own OAI service to expose this metadata. (We may try to illustrate this with IAph and IRT at some point.)

Harvesting
Each dataset will be essentially transformed into a data provider by exposing the extracted metadata accordingly with the OAI-PMH.

Schema
OAI-PMH in Dublin Core

What's next?

 * Set up OAIPMH server.
 * Create sample metadata for each dataset (ideally by writing scripts for the sake of process reproducibility)
 * discuss viability of CKAN for our purposes
 * provide a description of how we generate the metdata we agreed on for each dataset
 * Next meeting will be on 17/11/2010 1pm-2pm (CCH, seminar room).