Dataset Integration Hack

The problem

How to integrate several distributed but Open Access and Open Licensed datasets so that they can be served via a metadata portal from a single web service.

The datasets: Open Access Classical Data

Platform

OAI-PMH server and DC metadata. (JN, MR, JMV: more info please?)

An OAI-PMH server is the natural solution to having metadata records that need to be served to clients. It allows for clients to search on those records in a variety of ways (importantly including 'records modified since date') and to perform automatic translation of records from one format to another.

If there are metadata records out in the wild, the harvester can fetch those and merge them into its set of records. This is particularly useful if any of the datasets change over time, since only the modified records need to be fetched and updated. If those records aren't being exposed via OAI-PMH, or don't change, then a custom harvester/converter is needed.

Dublin Core metadata is not the only metadata format that an OAI-PMH server can serve. It is worth spending some time to determine what information is required and use or create an appropriate schema; if it isn't Dublin Core, then a fallback DC can be provided as well.

JOAI is a Java implementation of OAI-PMH data provider and harvester that is both easy to set up and play around with, and is also suitable for production use.

Metadata

Extraction

Metadata will be extracted on a case-by-case basis from the source data, with additional global parameters provided from local knowledge as required. Ideally, and eventually, individual datasets would provide their own OAI service to expose this metadata. (We may try to illustrate this with IAph and IRT at some point.)

Harvesting

Each dataset will be essentially transformed into a data provider by exposing the extracted metadata accordingly with the OAI-PMH.

Schema

OAI-PMH in Dublin Core


Tags	How we generate?
dc:title	title of resource
dc:creator	harvest (or known?)
dc:subject	??
dc:description	if any free prose
dc:publisher	harvest
dc:contributor	harvest if given
dc:date	harvest
dc:type	photograph\|commentary\|database\|linked data\|other)
dc:format	filetypes?
dc:identifier	URI and/or URL?
dc:source	??
dc:language	= modern language
dc:relation	??
dc:coverage	??
dc:rights	= license (in spreadsheet)

What's next?

Set up OAIPMH server.
Create sample metadata for each dataset (ideally by writing scripts for the sake of process reproducibility)
discuss viability of CKAN for our purposes
provide a description of how we generate the metdata we agreed on for each dataset
Next meeting will be on 17/11/2010 1pm-2pm (CCH, seminar room).

Dataset Integration Hack

Contents

The problem

Platform

Metadata

Extraction

Harvesting

Schema

What's next?

Navigation menu

Dataset Integration Hack

The problem

Platform

Metadata

Extraction

Harvesting

Schema

What's next?

Navigation menu

Search