Wednesday 13 July 2011

Project update and following in our footsteps

As the COMET project comes to a close, we are working through the final piece of ownership analysis to identify more data for RDF conversion and publication.

We've loaded sample records with FAST and VIAF and are in discussion with OCLC about the best way to model them.

In the interim, we've been asked to briefly blog about helping others to 'follow in our footsteps'. We ourselves were very much following the work done by the Open Bibliography project, even if we had a slightly different focus and toolset. There was a reason for this. One of the aims of COMET, at least in my mind was to see how easy it would be for an average library systems team to attempt the impressive work seen on projects such as Open Bibliography, work done by those who already had considerable experience of linked data and open licensing.

Here are few tips based on our experiences.

1) Be aware of your licensing. Whilst there is no good reason not to share data, some vendors have explictily prohibited it. We hope to have a better summary of our work examining out contracts up soon, but the main thing to look for is in explicit contractual agreements from vendors that prohibit re-sharing.

Otherwise, you then have to choose an appropriate license. We've ended up 'chunking' our data so that in the public domain stuff can can be PDDL will be. Otherwise, some form of attribution license would be required.

Thankfully, few other libraries should have as complex collections of data as Cambridge, with most relying on one or two vendors.

2) Think about the backend and issues of scaling before you start. We approached COMET with an exploratory hat on, the world of triplestores and SPARQL was new and we were not sure how much data we would be able to publish. The ARC2 datastore we eventual chose was great to develop with, but ultimately unable to adequately store our entire data output. For libraries with smaller datasets, (under half a million records or 16 million triples), its well worth a look. ( At least we are in good company with this, I've noticed that the DBpedia backend does not provide access to everything... )

3) Take a look at our tools. - We have an Perl MArc21 to RDF generation utility ready to go. We chose Perl as it is often used by systems librarians to 'munge' and export data. Our mapping is customisable, and the baseline triples it produces easy to load. We've based a lot of the final output on work done by the British Library in modelling the British National Bibliography.

4) RDF vocab modelling is itself something of a burden, you can give it a lot of thought and concern, try numerous different schemas and still not be sure as to the usefulness of your output. Our advice is focus on useful elements such as subject entries and identifiers. Be careful with the structure, too many links and nodes can lead to data that is 'linked to death'.

Don't expect to get it right first time.

No comments:

Post a Comment