Thursday 28 July 2011

Where exactly DOES a record come from?

Early on in the COMET project, Hugh Taylor assembled a complex document attempting to describe problems inherent in understanding the origin of a Marc-encoded bibliographic record. It also included a thorough analysis of Cambridge University Library data and the number and nature of vendor codes contained there-in.

We've updated this document with information on the various contracts and agreements associated with each vendor code to reflect the final work. The next problem was how to make sense of it all.

A major sticking point is related to Marc21 and its usage.

In our Marc data, we have four separate fields (015, 038, 994 ,035) that could indicate ownership, some of which may be repeated multiple times in a record. There is to my knowledge no mechanism in Marc21 or AACR2 to indicate which field and thus which vendor code takes precedence over others, (although cataloguers have some 'community knowledge' in this area).

Furthermore, many vendors change code and field used. Most rely on prefixes. Some are simply unhelpful strings of numbers.

In terms of practicalities, we need to ensure that:

1) Records from vendors who explicitly and contractually prohibit re-sharing in any format are excluded, this includes most ebook and ejournal records. Otherwise, there is no good reason not to share a record, although its origin may have an impact on license choice

2) In our case, records from OCLC are segmented due to a need to publish data from that vendor under an attribution license

3) Data from vendors who prefer non-marc output to be shared openly, but want Marc21 output restricted are segmented (RLUK and the BNB in our case) so these records will need to be split off

4) Data produced in house (usually that with no identifier) can be segmented for clarification

5) Everything else from smaller/ specialist record vendors is segmented together with a view to publishing openly

We've had to make some decisions over which field and vendor takes precedence based largely on this order of importance. To do this, we came up with a rough decision tree regarding record ownership:



The above JPG is also available as a scaled vector graphics file created in MS Visio.


One of my final tasks on COMET was to take this decision tree and turn it into a script to export record data for our final exercises in data publishing. I've also released a Perl script as output for COMET on out code page. (A warning / apology, this script is as ugly as the situation it attempts to resolve. It was pulled together at the last minute and could really do with a rethink.)

In the case of both the script and chart, the situation relates to Cambridge's specific and current situation., but should hopefully be useful for those wishing to replicate this activity.

As a personal opinion, I see this confusion regarding ownership as a key barrier that prevents libraries from openly sharing their data.

Furthermore, it is important that we do NOT see a repeat of this problem with the next set of record container and delivery standards.

Its my worry that stacking attribution statements in records at the bibliographic level could lead to similar problems down the road. Attribution at a data-set level, with some indication of the relationship between a record and a data-set seems more practical.

A standardization of practice across the library community with regards to licensing could help ease this pain in the future.

Because we always need more standards.

1 comment: