Sep 16 2009

Being Structured and Having Semantics is not Enough

Published by Ian Davis at 9:02 pm under Random Stuff

Jonathan Rochkind describes a very typical decision sequence when working with MARC data:

Frequently the answer to “How do I get this piece of data I want” is along the lines of: “Well, it’ll be in this field, UNLESS this other field is X, in which case it’ll be in field Y, UNLESS field Y is being used for Z to try to and figure out if Z look at fixed fields a, b, and c, the different combinations of all three of which determine that, but there’s no guarantee they’re filled out correct. Oh, and that’s assuming it’s a post-1972 record, in older records they did things entirely differently and put the data over in field N. Oh, and ALL of that is assuming this is AACR2 data, the corpus also includes Rare Books and Manuscripts data, and those guys do things entirely differently, although it’s still in MARC, you’ve got to look in this OTHER field for it. First check fixed field q to see if it’s RBM data, and hope fixed field q is right. Oh, and don’t forget to check if it’s encoded in UTF-8 or MARC-8 by checking this other fixed field, which we know is wrong most of the time.”

For non-librarians, MARC is a structured data format whose fields have well-defined and rather precise semantics. The missing pieces are that the data structure is not self-describing and there is no strategy for discovering the rules (apart from a human reading the documentation and then encoding the rules in code).

2 responses so far

2 Responses to “Being Structured and Having Semantics is not Enough”

  1. [...] This post was mentioned on Twitter by LeeFeigenbaum, infopeep and semanticaweb. semanticaweb said: Being Structured and Having Semantics is not Enough: Jonathan Rochkind describes a very typical decision sequenc.. http://bit.ly/3wyb1Z [...]

  2. Jonathan Rochkindon 16 Sep 2009 at 11:25 pm

    Thanks for the quote. The point I was trying to get across is that it’s not JUST the lack of documentation here that’s a problem though, it’s the baroqueness of the logic for figuring out what’s what.

    Although the two aren’t un-related, I guess. The baroqueness arises in part because there aren’t any clearly written rules in one place, they evolve piecemeal over time one hacky part at a time.

    I’m not even sure it’s accurate to say that, even for librarians, MARC fields have well-defined and precise semantics that are widely understood and agreed upon.