Being Structured and Having Semantics is not Enough

Jonathan Rochkind describes a very typical decision sequence when working with MARC data:

Frequently the answer to "How do I get this piece of data I want" is along the lines of: "Well, it'll be in this field, UNLESS this other field is X, in which case it'll be in field Y, UNLESS field Y is being used for Z to try to and figure out if Z look at fixed fields a, b, and c, the different combinations of all three of which determine that, but there's no guarantee they're filled out correct. Oh, and that's assuming it's a post-1972 record, in older records they did things entirely differently and put the data over in field N. Oh, and ALL of that is assuming this is AACR2 data, the corpus also includes Rare Books and Manuscripts data, and those guys do things entirely differently, although it's still in MARC, you've got to look in this OTHER field for it. First check fixed field q to see if it's RBM data, and hope fixed field q is right. Oh, and don't forget to check if it's encoded in UTF-8 or MARC-8 by checking this other fixed field, which we know is wrong most of the time."

For non-librarians, MARC is a structured data format whose fields have well-defined and rather precise semantics. The missing pieces are that the data structure is not self-describing and there is no strategy for discovering the rules (apart from a human reading the documentation and then encoding the rules in code).


Other posts tagged as random-stuff, technology

Earlier Posts