Google+

MARC Transliteration

9

22 December 2005 by Ian Davis

I got some great suggestions from my earlier posting which really shows the value of writing about stuff earlier rather than later.
one comment in particular from Harry Chen got me started on a particular track:

Here is what I typically do. I first map the syntactic format of the data into RDF, and then build external inference rules to produce a more expressive semantic model of the original data.

This seemed like good advice since it gets the data into the realm of declarative rules very quickly. The alternative is to build lots of transformation logic in code which, while efficient, limits the audience and reach of the technique. So I thought I’d try and transliterate a MARC record to some RDF representation of its structure. The trick is to get the RDF description to be expressive enough to to be useful to a rules engine. I also wanted a representation that would allow round-tripping of MARC records.

My starting point was a MARC record from the British Library. I can’t include it here because it contains control characters for separating fields and subfields, namely ASCII 0x1d, 0x1e and 0x1f. However here’s a common human-readable representation:

=LDR  00651nam a2200193u  4500
=001  004148830
=003  Uk
=005  20040502102300.0
=008  040422m19539999|||\||0||eng|
=015  $a2672340153
=019  s$aoagh06952
=040  $aUk$cUk
=082  04$a942.31
=245  02$aA History of Wiltshire. Edited by R. B. Pugh and E. Crittall. [With maps and illustrations.].
=260  $bOxford University Press for the Institute of Historical Research: London, 1957, 53- . fol.
=440

9 thoughts on “MARC Transliteration

  1. Bruce says:

    Dumb question Ian: when you write “The order of all fields is preserved”, is a product of using the Collection parsetype?

  2. Andrew Houghton says:

    So basically it reinvents MARC-XML in RDF. I’m not an RDF expert, but some points about the implementation:1) it needs to use rdf:parseType instead of parseType; the RDF validator throws warnings about depricated usage.2) it’s unclear why it uses md:value; can someone give a reason why rdf:value would not be appropriate in this context.3) it’s unclear why there are two additional levels, e.g., md:data and md:RecordData; can someone give a reason why they exist.Andy.

  3. James Brunskilll says:

    I’m not quite sure what the advantage to using RDF is, but I thought you might like to know that there is already a marc xml format: http://www.loc.gov/marc/marcxml.htmland software that can convert marc in to marcxml: http://oregonstate.edu/~reeset/marcedit/html/

  4. iand says:

    Andrew, yes it’s functionally the same as MARCXML since they’re both alternate representations of MARC. This post is part of a series detailing investigations of how to build a semantic representation of MARC grounded in the Web. The previous posting gives more background on the approach. To answer your questions specifically:1) ooops. my mistake and will fix2) rdf:value has almost no semantics whereas md:value allows me to assign the semantic I require such as cardinality3) this is an artifact of the RDF model. Essentially I am modelling a MARC record (marc:Record) as a separate entity. The MARC data is modelled as a property of that record (md:data). I can assign other properties to the same record such as dc:rights or foaf:maker. I might decide to have several md:data properties pointing to md:RecordData entities from different catalogers. One thing I’m almost sure will happen is a separation between the description of the catalog record and the description of the item being cataloged. A property of marc:Record will be the likely place to associate the two. The original MARC record contains data about the process of cataloging, the catalog record and the item being cataloged – I’d like to see those things separated in the RDF expression.

  5. iand says:

    Bruce, yes the rdf:parseType=”Collection” is a shorthand for writing the long-winded RDF collection syntax. It guarantees the order of the items and terminates the list so you know there can be no other members from other sources.The longhand way of writing one of those collections would be:  <md:subfields>    <rdf:Description>      <rdf:first>        <md:SubField md:code=”a” md:value=”Uk” />      </rdf:first>      <rdf:rest>        <rdf:Description>          <rdf:first>            <md:SubField md:code=”c” md:value=”Uk” />          </rdf:first>          <rdf:rest rdf:resource=”http://www.w3.org/1999/02/22-rdf-syntax-ns#nil” />        <rdf:Description>      </rdf:rest>    <rdf:Description>  </md:subfields>

  6. Andrew Houghton says:

    Some additional comments on the modeling.The MARC leader and control fields can contain multiple sequential blanks that are significant. The existing modeling for the control fields uses XML attributes, which according to the XML specification, multiple sequential spaces in attributes are to be compressed to one space by XML parsers. Even the xml:space attribute cannot override this behaviour since it applies only to element content. This is one of the reasons why the MARC-XML schema models the leader, control fields and subfields as elements with content rather than attributes. Note in your example the 008 field contains a backslash (\) where spaces should appear.While MARC does break down many metadata elements into subfields, some subfields are coded, e.g., subfield w of the 4XX/5XX authority format. Other subfields contain encoded information where each piece should probably be modeled as a separate metadata property, e.g., the 300 subfield a of the bibliographic format: “7 sound discs (ca. 8 hr.)”. The “7 sound discs” and “(ca. 8 hr.)” represent two different metadata properties about the physical description that are combined into one subfield. The former being the number of sound discs that comprise the item and the latter the approximate number of listening hours across those 7 sound discs.

  7. iand says:

    Actually my reading of the XML specification says that multiple spaces in an attribute’s value are preserved. See http://www.w3.org/TR/REC-xml/#AVNormalize for the details. Spaces in element content are compressed.It’s my error leaving the backslashes in the 008 field – I’ll change them to spaces

  8. Andrew Houghton says:

    I’m not sure how you came to your conclusion on XML attribute normalization after reading the XML specification, but it clearly says in the paragraph after the 3 point normalization algorithm:If the attribute type is not CDATA, then the XML processor MUST further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.The word MUST in the XML specification is interpreted as described in IETF RFC 2119 which basically says that it’s an absolute requirement for an XML processor to implement.Although in my mind, the XML specification is ambigious for a normalized attribute value that contains only a sequence of #x20 characters. Should the XML parser return an empty value or one #x20 character? Every XML parser I have tested with returns a single #x20. I suspect that phrases “leading” or “trailing” in the XML specification imply that there is at least one non-#x20 character, hence the second clause causes the value to be reduced to one #x20 character.

  9. iand says:

    Hi Andrew. If you read on from that section you’ll find a sentence stating: “All attributes for which no declaration has been read SHOULD be treated by a non-validating processor as if declared CDATA.”

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: