Transliteration or Interpretation

When creating RDF schemas based on existing data formats you soon hit the inevitable decision point of whether to simply produce an RDF description of the syntactic format or to interpret it and produce a semantic model of the format. If you stick close to the syntactic form of the original then conversion to RDF is often easy, usually just a simple mapping of tokens in the original format to property URIs in the RDF form. The original vCard in RDF note takes this approach - most vCard terms have simple equivilents in the RDF schema. This works but feels awkward because some of the semantics are being ignored or represented in sub-optimal ways.

Norman Walsh's recent rework of the vCard schema takes a much more interpretive approach. It keeps most of the mapping but introduces new relations such as workAdr to specify the role of an address within an individual vCard. This interpretation makes the resulting data semantically richer and more expressive.

I'm facing a similar decision with my current RDF work. I'm looking at representing MARC records in RDF. MARC originated in the 1970s and is a compact format for exchanging bibliographic and other library data. Obviously, given its age it's not XML or anything close. Some work on representing it in RDF has been done by a team at Deri resulting in marcont, an ontology for MARC.

The approach they have taken is more along the transilteration lines than interpretation. For example, MARC defines a field which, for Book records, consists of a number of data elements defined as character offsets. Character 17 indicates whether the book is a biography and the letter (a, b, c, d, # or |) indicates what kind of biography. The marcont ontology defines a corresponding Book class with an isBiography property with cardinality of 1 containing the letter (or some representation of it). The ontology is mirroring the syntactic structure of the record but is it expressing the semantics?

In my current research project I'm exploring whether a more semantically rich representation of MARC would give benefits in terms of queryability via Sparql. My suspicion is that it would but getting the rich model is not easy. Here's another example.

Field 008 in MARC contains a pair of dates. Character position 06 gives you the semantics of date data held in positions 07-10 and 11-14. If the character at position 06 is 's' then the first date represents a single known date, the second date area should be blank. However, if the character at position 06 is 'q' then it represents a questionable date, the first area gives the lower bound and the second area gives the upper bound for the range. If the character at position 06 is 'e' then the first date area contains the year of a definate date while the second area encodes the month and day (mmdd)!

A straight transliteration of this format might yield something like a property for dateType and two properties date1 and date2. However, the meaning of date1 and date2 depend on dateType so any queries across the data would have to involve all three properties. A semantic interpretation of the record might introduce separate properties for the types of dates, e.g. singleKnownDate, definateDate or questionableDateRange. These would be easier to query and less prone to accidents involving misinterpretation of the properties (e.g. is '1123' a year or mmdd?)

There's huge value in interpretations but they're costly to analyse and being somewhat subjective can be the subject of much debate. Also the data needs to go through a much more detailed conversion process. Transliteration is easier, quicker and cheaper but far less satisfying.


Other posts tagged as best-practice, marc, rdf, schema, sparql, vcard

Earlier Posts