Archive for December, 2005

Dec 22 2005

MARC Transliteration

Published by Ian Davis under Uncategorized and tagged as , ,

I got some great suggestions from my earlier posting which really shows the value of writing about stuff earlier rather than later.
one comment in particular from Harry Chen got me started on a particular track:

Here is what I typically do. I first map the syntactic format of the data into RDF, and then build external inference rules to produce a more expressive semantic model of the original data.

This seemed like good advice since it gets the data into the realm of declarative rules very quickly. The alternative is to build lots of transformation logic in code which, while efficient, limits the audience and reach of the technique. So I thought I’d try and transliterate a MARC record to some RDF representation of its structure. The trick is to get the RDF description to be expressive enough to to be useful to a rules engine. I also wanted a representation that would allow round-tripping of MARC records.

My starting point was a MARC record from the British Library. I can’t include it here because it contains control characters for separating fields and subfields, namely ASCII 0×1d, 0×1e and 0×1f. However here’s a common human-readable representation:

=LDR  00651nam a2200193u  4500
=001 004148830
=003 Uk
=005 20040502102300.0
=008 040422m19539999|||\\||\\00||eng|
=015 \$a2672340153
=019 s$aoagh06952
=040 \$aUk$cUk
=082 04$a942.31
=245 02$aA History of Wiltshire. Edited by R. B. Pugh and E. Crittall. [With maps and illustrations.].
=260 \$bOxford University Press for the Institute of Historical Research: London, 1957, 53- . fol.
=440

9 responses so far

Dec 20 2005

Transliteration or Interpretation

Published by Ian Davis under Uncategorized and tagged as , , , , ,

When creating RDF schemas based on existing data formats you soon hit the inevitable decision point of whether to simply produce an RDF description of the syntactic format or to interpret it and produce a semantic model of the format. If you stick close to the syntactic form of the original then conversion to RDF is often easy, usually just a simple mapping of tokens in the original format to property URIs in the RDF form. The original vCard in RDF note takes this approach – most vCard terms have simple equivilents in the RDF schema. This works but feels awkward because some of the semantics are being ignored or represented in sub-optimal ways.

Norman Walsh’s recent rework of the vCard schema takes a much more interpretive approach. It keeps most of the mapping but introduces new relations such as workAdr to specify the role of an address within an individual vCard. This interpretation makes the resulting data semantically richer and more expressive.

I’m facing a similar decision with my current RDF work. I’m looking at representing MARC records in RDF. MARC originated in the 1970s and is a compact format for exchanging bibliographic and other library data. Obviously, given its age it’s not XML or anything close. Some work on representing it in RDF has been done by a team at Deri resulting in marcont, an ontology for MARC.

The approach they have taken is more along the transilteration lines than interpretation. For example, MARC defines a field which, for Book records, consists of a number of data elements defined as character offsets. Character 17 indicates whether the book is a biography and the letter (a, b, c, d, # or |) indicates what kind of biography. The marcont ontology defines a corresponding Book class with an isBiography property with cardinality of 1 containing the letter (or some representation of it). The ontology is mirroring the syntactic structure of the record but is it expressing the semantics?

In my current research project I’m exploring whether a more semantically rich representation of MARC would give benefits in terms of queryability via Sparql. My suspicion is that it would but getting the rich model is not easy. Here’s another example.

Field 008 in MARC contains a pair of dates. Character position 06 gives you the semantics of date data held in positions 07-10 and 11-14. If the character at position 06 is ’s’ then the first date represents a single known date, the second date area should be blank. However, if the character at position 06 is ‘q’ then it represents a questionable date, the first area gives the lower bound and the second area gives the upper bound for the range. If the character at position 06 is ‘e’ then the first date area contains the year of a definate date while the second area encodes the month and day (mmdd)!

A straight transliteration of this format might yield something like a property for dateType and two properties date1 and date2. However, the meaning of date1 and date2 depend on dateType so any queries across the data would have to involve all three properties. A semantic interpretation of the record might introduce separate properties for the types of dates, e.g. singleKnownDate, definateDate or questionableDateRange. These would be easier to query and less prone to accidents involving misinterpretation of the properties (e.g. is ‘1123′ a year or mmdd?)

There’s huge value in interpretations but they’re costly to analyse and being somewhat subjective can be the subject of much debate. Also the data needs to go through a much more detailed conversion process. Transliteration is easier, quicker and cheaper but far less satisfying.

9 responses so far

Dec 16 2005

Change Policies for Namespaces

Published by Ian Davis under Uncategorized and tagged as , , , ,

A good practice from the current namespace state draft from the W3C tag:

Specifications that define namespaces SHOULD explicitly state their policy with respect to changes in the names defined in that namespace.

I need to implement this in the various vocab.org schemas.

Comments Off

Dec 16 2005

SOAP Destined to A Life of Obscurity

Published by Ian Davis under Uncategorized and tagged as , , , , ,

This piece from Dare Obasanjo hot on the heels of the UDDI public registry closure adds weight to my suspicion that SOAP is finally being sidelined into a niche activity.

When I worked on the XML team, I used to interact regularly with the Indigo folks. At the time, I got the impression that they had two clear goals (i) build the world’s best Web services framework built on SOAP & WS-* and (ii) unify the diverse distributed computing offerings produced by Microsoft. As I spent time on my new job I realized that the first goal of Indigo folks didn’t jibe with the reality of how we built services. Despite how much various evangelists and marketing folks have tried to make it seem otherwise, SOAP based Web services aren’t the only Web service on the planet. Technically they aren’t even the most popular. If anything the most popular Web services is RSS which for all intents and purposes is a RESTful Web service. Today, across our division we have services that talk SOAP, RSS, JSON, XML-RPC and even WebDAV. The probability of all of these services being replaced by SOAP-based services is 0.

One response so far