Tue, Sep 13, 2005

Crisis

I experienced something of a system shock at the DC2005 conference today. I sat in on the Architecture Working Group meeting and as events unfolded I suddenly realised that, without a radical change, I could well be witnessing the beginning of the end of the RDF project.

We were discussing the progress of the Dublin Core RDF task force and there were a number of agenda items under discussion. We didn't get past the first item though - it was so hairy and ugly that no-one could agree on the right approach. The essence of the problem is best illustrated by the dc:creator term. The current definition says An entity primarily responsible for making the content of the resource.. The associated comments states Typically, the name of a Creator should be used to indicate the entity and this is exactly the most common usage. Most people, most of the time use a person's name as the value of this term. That's the natural mode if you write it in an HTML meta tag and it's the way tens or hundreds of thousands of records have been written over the past six years. Here's that model:

Of course, us RDFers, with our penchant for precision and accuracy take issue with the notion of using a string to denote an "entity". Is it an entity or the name of an entity. Most of us prefer to add some structure to dc:creator, perhaps using a foaf:Person as the value. It lets us make more assertions about the creator entity. Here's a picture of that model:

The problem, if it isn't immediately obvious, is that in RDF and RDFS it's impossible to specify that a property can have a literal value but not a resource or vice versa. When I ask "what is the email address of the creator of this resource?" what should the (non-OWL) query engine return when the value of creator is a literal? It isn't a new issue, and is discussed in-depth on the FOAF wiki.

There are several proposals for dealing with this. The one that seemed to get the most support was to recommend the latter approach and make the first illegal. That means making hundreds of thousands of documents invalid. A second approach was to endorse current practice and change the semantics of the dc:creator term to explictly mean the name of the creator and invent a new term (e.g. creatingEntity) to represent the structured approach.

Danbri referred us to work he had done after the last DC meeting in 2004 on a SPARQL query to convert between the two forms. Discussion then moved onto special case processing for particular properties, along the lines of "if you see a dc:creator property with a literal value then you should insert a blank node and hang the literal off of that". Note that I'm paraphrasing, no-one actually said this but it was the intent.

That's when my crisis struck. I was sitting at the world's foremost metadata conference in a room full of people who cared deeply about the quality of metadata and we were discussing scraping data from descriptions! Scraping metadata from Dublin Core! I had to go check the dictionary entry for oxymoron just in case that sentence was there! If professional cataloguers are having these kinds of problems with RDF then we are fucked.

It says to me that the looseness of the model is introducing far too much complexity as evidenced by the difficulties being experienced by the Dublin Core community and the W3C HTML working group. A simpler RDF could take a lot of this pain away and hit a sweet spot of simplicity versus expressivity.

Where can complexity be removed?

The graph is fundamental and I think the base triple model - the simplest construct that could possibly make a graph - is right. But there are too many types of nodes: URIs, blanks, literals, literals with language, datatyped literals, XML literals. What if there were only two: literals and URIs? The rest are a distraction. URIs are cheap and to be honest I've never understood why referring to me with a URI like http://purl.org/NET/iand/foaf#ian is frowned on, especially in light of the httpRange-14 decision.

What if it were possible to define two types of properties: those that had literal values and those that had URI values? Wouldn't that make mappings from HTML and Dublin Core simpler and easier to validate?

What if we jilted the ugly sisters of rdf:Bag, rdf:Alt and rdf:Seq and took reification out back and shot it? How many tears would be shed?

What if we junked classes, domains and ranges? Would anyone notice? The key concept in RDF is the relationship, the property.

The result would be a subset of RDF, RDF-lite perhaps. All instances of RDF-lite would be valid RDF-full but the converse couldn't be true. Sparql would still work and so, I suspect, would the OWL machinery despite the omission of classes. RDF diffs would be trivial without blank nodes allowing efficient synchronisation of triple stores. Signing of triples would also be possible without requiring the hoops of canonicalisation to be jumped through.

Maybe it's necessary to take a few steps back to find the true path to the summit.

Permalink: http://blog.iandavis.com/2005/09/crisis/

Internet Alchemy

Crisis

Earlier Posts