Fragmentation

I'm troubled by this well written essay by Xiaoshu Wang, in particular this part:

This example showed that the identity of URI is never ambiguous. What is ambiguous is our mental assignment of the URI's identity. Similarly, in the Mr. Hayes' example, if we say that "http://www.ihmc.us/users/phayes/PatHayes" denotes a person and a representation of http://www.ihmc.us/users/phayes/PatHayes is a web page, no confusion would have been created. And in the image example, if we say that http://dfdf.inesc-id.pt/tr/doc/web-arch/img/fig2 denotes an idea and one of its representations is a picture, no ambiguity will arise either.

I'm troubled because I don't think I can disagree with it. In fact I think it might be the only sane interpretation possible. The fact that it runs counter to the W3C's Architecture of the World Wide Web and the whole of the Linked Data best practice is kind of a worry for me.

I'm also troubled by this statement from timbl on the subject of fragment identifiers:

There are three possible attitudes:

1) don't mix HTML and RDF, HTML will always have anchors. I think that this doesn't meet the need.

2) Do mix RDF and HTML, allow one file to define both anchors and arbitrary things. Don't let the same fragid be used for both an anchor and a thing.

3) Do mix them, and by the way, allow the same fragid to be used as an ID for an anchor and an ID for a thing, with RDF clients and HTML clienst doing different things. I think that this path leads to madness, as in a script for exaple, I may want to use a URI to refer to one or the other unambiguously. It also makes it impossible for HTML+RDF clients.

Actually I'm more than troubled, I'm really concerned by this. The fact that writing off option 3 blows eRDF and RDFa to smithereens is a non-issue, we can rework things to fix that. What I'm really concerned about is the growing evidence that fragment identifiers on the web are a broken technology and are to be avoided if not deprecated.

The nub of the problem is that the meaning of a fragment identifier is determined by the type of the representation. For example, what is the meaning of http://www.w3.org/People/Berners-Lee/card#i - the answer depends on what representation you obtain when you issue an HTTP GET against it. You might get some HTML in which case the URI represents a fragment of an HTML document, or you might get some RDF in which case the URI could represent a person. Clearly HTML fragments and people are disjoint sets, no person is an HTML fragment and vice versa.

That principle is enshrined and embodied in dozens of RFCs and recommendations that define the meanings of fragment identifiers for various formats such as SVG and XPointer. The problem isn't even a new one, as this note written by timbl in 2001 points out:

Fragment IDs and Content negotiation - known bug. If content negotiation occurs across types which do NOT share a fragment ID specification, then rigidly there has been an error. In practice, HTML was the only type (in 1997) which allowed fragment IDs anyway, and other types ignore it. Also, as falling back from a pointer to a specific view to a pointer to the whole document has been considered effective fallback procedure, so no harm was done. Now (2001) it becomes more of a problem. there have been proposasl to add the requested fragment idntifier to the HTTP request to fix this.)

This is echoed in the webarch document:

representation providers must not use content negotiation to serve representation formats that have inconsistent fragment identifier semantics. This situation also leads to URI collision (§2.2.1).

Serving RDF and HTML up from the same URI is forbidden. Content negotiation is blamed here, but in reality the problem is caused by the assumption that the fragment identifiers are somehow associated with the underlying resource rather than the representation.

It's interesting that another warning on the dangers of fragment identifiers comes to a conclusion spookily similar to Xiaoshu's . This time it's from Aaron Swartz back in 2001 again:

Fragments in Web Architecture only makes sense when referring to a representation of a resource, not a resource itself. URI references worked in HTML where the set context was of surfing between web pages (representations), and human users could deal with some breakage. However, as we move into formats like RDF, clarity and precision become increasingly important and URI fragments just don't work.

The whole httpRange-14 issue was supposed to resolve this (and I accepted it at the time) but I've come to the conclusion that it is fundamentally flawed. It encourages people to use URIs with fragments to represent resources, when the fragment changes meaning depending on the representation served from that URI. The URI represents the resource and the fragment identifies a portion of a representation obtained from that URI. Web pages are one type of representation and thinking of them in that way avoids all this nonsense about Information Resources, which I believe now is a patch required to cover up a crack in the architecture.

Permalink: http://blog.iandavis.com/2007/11/fragmentation/

Other posts tagged as rants, rdf, semantic-web, technology

Earlier Posts