Dec 05 2007

303 Asymmetry

Published by Ian Davis under Uncategorized and tagged as , ,

I mentioned a while back that I wanted to talk more about the descriptions vs representations issue. A recent message by timbl provided the impetus to do so. In that message he says the following:

Try thinking of it this way instead. You are going to serve some representation on the web, for this thing. Are those going to be (a) ABOUT the thing, or (b) the CONTENT of this thing denoted by the URI? If the former, you must use # or 303. If the later, you can serve the representations with 200 from that URI. You see, 200 means (basically) “Here comes the content of the document you asked for” and 303 means “Here is the URI of document ABOUT the thing you asked for.

That seems a pretty good characterisation of the 303 decision. Information resources can serve up representations of themselves, other resources cannot so you have to make do with descriptions. (Regular readers will already know that I don’t fully agree with this model, but it’s the accepted one in the SemWeb community)

There appears to be an asymmetry about this though and I think it’s a limitation of the model.

Suppose I have a resource “R” with URI http://example.org/R. If it is an “Information Resource” then I can arrange things so that a GET request for its text/html representation responds with a 200 and the HTML in the body of the response. I could also arrange for a request for its application/rdf+xml representation to respond with a 303 status and the URI of another information resource “RDESC” (e.g. http://example.org/RDESC). In this example the 303 response meand that “R” cannot be represented as RDF, but there’s an alternative RDF document that is a description of R. The user can then re-issue the request on http://example.org/RDESC to obtain that description.

Now, I can arrange for http://example.org/RDESC to return an RDF representation in a 200 response. But, here’s the asymmetry. How can I allow the user to obtain a description of RDESC? The representation I send back is the content of RDESC, not its description. I can’t use the media type to distinguish the type of request any more.

In case you think this is an artificial distinction, it’s not. We’re dealing with it right now at Talis (and have been for a number of months). We give access to a number of RDF graphs via HTTP where naturally the user wants to obtain the content of the graph as the response to a GET. We also serve RDF descriptions of those graphs containing some configuration information. There’s no standard way to link the two things together so that users can select either the description or the content using HTTP.

Now, I happen to think that there is an interesting solution to this asymmetry. Suppose we created a new HTTP header called “resource-description” whose value was the URI of a description of the given resource. Note that it’s a description of the resource, not of any representation that is being sent as part of the response. The asynmmetry goes away because this gives you a method of pointing to the description regardless of the status code and/or content negotiation going on in the request.

Things get even more interesting if you allow multiple resource-description headers: what a great way to cross link to other people’s descriptions of your resource.

I seem to recall something similar to this being proposed a few years back but my Googling doesn’t turn anything up. Going back further takes us to Patrick Stickler’s attempts to solve the description problem using URIQA, which took the approach of introducing another verb, MGET, to obtain the description. This was almost universally disliked, but the underlying problem has remained unsolved in the meantime.

And given where my head has been for the past few weeks, I have to ask what decision would have been taken on the httpRange-14 issue if this header had already existed. Would instead we be returning 406 responses when we cannot supply a suitable representation for resources, or even a 204. Both of those could work with the header pointing to an appropriate description of the resource.

Update: it seems that timbl and I were touched by the same muse tonight: Alternative to 303 response: Description-ID: header

5 responses so far

Nov 29 2007

It’s OK to use URIs with Fragments in RDF

Published by Ian Davis under Uncategorized and tagged as ,

I’ve been doing some more digging on my fragmentation and shadow web themes and came across something I hadn’t really seen before or, if I have, has been completely wiped from my mind. The RDF Concepts document contains a whole section on fragment identifiers which is worth reproducing:

RDF uses an RDF URI Reference, which may include a fragment identifier, as a context free identifier for a resource. RFC 2396 [URI] states that the meaning of a fragment identifier depends on the MIME content-type of a document, i.e. is context dependent.

These apparently conflicting views are reconciled by considering that a URI reference in an RDF graph is treated with respect to the MIME type application/rdf+xml [RDF-MIME-TYPE]. Given an RDF URI reference consisting of an absolute URI and a fragment identifier, the fragment identifer identifies the same thing that it does in an application/rdf+xml representation of the resource identified by the absolute URI component. Thus:

  • we assume that the URI part (i.e. excluding fragment identifier) identifies a resource, which is presumed to have an RDF representation. So when eg:someurl#frag is used in an RDF document, eg:someurl is taken to designate some RDF document (even when no such document can be retrieved).
  • eg:someurl#frag means the thing that is indicated, according to the rules of the application/rdf+xml MIME content-type as a “fragment” or “view” of the RDF document at eg:someurl. If the document does not exist, or cannot be retrieved, or is available only in formats other than application/rdf+xml, then exactly what that view may be is somewhat undetermined, but that does not prevent use of RDF to say things about it.
  • the RDF treatment of a fragment identifier allows it to indicate a thing that is entirely external to the document, or even to the “shared information space” known as the Web. That is, it can be a more general idea, like some particular car or a mythical Unicorn.
  • in this way, an application/rdf+xml document acts as an intermediary between some Web retrievable documents (itself, at least, also any other Web retrievable URIs that it may use, possibly including schema URIs and references to other RDF documents), and some set of possibly abstract or non-Web entities that the RDF may describe.

This provides a handling of URI references and their denotation that is consistent with the RDF model theory and usage, and also with conventional Web behavior. Note that nothing here requires that an RDF application be able to retrieve any representation of resources identified by the URIs in an RDF graph.

I’ve been thinking about this for a couple of days and I’m still not entirely sure what to make of it. What it appears to be saying is that RDF ignores the Web Architecture principle that fragment identifiers are given meaning by the representation that is retrieved.

So this ensures that RDF is self-consistent. I can refer to anything I like using a fragment identifier in my URI and I’m guaranteed not to have my intended meaning upset by anything messy like a network operation. This alleviates one of my major concerns at using these kinds of URIs in RDF, but at what cost? If anything this increases my concerns over the shadow web since by circumventing the web architecture it sets RDF further away from today’s web of documents. For example, when I use “http://www.w3.org/TR/webarch/#media-type-fragid” as a URI in my RDF, it probably doesn’t refer to the thing you think it does. You, as a human (if you are), get to see a representation of that section of the document when you click on the link, but an RDF-aware agent must treat that URI as though rdf/xml had been retrieved. Unfortunately there isn’t any RDF there and the Web Architecture actually forbids you from serving up both HTML and RDF documents at the same URI.

What does that mean? How are we supposed to interpret that? One interpretation is that it really doesn’t matter what you do outside of RDF. You can throw up all kinds of other representation formats and it won’t affect yours or anyone else’s RDF. They might use the same identifiers, and occasionally, coincidentally they may identify the same things, but in general RDF is partitioned into its own little world. RDF can only link to RDF.

How can RDF co-exist with other formats on the Web if it ignores their semantics? If you just want the Semantic Web to be built using RDF then you probably don’t care. But if, like me, you want to see an inclusive Semantic Web built from a mix of RDF, microformats, topic maps, RDDL and all the other ways to express semantics, then it’s a very very big problem. I don’t want two webs competing for attention, I want one strong one.

Hence the title of this post. It is OK to use URIs with fragments in RDF, but only if you don’t particularly care about relating to the existing web. If you do care then avoid fragments at all costs. Use standard URIs and stick 303 redirects on them if you need to. It’ll work and the whole web will be better for it.

13 responses so far

Nov 21 2007

Isn’t The Web Built From Links?

Published by Ian Davis under Uncategorized and tagged as , ,

If my shadow web post hasn’t convinced you then try this thought experiment:

You want to link from your webpage to Tim Berners-Lee’s URI <http://www.w3.org/People/Berners-Lee/card#i>, except you can’t because that link points to something that can never contain the #i fragment in its HTML. It can only ever link to RDF because Tim is relying on RDF’s semantics for the meaning of the #i fragment. Tough luck if you can’t read RDF or don’t want to have to learn.

6 responses so far

Nov 21 2007

Reformulating the Web Architecture

Published by Ian Davis under Uncategorized and tagged as , ,

So, accepting that URIs with fragments are generally a broken piece of architecture for the Semantic Web and that information resources are not adding any real substance, here’s how I see the Web Architecture being reformulated for use with the Semantic Web:

  1. A hashless URI should be allowed to denote any resource whatsoever. Documents, books, people, galaxies and unicorns. There is no ambiguity here, the URI denotes a single thing. More than one URI can denote the same thing, so I can have a URI that denotes the city of London, and Danny can have a different URI that also denotes London.
  2. A representation of a resource can be obtained by issuing an HTTP GET on a URI. The representation is a sequence of bits that somehow stands in for the resource the URI denotes. Content negotiation can be used to select an appropriate format for the representation, withouth changing the actual resource being denoted. Perhaps my URI denoting London can respond with an HTML document containing essential facts and figures about the city, a JPEG aerbyial photograph, an SVG streetmap or a sound recording of the sounds encountered while in the city itself. None of these things are London, but they all can stand in for it in some limited fashion. I could retrieve them all to obtain a better sense of London itself, but I cannot actually obtain London using HTTP.
  3. URIs containing hashes are constrained in what they may denote and have an inherent ambiguity due to their reliance on the particular representation obtained. Their denotations vary depending on the URI plus a set of HTTP headers used during the request.
  4. There is no such thing as an “Information Resource”. All resources are made equal. However for many resources, the only representation available happens to be identical to the resource itself. Still, you cannot obtain the actual resource using HTTP, but you can get a copy in the form of a representation. The majority of HTML documents on the web behave in this manner, a single representation that is a copy of the resource itself.

These aren’t huge changes and they’re backwards compatible with the existing web. On the other hand they greatly reduce the reliance on fragment identifiers and they encourage people to use real unambiguous URIs to refer to things other than documents, weaving the Semantic Web right into today’s Web.

For background, you might like to read my earlier posts on this subject:

9 responses so far

Nov 21 2007

What are Information Resources Good For?

Published by Ian Davis under Uncategorized and tagged as , ,

As is probably obvious from my recent posts (e.g. Fragmentation and Is the Semantic Web Destined to be a Shadow?), I’m thinking about the TAG’s httpRange-14 decision again and a large amount of the Architecture of the World Wide Web. The more I think about it, the more I come to believe that Xiaoshu Wang’s formulation is the only one that makes any kind of architectural sense. The foundation of all these issues rests in what was once known as the URI crisis and was boiled down to the question: what kind of things can a HTTP URI identify? The TAG took this up as httpRange-14 which was resolved back in 2005 by introducing the notion of a special class of resources called “information resources”. In his essay What do HTTP URIs Identify? Tim Berners-Lee wrote:

The authors of document <http://www.w3.org/2000/10/rdf-tests/rdfcore/Manifest.rdf> certainly thought that they could use “http://www.w3.org/2000/10/rdf-tests/TestSchema/NegativeParserTest” to identify an abstract thing which is a type of software test. Now they have a choice as to what to make the server return for them when I ask for it. It returns 404 “doesn’t match anything we have available”. It can’t really, because HTTP doesn’t allow one to return a class, only a document. And if it were to return a document, then I wouldn’t be able to refer to that document without accidentally referring to the class of negative parser tests.

It seems to me that this essay contains a simple mistake which colours the whole httpRange-14 debate and resolution. It says that the URI can’t return anything because “HTTP doesn’t allow one to return a class, only a document”. That’s true, but it does allow you to return a representation of the class which is a document (or potentially an image, or spoken word audio file). HTTP can never return a resource, just representations of them, i.e. things that stand in for the resource. The final point in that quote suffers from that confusion: of course you can’t use the URI of the resource to refer to its representation. Possibly you could mint a new URI to denote it, but there is no standard vocabulary that I’m aware of that can relate a representation to its resource parameterized by the HTTP request headers and the time. There probably needs to be.

The resolution to all this was to introduce a class of resources that are basically the same as their representations: information resources. The only way to detect if you have an information resource is to GET its URI. If it responds with a 2xx response then the resource identified by that URI is an information resource. Any other response code means it might or might not be an information resource. This rule has a corollary: if you have a non-information resource then you must not respond with a 2xx response, instead you should use 303 to point to an information resource that somehow gives information about the non-information resource.

Information resources are defined in AWWW:

By design a URI identifies one resource. We do not limit the scope of what might be a resource. The term “resource” is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as “resources”. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as “information resources.”

This document is an example of an information resource. It consists of words and punctuation symbols and graphics and other artifacts that can be encoded, with varying degrees of fidelity, into a sequence of bits. There is nothing about the essential information content of this document that cannot in principle be transfered in a message. In the case of this document, the message payload is the representation of this document.

It’s hard to understand what benefit the introduction of information resources has to the Web architecture. It definitely has drawbacks. For a start it forces all non-information resources off of the web entirely – they’re not allowed to respond with 200 status codes to GET requests. It encourages non-information resources to use of URIs containing fragment identifiers which, as I’ve pointed out, are a very broken piece of architecture and are leading to the formation of a sort of shadow web.

They are also notoriously hard to define. Consider the following resources, remembering to ask whether all their essential characteristics can be conveyed in a message?

  • A cat – no
  • A description of a cat – yes
  • A digital photo of a cat – yes
  • A 35mm film frame containing the image of a cat – no
  • A web page about cats – I hope so!
  • A website about cats – maybe, I guess you could tar it up and serve it from http://www.allaboutcats.com
  • The DNA of a cat – probably no
  • A recording of a cat’s mew – yes, unless its an analogue recording in which case we have a digital approximation of the analogue recording
  • A cat’s mew – no
  • A book about cats – probably not, the book is an abstract work of which there can be multiple editions, revisions, translations, abridgments etc.
  • The class of all cats – I say yes, timbl says no. I can convey the precise definition of a class in a message as an RDF or OWL schema so that seems to satisfy the criteria
  • The members of the class of all cats – no
  • A database of cats – yes
  • A card catalogue of cats – no
  • The name of a cat – yes
  • A taxonomy of cat species – yes
  • The cat character from Shrek 2 and 3 – yes

Remind me again, what’s the point of having this distinction in types of resources….?

35 responses so far