Thu, Nov 4, 2010

Is 303 Really Necessary?

A few months back I threw out a question on Twitter: what breaks on the web if we use status code 200 instead of 303 for our Linked Data? I saw a resurgence of this on Twitter today which prompted me to finally write up my thoughts in a medium with more than 140 character soundbites.

For those new to this debate the current practice in the Linked Data community is to divide the world into two classes: things that are definitely "Information Resources" and things that might be information resources or might be something else entirely, like a planet or a toucan. I have written on the subject of information resources before (such as what are information resources good for and 303 Asymmetry). Simplistically information resources can be considered to be electronic documents such as web pages, spreadsheets, images and the like. The only way to tell for sure whether a URI denotes an information resource is to dereference it. If you get a status code of 200 then the URI denotes an information resource. No meaning is ascribed to any other status code and those cases the URI might denote an information resource or it might not.

Why, you might ask, is all this emphasis placed on information resources? The answer is that the overwhelming use of the web is to serve up electronic documents, predominantly html. The Linked Data people want to use the web's infrastructure to store information about other things (planets and toucans) and use HTTP URIs to denote those things. Because toucan's aren't electronic documents it has been assumed that we need to distinguish the toucan itself from the document containing data about the toucan. One of the central dictums of Linked Data is that URIs can only denote one thing at a time: that means the URI for the toucan needs to be different from the URI for the document about the toucan. We connect the two together in two ways:

when someone issues an HTTP GET to the toucan's URI the server responds with a 303 status code redirecting the user to the document about the toucan
when someone issues an HTTP GET to the document's URI the server responds with a 200 status code and an RDF document containing triples that refer to the toucan's URI

That is the current state of affairs for situations where people want to use HTTP URIs to denote real world things. (there is another approach which uses URIs with a fragment e.g. http://example.com/doc#foo which avoids this 303 redirect, but it has its own problems as I point out here and here).

There are several disadvantages to the 303 redirect approach:

it requires an extra round-trip to the server for every request
only one description can be linked from the toucan's URI
the user enters one URI into their browser and ends up at a different one, causing confusion when they want to reuse the URI of the toucan. Often they use the document URI by mistake.
its non-trivial to configure a web server to issue the correct redirect and only to do so for the things that are not information resources.
the server operator has to decide which resources are information resources and which are not without any precise guidance on how to distinguish the two (the official definition speaks of things whose "essential characteristics can be conveyed in a message"). I enumerate some examples here but it's easy to get to the absurd.
it cannot be implemented using a static web server setup, i.e. one that serves static RDF documents
it mixes layers of responsibility - there is information a user cannot know without making a network request and inspecting the metadata about the response to that request. When the web server ceases to exist then that information is lost.
the 303 response can really only be used with things that aren't information resources. You can't serve up an information resource (such as a spreadsheet) and 303 redirect to metadata about the spreadsheet at the same time.
having to explain the reasoning behind using 303 redirects to mainstream web developers simply reinforces the perception that the semantic web is baroque and irrelevant to their needs.

The one clear advantage it has is:

It's easy to distinguish the toucan from the description of the toucan

Given there are more disadvantages than advantages the natural assumption has to be that the single advantage vastly outweighs the cost of the disadvantages to server operators and consumers of Linked Data.

I am far from convinced that it does.

Firstly, I have never needed to distinguish these things and secondly, if I ever did, then RDF itself makes that trivial by its self-describing nature. The document retrieved can contain triples that assert the nature of the thing denoted by the requested URI.

So, back to my original question: what exactly would break on the web if we dropped the requirement to issue a 303 redirect when the user requested the URI of our toucan? What if we simply responded with a 200 status code and the description document?

It's pretty clear what we would gain: all of those disadvantages above would be eliminated.

At first glance it looks like we would be left with the problem of distinguishing the toucan from its description. However, the description document can still retain its own URI. We link the toucan to its document by using a new property ex:isDescribedBy. This property has exactly the same semantics as the 303 redirect except it is active at the data layer and not the network layer. That means that we still keep the advantage of distinguishing the toucan from its document.

As an example here's how one could declare the owner of the toucan and the owner of the description document to be different individuals. Under the current state of affairs it's simple because the toucan and the document have different URIs and no RDF is ever emitted from the toucan's URI:

GET /toucan responds with a 303 to /doc

GET /doc responds with 200 and a representation containing some RDF which includes the triples <http://example.org/toucan> ex:owner <http://example.org/anna> and <http://example.org/doc> ex:owner <http://example.org/fred>

Under my new scheme:

GET /toucan responds with 200 and a representation containing some RDF which includes the triples <http://example.org/toucan> ex:owner <http://example.org/anna> and <http://example.org/toucan> ex:isDescribedBy <http://example.org/doc>

GET /doc responds with 200 and a representation containing some RDF which includes the triple <http://example.org/doc> ex:owner <http://example.org/fred>

There would be no requirement for the toucan's response to include the ex:isDescribedBy property. If the owner of the server has no addiitonal information about the description document then there is no point in linking to it.

It's important to note that the data in the response to the GET on /toucan should be taken at face value. Any triples referencing the /toucan URI refer to the thing denoted by that URI, not to the representation retrieved from it. (As an aside this is consistent with current HTTP semantics which does not name individual representations).

As far as I can see this approach doesn't break the web, just provides a bunch of clear advantages. It's simpler, more effiicent, more extensible and has clearer semantics than the current 303 approach and removes the onus on server operators to decide what is/isn't an information resource.

If there really are no disadvantages and no breakage in the web, we really ought to evangelise to get this approach accepted as standard practice. That includes doing the following:

define a stable URI for ex:isDescribedBy (the POWDER property describedby seems close but makes assumptions about the type of description pointed to)
lobby the W3C TAG to deprecate their finding on httpRange-14
updating the how to publish linked data tutorial
updating Tabulator and other linked data browsers to understand the new semantics
converting existing linked datasets to use 200 instead of 303
perhaps lobbying the new RDF working group to write this approach up as a note or recommendation

But before all that, perhaps there really are areas of the web architecture that break under this approach. If you spot any, let me know in the comments.

Update Nov 5

I posted a link to this blog on the public-lod@w3.org mailing list which generated lots of discussion: http://markmail.org/thread/mkoc5kxll6bbjbxk

To aid that discussion I've also created a small demo of the idea.Here is the URI of a toucan:

http://iandavis.com/2010/303/toucan

Here is the URI of a description of that toucan:

http://iandavis.com/2010/303/toucan.rdf

As you can see both these resources have distinct URIs. I created a new property http://vocab.org/desc/schema/description to link the toucan to its description. The schema for that property is
here:

http://vocab.org/desc/schema

(BTW I looked at the powder describedBy property and it's clearly designed to point to one particular type of description, not a general RDF one. I also looked at http://ontologydesignpatterns.org/ont/web/irw.owl and didn't see anything suitable)

Here is the URI Burner view of the toucan resource and of its description document:

I'd like to use this demo to focus on the main thrust of my question: does this break the web and if so, how?

Permalink: http://blog.iandavis.com/2010/11/is-303-really-necessary/

Other posts tagged as data, httprange14, linked-data, opinion, rdf, technology

Internet Alchemy

Is 303 Really Necessary?

Earlier Posts