Wed, Nov 21, 2007

What are Information Resources Good For?

As is probably obvious from my recent posts (e.g. Fragmentation and Is the Semantic Web Destined to be a Shadow?), I'm thinking about the TAG's httpRange-14 decision again and a large amount of the Architecture of the World Wide Web. The more I think about it, the more I come to believe that Xiaoshu Wang's formulation is the only one that makes any kind of architectural sense. The foundation of all these issues rests in what was once known as the URI crisis and was boiled down to the question: what kind of things can a HTTP URI identify? The TAG took this up as httpRange-14 which was resolved back in 2005 by introducing the notion of a special class of resources called "information resources". In his essay What do HTTP URIs Identify? Tim Berners-Lee wrote:

The authors of document <http://www.w3.org/2000/10/rdf-tests/rdfcore/Manifest.rdf> certainly thought that they could use "http://www.w3.org/2000/10/rdf-tests/TestSchema/NegativeParserTest" to identify an abstract thing which is a type of software test. Now they have a choice as to what to make the server return for them when I ask for it. It returns 404 "doesn't match anything we have available". It can't really, because HTTP doesn't allow one to return a class, only a document. And if it were to return a document, then I wouldn't be able to refer to that document without accidentally referring to the class of negative parser tests.

It seems to me that this essay contains a simple mistake which colours the whole httpRange-14 debate and resolution. It says that the URI can't return anything because "HTTP doesn't allow one to return a class, only a document". That's true, but it does allow you to return a representation of the class which is a document (or potentially an image, or spoken word audio file). HTTP can never return a resource, just representations of them, i.e. things that stand in for the resource. The final point in that quote suffers from that confusion: of course you can't use the URI of the resource to refer to its representation. Possibly you could mint a new URI to denote it, but there is no standard vocabulary that I'm aware of that can relate a representation to its resource parameterized by the HTTP request headers and the time. There probably needs to be.

The resolution to all this was to introduce a class of resources that are basically the same as their representations: information resources. The only way to detect if you have an information resource is to GET its URI. If it responds with a 2xx response then the resource identified by that URI is an information resource. Any other response code means it might or might not be an information resource. This rule has a corollary: if you have a non-information resource then you must not respond with a 2xx response, instead you should use 303 to point to an information resource that somehow gives information about the non-information resource.

Information resources are defined in AWWW:

By design a URI identifies one resource. We do not limit the scope of what might be a resource. The term "resource" is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as â€œresourcesâ€. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as â€œinformation resources.â€

This document is an example of an information resource. It consists of words and punctuation symbols and graphics and other artifacts that can be encoded, with varying degrees of fidelity, into a sequence of bits. There is nothing about the essential information content of this document that cannot in principle be transfered in a message. In the case of this document, the message payload is the representation of this document.

It's hard to understand what benefit the introduction of information resources has to the Web architecture. It definitely has drawbacks. For a start it forces all non-information resources off of the web entirely - they're not allowed to respond with 200 status codes to GET requests. It encourages non-information resources to use of URIs containing fragment identifiers which, as I've pointed out, are a very broken piece of architecture and are leading to the formation of a sort of shadow web.

They are also notoriously hard to define. Consider the following resources, remembering to ask whether all their essential characteristics can be conveyed in a message?

A cat - no
A description of a cat - yes
A digital photo of a cat - yes
A 35mm film frame containing the image of a cat - no
A web page about cats - I hope so!
A website about cats - maybe, I guess you could tar it up and serve it from http://www.allaboutcats.com
The DNA of a cat - probably no
A recording of a cat's mew - yes, unless its an analogue recording in which case we have a digital approximation of the analogue recording
A cat's mew - no
A book about cats - probably not, the book is an abstract work of which there can be multiple editions, revisions, translations, abridgments etc.
The class of all cats - I say yes, timbl says no. I can convey the precise definition of a class in a message as an RDF or OWL schema so that seems to satisfy the criteria
The members of the class of all cats - no
A database of cats - yes
A card catalogue of cats - no
The name of a cat - yes
A taxonomy of cat species - yes
The cat character from Shrek 2 and 3 - yes

Remind me again, what's the point of having this distinction in types of resources....?

Permalink: http://blog.iandavis.com/2007/11/what-are-information-resources-good-for/

Other posts tagged as rdf, semantic-web, technology, web-architecture

Internet Alchemy

What are Information Resources Good For?

Earlier Posts