Back to Basics with Linked Data and HTTP
<blockquote><p>In the Semantic Web, it is not the Semantic which is novel, it is the Web</p></blockquote>
That quote, attributed to Chris Welty of IBM, is the one that best captures my outlook on the Semantic Web and Linked Data. The Web has connected people to information at an unprecedented rate and scale and comparisons to the impact of Caxton’s press are justified however trite they are. For the majority of people using the Web it’s a rich place full of stories, pictures, shops and encyclopedias but for us Web technologists we see that all of those marvelous things are enabled by the use of URIs, HTTP and various machine readable data formats.
HTTP itself is pretty simple: it tells you how you can use a specific type of URI to lookup some information. It doesn’t try and tell you what that information means but it provides plenty of clues about the provenance, format and timeliness of that information. HTTP just provides the transfer mechanism for messages between a client and a server and it’s very good at its job. It’s very natural to assume that the information received from a URI request is in some way related to the thing identified by that URI. Once you have that mechanism then it’s obvious that if you want to assign an identifier for something and publish information relating to it to the largest number of people then you’re going to pick an HTTP URI.
The HTTP specification uses the word “representation” to denote the relationship between a URI and the thing it identifies as in: "the information you received is a representation of the thing identified by that URI". The spec doesn't define what representation means any further than that.
The nature of that relationship and the meaning of “representation” has been the subject of a huge amount of debate by the Semantic Web community spread over half a decade and has resulted in a number of new terms such as “information resource” and some conventions such as “303 redirects” to resolve perceived problems. After all that debate we are left with something isn’t technically broken but has not been universally hailed as essential to the fabric of the Web. The majority of the wider Web community are content with publishing information using URIs, simply ignoring these strange conventions and distinctions that the Semantic Web commuity find so important.
I’ve always been of the opinion that this debate could have been avoided by keeping the responsibilities of each component of the Web separate and clean. The seperation should be: identify things using URIs, transfer information using HTTP, encode meaning in the data formats used in the transfer. Instead, we have special interpretations of certain parts of URIs and special interpretations of certain HTTP status codes to infer special meaning on the information being transferred.
I think it’s time to stop blurring those responsibilities so I'm going back to basics.
- Plain old resources — I don’t find the distinction between information resources and non-information resources to be a useful one when compared with the complexity of deciding which is which so I’m going to stop using that terminology. From now on everything that has a URI is just a plain old resource, just like it says in the HTTP spec.
- The meaning is in the message, not the protocol — I don’t think it’s useful to overload HTTP with notions of descriptions, documents and content. I think that classification is best conveyed in the body of the messages used by HTTP, not in the protocol itself. This means I won’t assume I can get a description of a resource or a copy of it by using its URI. Instead when I use GET on a URI I am simply looking up the information that the owner of that URI desires to send. Now, it might be the case that some other information I already have says that when I lookup information from a URI then I should treat it as a description of the identified resource. That’s entirely fine and a good separation of concerns. I may also discover some other evidence that the received information is not a description after all but something else. That’s cool too because any system I am using that can tell me these things should also be able to help me determine inconsistencies and discrepancies in the information I am collecting.
- Use the protocol to manage and route information — If I want to allow other people to change the information that can be looked up using any of my URIs then I’ll enable support for POST or PUT methods. I can even allow DELETE the method to prevent any future information being transmitted to users. Because I'm leaving the interpretation of the information to the message body I don't need to worry whether someone is updating the content or the description of the resource. I also have the full range of redirects and other HTTP machinery available to help people find the information I have that's related to my resource.
- Place trust in the information returned by a URI — the information received when a user accesses a URI has a special position because only the owner of that URI gets to control what is sent. That’s useful if I want to lookup what the owner thinks is important in relation to their URI and I should place more trust in that information than I should with information from other sources.
That's it really. No messing around with special status codes and redirects based on hard-to-pin-down concepts. No special types of URI that differ in meaning depending on what software you use. Just standard HTTP. When someone enters a URI in their browser or application, they get useful related information back. Moreover, the URI in their browser's address bar is one they can use to refer to that resource in any context. They can bookmark it, send it in an email, use it in a SPARQL query or even write some of their own RDF with it. I like that kind of simplicity.