Mar 04 2009

Tom Ilube Explains the Semantic Web at Davos

Published by Ian Davis under Random Stuff and tagged as ,

A great presentation by Tom Ilube, explaining the Semantic Web at the Davos economic forum this year. Succinct, articulate and pitched at just the right level.

Comments Off

Mar 02 2009

The Semantic Web Acid Test

Published by Ian Davis under Random Stuff and tagged as , ,

Tom Heath writes a cracking post on the current attempts by a few people to brand web applications that happen to perform text analysis as “Semantic Web”. For me, this nails it:

I certainly notice plenty of unjustified attempts at present to co-opt the term Semantic Web, now that it’s no longer a dirty word, and drive it off down some dodgy alleyway. Some of these products, services or companies may be applications or services that use some semantic technology and are delivered over the Web, but that doesn’t make them Semantic Web applications, services or companies. Anything claiming the Semantic Web label needs to get its hands dirty with Linked Data somewhere along the way. That’s just how it is.

Tom’s right. These attempts to label some pretty run-of-the-mill web applications as Semantic Web suggests to me that the marketers are seeing the Semantic Web meme as carrying some useful currency. The problem they face is that the Semantic Web has some well-defined principles that can be used as tests. Here’s the first test: if you see one of these applications find one of its pages describing something that’s useful to you (e.g. a place or a person) and ask yourself “what’s the URI of the thing this page is describing?”.

12 responses so far

Apr 09 2008

Identity Theft: It’s Not Your Problem

Published by Ian Davis under Uncategorized and tagged as , ,

I spotted this today, a group of people upset by the ease by which their personal information can be accessed. This information was already available to the public but distributed across many locations and physical formats:

“Who knew it was going to get posted on the Web? It’s shocking,” said one House Democratic chief of staff, who requested anonymity to discuss her personal finances. “Now that anybody can look it up on the Web, I don’t know if I like it anymore.”

Her forms for 2006, which were filed last spring, included her home address and 32 pages of detailed statements about bank accounts under the name of her husband and daughter. That prompted her to raise concerns about identity theft at a chiefs of staff meeting in March.

These people are upset but their anger is misdirected. The problem isn’t that this is private information, after all these are government employees being held to account. Nor is the Web to blame for making this information trivial to access. The blame squarely lies with the ease with which having access to this information can be used to commit fraud. The reaction though is a startling illustration of how the banking industry has subtly shifted responsibility for financial security from themselves to their customers. They use the phrase “identity theft” like it’s our fault! They should be focusing on fraud prevention rather than throwing up smokescreens.

The Web opens and connects enormous quantities of data from all over the world and, as the Semantic Web gains momentum, we’ll see more connections and exposure of many more types and sources of data. Hiding this data from other people isn’t an option. Taking control of your own data and having the tools and services that let you find out where and why it’s being used is. But we should also expect every organization to accept responsibility for fraud prevention and guarding their customers against abuse. Pretending it’s someone else’s fault is just abdication of that responsibility.

5 responses so far

Dec 05 2007

303 Asymmetry

Published by Ian Davis under Uncategorized and tagged as , ,

I mentioned a while back that I wanted to talk more about the descriptions vs representations issue. A recent message by timbl provided the impetus to do so. In that message he says the following:

Try thinking of it this way instead. You are going to serve some representation on the web, for this thing. Are those going to be (a) ABOUT the thing, or (b) the CONTENT of this thing denoted by the URI? If the former, you must use # or 303. If the later, you can serve the representations with 200 from that URI. You see, 200 means (basically) “Here comes the content of the document you asked for” and 303 means “Here is the URI of document ABOUT the thing you asked for.

That seems a pretty good characterisation of the 303 decision. Information resources can serve up representations of themselves, other resources cannot so you have to make do with descriptions. (Regular readers will already know that I don’t fully agree with this model, but it’s the accepted one in the SemWeb community)

There appears to be an asymmetry about this though and I think it’s a limitation of the model.

Suppose I have a resource “R” with URI http://example.org/R. If it is an “Information Resource” then I can arrange things so that a GET request for its text/html representation responds with a 200 and the HTML in the body of the response. I could also arrange for a request for its application/rdf+xml representation to respond with a 303 status and the URI of another information resource “RDESC” (e.g. http://example.org/RDESC). In this example the 303 response meand that “R” cannot be represented as RDF, but there’s an alternative RDF document that is a description of R. The user can then re-issue the request on http://example.org/RDESC to obtain that description.

Now, I can arrange for http://example.org/RDESC to return an RDF representation in a 200 response. But, here’s the asymmetry. How can I allow the user to obtain a description of RDESC? The representation I send back is the content of RDESC, not its description. I can’t use the media type to distinguish the type of request any more.

In case you think this is an artificial distinction, it’s not. We’re dealing with it right now at Talis (and have been for a number of months). We give access to a number of RDF graphs via HTTP where naturally the user wants to obtain the content of the graph as the response to a GET. We also serve RDF descriptions of those graphs containing some configuration information. There’s no standard way to link the two things together so that users can select either the description or the content using HTTP.

Now, I happen to think that there is an interesting solution to this asymmetry. Suppose we created a new HTTP header called “resource-description” whose value was the URI of a description of the given resource. Note that it’s a description of the resource, not of any representation that is being sent as part of the response. The asynmmetry goes away because this gives you a method of pointing to the description regardless of the status code and/or content negotiation going on in the request.

Things get even more interesting if you allow multiple resource-description headers: what a great way to cross link to other people’s descriptions of your resource.

I seem to recall something similar to this being proposed a few years back but my Googling doesn’t turn anything up. Going back further takes us to Patrick Stickler’s attempts to solve the description problem using URIQA, which took the approach of introducing another verb, MGET, to obtain the description. This was almost universally disliked, but the underlying problem has remained unsolved in the meantime.

And given where my head has been for the past few weeks, I have to ask what decision would have been taken on the httpRange-14 issue if this header had already existed. Would instead we be returning 406 responses when we cannot supply a suitable representation for resources, or even a 204. Both of those could work with the header pointing to an appropriate description of the resource.

Update: it seems that timbl and I were touched by the same muse tonight: Alternative to 303 response: Description-ID: header

5 responses so far

Nov 26 2007

Platform at SWIG-UK, Bristol

Published by Ian Davis under Uncategorized and tagged as , , , ,

Last Friday I gave a presentation on the Talis Platform to the SWIG meeting, kindly hosted by HP Labs Bristol. I’ve posted the slides up at our n2 developer community site. They’re not much to look at but I wrote them to be informative rather than suggestive. Although I love seeing beautifully spare presentations being given, they frustrate me when I want to go back and see what the speaker said and find a picture of a butterfly on a flower :)

Nad’s written a good summary over on his blog and managed to capture the questions I was asked at the end too, which is nice to see after the event. The whole day was brilliant with lots of chances to natter and catch up with everyone. I met lots of new people too and everyone seemed to be doing something interesting. I even managed to harangue Stuart Williams off of the W3C TAG on my recent Web architecture posts. Andy said about 50 people were attending despite there being no marketing of the event at all which is a good indicator of the rising popularity of the Semantic Web. Talis are planning to host a SWIG meeting like this in the middle of next year – hopefully we can get more people from the Midlands interested.

There were plenty of other cool presentations on the day too. Nad and Rob were blogging but couldn’t post them live for some reason so check out their sites and Nodalities too in the coming days. I particularly enjoyed Leigh’s talk on Facet, another templating framework for RDF, this time in bog-standard Java allowing the use of JSP and/or Velocity; Richard’s talk on Sindice (which I didn’t get to see at ISWC and for us ignorant Brits is pronounced “sin-dee-chee”); and Graham’s talk on image publication. All great stuff!

Comments Off

Nov 21 2007

Isn’t The Web Built From Links?

Published by Ian Davis under Uncategorized and tagged as , ,

If my shadow web post hasn’t convinced you then try this thought experiment:

You want to link from your webpage to Tim Berners-Lee’s URI <http://www.w3.org/People/Berners-Lee/card#i>, except you can’t because that link points to something that can never contain the #i fragment in its HTML. It can only ever link to RDF because Tim is relying on RDF’s semantics for the meaning of the #i fragment. Tough luck if you can’t read RDF or don’t want to have to learn.

6 responses so far

Nov 21 2007

Reformulating the Web Architecture

Published by Ian Davis under Uncategorized and tagged as , ,

So, accepting that URIs with fragments are generally a broken piece of architecture for the Semantic Web and that information resources are not adding any real substance, here’s how I see the Web Architecture being reformulated for use with the Semantic Web:

  1. A hashless URI should be allowed to denote any resource whatsoever. Documents, books, people, galaxies and unicorns. There is no ambiguity here, the URI denotes a single thing. More than one URI can denote the same thing, so I can have a URI that denotes the city of London, and Danny can have a different URI that also denotes London.
  2. A representation of a resource can be obtained by issuing an HTTP GET on a URI. The representation is a sequence of bits that somehow stands in for the resource the URI denotes. Content negotiation can be used to select an appropriate format for the representation, withouth changing the actual resource being denoted. Perhaps my URI denoting London can respond with an HTML document containing essential facts and figures about the city, a JPEG aerbyial photograph, an SVG streetmap or a sound recording of the sounds encountered while in the city itself. None of these things are London, but they all can stand in for it in some limited fashion. I could retrieve them all to obtain a better sense of London itself, but I cannot actually obtain London using HTTP.
  3. URIs containing hashes are constrained in what they may denote and have an inherent ambiguity due to their reliance on the particular representation obtained. Their denotations vary depending on the URI plus a set of HTTP headers used during the request.
  4. There is no such thing as an “Information Resource”. All resources are made equal. However for many resources, the only representation available happens to be identical to the resource itself. Still, you cannot obtain the actual resource using HTTP, but you can get a copy in the form of a representation. The majority of HTML documents on the web behave in this manner, a single representation that is a copy of the resource itself.

These aren’t huge changes and they’re backwards compatible with the existing web. On the other hand they greatly reduce the reliance on fragment identifiers and they encourage people to use real unambiguous URIs to refer to things other than documents, weaving the Semantic Web right into today’s Web.

For background, you might like to read my earlier posts on this subject:

9 responses so far

Nov 21 2007

What are Information Resources Good For?

Published by Ian Davis under Uncategorized and tagged as , ,

As is probably obvious from my recent posts (e.g. Fragmentation and Is the Semantic Web Destined to be a Shadow?), I’m thinking about the TAG’s httpRange-14 decision again and a large amount of the Architecture of the World Wide Web. The more I think about it, the more I come to believe that Xiaoshu Wang’s formulation is the only one that makes any kind of architectural sense. The foundation of all these issues rests in what was once known as the URI crisis and was boiled down to the question: what kind of things can a HTTP URI identify? The TAG took this up as httpRange-14 which was resolved back in 2005 by introducing the notion of a special class of resources called “information resources”. In his essay What do HTTP URIs Identify? Tim Berners-Lee wrote:

The authors of document <http://www.w3.org/2000/10/rdf-tests/rdfcore/Manifest.rdf> certainly thought that they could use “http://www.w3.org/2000/10/rdf-tests/TestSchema/NegativeParserTest” to identify an abstract thing which is a type of software test. Now they have a choice as to what to make the server return for them when I ask for it. It returns 404 “doesn’t match anything we have available”. It can’t really, because HTTP doesn’t allow one to return a class, only a document. And if it were to return a document, then I wouldn’t be able to refer to that document without accidentally referring to the class of negative parser tests.

It seems to me that this essay contains a simple mistake which colours the whole httpRange-14 debate and resolution. It says that the URI can’t return anything because “HTTP doesn’t allow one to return a class, only a document”. That’s true, but it does allow you to return a representation of the class which is a document (or potentially an image, or spoken word audio file). HTTP can never return a resource, just representations of them, i.e. things that stand in for the resource. The final point in that quote suffers from that confusion: of course you can’t use the URI of the resource to refer to its representation. Possibly you could mint a new URI to denote it, but there is no standard vocabulary that I’m aware of that can relate a representation to its resource parameterized by the HTTP request headers and the time. There probably needs to be.

The resolution to all this was to introduce a class of resources that are basically the same as their representations: information resources. The only way to detect if you have an information resource is to GET its URI. If it responds with a 2xx response then the resource identified by that URI is an information resource. Any other response code means it might or might not be an information resource. This rule has a corollary: if you have a non-information resource then you must not respond with a 2xx response, instead you should use 303 to point to an information resource that somehow gives information about the non-information resource.

Information resources are defined in AWWW:

By design a URI identifies one resource. We do not limit the scope of what might be a resource. The term “resource” is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as “resources”. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as “information resources.”

This document is an example of an information resource. It consists of words and punctuation symbols and graphics and other artifacts that can be encoded, with varying degrees of fidelity, into a sequence of bits. There is nothing about the essential information content of this document that cannot in principle be transfered in a message. In the case of this document, the message payload is the representation of this document.

It’s hard to understand what benefit the introduction of information resources has to the Web architecture. It definitely has drawbacks. For a start it forces all non-information resources off of the web entirely – they’re not allowed to respond with 200 status codes to GET requests. It encourages non-information resources to use of URIs containing fragment identifiers which, as I’ve pointed out, are a very broken piece of architecture and are leading to the formation of a sort of shadow web.

They are also notoriously hard to define. Consider the following resources, remembering to ask whether all their essential characteristics can be conveyed in a message?

  • A cat – no
  • A description of a cat – yes
  • A digital photo of a cat – yes
  • A 35mm film frame containing the image of a cat – no
  • A web page about cats – I hope so!
  • A website about cats – maybe, I guess you could tar it up and serve it from http://www.allaboutcats.com
  • The DNA of a cat – probably no
  • A recording of a cat’s mew – yes, unless its an analogue recording in which case we have a digital approximation of the analogue recording
  • A cat’s mew – no
  • A book about cats – probably not, the book is an abstract work of which there can be multiple editions, revisions, translations, abridgments etc.
  • The class of all cats – I say yes, timbl says no. I can convey the precise definition of a class in a message as an RDF or OWL schema so that seems to satisfy the criteria
  • The members of the class of all cats – no
  • A database of cats – yes
  • A card catalogue of cats – no
  • The name of a cat – yes
  • A taxonomy of cat species – yes
  • The cat character from Shrek 2 and 3 – yes

Remind me again, what’s the point of having this distinction in types of resources….?

35 responses so far

Nov 21 2007

Is the Semantic Web Destined to be a Shadow?

Published by Ian Davis under Uncategorized and tagged as ,

Current practice in the Semantic Web community is leading to the creation of a shadow web that is becoming disconnected from the web of documents. This fracturing is being caused by the W3C’s decision to restrict the types of resources that can be addressed directly with HTTP .

Rob McCool points out in his Rethinking the Semantic Web article that much of the W3C’s Semantic Web activity goes to promoting the creation of separate RDF documents, creating a “shadow web” largely invisible and inaccessible to the bulk of users. Because few humans traverse and explore this shadow web, and because the documents require significant technical understanding of the RDF model, there is no significant ability to validate or affirm the relevance of the metadata being expressed. In some respects the nascent web of data is experiencing a golden age where all data is created with the best intentions. However, my prediction is that were this web of data to become visible in a major search engine it would become another vector for spam to attack search results. Today’s spammers are very much more evolved after a long arms race with the Web search engines, and the semantic web community vanishingly small compared to that of the wider Web. It is an open question as to whether this shadow web could ever survive this hostile environment. In the W3C’s classic layer cake diagram of the Semantic Web, the topmost layer is “trust”, and its positioning indicates that it will be the last component to be built, once all the mechanics are in place.

My belief is that trust must be considered far earlier and that it largely comes from usage and the wisdom of the crowds, not from technology. Trust is a social problem and the best solution is one that involves people making informed judgements on the metadata they encounter. To make an effective evaluation they need to have the ability to view and explore metadata with as few barriers as possible. In practice this means that the web of data needs to be as accessible and visible as the web of documents is today and it needs to interweave transparently. A separate, dry, web of data is unlikely to attract meaningful attention, whereas one that is a full part of the visible and interactive web that the majority of the population enjoys is far more likely to undergo scrutiny and analysis. This means that HTML and RDF need to be much more connected than many people expect. In fact I think that the two should never be separate and it’s not enough that you can publish RDF documents, you need to publish visible, browseable and engaging RDF that is meaningful to people. Tabular views are a weak substitute for a rich, readable description.

Keeping metadata visible and auditable by humans is one of the key principles of the microformats movement. Tantek described the process as one where:

Authors readily saw mistakes themselves and corrected them (because presentation matters). Readers informed authors of errors the authors missed, which were again corrected. This feedback led to an implied social pressure to be more accurate with hyperlinks thus encouraging authors to more often get it right the first time. When authors/sites abused visible hyperlinks, it was obvious to readers, who then took their precious attention somewhere else. Visible data like hyperlinks with the positive feedback loop of user/market forces encouraged accuracy and accountability. This was a stark contrast from the invisible metadata of meta keywords, which, lacking such a positive feedback loop, through the combination of gaming incentives and natural entropy, deteriorated into useless noise.

This is akin to the many eyes principle of the open source movement. Making metadata a visible and integral part of the web page was the principle motivation that led to me developing embedded RDF and is an important consideration in the design of RDFa. The importance of the existing web to the nascent Semantic Web is also underlined by the W3C’s recent standardization of GRDDL which allows pre-existing documents to be transformed into RDF.

However, there is a problem to this coexistence and it’s forced by the W3C TAG’s notions of Information Resources and the httpRange-14 decision on the types of resources that can be addressed with HTTP. As I pointed out in my recent Fragmentation post, there is strong pressure towards using URIs with fragment IDs to represent “non-information resources”.

The dogma that URIs without fragment identifiers must be restricted to document-like resources pushes people into using URIs like http://www.w3.org/People/Berners-Lee/card#i to denote things that aren’t documents. However, there’s a big problem with this: in reality the fragment identifer is associated with representations of resources, not the resources themselves. The fragment identifies a portion of a representation obtained from a URI, and its meaning changes depending on the type of representaion.

In the Web Architecture it is impossible to get a representation of http://www.w3.org/People/Berners-Lee/card#i so you have to get a representation of http://www.w3.org/People/Berners-Lee/card and hope it contains information about the resource you’re looking for. However, the Web Architecture also forbids you from serving up both HTML and RDF documents at that URI that refer to the same fragment id. You can have a machine readable RDF version or a human readable HTML version but not both at the same time. Ever Unless you really did mean to refer to an HTML document fragment. If you’re a mere mortal reader, rather than an RDF guru, then you can’t find out what http://www.w3.org/People/Berners-Lee/card#i denotes because of that single hash character!

The inevitable consequence of this dogma is the statement I opened with: current practice is leading to the creation of a shadow web that is becoming disconnected from the web of documents. Pushing the web of data further away from people is very dangerous with far reaching consequences for the success (or not) of the Semantic Web, especially when the spammers get involved.

17 responses so far

Nov 16 2007

Fragmentation Reprise

Published by Ian Davis under Uncategorized and tagged as ,

I got some good responses to my recent post that poked holes in the semantic web architecture. In comments, Peter Murray asked:

Is the nub of the nub of the problem that the client may not know what kind of representation it might get back when dereferencing a URI? If the client does know the type of representation, or if asking for a particular type of representation results in an error from the server, then it can assume to know what the fragment identifiers mean. Right?

While that is true, it’s not relevant to this argument. The core of the problem is that hashed URIs are inherently ambiguous. Its meaning depends on how you access it, which is nuts. Its as though a word has different meanings depending on whether you read it in a book or have it read out to you.

Bill de hÓra points to a similar writeup he did a month or so back (which I should have remembered and linked to when I wrote mine!). It contains this classic de hÓra line:

You really don’t want your absolute naming system for a planet to be driven by an arbitrary feature of a markup format

Danny weighed in with a long post that darted onto the representation vs description issue about which I also have a bit to say… but in another post :)

3 responses so far

Next »