Is the Semantic Web Destined to be a Shadow?

Current practice in the Semantic Web community is leading to the creation of a shadow web that is becoming disconnected from the web of documents. This fracturing is being caused by the W3C's decision to restrict the types of resources that can be addressed directly with HTTP .

Rob McCool points out in his Rethinking the Semantic Web article that much of the W3C's Semantic Web activity goes to promoting the creation of separate RDF documents, creating a "shadow web" largely invisible and inaccessible to the bulk of users. Because few humans traverse and explore this shadow web, and because the documents require significant technical understanding of the RDF model, there is no significant ability to validate or affirm the relevance of the metadata being expressed. In some respects the nascent web of data is experiencing a golden age where all data is created with the best intentions. However, my prediction is that were this web of data to become visible in a major search engine it would become another vector for spam to attack search results. Today's spammers are very much more evolved after a long arms race with the Web search engines, and the semantic web community vanishingly small compared to that of the wider Web. It is an open question as to whether this shadow web could ever survive this hostile environment. In the W3C's classic layer cake diagram of the Semantic Web, the topmost layer is "trust", and its positioning indicates that it will be the last component to be built, once all the mechanics are in place.

My belief is that trust must be considered far earlier and that it largely comes from usage and the wisdom of the crowds, not from technology. Trust is a social problem and the best solution is one that involves people making informed judgements on the metadata they encounter. To make an effective evaluation they need to have the ability to view and explore metadata with as few barriers as possible. In practice this means that the web of data needs to be as accessible and visible as the web of documents is today and it needs to interweave transparently. A separate, dry, web of data is unlikely to attract meaningful attention, whereas one that is a full part of the visible and interactive web that the majority of the population enjoys is far more likely to undergo scrutiny and analysis. This means that HTML and RDF need to be much more connected than many people expect. In fact I think that the two should never be separate and it's not enough that you can publish RDF documents, you need to publish visible, browseable and engaging RDF that is meaningful to people. Tabular views are a weak substitute for a rich, readable description.

Keeping metadata visible and auditable by humans is one of the key principles of the microformats movement. Tantek described the process as one where:

Authors readily saw mistakes themselves and corrected them (because presentation matters). Readers informed authors of errors the authors missed, which were again corrected. This feedback led to an implied social pressure to be more accurate with hyperlinks thus encouraging authors to more often get it right the first time. When authors/sites abused visible hyperlinks, it was obvious to readers, who then took their precious attention somewhere else. Visible data like hyperlinks with the positive feedback loop of user/market forces encouraged accuracy and accountability. This was a stark contrast from the invisible metadata of meta keywords, which, lacking such a positive feedback loop, through the combination of gaming incentives and natural entropy, deteriorated into useless noise.

This is akin to the many eyes principle of the open source movement. Making metadata a visible and integral part of the web page was the principle motivation that led to me developing embedded RDF and is an important consideration in the design of RDFa. The importance of the existing web to the nascent Semantic Web is also underlined by the W3C's recent standardization of GRDDL which allows pre-existing documents to be transformed into RDF.

However, there is a problem to this coexistence and it's forced by the W3C TAG's notions of Information Resources and the httpRange-14 decision on the types of resources that can be addressed with HTTP. As I pointed out in my recent Fragmentation post, there is strong pressure towards using URIs with fragment IDs to represent "non-information resources".

The dogma that URIs without fragment identifiers must be restricted to document-like resources pushes people into using URIs like to denote things that aren't documents. However, there's a big problem with this: in reality the fragment identifer is associated with representations of resources, not the resources themselves. The fragment identifies a portion of a representation obtained from a URI, and its meaning changes depending on the type of representaion.

In the Web Architecture it is impossible to get a representation of so you have to get a representation of and hope it contains information about the resource you're looking for. However, the Web Architecture also forbids you from serving up both HTML and RDF documents at that URI that refer to the same fragment id. You can have a machine readable RDF version or a human readable HTML version but not both at the same time. Ever Unless you really did mean to refer to an HTML document fragment. If you're a mere mortal reader, rather than an RDF guru, then you can't find out what denotes because of that single hash character!

The inevitable consequence of this dogma is the statement I opened with: current practice is leading to the creation of a shadow web that is becoming disconnected from the web of documents. Pushing the web of data further away from people is very dangerous with far reaching consequences for the success (or not) of the Semantic Web, especially when the spammers get involved.


Other posts tagged as rdf, semantic-web, technology

Earlier Posts