Archive for the 'Random Stuff' Category

Jan 18 2010

Test blog post from posterous

Published by Ian Davis under Random Stuff

Hmmm, a reply

Comments Off

Jan 17 2010

The Solution!

Published by Ian Davis under Random Stuff

Comments Off

Jan 17 2010

Test blog post from posterous

Published by Ian Davis under Random Stuff

This is a simple test!

Comments Off

Dec 22 2009

New Blog URL: blog.iandavis.com

Published by Ian Davis under Random Stuff and tagged as , , , ,

I’ve moved this blog from iandavis.com/blog to blog.iandavis.com because I’m thinking about moving to a hosted blog system rather than self-run. I think I’ve put all the necessary redirects in but please let me know if you think I’ve missed something.

Comments Off

Oct 20 2009

More Than The Minimum

Published by Ian Davis under Random Stuff

The Linked Data design note lists four practices that lay the foundations of a web of connected data:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

These practices are now well known and implemented in hundreds of datasets. However, I think it is important to realise that these are the minimum requirements for a web of data about real-world things, a starter-kit if you like. They are not the final word on what will make up a rich web of data and there are many more things we, as data publishers, could be doing.

For example, rule 3 suggests that you provide some useful RDF when someone looks up the URI of a resource. That doesn’t mean you can’t publish more RDF about that URI at a different location. If you want to assert some additional information about http://dbpedia.org/page/Decentralization then just publish a document on your site. Crucially you don’t have to persuade dbpedia.org to add your triples to their database. There is no rule that the data found at a URI is the only relevant data about that thing – it’s just one privileged portion of the total data.

That leads nicely to another example. Rule 1 suggests that you use URIs to refer to real-world things. It says nothing about how or when you should create them. The convention so far has been to mint new URIs for things rather than try to find a pre-existing one URI. That’s an acceptable practice in the bootstrapping phase where the data is sparse in but it is saving up a big integration problem for the future. I think we should be encouraging people to reuse well-known identifiers such as those in dbpedia and geonames in preference to creating new ones.

4 responses so far

Sep 21 2009

Linked Data Spam Vectors

Published by Ian Davis under Random Stuff and tagged as , , , ,

One of the defining characteristics of a successful information system is its level of exploitation by spammers. Successful systems will attract those who wish to hijack that success for their own ends, most often by the use of deception. I think we can safely say that the Linked Data web is not yet successful because it has not attracted that level of spam attack. With the Google and Yahoo’s adoption of RDFa I think we’ll see that change in the next twelve months. In anticipation of future attacks I thought it would be a useful exercise to try and predict some of the possible vectors for introducing spam into a Linked Data system. Some of the spam vectors are going to be traditional, rooted in the existing web. However, some are going to be new, made possible by the open world aspects and low cost integration possibilities provided by Linked Data.

Without too much effort I identified seven obvious attack vectors for spammers. I’m sure collectively we can extend this list and then examine the methods of control to ensure that they don’t become a problem for the Linked Data web. Here’s what I came up with:

False Labelling

With this vector of attack the spammer simply asserts labelling triples that promote their message. Linked data systems often display the objects of these triples when labelling resources. If the spammer targets popular subject URIs then there is a higher chance of their message appearing for users of the Linked Data system. For example:

dbpdedia:London rdfs:label "Buy more Wensleydale" .
<http://danbri.org/foaf.rdf#danbri> foaf:name "Wensleydale fan" .

Misdirection

Here the attacker asserts triples using properties that are commonly used to provide links to human-readable content. In the attack, the triple objects are resources that contain the attacker’s message. Systems that use these properties may inadvertently display links to the spammer’s site and content:

dbpedia:London rdfs:seeAlso <http://example.com/buycheese> .

dbpedia:Tim_Berners-Lee foaf:isPrimaryTopicOf <http://example.com/buycheese> .

<http://sws.geonames.org/3333196/> mo:wikipedia <http://example.com/buycheese> .

Schema Pollution

In this attack all of the instance data is innocuous but some of the properties used in the data are labelled with the spammer’s message. When rendering data for human use, many linked data systems will look for schema information to label unknown predicates. This attack causes those systems to display the spammer’s message:

ex:thing dc:title "New study finds that mice can learn to sing." ;
         a foaf:Document ;
         dc:subject "mouse behaviour" ;
         ex:prop "Journal of mouse psychology" .

ex:prop a rdfs:Property ;
        rdfs:label "Lowest Wensleydale prices at bargaincheeseshop.com" .

This attack can be combined with False Labelling, attempting to inject a message into a commonly used schema:

dc:title rdfs:label "Lowest Wensleydale prices at bargaincheeseshop.com" .

Identity Assumption

This vector relies on the common practice of minting URIs in one URI space and using owl:sameAs to connect the resource to identical resources in other URI spaces. The attacker simply describes a resource that conveys their message and then uses owl:sameAs to make it identical to popular resources. Most Linked Data systems recognise owl:sameAs and aggregate all triples about any subjects declared to be identical.

ex:thing dc:title "Wensleydale: the mature, smooth cheese you will love." ;
         owl:sameAs dbpedia:The_Beatles ;
         owl:sameAs dbpedia:Lady_Gaga ;
         owl:sameAs dbpedia:True_Blood ;
         owl:sameAs dbpedia:Harry_Potter .

Bait and Switch

In this vector, the spammer uses content negotiation to provide enticing linked data to machines and spam messages to humans. When a Linked Data system fetches a URI it indicates that it requires machine-readable data by sending an appropriate HTTP header. Web browsers under the control of a human will send a different value for the header so servers can distinguish machines from humans and send different information. The spammer can configure their server to send innocuous Linked Data to machines which, when visited by humans, display the spammer’s message. (See my earlier post Is the semantic web destined to be a shadow? for some of the consequences of this separation of machine/human content)

Misattribution

Under this attack, the spammer attributes their message to someone they hope the recipient will trust. Linked Data systems may ingest this data and display the quotation with the source inadvertently misleading its users:

ex:1 a bibo:Quote ;
   bibo:content "I always buy Wensleydale from bargaincheeseshop.com
                 and so should you" ;
   dc:creator "Sergey Brin" .

Data URI Embedding

In this attack vector the data itself is innocuous but the URIs used by the attacker use the data: scheme to embed the spam message. If these URIs are displayed to the user of a Linked Data system then they may click on them and trigger the message display. (example )

dbpedia:London rdfs:seeAlso
        <data:text/html;charset=utf-8;base64,PGEgaHJlZj0iaHR0cDovL2V4YW1wbGUuY29tL2J1eWNoZWVzZSI+bG93ZXN0IFdlbnNsZXlkYWxlIHByaWNlczwvYT4=> .

Conclusion

I think there might be an attack vector using reification of the spam triples – but I’m not sure it’s really different to any of those listed above. Also, I covered owl:sameAs as a vector but OWL provides many other ways to infer identity between resources such as owl:functionalProperty, owl:inverseFunctionalProperty etc.

Most of these attack vectors can be countered through a whitelist provenance system, but they are not easy to scale. Also some of them may be difficult to detect if the vast majority of the data from a spammer is actually useful. One particular property of RDF where duplicate triples can be ignored makes it easy to bury spam inside billions of legitimate triples – simply take a copy of dbpedia and add a few spam triples. A casual inspection of the dataset will more than likely just see the dbpedia triples, but a Linked Data system that already has those triples will ignore them and just add the spam triples.

6 responses so far

Sep 17 2009

New SVG Version of FOAF Logo

Published by Ian Davis under Random Stuff and tagged as , ,

I needed a better SVG version of the FOAF logo, so I made one. I added it to the main FOAF logo page with a few notes on the colours. Like the other images there, this is gifted to the public domain, please use widely.

I also uploaded this image to wikimedia commons

2 responses so far

Sep 16 2009

Being Structured and Having Semantics is not Enough

Published by Ian Davis under Random Stuff

Jonathan Rochkind describes a very typical decision sequence when working with MARC data:

Frequently the answer to “How do I get this piece of data I want” is along the lines of: “Well, it’ll be in this field, UNLESS this other field is X, in which case it’ll be in field Y, UNLESS field Y is being used for Z to try to and figure out if Z look at fixed fields a, b, and c, the different combinations of all three of which determine that, but there’s no guarantee they’re filled out correct. Oh, and that’s assuming it’s a post-1972 record, in older records they did things entirely differently and put the data over in field N. Oh, and ALL of that is assuming this is AACR2 data, the corpus also includes Rare Books and Manuscripts data, and those guys do things entirely differently, although it’s still in MARC, you’ve got to look in this OTHER field for it. First check fixed field q to see if it’s RBM data, and hope fixed field q is right. Oh, and don’t forget to check if it’s encoded in UTF-8 or MARC-8 by checking this other fixed field, which we know is wrong most of the time.”

For non-librarians, MARC is a structured data format whose fields have well-defined and rather precise semantics. The missing pieces are that the data structure is not self-describing and there is no strategy for discovering the rules (apart from a human reading the documentation and then encoding the rules in code).

2 responses so far

May 12 2009

Microdata Experiment

Published by Ian Davis under Random Stuff and tagged as , , ,

I read the new HTML5 microdata proposal tonight and thought I’d see what it would take to convert my existing homepage which is currently marked up using eRDF. The result is here and it was surprisingly painless to do the conversion. You can try it out using this demo service. The spec is still changing so I don’t know how long my experiment will remain valid (it changed from using property to itemprop attributes while I was converting my html!)

2 responses so far

Mar 04 2009

Tom Ilube Explains the Semantic Web at Davos

Published by Ian Davis under Random Stuff and tagged as ,

A great presentation by Tom Ilube, explaining the Semantic Web at the Davos economic forum this year. Succinct, articulate and pitched at just the right level.

Comments Off

Next »