Google+

Linked Data Spam Vectors

10

21 September 2009 by Ian Davis

One of the defining characteristics of a successful information system is its level of exploitation by spammers. Successful systems will attract those who wish to hijack that success for their own ends, most often by the use of deception. I think we can safely say that the Linked Data web is not yet successful because it has not attracted that level of spam attack. With the Google and Yahoo’s adoption of RDFa I think we’ll see that change in the next twelve months. In anticipation of future attacks I thought it would be a useful exercise to try and predict some of the possible vectors for introducing spam into a Linked Data system. Some of the spam vectors are going to be traditional, rooted in the existing web. However, some are going to be new, made possible by the open world aspects and low cost integration possibilities provided by Linked Data.

Without too much effort I identified seven obvious attack vectors for spammers. I’m sure collectively we can extend this list and then examine the methods of control to ensure that they don’t become a problem for the Linked Data web. Here’s what I came up with:

False Labelling

With this vector of attack the spammer simply asserts labelling triples that promote their message. Linked data systems often display the objects of these triples when labelling resources. If the spammer targets popular subject URIs then there is a higher chance of their message appearing for users of the Linked Data system. For example:

dbpdedia:London rdfs:label "Buy more Wensleydale" .
<http://danbri.org/foaf.rdf#danbri> foaf:name "Wensleydale fan" .

Misdirection

Here the attacker asserts triples using properties that are commonly used to provide links to human-readable content. In the attack, the triple objects are resources that contain the attacker’s message. Systems that use these properties may inadvertently display links to the spammer’s site and content:

dbpedia:London rdfs:seeAlso <http://example.com/buycheese> .

dbpedia:Tim_Berners-Lee foaf:isPrimaryTopicOf <http://example.com/buycheese> .

<http://sws.geonames.org/3333196/> mo:wikipedia <http://example.com/buycheese> .

Schema Pollution

In this attack all of the instance data is innocuous but some of the properties used in the data are labelled with the spammer’s message. When rendering data for human use, many linked data systems will look for schema information to label unknown predicates. This attack causes those systems to display the spammer’s message:

ex:thing dc:title "New study finds that mice can learn to sing." ;
         a foaf:Document ;
         dc:subject "mouse behaviour" ;
         ex:prop "Journal of mouse psychology" .
         
ex:prop a rdfs:Property ;
        rdfs:label "Lowest Wensleydale prices at bargaincheeseshop.com" .

This attack can be combined with False Labelling, attempting to inject a message into a commonly used schema:

dc:title rdfs:label "Lowest Wensleydale prices at bargaincheeseshop.com" .

Identity Assumption

This vector relies on the common practice of minting URIs in one URI space and using owl:sameAs to connect the resource to identical resources in other URI spaces. The attacker simply describes a resource that conveys their message and then uses owl:sameAs to make it identical to popular resources. Most Linked Data systems recognise owl:sameAs and aggregate all triples about any subjects declared to be identical.

ex:thing dc:title "Wensleydale: the mature, smooth cheese you will love." ;
         owl:sameAs dbpedia:The_Beatles ;
         owl:sameAs dbpedia:Lady_Gaga ;
         owl:sameAs dbpedia:True_Blood ;
         owl:sameAs dbpedia:Harry_Potter .

Bait and Switch

In this vector, the spammer uses content negotiation to provide enticing linked data to machines and spam messages to humans. When a Linked Data system fetches a URI it indicates that it requires machine-readable data by sending an appropriate HTTP header. Web browsers under the control of a human will send a different value for the header so servers can distinguish machines from humans and send different information. The spammer can configure their server to send innocuous Linked Data to machines which, when visited by humans, display the spammer’s message. (See my earlier post Is the semantic web destined to be a shadow? for some of the consequences of this separation of machine/human content)

Misattribution

Under this attack, the spammer attributes their message to someone they hope the recipient will trust. Linked Data systems may ingest this data and display the quotation with the source inadvertently misleading its users:

ex:1 a bibo:Quote ;
   bibo:content "I always buy Wensleydale from bargaincheeseshop.com
                 and so should you" ;
   dc:creator "Sergey Brin" .

Data URI Embedding

In this attack vector the data itself is innocuous but the URIs used by the attacker use the data: scheme to embed the spam message. If these URIs are displayed to the user of a Linked Data system then they may click on them and trigger the message display. (example )

dbpedia:London rdfs:seeAlso 
        <data:text/html;charset=utf-8;base64,PGEgaHJlZj0iaHR0cDovL2V4YW1wbGUuY29tL2J1eWNoZWVzZSI+bG93ZXN0IFdlbnNsZXlkYWxlIHByaWNlczwvYT4=> .

Conclusion

I think there might be an attack vector using reification of the spam triples – but I’m not sure it’s really different to any of those listed above. Also, I covered owl:sameAs as a vector but OWL provides many other ways to infer identity between resources such as owl:functionalProperty, owl:inverseFunctionalProperty etc.

Most of these attack vectors can be countered through a whitelist provenance system, but they are not easy to scale. Also some of them may be difficult to detect if the vast majority of the data from a spammer is actually useful. One particular property of RDF where duplicate triples can be ignored makes it easy to bury spam inside billions of legitimate triples – simply take a copy of dbpedia and add a few spam triples. A casual inspection of the dataset will more than likely just see the dbpedia triples, but a Linked Data system that already has those triples will ignore them and just add the spam triples.

10 thoughts on “Linked Data Spam Vectors

  1. Tweets that mention Internet Alchemy » Linked Data Spam Vectors -- Topsy.com says:

    [...] This post was mentioned on Twitter by jansc, infopeep and Andrew Plumb. Andrew Plumb said: Nice summary of current spam morphology by @iand: Linked Data Spam Vectors http://bit.ly/cZfHS #fb [...]

  2. Michael Hausenblas (mhausenblas) 's status on Monday, 21-Sep-09 13:59:55 UTC - Identi.ca says:

    [...] http://iandavis.com/blog/2009/09/linked-data-spam-vectors #linkeddata #spam [...]

  3. Linked Data | Healthcare Semantic Architectures says:

    [...] Data Spam Vectors http://iandavis.com/blog/2009/09/linked-data-spam-vectors Categories: Linked Data, RDF, Semantic Web Tags: Linked Data Comments (0) Trackbacks (0) [...]

  4. Sifspainru says:

    ‘ll Show someone a thread, is it possible to make the text posted on the website can be read, he would have seen robots, but it would be impossible to copy.Продажа, Новая и Загородная. Недвижимость Испании

  5. Eric Hellman says:

    Perhaps you’ve finally identified the business model for linked data providers- data for free, spam filtering as for-fee service.

  6. Ian Davis says:

    Eric, maybe :) There is a reason why Talis segregates data into individually access-controlled stores…

  7. Dan Brickley says:

    Funny you should post this, I’ve just come back from dinner with Tom Baker of Dublin Core … where I was saying (re issue of trusting diverse namespaces) that we should enumerate a list of all the attacks we can think of. This looks like a good start.I think we will at some point see some OWL attacks, since it is so easy to hide a non-obvious inference chain.I worry also about time-shift attacks, when attackers gain control of a popular namespace, … one whose contents are sometimes trustingly injected into contexts where they’re mixed with instance data that is believed. For example, some apps I’m sure deref DC and FOAF namespaces. So a time-shift attack would put something temporarily into the schema, only for a short period of time, when it is expected to be re-loaded into the target app; after which the mischief would be reverted.I observed something similar happen to my homepage when my personal site was hacked. This kind of attack also works on OpenID indirection markup, quite nastily… though I’ve never heard of that being used in the wild.

  8. [...] more directly, Ian Davis (CTO of Talis) enumerated seven different possible linked data exploits in Linked Data Spam Vectors.  Following on this we have Marie-Claire Jenkins’ Semantic web spam: SemSpam, which seems to [...]

  9. [...] I am a Semantic Web enthusiast I am also open to points against it, like this one raised by CJ and this one by Ian.  I believe it’s VERY important to know how the semantic web can be exploited for spamming [...]

  10. [...] more directly, Ian Davis (CTO of Talis) enumerated seven different possible linked data exploits in Linked Data Spam Vectors.  Following on this we have Marie-Claire Jenkins’ Semantic web spam: SemSpam, which seems to [...]

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 27 other followers

%d bloggers like this: