Sep 21 2009

Linked Data Spam Vectors

Published by Ian Davis under Random Stuff and tagged as , , , ,

One of the defining characteristics of a successful information system is its level of exploitation by spammers. Successful systems will attract those who wish to hijack that success for their own ends, most often by the use of deception. I think we can safely say that the Linked Data web is not yet successful because it has not attracted that level of spam attack. With the Google and Yahoo’s adoption of RDFa I think we’ll see that change in the next twelve months. In anticipation of future attacks I thought it would be a useful exercise to try and predict some of the possible vectors for introducing spam into a Linked Data system. Some of the spam vectors are going to be traditional, rooted in the existing web. However, some are going to be new, made possible by the open world aspects and low cost integration possibilities provided by Linked Data.

Without too much effort I identified seven obvious attack vectors for spammers. I’m sure collectively we can extend this list and then examine the methods of control to ensure that they don’t become a problem for the Linked Data web. Here’s what I came up with:

False Labelling

With this vector of attack the spammer simply asserts labelling triples that promote their message. Linked data systems often display the objects of these triples when labelling resources. If the spammer targets popular subject URIs then there is a higher chance of their message appearing for users of the Linked Data system. For example:

dbpdedia:London rdfs:label "Buy more Wensleydale" .
<http://danbri.org/foaf.rdf#danbri> foaf:name "Wensleydale fan" .

Misdirection

Here the attacker asserts triples using properties that are commonly used to provide links to human-readable content. In the attack, the triple objects are resources that contain the attacker’s message. Systems that use these properties may inadvertently display links to the spammer’s site and content:

dbpedia:London rdfs:seeAlso <http://example.com/buycheese> .

dbpedia:Tim_Berners-Lee foaf:isPrimaryTopicOf <http://example.com/buycheese> .

<http://sws.geonames.org/3333196/> mo:wikipedia <http://example.com/buycheese> .

Schema Pollution

In this attack all of the instance data is innocuous but some of the properties used in the data are labelled with the spammer’s message. When rendering data for human use, many linked data systems will look for schema information to label unknown predicates. This attack causes those systems to display the spammer’s message:

ex:thing dc:title "New study finds that mice can learn to sing." ;
         a foaf:Document ;
         dc:subject "mouse behaviour" ;
         ex:prop "Journal of mouse psychology" .

ex:prop a rdfs:Property ;
        rdfs:label "Lowest Wensleydale prices at bargaincheeseshop.com" .

This attack can be combined with False Labelling, attempting to inject a message into a commonly used schema:

dc:title rdfs:label "Lowest Wensleydale prices at bargaincheeseshop.com" .

Identity Assumption

This vector relies on the common practice of minting URIs in one URI space and using owl:sameAs to connect the resource to identical resources in other URI spaces. The attacker simply describes a resource that conveys their message and then uses owl:sameAs to make it identical to popular resources. Most Linked Data systems recognise owl:sameAs and aggregate all triples about any subjects declared to be identical.

ex:thing dc:title "Wensleydale: the mature, smooth cheese you will love." ;
         owl:sameAs dbpedia:The_Beatles ;
         owl:sameAs dbpedia:Lady_Gaga ;
         owl:sameAs dbpedia:True_Blood ;
         owl:sameAs dbpedia:Harry_Potter .

Bait and Switch

In this vector, the spammer uses content negotiation to provide enticing linked data to machines and spam messages to humans. When a Linked Data system fetches a URI it indicates that it requires machine-readable data by sending an appropriate HTTP header. Web browsers under the control of a human will send a different value for the header so servers can distinguish machines from humans and send different information. The spammer can configure their server to send innocuous Linked Data to machines which, when visited by humans, display the spammer’s message. (See my earlier post Is the semantic web destined to be a shadow? for some of the consequences of this separation of machine/human content)

Misattribution

Under this attack, the spammer attributes their message to someone they hope the recipient will trust. Linked Data systems may ingest this data and display the quotation with the source inadvertently misleading its users:

ex:1 a bibo:Quote ;
   bibo:content "I always buy Wensleydale from bargaincheeseshop.com
                 and so should you" ;
   dc:creator "Sergey Brin" .

Data URI Embedding

In this attack vector the data itself is innocuous but the URIs used by the attacker use the data: scheme to embed the spam message. If these URIs are displayed to the user of a Linked Data system then they may click on them and trigger the message display. (example )

dbpedia:London rdfs:seeAlso
        <data:text/html;charset=utf-8;base64,PGEgaHJlZj0iaHR0cDovL2V4YW1wbGUuY29tL2J1eWNoZWVzZSI+bG93ZXN0IFdlbnNsZXlkYWxlIHByaWNlczwvYT4=> .

Conclusion

I think there might be an attack vector using reification of the spam triples – but I’m not sure it’s really different to any of those listed above. Also, I covered owl:sameAs as a vector but OWL provides many other ways to infer identity between resources such as owl:functionalProperty, owl:inverseFunctionalProperty etc.

Most of these attack vectors can be countered through a whitelist provenance system, but they are not easy to scale. Also some of them may be difficult to detect if the vast majority of the data from a spammer is actually useful. One particular property of RDF where duplicate triples can be ignored makes it easy to bury spam inside billions of legitimate triples – simply take a copy of dbpedia and add a few spam triples. A casual inspection of the dataset will more than likely just see the dbpedia triples, but a Linked Data system that already has those triples will ignore them and just add the spam triples.

6 responses so far

Jul 21 2009

The Linked Data Brand

Published by Ian Davis under Opinion and tagged as , ,

Paul Miller has kicked off a twitstorm with his simple question: does linked data require RDF?. My contention is that Linked Data does absolutely require RDF. This is not a technical issue and its not one of zealots or pragmatists: its a marketing and branding issue.

The term Linked Data was coined to brand a specific class of practices: namely assigning HTTP URIs to abitrary things and making those URIs respond with RDF relating the things to other things.

Here very few of the ‘things’ are documents, instead they are people, places, objects and concepts.

That deliberately excludes many other practices of publishing data on the web such as atom feeds, spreadsheets, APIs and even many existing RDF use cases.

The purpose of giving things a brand is to engender recognition, familiarity and trust. When you open a can of Pepsi you know exactly what you are going to get. You know you will get a great user experience whatever Apple product you buy. When you buy Lego you can rely on all the pieces fitting together.

The Linked Data brand makes similar promises of quality and consistency. When you consume Linked Data you know it will be RDF so your tools will work correctly. You also know that the data will be using HTTP URIs to refer to real-world things so you can determine what the data is about. You can trust that you’re not suddenly going to be given some XML in a proprietary schema or CSV with text headings you will have to guess the meaning of.

The Semantic Web community has been notorious for its poor marketing over the past decade. Now just when it seems the community has found the right balance between technology and mass appeal it feels like people are trying to rip away that success for their own purposes. That is deliberately emotive language because brands are all about emotion.

I don’t want to see the Linked Data brand weakened because it destroys trust. That’s why I pushed back on Twitter. As all involved know I am a huge advocate of making more data available on the web for reuse. It makes me glad whenever I see people invest their time in publishing data in any format, but my heart sings when I see more Linked Data.

There are many situations where there are better approaches than Linked Data e.g. I would rather have a midi file than the RDF version. In many circumstances I would be glad of a spreadsheet – simple and convenient.

But we should not confuse these forms of data publishing with Linked Data. That would sow confusion and be counterproductive. The coming web of data will be a rich and varied space full of content and data in every format imaginable. A large part of that we will call Linked Data and when you encounter it you will be justified in expecting RDF and HTTP.

I welcome anyone who wants to share data on the web in any way. But play fair and use the Linked Data brand only when it uses the Linked Data rules.

11 responses so far

Mar 04 2009

Why Open Data Is More Important than Open Source

Published by Ian Davis under Ideas and tagged as , ,

Last week I delivered the keynote for the final day of code4lib 2009. This was a particular honour because, unlike many conferences, the keynote speakers are proposed and voted on by the code4lib community. So, rather than keynote speakers being used to draw people to the conference, the community draws the speakers to them.

I chose to present on a topic that is close to my heart, to my company’s vision and, I hoped, of interest to the audience: freedom. For this conference I chose a specific expression of freedom that i thought would be of particular interest to a community deeply entrenched in metadata. The title of my keynote was “If you love something… set it free“:

The conference organisers videoed all the sessions so hopefully I can link to a more informative version. As usual with this kind of presentation I wrote copious prompt notes in case I dried up then found I couldn’t read them and carried on regardless :)

I hope to come back to various points raised in my presentation over time, but right now I want to focus on one area that has sparked a good deal of debate (such as here, here and here with much twittering too). Right in the middle of the presentation I offered three conjectures, the first of which was data outlasts code which lead me to then assert that therefore open data is more important than open source. This appears to be controversial.

First, it’s important to note what I did not say. I did not say that open source is not important. On the contrary I said that open source was extremely important and it has sounded the death knell for proprietary software. Later speakers at the conference referred to this statement as controversial too :) . (What I actually meant to say was that open source has sounded the death knell for propietary software models). I also mentioned that open source and free software has a long history and that open data is where open source was 25 years ago (I am using the term open source and free software interchangeably here).

I also did not say that code does not last nor that algorithms do not last. Of course they last, but data lasts longer. My point was that code is tied to processes usually embodied in hardware whereas data is agnostic to the hardware it resides on. The audience at the conference understand this already: they are archivists and librarians and they deal with data formats like MARC which has had superb longevity. Many of them deal with records every day that are essentially the same as they were two or three decades ago. Those records have gone through multiple generations of code to parse and manipulate the data.

In a recent post Egon Willighagen criticised my conjecture:

Ian Davis was recently quoted saying open data is more important than open source, which was pulled (out of context) from this presentation. The context was (a slide earlier): Data outlasts code.

As far as I can see, this is utter nonsense, even within context of the slide (see also this discussion on FriendFeed). Obviously, within the context of Ian it does makes sense, and I hope he will respond in his blog and explain why he thinks Open Data is more special.

Without code, you have no way of accessing the data. Ask anyone to recover from a hard disk failure. In ODOSOS (Open Standards, Open Data, Open Source) they are all equal. You need them all for progress. You cannot single out one as being more important than another. Why would you anyway? Politics is all I can think of… All three combine and ensure our science is more efficient.

I think the flaw in this argument is this statement: “Without code, you have no way of accessing the data.” It’s true that you need code to access data, but critically it doesn’t have to be the same code from year to year, decade to decade, century to century. Any code capable of reading the data will do, even if it’s proprietary. You can also recreate the code whereas the effort involved in recreating the data could be prohibitively high. This is, of course, a strong argument for open data formats with simple data models: choosing CSV, XML or RDF is going to give you greater data longevity than PDF, XLS or PST because the cost of recreating the parsing code is so much lower.

Here’s the central asymmetry that leads me to conclude that open data is more important than open source: if you have data without code then you could write a program to extract information from the data, but if you have code without data then you have lost that information forever.

Consider also, the rise of software as a service. It really doesn’t matter whether the code they are built on are open source or not if you cannot access the data they manage for you. Even if you reproduce the service completely, using the same components, your data is buried awayout of your reach. However, if you have access to the data then you can achieve continuity even if you don’t have access to the underlying source of the application. I’ll say it again: open data is more important than open source.

Of course we want open standards, open source and open data. But in one or two hundred years which will still be relevant? Patents and copyrights on formats expire, hardware platforms and even their paradigms shift and change. Data persists, open data endures.

The problem we have today is that the open data movement is in its infancy when compared to open source. We have so far to go, and there are many obstacles. One of the first steps to maturity is to give people the means to express how open their data is, how reusable it is. The Open Data Commons is an organisation explicitly set up to tackle the problem of open data licensing. If you are publishing data in any way you ought to check out their licences and see if any meet with your goals. If you licence your data openly then it will be copied and reused and will have an even greater chance of persisting over the long term.

Hopefully I have given plenty of background to my open data conjecture. I’m eager to hear what you think so please comment or email me directly. You may find these links relevant too:

9 responses so far

Mar 02 2009

The Semantic Web Acid Test

Published by Ian Davis under Random Stuff and tagged as , ,

Tom Heath writes a cracking post on the current attempts by a few people to brand web applications that happen to perform text analysis as “Semantic Web”. For me, this nails it:

I certainly notice plenty of unjustified attempts at present to co-opt the term Semantic Web, now that it’s no longer a dirty word, and drive it off down some dodgy alleyway. Some of these products, services or companies may be applications or services that use some semantic technology and are delivered over the Web, but that doesn’t make them Semantic Web applications, services or companies. Anything claiming the Semantic Web label needs to get its hands dirty with Linked Data somewhere along the way. That’s just how it is.

Tom’s right. These attempts to label some pretty run-of-the-mill web applications as Semantic Web suggests to me that the marketers are seeing the Semantic Web meme as carrying some useful currency. The problem they face is that the Semantic Web has some well-defined principles that can be used as tests. Here’s the first test: if you see one of these applications find one of its pages describing something that’s useful to you (e.g. a place or a person) and ask yourself “what’s the URI of the thing this page is describing?”.

12 responses so far

Nov 10 2008

Paget Iteration 2

Published by Ian Davis under Projects and tagged as , , , ,

A few weeks ago I released a small PHP framework for publishing linked data (see my earlier post Publishing Linked Data with PHP). Since then I have made a lot of changes to the code and ended up completely changing the application flow.

Previously all the behaviour was specified by a configuration array with a dispatcher class. I found that was limiting the flexibility I needed and the “simple” configuration array was becoming decidedly complex. The Dispatcher class has been replaced by a new UriSpace class which is responsible for identifying the resources identified by a group of URIs. Applications can create classes derived from UriSpace to encapsulate the behaviour of their resources. Resources are split into three categories: documents that can be served straight up, abstract resources and descriptions of abstract resources. The last two are where the interesting bits of Paget lie. An application will typically override the get_description method to return a custom description derived from ResourceDescription. This class does all the hard work of finding triples about the requested abstract resource.

A class derived from ResourceDescription can override several methods to customise the RDF returned:

get_resources
This method returns an array of resource URIs that the description will consider when generating its RDF. The default behaviour is simply to chop the file extension off of the description’s URI. So, the description at http://iandavis.com/id/me.rdf will have a resource of http://iandavis.com/id/me.
get_generators
This returns a list of generators that seed the triples in the descrpition. The ResourceDescription class calls each generator’s add_triples method once for each resource returned by the get_resources method. Paget has some pre-defined generators that can read triples from a local file or from a platform store. The default behaviour is to do nothing.
get_augmentors
This returns a list of augmentors that add triples to the description. Paget comes with a few built-in augmentors to augment with RDF from a platform store, annotate properties with human readable labels and even do some limited inferencing. By default the simple property labeller is returned as an augmentor.
get_label
This just calculates a sensible label for the description that could be used in the title of a web page or a link. The default behaviour is to look for an rdfs:label, dc:title or foaf:name for the primary resource in the description (which is the first one returned by get_resources). Applications could override this to use whatever heuristics make sense for their data.
get
This is the dispatch point for HTTP GET requests. At a later date I hope to handle other methods too, but for now Paget is a read only system
get_html
This is called by the get method to generate an HTML representation of the description. By default it uses Paget’s SimpleHtmlRepresentation class but this is the point at which most customisations will take place for rendering linked data.

The HTML output from Paget has been revised too. The basic layout of the page is handled by the SimpleHtmlRepresentation class but some type-specific logic has been broken out into a number of “widgets”. There’s one for OWL ontologies, RDF classes and properties and a general one that can render any RDF data. The html representation chooses an appropriate widget based in the type of the primary resource being rendered. I’m thinking about adding widgets for people and various other common classes. This is all very early and experimental. Ideally I would like the page to adapt itself completely dynamically based on the underlying data. Switching on the class of a resource is rather simplistic, but it will do as a starter.

Here’s an example of how I’m using Paget in my personal data space http://iandavis.com/id/me. All the data is held in a Talis Platform store. I handle requests to http://iandavis.com/id/ with some .htaccess rules that ensure every request is handled by a file called index.php which contains the code hooking the space up to Paget. In index.php I create a subclass of UriSpace called StoreBackedUriSpace that maps the URIs beneath http://iandavis.com/id/ to resources and their descriptions. That class creates instances of StoreBackedResourceDescription that use a StoreDescribeGenerators to fetch the descriptions from the platform store. The entire code for index.php (less PHP includes etc) is shown here:


class StoreBackedUriSpace extends PAGET_UriSpace {
  function get_description($uri) {
    return new StoreBackedResourceDescription($uri);
  }
}

class StoreBackedResourceDescription extends PAGET_ResourceDescription {
  function get_generators() {
    return array( new PAGET_StoreDescribeGenerator("http://api.talis.com/stores/iand") );
  }
}

$space = new StoreBackedUriSpace();
$space->dispatch();

That’s basically the pattern for publishing data using Paget: derive a class from UriSpace and override the get_description method to return a custom ResourceDescription. I do that to publish some vocabularies on vocab.org such as Bio and Whisky. The UriSpace for those locations returns a resource description class that uses the FileGenerator class to read the schemas from local RDF documents and the simple property labeller and the simple inferencer to augment the results. My other deployment, at placetime.com, uses a custom resource description for each type of resource with custom generators that create the raw triples based on the requested URI.

So far it seems that Paget is flexible enough to deal with these varied scenarios of data publishing. The next step is to start looking at editing of the data and providing more application functionality.

Comments Off