The Real Challenge for RDF is Yet to Come

One often overlooked advantage that RDF offers is its deceptively simple data model. This data model trivializes merging of data from multiple sources and does it in such a way that data about the same things gets collated and de-deduplicated. In my opinion this is the most important benefit of using RDF over other open data formats.

Paradoxically this excellent feature is also a significant factor in the slow adoption of RDF. The reason is that RDF is a general solution to the problem of merging disparate types and sources of data. If you don’t have that problem then RDF will always look inefficient, verbose and obtuse to you. Even if you are merging data today you’re most likely only doing it from a few known sources and it’ll be easier to write some custom code to do it for you.

I’ve heard many other arguments for the slow adoption of RDF over the years ranging from perceived deficiencies in the RDF model through obtuse XML formats and all the way up to techies being blamed for being bad at explaining RDF. I’ve been guilty of complaining long and hard about blank nodes and status codes. In reality, none of these things have any impact on the rate adoption of RDF because people won’t use it until they have the problem it solves.

This is a typical characteristic of technical paradigm shifts. No-one thought they had the problem of not being able to speak to anyone they liked wherever they were until the cellphone arrived and shifted expectations.

Right now, no-one realises they have the problem of not being able to merge and combine data from thousands of different primary sources. Most people aren’t thinking about it and those that do are facing an economic barrier, not a technical one. We know a general technical solution exists but the benefit/cost ratio needs to be high enough to warrant using a general solution over a custom one and today the costs of integrating data at scale are too high for most even given the massive benefits that could be possible.

There are organisations that are already merging and regularising data from thousands of sources but they’re paying the cost side of the equation just to get at the massive value they can generate. Government security agencies do it as do the largest consumer retailers and logistics firms. Google has a mission to organise the world’s information and we’re right in the middle of the big data startup era where founders and VCs are predicting huge returns from combining data at scale. Today these activities are seen as competitive advantage. That means there’s not much appetite for promoting a general solution that drives the costs close to zero and lets everyone get in on the game. A very few pioneers, notably Garlik, are using open standards like RDF to merge their data.

The primary cost that RDF itself adds to the benefit/cost equation is the effort you need to spend mapping all the data you need into the RDF triple model. You really need to persuade everyone else to publish RDF so you can consume it cheaply. This is where the Linked Data project has been very successful: however you choose to measure it there is a vast amount of data already in RDF. There is also growing body of data in other formats such as microdata that are pretty close to the RDF model and fairly cheap to consume. This aspect of the benefit/cost ratio is on the right trajectory.

A larger proportion of the costs in the equation have nothing to do with the data model or format used. They are costs inherent in the problem space itself: how do you know, for example, that the data you are depending on is accurate, available when you need it and legally yours to use? Until those issues of trust and quality are tackled effectively then people will stick to their small scale merges from a few trusted sources. This problem is huge and the solution can’t be rushed. It’s the area that Kasabi has in its sights and we’re going to do our best to help people overcome the fear inherent in using random data from untrusted parties.

My hope is that Kasabi plays the role of the cellphone and radically shifts expectations. Once people realise they have the problem of diverse data integration at scale the focus will then shift back to the technology. That’s where we’ll see competing approaches to solve the problem emerge and RDF will face its stiffest challenge of all: direct competition on its home territory.

About Ian Davis

British entrepreneur and CEO of Kasabi. Primary interests are open data, the semantic web and decentralization.
This entry was posted in Opinion and tagged , , , , , . Bookmark the permalink.

7 Responses to The Real Challenge for RDF is Yet to Come

  1. Although you cite one reason for RDF as “a general solution to the problem of merging disparate types and sources of data”, the other strength for single data sources, namely flexibility, does have use cases: WordPress itself (this blog!) has key-value tables for user and post data in addition to the fixed wide tables (including if I recall right multiple values for the same key). Similar extension mechanisms can be found in other large open-source systems, and I’ve used this kind of data representation method for many years. So one question I’d ask is, given there is a use case, and people do recognise it, why haven’t these *single data set* systems changed to RDF as it is a better match than RDBMS. Some of this may be historical / momentum of SQL – in which case things like SQL-emulation over RDF may help (with rules at the back end to ensure constraints that RDMS give for ‘free’). Or are there more fundamental issues with RDF itself (e.g. need for minting unique ids everywhere, status of literals).

    For “the problem of merging disparate types and sources of data” I guess can ask the same question ahead of time – what are the really critical issues of “merging disparate types and sources of data” and what are the pros and cons of RDF for thsi problem. That is, once people recognise the problem, will RDF be best technology for them? The focus on URI seems critical here (even if a barrier for other uses), but without sameas or value-based equivalence support being standard in all triple stores, it doesn’t really do the job, you can throw all your data together, but it is not really interlinked – reconciliation is the key. Furthermore for “disparate sources of data” trivial provenance (not via reification!) is crucial … give those triples URIs ;-)

    There may also be more general ‘usability’ (as in developer usability) issues. When I first saw triple proposed for data representation (way pre-web!), I thought “interesting theoretical representation, but would never use it in practice” – that is like binary representation, good low level semantics. I have warmed to having this more upfront over the years, but still issues such as sequence representation, sub-structure, etc. are do-able, but messy (the ‘Turing equivalent’ fallacy in CS, being able to do something using a technique doesn’t mean you would want to use it). All data representations have resistances, so this is not a RDF-specific issue, but as a touchstone, I imagine programming where RDF was my *only* data representation, how difficult would it be? This is about the underlying semantics of RDF, sort of the counter of the XML representation problems, while RDF triples may be useful base semantics, I would want some ‘syntactic sugar’ on top of them and probably some tweaks to the base semantics as well (like some non-string semantics of rdf:_n).

    In summary, you are right that until the need becomes clear RDF won;t take off, but also need to be very sure that when the need does arrive it will be found to be the right solution.

  2. Pingback: Geek Reading August 18, 2011 | Regular Geek

  3. Well written, especially at the beginning.

    Regarding the problem of data quality from the information providers, I approached some time ago the Info Service Ontology (http://purl.org/ontology/is/core#) which is intended to describe information services (these are not only Linked-Data-based ones) and the quality of their data.

  4. mauvedeity says:

    Alan,

    Ian rightly points out that the killer advantage of RDF is merging disparate data sources. As such, if you have just one database, which is the whole world for your application, then you absolutely don’t have any kind of data merging problem. As such, the tried-and-tested SQL database solution is absolutely the most common solution, as the applications are designed with it in mind. As such, SQL probably is the right solution for a WordPress instance, or something similar.

    But as soon as you have more than one application needing that data store – that’s when you have that data sources problem, and that’s when moving to RDF will pay dividends.

    As the world moves away from data silos to building with the assumption that data will be remixed, repurposed, and reinterpreted, that’s when we’ll see RDF being embraced.

  5. Great post; something I find frustrating sometimes in discussions about RDF/Linked Data being worthwhile or not, is the focus on the technology, to the neglect of its purpose: connecting data.

    It’s more of a pity that connecting data has yet to become the norm, rather than that RDF has yet to be the norm. (Though RDF is currently the most serious solution for integrating data).

  6. Pingback: Länksprutning – 12 September 2011 – Månhus

  7. Pingback: Putting the Links into Linked Data | Talis Consulting | World leading expertise in Linked Data and the Semantic Web

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s