The Real Challenge for RDF is Yet to Come
One often overlooked advantage that RDF offers is its deceptively simple data model. This data model trivializes merging of data from multiple sources and does it in such a way that data about the same things gets collated and de-deduplicated. In my opinion this is the most important benefit of using RDF over other open data formats.
Paradoxically this excellent feature is also a significant factor in the slow adoption of RDF. The reason is that RDF is a general solution to the problem of merging disparate types and sources of data. If you don’t have that problem then RDF will always look inefficient, verbose and obtuse to you. Even if you are merging data today you’re most likely only doing it from a few known sources and it’ll be easier to write some custom code to do it for you.
I’ve heard many other arguments for the slow adoption of RDF over the years ranging from perceived deficiencies in the RDF model through obtuse XML formats and all the way up to techies being blamed for being bad at explaining RDF. I’ve been guilty of complaining long and hard about blank nodes and status codes. In reality, none of these things have any impact on the rate adoption of RDF because people won’t use it until they have the problem it solves.
This is a typical characteristic of technical paradigm shifts. No-one thought they had the problem of not being able to speak to anyone they liked wherever they were until the cellphone arrived and shifted expectations.
Right now, no-one realises they have the problem of not being able to merge and combine data from thousands of different primary sources. Most people aren’t thinking about it and those that do are facing an economic barrier, not a technical one. We know a general technical solution exists but the benefit/cost ratio needs to be high enough to warrant using a general solution over a custom one and today the costs of integrating data at scale are too high for most even given the massive benefits that could be possible.
There are organisations that are already merging and regularising data from thousands of sources but they’re paying the cost side of the equation just to get at the massive value they can generate. Government security agencies do it as do the largest consumer retailers and logistics firms. Google has a mission to organise the world’s information and we’re right in the middle of the big data startup era where founders and VCs are predicting huge returns from combining data at scale. Today these activities are seen as competitive advantage. That means there’s not much appetite for promoting a general solution that drives the costs close to zero and lets everyone get in on the game. A very few pioneers, notably Garlik, are using open standards like RDF to merge their data.
The primary cost that RDF itself adds to the benefit/cost equation is the effort you need to spend mapping all the data you need into the RDF triple model. You really need to persuade everyone else to publish RDF so you can consume it cheaply. This is where the Linked Data project has been very successful: however you choose to measure it there is a vast amount of data already in RDF. There is also growing body of data in other formats such as microdata that are pretty close to the RDF model and fairly cheap to consume. This aspect of the benefit/cost ratio is on the right trajectory.
A larger proportion of the costs in the equation have nothing to do with the data model or format used. They are costs inherent in the problem space itself: how do you know, for example, that the data you are depending on is accurate, available when you need it and legally yours to use? Until those issues of trust and quality are tackled effectively then people will stick to their small scale merges from a few trusted sources. This problem is huge and the solution can’t be rushed. It’s the area that Kasabi has in its sights and we’re going to do our best to help people overcome the fear inherent in using random data from untrusted parties.
My hope is that Kasabi plays the role of the cellphone and radically shifts expectations. Once people realise they have the problem of diverse data integration at scale the focus will then shift back to the technology. That’s where we’ll see competing approaches to solve the problem emerge and RDF will face its stiffest challenge of all: direct competition on its home territory.