Mar 04 2009

Why Open Data Is More Important than Open Source

Published by Ian Davis at 2:56 am under Ideas and tagged as , ,

Last week I delivered the keynote for the final day of code4lib 2009. This was a particular honour because, unlike many conferences, the keynote speakers are proposed and voted on by the code4lib community. So, rather than keynote speakers being used to draw people to the conference, the community draws the speakers to them.

I chose to present on a topic that is close to my heart, to my company’s vision and, I hoped, of interest to the audience: freedom. For this conference I chose a specific expression of freedom that i thought would be of particular interest to a community deeply entrenched in metadata. The title of my keynote was “If you love something… set it free“:

The conference organisers videoed all the sessions so hopefully I can link to a more informative version. As usual with this kind of presentation I wrote copious prompt notes in case I dried up then found I couldn’t read them and carried on regardless :)

I hope to come back to various points raised in my presentation over time, but right now I want to focus on one area that has sparked a good deal of debate (such as here, here and here with much twittering too). Right in the middle of the presentation I offered three conjectures, the first of which was data outlasts code which lead me to then assert that therefore open data is more important than open source. This appears to be controversial.

First, it’s important to note what I did not say. I did not say that open source is not important. On the contrary I said that open source was extremely important and it has sounded the death knell for proprietary software. Later speakers at the conference referred to this statement as controversial too :) . (What I actually meant to say was that open source has sounded the death knell for propietary software models). I also mentioned that open source and free software has a long history and that open data is where open source was 25 years ago (I am using the term open source and free software interchangeably here).

I also did not say that code does not last nor that algorithms do not last. Of course they last, but data lasts longer. My point was that code is tied to processes usually embodied in hardware whereas data is agnostic to the hardware it resides on. The audience at the conference understand this already: they are archivists and librarians and they deal with data formats like MARC which has had superb longevity. Many of them deal with records every day that are essentially the same as they were two or three decades ago. Those records have gone through multiple generations of code to parse and manipulate the data.

In a recent post Egon Willighagen criticised my conjecture:

Ian Davis was recently quoted saying open data is more important than open source, which was pulled (out of context) from this presentation. The context was (a slide earlier): Data outlasts code.

As far as I can see, this is utter nonsense, even within context of the slide (see also this discussion on FriendFeed). Obviously, within the context of Ian it does makes sense, and I hope he will respond in his blog and explain why he thinks Open Data is more special.

Without code, you have no way of accessing the data. Ask anyone to recover from a hard disk failure. In ODOSOS (Open Standards, Open Data, Open Source) they are all equal. You need them all for progress. You cannot single out one as being more important than another. Why would you anyway? Politics is all I can think of… All three combine and ensure our science is more efficient.

I think the flaw in this argument is this statement: “Without code, you have no way of accessing the data.” It’s true that you need code to access data, but critically it doesn’t have to be the same code from year to year, decade to decade, century to century. Any code capable of reading the data will do, even if it’s proprietary. You can also recreate the code whereas the effort involved in recreating the data could be prohibitively high. This is, of course, a strong argument for open data formats with simple data models: choosing CSV, XML or RDF is going to give you greater data longevity than PDF, XLS or PST because the cost of recreating the parsing code is so much lower.

Here’s the central asymmetry that leads me to conclude that open data is more important than open source: if you have data without code then you could write a program to extract information from the data, but if you have code without data then you have lost that information forever.

Consider also, the rise of software as a service. It really doesn’t matter whether the code they are built on are open source or not if you cannot access the data they manage for you. Even if you reproduce the service completely, using the same components, your data is buried awayout of your reach. However, if you have access to the data then you can achieve continuity even if you don’t have access to the underlying source of the application. I’ll say it again: open data is more important than open source.

Of course we want open standards, open source and open data. But in one or two hundred years which will still be relevant? Patents and copyrights on formats expire, hardware platforms and even their paradigms shift and change. Data persists, open data endures.

The problem we have today is that the open data movement is in its infancy when compared to open source. We have so far to go, and there are many obstacles. One of the first steps to maturity is to give people the means to express how open their data is, how reusable it is. The Open Data Commons is an organisation explicitly set up to tackle the problem of open data licensing. If you are publishing data in any way you ought to check out their licences and see if any meet with your goals. If you licence your data openly then it will be copied and reused and will have an even greater chance of persisting over the long term.

Hopefully I have given plenty of background to my open data conjecture. I’m eager to hear what you think so please comment or email me directly. You may find these links relevant too:

9 responses so far

9 Responses to “Why Open Data Is More Important than Open Source”

  1. Egon Willighagenon 04 Mar 2009 at 7:49 am

    Hi Ian,

    thanx for getting back on this and your detailed explanation of your perspective.

    From my perspective, your arguments seem to be build on two assumptions:

    1. data is static
    2. code is cheap to build

    About 2: in science this is typically not the case. Surely, writing a XML parser is easy enough, and writing a simple text editor likewise. However, much of the scientific code output is the outcome of several years of development. Believe me, it is not cheap to rebuild scientific software. The market is scientific programmers is so small, the costs for rebuilding software is quite. Unlike with desktop software, where there is a huge user market, and redoing it is trivial.

    Unfortunately, it is common to not distinguish between desktop software and scientific software. People typically do not understand what makes development of new scientific algorithms more difficult than writing a desktop tool. A choice of CSV versus XML is a non-issue in scientific software design; Proper representation of the scientific problem, that requires a lot of domain knowledge, and a lot of testing against that knowledge. These scientific algorithms easily outlast a lot of the data around.

    In short, the development of scientific software which reflects new scientific insights is certainly not by default cheaper than measuring the data again.

    About 1) However, there is another reason why data does not outlast code. Street map data of 100 years ago surely has some historical value, but I leave it to the reader how useful it is to current problems. Likewise, chemical data measured 50 years ago is certainly not as useful as it is now: measurements we make now are so much more accurate and more precise. Infrared (IR) spectroscopy was the main identification method in organic chemistry until NMR came around. The latter practically made IR obsolete, though it often still is used to back up the presence of a certain chemical fragment. I’m sure this is even more the case in biological sciences.

    Old scientific data is nice, but to use it, you just have to redo your experiment (it’s exponentially cheaper to do the measuring now, then when it was originally done), gaining much more information. Who cares about the speed of light measured 100 years ago. That data did simply not last. Scientific data does not last (and I do not think telephone numbers of 5 years ago are useful either). The assumption that data is static and lasts is flawed. The assumption that measuring things is prohibitively expensive is flawed too. Well, if you don’t have code, it surely is.

    Bottom line is that either can be less or more expensive to reproduce: the data and the code. At least in science.

    Egon

  2. roberton 04 Mar 2009 at 8:40 am

    fully agree with your “data outlasts code” statement. even in my own 5 years of programming in a digital library environment, i’ve seen code bit-rot and getting replaced, but still working with the same data(bases).

    i’d even go a bit further and propose that whenever you have to create code to access/manage some data, this must not take longer than – say – 1/10 of the time necessary to create the data. and maybe we should put a hard limit of 5 years on that, too, because in most parts of the programming world, that’s about the time it takes to make technologies obsolete.

  3. Steve Tolleyon 04 Mar 2009 at 12:14 pm

    An interesting and perhaps stretching view given the likely audience for the comments. There certainly seems to be a growing sense of utility in computing provision and as such more fundamental social needs are being debated. SAS and cloud approaches will hopefully focus more brain power on the philosophical issues of informational purpose in broader context and longer time line.

    You seem more at the evangelist end of the CTO spectrum, and this debate will hopefully become more commonplace and thought provoking.

    Looking forward to more semantic serendipity!

  4. Jonathan Rochkindon 04 Mar 2009 at 2:18 pm

    I’m actually shocked that this is a controversial claim among librarians and library workers. It seems so obviously true to me, _especially_ from the experience of working with library metadata. As you note, we’re still working with data files that have been essentially untouched for years. AACR was published in 1967, and AACR2 in 1978, and most of our catalogs include both AACR and even pre-AACR records in them still.

    Software comes and goes, but data sticks around. The data is actually more expensive to produce or replace than software.

    One of the unfortunate things about much of our data is that it was sort of fitted to the idiosyncracies of a particular piece of software that was going to be used to display it when it was created. Values were put in certain fields because they would then be displayed (or not displayed) by certain software, in an idiosyncratic ad hoc way. But the data has long outlasted the software whose behaviors it was molded to.

    You’d think catalogers would be pleased to have computer programmers acknowledging that what they do (data control and generation) is more important than what we do!

  5. roberton 04 Mar 2009 at 7:15 pm

    oh, what i should have added: each generation of code should leave the data in a better state than before.

  6. Mike Linksvayeron 05 Mar 2009 at 1:50 am

    Rocks outlive humans. In other words, the argument seems facile.

    What “open” are you referring to in “open data” in the context of this conjecture? As you say, legal restrictions expire, eventually. Do you mean data in open formats and open standards? Those are defined by a legal component (lack of patent encumberance, which is more certain to expire if present anyway than are copyright restrictions), documentation, and sometimes — open source reference implementation.

    If pressed I’d argue the whole “open” stack requires free/open source software (even if all software isn’t), thus in a sense “open” software is more important than “open” data (whether data is more important than software is a somewhat different argument). For something along those lines (briefly, open formats/standards was a losing battle without source code … the article at the link tries to argue that we’re in danger of losing the battle again as code moves to servers), see http://lists.canonical.org/pipermail/kragen-tol/2006-July/000818.html … note that this is from a person who considers closed formats unethical http://lists.canonical.org/pipermail/kragen-tol/2002-July/000725.html :)

    Or maybe by “open data” you merely mean data which is accessible, eg downloadable. Your comments toward the end contrasting having lost your data and having lost your code seem to indicate that this is what you mean. Ok, but this again seems rather facile, at least not without some context. For example, you could argue that it is more important for governments to make their data available than it is for governments to use open source software.

  7. Michael Roessleron 07 Mar 2009 at 12:26 am

    Data is in the eye of the beholder.

    In my opinion, the following is the most valuable statement among your slides and deserves much attention and discussion:

    “Much of the value in our data will be unexpected and unintended, therefore we should engineer for serendipity.” – Ian Davis

    Data may be important, code may be important, but the ability to incorporate into engineering certain methodologies to enhance the ability to read and understand significance in data is magic – especially in a manner that does not excessively limit potential meaning in data through preconceptions. Data is in the eye of the beholder, at least as related to using data for decision-making.

    How do we ask questions that help us to engineer for serendipity? This, in my opinion, is a hugely significant question for all of us that we ought not obscure through an open data vs open source argument. I think this question should be investigated on its own.

    How do we add value to data by engineering for serendipity?

    This fits in with another comment you made:

    “Network effects arise when the act of participation makes the entire network more useful for everyone.” – Ian Davis

    This is an excellent statement! Most of us know how the example of social networking fits in to this statement. We know that the more people who participate in social networking, the more valuable that network can become for each of us. We know how the web fits in, as for example, your presenting these ideas on a public web page makes the web more useful for each of us. But not enough of us understand networking effects in raw data, especially within a corporate business environment. Not enough of us understand that keeping data in a spreadsheet within one department may add insufficient value to the organization because there can be few serendipitous value creations since few people will be exposed to the data. Placing the spreadsheet on an intranet might not be the most efficient method of producing the network effects that are possible.

    Combining network effects with engineering for serendipity is a good recipe for changing the world by making data not only open, but also useful and positioning it squarely in the eye of the beholder.

    I’d very much like to pursue this further.

  8. Nick Gallon 09 Mar 2009 at 10:17 pm

    Ian, you’ll be happy to know that the adage “data outlasts code” been around for a while. Here is a quote from 1994: “Awareness is growing throughout the world that data outlasts computer programs,
    in the same way that an aircraft manual long outlasts any typewriter, …” ( see http://books.google.com/books?filter=0&um=1&q=%22long+outlasts+any+typewriter%22&btnG=Search+Books ).

    I also just saw this quote in a circa 1997 Oracle MDM whitepaper: “It has been said that data outlasts applications.” (see http://www.oracle.com/master-data-management/oracle-master-data.pdf ).

  9. Jonahon 13 Mar 2009 at 3:50 am

    Well put. Now I am even more sorry I missed code4lib this year.

    I have been making this argument lately by raising the concern that the free software/culture movement is being outflanked by a land grab for data. Free software is only one corner piece of this puzzle – to complete the jigsaw we need the corners of free data, in a free format. If I can get my data back out in a free format, do I really care which calendar service I am using (if the data is meant to be public and shared, that is)? As the applications become commodities, data is king.

    http://alchemicalmusings.org/2006/03/12/saints-in-the-church-of-writely/