Archive for the 'Ideas' Category

Oct 05 2009

A Wild Alternative to Postcode Data

Published by Ian Davis under Ideas and tagged as ,

So the Royal Mail are targeting threats to their monopoly on the postcode data. The blogosphere is outraged naturally, and most arguments take the stance that this is data created by a publicly owned body and that it should belong to the nation. Morally that may be true, but politically it is a very different story. Successive governments have encouraged organisations like the Royal Mail, Ordnance Survey and British Library to recoup a certain level of their costs through data licensing. We now know that stance is untenable in the face of the disruption to costs of production and distribution brought by the Web, but dinosaurs take a long time to adapt.

There have been several attempts to circumvent the Royal Mail’s monopoly by crowdsourcing the data. FreeThePostcode is one approach which has geocoded about 8000 postcodes out of 1.6M. This is after several years effort, so its not clear that this is a viable approach. I’m not even sure that the Post Office would have no claim on it even if the data is completely crowdsourced. Postcodes aren’t natural facts. They are artificial, created and assigned by the Post Office. I don’t know if that makes a real difference, but there’s enough doubt in my mind to make me worry about it.

I wonder if trying to replicate the database is simply the wrong approach. Consider OpenStreetMap: they didn’t set out to replicate the Ordnance Survey’s maps, they set out to build an entirely new map, one free from IPR claims. Their map can be used like the Ordnance Survey maps but they are entirely independent of them.

Here’s my wild idea: create a new postcode system from scratch.

It could be very simple and because it would be open data from the start it could have a real connection to the web from day one. Maybe it could be based on some algorithmic coding of data from OpenStreetMap and we could make it as granular as we like, even down to the exact house. Open data would allow hundreds of derived services to exist that are stifled by the grasp of the Post Office today.

Whenever you write a postcode on a letter add the open postcode on the next line – no harm done to anyone and a little bit more value added. If the open postcode was a number then they could be printed as barcodes on letters – a simple innovation that the closed attitude of the Post Office has prevented from happening.

A wild idea….!

11 responses so far

Mar 04 2009

Why Open Data Is More Important than Open Source

Published by Ian Davis under Ideas and tagged as , ,

Last week I delivered the keynote for the final day of code4lib 2009. This was a particular honour because, unlike many conferences, the keynote speakers are proposed and voted on by the code4lib community. So, rather than keynote speakers being used to draw people to the conference, the community draws the speakers to them.

I chose to present on a topic that is close to my heart, to my company’s vision and, I hoped, of interest to the audience: freedom. For this conference I chose a specific expression of freedom that i thought would be of particular interest to a community deeply entrenched in metadata. The title of my keynote was “If you love something… set it free“:

The conference organisers videoed all the sessions so hopefully I can link to a more informative version. As usual with this kind of presentation I wrote copious prompt notes in case I dried up then found I couldn’t read them and carried on regardless :)

I hope to come back to various points raised in my presentation over time, but right now I want to focus on one area that has sparked a good deal of debate (such as here, here and here with much twittering too). Right in the middle of the presentation I offered three conjectures, the first of which was data outlasts code which lead me to then assert that therefore open data is more important than open source. This appears to be controversial.

First, it’s important to note what I did not say. I did not say that open source is not important. On the contrary I said that open source was extremely important and it has sounded the death knell for proprietary software. Later speakers at the conference referred to this statement as controversial too :) . (What I actually meant to say was that open source has sounded the death knell for propietary software models). I also mentioned that open source and free software has a long history and that open data is where open source was 25 years ago (I am using the term open source and free software interchangeably here).

I also did not say that code does not last nor that algorithms do not last. Of course they last, but data lasts longer. My point was that code is tied to processes usually embodied in hardware whereas data is agnostic to the hardware it resides on. The audience at the conference understand this already: they are archivists and librarians and they deal with data formats like MARC which has had superb longevity. Many of them deal with records every day that are essentially the same as they were two or three decades ago. Those records have gone through multiple generations of code to parse and manipulate the data.

In a recent post Egon Willighagen criticised my conjecture:

Ian Davis was recently quoted saying open data is more important than open source, which was pulled (out of context) from this presentation. The context was (a slide earlier): Data outlasts code.

As far as I can see, this is utter nonsense, even within context of the slide (see also this discussion on FriendFeed). Obviously, within the context of Ian it does makes sense, and I hope he will respond in his blog and explain why he thinks Open Data is more special.

Without code, you have no way of accessing the data. Ask anyone to recover from a hard disk failure. In ODOSOS (Open Standards, Open Data, Open Source) they are all equal. You need them all for progress. You cannot single out one as being more important than another. Why would you anyway? Politics is all I can think of… All three combine and ensure our science is more efficient.

I think the flaw in this argument is this statement: “Without code, you have no way of accessing the data.” It’s true that you need code to access data, but critically it doesn’t have to be the same code from year to year, decade to decade, century to century. Any code capable of reading the data will do, even if it’s proprietary. You can also recreate the code whereas the effort involved in recreating the data could be prohibitively high. This is, of course, a strong argument for open data formats with simple data models: choosing CSV, XML or RDF is going to give you greater data longevity than PDF, XLS or PST because the cost of recreating the parsing code is so much lower.

Here’s the central asymmetry that leads me to conclude that open data is more important than open source: if you have data without code then you could write a program to extract information from the data, but if you have code without data then you have lost that information forever.

Consider also, the rise of software as a service. It really doesn’t matter whether the code they are built on are open source or not if you cannot access the data they manage for you. Even if you reproduce the service completely, using the same components, your data is buried awayout of your reach. However, if you have access to the data then you can achieve continuity even if you don’t have access to the underlying source of the application. I’ll say it again: open data is more important than open source.

Of course we want open standards, open source and open data. But in one or two hundred years which will still be relevant? Patents and copyrights on formats expire, hardware platforms and even their paradigms shift and change. Data persists, open data endures.

The problem we have today is that the open data movement is in its infancy when compared to open source. We have so far to go, and there are many obstacles. One of the first steps to maturity is to give people the means to express how open their data is, how reusable it is. The Open Data Commons is an organisation explicitly set up to tackle the problem of open data licensing. If you are publishing data in any way you ought to check out their licences and see if any meet with your goals. If you licence your data openly then it will be copied and reused and will have an even greater chance of persisting over the long term.

Hopefully I have given plenty of background to my open data conjecture. I’m eager to hear what you think so please comment or email me directly. You may find these links relevant too:

9 responses so far