Mon, Aug 10, 2009

Representing Time in RDF Part 1

Way back in 2006 I wrote a blog post concerning the modelling of time in RDF (see Refactoring Bio With Einstein Part 3: Temporal Invariants. That post also provoked some discussion in the blogosphere. Although I haven't written anything on the subject for the past three years I haven't stopped thinking about it. In fact I've been working quite hard on the problem, mainly by modelling real data, especially geographical information. This is the first of a series of blog posts describing my experiments. I'd like to thank Leigh Dodds and Jeni Tennison who gave me valuable feedback on an earlier version of this write-up.

In a comment to my blog post Chris Mungall made an excellent point about the importance of solving the time problem:

However, it's also seems clear to me that this is a recipe for trouble for the semantic web. Surely all real-world data that concerns non-trivial applications such as science and electronic health records, or any kind of human activity _must_ take time into account? Which ever hack you make to account for time, it has to propagate through all your ontologies. An ontology that treats the world as time-slices can't interoperate with one that has a standard view of objects and processes. It may be just about workable, but I can't see it being anything other than tremendously complicated. We'll essentially end up with layering 3-place relations on top of RDF in an extremely inelegant way.

This is not made clear when people are lured into the semantic web with examples of toy ontologies about pizzas that live floating in some mathematical space untroubled by time. Unless more is done to address these issues (and I commend this article for tackling this) the semantic web will face a huge backlash when people start realising they have to warp their ontologies and refactor their instance data to deal with time in order to represent real entities. Why is there no best practices document on representing instances that vary in time (that is, all real-world instances)? I do find it curious that more people aren't making noise about this problem - I can only conclude that there's a dearth of serious applications using RDF or OWL for instance data.

Those comments are still true today and in fact they are being accentuated by the wide availability of data brought about by the successful Linked Data project. For example dbpedia, freebase and geonames all have descriptions of London (in England) and their URIs are all declared to be owl:sameAs one another:

These descriptions assert population figures for London of 7,355,400, 7,421,209 and 7,512,400 respectively. Since all these resources are owl:sameAs one another then I have three different populations for exactly the same thing with no temporal context (to be fair both freebase and dbpedia do attempt to assign a date, but they both say it's for the year 2006). Perhaps they are all correct but are taken at different times, or perhaps they are actually referring to slightly different definitions of "London". Whatever the cause, the effect is that the data is not particularly useful. It would be helpful for them to indicate when the measurement of population took place. This is not intended as a criticism of the LOD project but to demonstrate that simplistic modelling of data that ignores time can quickly produce unhelpful results.

As an another example of the kind of data that I should be able to model in RDF consider the city of Istanbul. During it's long and varied history it has been named Byzantium, New Rome, Constantinople and Stamboul (see Wikipedia's page on the names of Istanbul). At various times it has been the capital city of Roman Empire, the Byzantine/Eastern Roman Empire (twice), the Latin Empire, the Ottoman Empire and modern Turkey and of course its extent has varied considerably over that period of time too.

No existing geographic ontology can model that variation in properties accurately enough for me to write a query to return the name of that city during the sixth crusade.

My main requirements for modelling time are:

to be able to query the properties of and relations between entities at any point in time

to be able to sequence data in relative terms such as before, after and during
not to extend the RDF triple model beyond possibly allowing named graphs
not to require changes to existing RDF schemas
avoid duplication of data

In this post I'm going to explore some of the possible solutions to this modelling problem. I take four main approaches:

Conditions were my invention to model the state of being of an individual at a point in time (basically time slices like CYC sub abstractions).
Named graphs, with one graph containing time interval information about the other graphs.
Reification of all triples and attaching time interval information to the reified statements.
N-ary relations representing contexts.

I'm going to illustrate the various approaches using three scenarios all drawn from problems in the genealogy field which happens to be both an enduring interest of mine and a minefield for time-insensitive applications:

In the first a woman is born as "Maria Smith" in 1867 and marries "Richard Johnson" in 1888. I want to write a sparql query that gives me her name so I can find her in the 1891 census.
For scenario 2 imagine that I discover an ancestor in the 1861 census who claims to have been born in in Widford, Gloucestershire. However when I check I find three Widfords: one in Essex, one in Hertfordshire and one in Oxfordshire. Has there been an error in the census transcription? The explanation is that prior to 1844 the Oxfordshire Widford was actually in Gloucestershire. I want to write a sparql query that finds out which county the parish was in when the 1841 census was being taken.
The final scenario is where I have records of the addresses that a person has lived at. I don't have precise dates for the moves between them because the information has been derived from locating that person in public records. I know, for instance that in 1870 this person lived in Lyme Regis, Dorset; in 1871 they were in Charmouth, Dorset and in 1881 they were in Hastings, Sussex. Given that information, where is the most likely place to look for them in 1874? Obviously in the absence of any other information, I would start looking in Charmouth and if that proved fruitless, I would move onto Hastings. Can I write a sparql query to give me that ordering of possibilities?

For completeness, there were approaches that I didn't consider in detail:

Temporal RDF introduces a fourth time component to the triple. I chose not to cover this approach in a lot of detail because it extends the RDF model in a way that no current triple store implements and it requires a numeric time to be associated with each triple, preventing relative times from being expressed.

It is worth noting that the scenarios analysed in these posts are very specialist. Most data modelling is only concerned with "The Now". The data and corresponding queries I show in the following posts are quite convoluted and don't reflect usual usage of RDF. This would likely be true of any data representation format that attempted to model time-varying properties of arbitrary things.

This post is part 1 of a series about representing time in RDF. All posts in this series: part 1, part 2, part 3, part 4, part 5 and part 6

Permalink: http://blog.iandavis.com/2009/08/representing-time-in-rdf-part-1/

Other posts tagged as data, genealogy, history, modelling, projects, rdf, technology, time, time-in-rdf

Internet Alchemy

Representing Time in RDF Part 1

Earlier Posts