Google+

Representing Time in RDF Part 1

8

10 August 2009 by Ian Davis

Way back in 2006 I wrote a blog post concerning the modelling of time in RDF (see Refactoring Bio With Einstein Part 3: Temporal Invariants. That post also provoked some discussion in the blogosphere. Although I haven’t written anything on the subject for the past three years I haven’t stopped thinking about it. In fact I’ve been working quite hard on the problem, mainly by modelling real data, especially geographical information. This is the first of a series of blog posts describing my experiments. I’d like to thank Leigh Dodds and Jeni Tennison who gave me valuable feedback on an earlier version of this write-up.

In a comment to my blog post Chris Mungall made an excellent point about the importance of solving the time problem:

However, it’s also seems clear to me that this is a recipe for trouble for the semantic web. Surely all real-world data that concerns non-trivial applications such as science and electronic health records, or any kind of human activity _must_ take time into account? Which ever hack you make to account for time, it has to propagate through all your ontologies. An ontology that treats the world as time-slices can’t interoperate with one that has a standard view of objects and processes. It may be just about workable, but I can’t see it being anything other than tremendously complicated. We’ll essentially end up with layering 3-place relations on top of RDF in an extremely inelegant way.

This is not made clear when people are lured into the semantic web with examples of toy ontologies about pizzas that live floating in some mathematical space untroubled by time. Unless more is done to address these issues (and I commend this article for tackling this) the semantic web will face a huge backlash when people start realising they have to warp their ontologies and refactor their instance data to deal with time in order to represent real entities. Why is there no best practices document on representing instances that vary in time (that is, all real-world instances)? I do find it curious that more people aren’t making noise about this problem – I can only conclude that there’s a dearth of serious applications using RDF or OWL for instance data.

Those comments are still true today and in fact they are being accentuated by the wide availability of data brought about by the successful Linked Data project. For example dbpedia, freebase and geonames all have descriptions of London (in England) and their URIs are all declared to be owl:sameAs one another:

These descriptions assert population figures for London of 7,355,400, 7,421,209 and 7,512,400 respectively. Since all these resources are owl:sameAs one another then I have three different populations for exactly the same thing with no temporal context (to be fair both freebase and dbpedia do attempt to assign a date, but they both say it’s for the year 2006). Perhaps they are all correct but are taken at different times, or perhaps they are actually referring to slightly different definitions of “London”. Whatever the cause, the effect is that the data is not particularly useful. It would be helpful for them to indicate when the measurement of population took place. This is not intended as a criticism of the LOD project but to demonstrate that simplistic modelling of data that ignores time can quickly produce unhelpful results.

As an another example of the kind of data that I should be able to model in RDF consider the city of Istanbul. During it’s long and varied history it has been named Byzantium, New Rome, Constantinople and Stamboul (see Wikipedia’s page on the names of Istanbul). At various times it has been the capital city of Roman Empire, the Byzantine/Eastern Roman Empire (twice), the Latin Empire, the Ottoman Empire and modern Turkey and of course its extent has varied considerably over that period of time too.

No existing geographic ontology can model that variation in properties accurately enough for me to write a query to return the name of that city during the sixth crusade.

My main requirements for modelling time are:

  • to be able to query the properties of and relations between entities at any point in time

  • to be able to sequence data in relative terms such as before, after and during

  • not to extend the RDF triple model beyond possibly allowing named graphs

  • not to require changes to existing RDF schemas

  • avoid duplication of data

In this post I’m going to explore some of the possible solutions to this modelling problem. I take four main approaches:

  1. Conditions were my invention to model the state of being of an individual at a point in time (basically time slices like CYC sub abstractions).

  2. Named graphs, with one graph containing time interval information about the other graphs.

  3. Reification of all triples and attaching time interval information to the reified statements.

  4. N-ary relations representing contexts.

I’m going to illustrate the various approaches using three scenarios all drawn from problems in the genealogy field which happens to be both an enduring interest of mine and a minefield for time-insensitive applications:

  1. In the first a woman is born as “Maria Smith” in 1867 and marries “Richard Johnson” in 1888. I want to write a sparql query that gives me her name so I can find her in the 1891 census.

  2. For scenario 2 imagine that I discover an ancestor in the 1861 census who claims to have been born in in Widford, Gloucestershire. However when I check I find three Widfords: one in Essex, one in Hertfordshire and one in Oxfordshire. Has there been an error in the census transcription? The explanation is that prior to 1844 the Oxfordshire Widford was actually in Gloucestershire. I want to write a sparql query that finds out which county the parish was in when the 1841 census was being taken.

  3. The final scenario is where I have records of the addresses that a person has lived at. I don’t have precise dates for the moves between them because the information has been derived from locating that person in public records. I know, for instance that in 1870 this person lived in Lyme Regis, Dorset; in 1871 they were in Charmouth, Dorset and in 1881 they were in Hastings, Sussex. Given that information, where is the most likely place to look for them in 1874? Obviously in the absence of any other information, I would start looking in Charmouth and if that proved fruitless, I would move onto Hastings. Can I write a sparql query to give me that ordering of possibilities?

For completeness, there were approaches that I didn’t consider in detail:

  • Temporal RDF introduces a fourth time component to the triple. I chose not to cover this approach in a lot of detail because it extends the RDF model in a way that no current triple store implements and it requires a numeric time to be associated with each triple, preventing relative times from being expressed.

It is worth noting that the scenarios analysed in these posts are very specialist. Most data modelling is only concerned with “The Now”. The data and corresponding queries I show in the following posts are quite convoluted and don’t reflect usual usage of RDF. This would likely be true of any data representation format that attempted to model time-varying properties of arbitrary things.

This post is part 1 of a series about representing time in RDF. All posts in this series: part 1, part 2, part 3, part 4, part 5 and part 6

8 thoughts on “Representing Time in RDF Part 1

  1. Twitted by iand says:

    [...] This post was Twitted by iand [...]

  2. Twitted by lukask says:

    [...] This post was Twitted by lukask [...]

  3. Karen Lopez says:

    I’m happy to see that one important aspect of analyzing data has been addressed — that of temporal aspects of data.I have also seen the simple RDF and ontology examples that over-simplify models of the real world. Your London population example is perfect. One cannot, with any accuracy or precision, link together data from completely different sources without a common agreement of what those data mean.Comparing recorded facts about “London” can vary based on:- time- definition of “population” (does it include all people who were there on a specific date? Who were residents? Extrapolated via an official census?, etc).- definition of “London” (city proper? London area? Region?- data usage (Has the data been adjusted for some specific reason? Has the data been adjusted for quality? Is there an accompanying legend or statistical reference? have they been rounded? By what method?)- politics (are these numbers used in such a way that bias or other political gain might impact them?)- source (did these numbers come from a census? a scientific study? a “welcome to London street sign?)I believe that the lessons learned over the last few decades in traditional data modeling/data management can help deal with these questions. Ialso believe it is time for those showing how nifty it is to link up a bunch of data with none of the above questions answered to realize that they are demoing something quite different than they think they are.Demos can and should be simple. But those doing the demo-ing should know what simplifications they have applied.

  4. Ian Davis says:

    Great points Karen. As more data becomes exposed for machines to process we need to distinguish carefully between the data that good enough for quick answers and data that carries precision and context with it. For lots of purposes it’s good enough to know that the population of London is about 7.5 million but there are times when you need to know what kind of population figure it is (e.g. when calculating unemployment rates).

  5. Bill Roberts says:

    IanGreat that you are addressing this topic – it’s something I’ve been thinking a lot about myself the last couple of weeks, particularly the “time-varying property of something” scenario, like your population of London example. This is such a common pattern that I’m really surprised there isn’t already a ‘standard’ way to do it.So I look forward to the rest of your posts!Bill

  6. Ian Davis says:

    Bill, the other posts are there but the links between them are broken at the moment. You can see them all at http://iandavis.com/blog

  7. Danny Ayers (danja) 's status on Tuesday, 11-Aug-09 11:22:59 UTC - Identi.ca says:

    [...] great series from @iand on Representing Time in RDF http://iandavis.com/blog/2009/08/time-in-rdf-1 [...]

  8. John Goodwin (johngoodwin) 's status on Tuesday, 11-Aug-09 11:39:28 UTC - Identi.ca says:

    [...] Resending @danja: great series from @iand on Representing Time in RDF http://iandavis.com/blog/2009/08/time-in-rdf-1 [...]

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 28 other followers

%d bloggers like this: