Mon, Aug 10, 2009

Representing Time in RDF Part 3

<h2>Approach 2: Named Graphs</h2>

In this approach I physically divide my data up into separate graphs. Each graph contains triples that hold true for a specified time interval. One graph is designated as holding the time interval information for all the other graphs.

Scenario 1

The first graph a2s1g1.ttl contains information about the other graphs. Specifically it describes the time periods for which their triples hold true.

@prefix bio: <http://purl.org/vocab/bio/0.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix time: <http://www.w3.org/2006/time#> .
@prefix ex: <http://example.org/ex#> .
@prefix thing: <http://example.org/thing#> .
thing:maria a foaf:Person .
<a2s1g2.ttl>
time:start "1867" ;
time:end "1888" .
<a2s1g3.ttl>
time:start "1888" ;
time:end "9999" .

Original file: a2s1g1.ttl

The triples say that the graph a2s1g2.ttl has a start time of 1867 and an end of 1888 and that a2s1g3.ttl starts in 1888 and ends at an arbitrary maximum date. This should be interpreted to mean that the triples contained in those graphs hold only between those dates. This means that I am saying a graph is also a time interval which might be a problem. That could be fixed by introducing a new predicate with the meaning "holds during" that can relate a graph to a time interval and moving the time predicates from the graphs to new interval resources. That would make the following somewhat more complicated.

Note that because I'm using named graphs I won't be able to use blank nodes to refer to things across the graphs.

The triples that hold true for Maria before she is married are held in a2s1g2.ttl :

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix thing: <http://example.org/thing#> .
thing:maria foaf:name "Maria Smith" .

Original file: a2s1g2.ttl

The triples that hold true after her marriage are in a2s1g3.ttl:

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix thing: <http://example.org/thing#> .
thing:maria foaf:name "Maria Johnson" .

Original file: a2s1g3.ttl

I only have one triple in each of these graphs but of course I could assert lots more facts that were true in the specific time periods for each graph.

Overall this approach uses 8 triples compared to 10 used by Approach 1. The SPARQL query has many similarities with Approach 1.The major difference is that the time period filters are applied to the named graphs not the conditions. Note that in this query I assume a2s1g1.ttl is the default graph.

prefix foaf: <http://xmlns.com/foaf/0.1/> 
prefix time: <http://www.w3.org/2006/time#> 
prefix xsd:  <http://www.w3.org/2001/XMLSchema#>
prefix ex: <http://example.org/ex#> 
prefix thing: <http://example.org/thing#> 
select ?name where {
?g time:start ?start .
?g time:end ?end .
filter (xsd:integer(?start) <= 1891 && xsd:integer(?end) >= 1891) .
graph ?g {
thing:maria foaf:name ?name .
}
}

Original file: a2s1.sq

Scenario 2

In this scenario the data is split into four graphs. a2s2ag1.ttl holds the time interval descriptions for the other graphs.

@prefix bio: <http://purl.org/vocab/bio/0.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix time: <http://www.w3.org/2006/time#> .
@prefix ex: <http://example.org/ex#> .
@prefix thing: <http://example.org/thing#> .
<a2s2ag2.ttl>
time:start "1837" ;
time:end "1844" .
<a2s2ag3.ttl>
time:start "1844" ;
time:end "9999" .
<a2s2ag4.ttl>
time:start "1837" ;
time:end "9999" .

Original file: a2s2ag1.ttl

a2s2ag2.ttl holds information about Widford being in Gloucestershire:

@prefix ex: <http://example.org/ex#> .
@prefix thing: <http://example.org/thing#> .
thing:widford ex:partOf thing:gloucestershire .

Original file: a2s2ag2.ttl

a2s2ag3.ttl holds information about Widford being in Oxfordshire:

@prefix ex: <http://example.org/ex#> .
@prefix thing: <http://example.org/thing#> .
thing:widford ex:partOf thing:oxfordshire .

Original file: a2s2ag3.ttl

And finally, a2s2ag4.ttl holds information about the names of the places:

@prefix ex: <http://example.org/ex#> .
@prefix thing: <http://example.org/thing#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
thing:oxfordshire
a ex:County ;
foaf:name "Oxfordshire" .
thing:gloucestershire
a ex:County ;
foaf:name "Gloucestershire" .
thing:widford
a ex:Parish ;
foaf:name "Widford" .

Original file: a2s2ag4.ttl

As can be seen from these examples, the data quickly becomes fragmented across many graphs. The granularity of this fragmentation depends on how frequently information about things changes over time. In the limit you might have one graph per year, per day or even per minute. However in practice you are likely to have graphs with long overlapping time intervals.

Naively you might expect the following query to work for us, following the pattern set by Scenario 1:

prefix bio: <http://purl.org/vocab/bio/0.1/> 
prefix foaf: <http://xmlns.com/foaf/0.1/> 
prefix time: <http://www.w3.org/2006/time#> 
prefix xsd:  <http://www.w3.org/2001/XMLSchema#>
prefix ex: <http://example.org/ex#> 
prefix thing: <http://example.org/thing#> 
select ?name where {
?g time:start ?start .
?g time:end ?end .
filter (xsd:integer(?start) <= 1841 && xsd:integer(?end) >= 1841) .
graph ?g {
thing:widford ex:partOf ?x .
?x foaf:name ?name .
}
}

However that only works when both the ex:partOf and the foaf:name triples are in the same named graph. I didn't see this in Scenario 1 because I was only interested in a single triple. Here I am trying to discover a relationship and a name, both of which could change at different times and so may be in different graphs.

So the query is a little complex:

prefix bio: <http://purl.org/vocab/bio/0.1/> 
prefix foaf: <http://xmlns.com/foaf/0.1/> 
prefix time: <http://www.w3.org/2006/time#> 
prefix xsd:  <http://www.w3.org/2001/XMLSchema#>
prefix ex: <http://example.org/ex#> 
prefix thing: <http://example.org/thing#> 
select ?name where {
?g time:start ?start .
?g time:end ?end .
filter (xsd:integer(?start) <= 1841 && xsd:integer(?end) >= 1841) .
graph ?g {
thing:widford ex:partOf ?x .
}
?g2 time:start ?start2 .
?g2 time:end ?end2 .
filter (xsd:integer(?start2) <= 1841 && xsd:integer(?end2) >= 1841) .
graph ?g2 {
?x foaf:name ?name .
}
}

Original file: a2s2a.sq

This shows a major disadvantage of the named graph approach because I would have to repeat the time filtering of graphs for each triple I wanted to find!

Scenario 3

Once again this graph (a2s3g1.ttl) contains time information about the other graphs. For this example I have chosen to include some facts about the places here too. By doing this I am basically saying that they are timeless facts which helps simplify this approach.

@prefix bio: <http://purl.org/vocab/bio/0.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix time: <http://www.w3.org/2006/time#> .
@prefix ex: <http://example.org/ex#> .
@prefix thing: <http://example.org/thing#> .
thing:lymeRegis
a ex:Town ;
foaf:name "Lyme Regis" .
thing:charmouth
a ex:Town ;
foaf:name "Charmouth" .
thing:hastings
a ex:Town ;
foaf:name "Hastings" .
<a2s3g2.ttl>
time:intervalBefore <a2s3g3.ttl> ;
time:intervalContains "1844" .
<a2s3g3.ttl>
time:intervalAfter <a2s3g2.ttl> ;
time:intervalBefore <a2s3g4.ttl> ;
time:intervalContains "1871" .
<a2s3g4.ttl>
time:intervalAfter <a2s3g3.ttl> ;
time:intervalContains "1881" .

Original file: a2s3g1.ttl

In this case I am using relative times for the graphs, saying that the triples in a2s3g2.ttl hold before the triples in a2s3g3.ttl which in turn hold true before the triples in a2s3g4.ttl.

The triples that assert my person lived in Lyme Regis are held in a2s3g2.ttl:

@prefix ex: <http://example.org/ex#> .
@prefix thing: <http://example.org/thing#> .
thing:anon ex:residence thing:lymeRegis .

Original file: a2s3g2.ttl

The triples that assert my person lived in Charmouth are held in a2s3g3.ttl:

@prefix ex: <http://example.org/ex#> .
@prefix thing: <http://example.org/thing#> .
thing:anon ex:residence thing:charmouth .

Original file: a2s3g3.ttl

And the triples for Hastings are in a2s3g4.ttl:

@prefix ex: <http://example.org/ex#> .
@prefix thing: <http://example.org/thing#> .
thing:anon ex:residence thing:hastings .

Original file: a2s3g4.ttl

Now the query looks like this:

prefix bio: <http://purl.org/vocab/bio/0.1/> 
prefix foaf: <http://xmlns.com/foaf/0.1/> 
prefix time: <http://www.w3.org/2006/time#> 
prefix xsd:  <http://www.w3.org/2001/XMLSchema#>
prefix ex: <http://example.org/ex#> 
prefix thing: <http://example.org/thing#> 
select ?nameBefore ?nameAfter where {
?gBefore time:intervalContains ?dateBefore .
filter (xsd:integer(?dateBefore) <= 1874) .
?gAfter time:intervalContains ?dateAfter .
filter (xsd:integer(?dateAfter) > 1874) .
?gBefore time:intervalBefore ?gAfter .
graph ?gBefore {
thing:anon ex:residence ?placeBefore .
}
graph ?gAfter {
thing:anon ex:residence ?placeAfter .
}
?placeBefore foaf:name ?nameBefore .
?placeAfter foaf:name ?nameAfter .
}

Original file: a2s3.sq

What the first two clauses do is find graphs that might contain relevant triples. The first looks for a graph that has triples with time:intervalContains predicate whose value is less than equal to 1874. The second repeats it looking for graphs containing triples after that date. It then uses those two graphs to lookup residence information for the person at those times.

I have a suspicion that this query will bring back more results than are necessary if I had more data. I may have several addresses after 1874 and this query will bring them all back, not just the first.

It's worth noting that this query would be even more complex if I had not chosen to treat the names of the places as timeless data. If they were in separate graphs then I would have to add additional clauses to find graphs that hold at the specified time just like I did in Scenario 2

Approach 2 Conclusion

Named graphs appear to partition the data very nicely. However it seems that they don't make the querying any simpler. If it were possible to define a merge of all possible graphs that cover the time interval of interest and query that directly then it could be possible to write very natural queries and completely ignore the time component. This could be possible with a two-phase approach to running the queries or perhaps SPARQL sub-queries might help.

The main problem with named graphs is that they lie outside of the standard RDF model. In fact they are only really formalised by the SPARQL specification. There are no standardised serialisations for named graph data so it is not generally possible to query a SPARQL service and retrieve the named graph information. The TRIG and TRIX serialisations do support named graphs but they are not widely implemented in comparison to RDF/XML, Turtle or Ntriples, none of which support anything beyond the standard RDF model. I don't know of any reasoners that can work across multiple named graphs like this either.

This post is part 3 of a series about representing time in RDF. All posts in this series: part 1, part 2, part 3, part 4, part 5 and part 6

Permalink: http://blog.iandavis.com/2009/08/representing-time-in-rdf-part-3/

Other posts tagged as data, genealogy, history, modelling, projects, rdf, technology, time, time-in-rdf

Internet Alchemy

Representing Time in RDF Part 3

Scenario 1

Scenario 2

Scenario 3

Approach 2 Conclusion

Earlier Posts