Constraining RDF/XML
These are a few notes about rules to constrain RDF/XML output in order to make the serialisation of any given graph deterministic. One strong motivator is RSS 1.0 which uses a restricted profile of RDF/XML designed for compatibility with RSS 0.90. Because of this restriction, I always end up writing a custom RDF writer for my aggregated RSS - just pumping it out as RDF/XML is a no-no. It got me thinking that perhaps there were some simple rules that I could build into my standard RDF/XML writer that would automatically produce RSS 1.0. If the rules produced a lossless representation of the graph then I could use them for any RDF graph I wanted to serialise.
My rough rules so far are:
- Output blank nodes that are not the object of any triples as a top-level element without a generated nodeID.
- Output any blank nodes that are the object of only one triple as a child element of the triple property without a generated nodeID.
- Output blank nodes that are the object of two or more triples as a top-level element with a generated nodeID.
- Output all other subject nodes as top-level elements.
- Allow the specification of an ordered list of preferred namespace URIs.
- Always output typed nodes unless no rdf:type has been specified. If the node has multiple types, use the following algorithm to select the "preferred type":
- split each URI into a namespace and local name
- group URIs by namespace
- order groups in the same order as the preferred namespace list
- sort within each group by the local name, ascending
- pick the first URI from the first group
- Top-level nodes should be ordered by "preferred type". Use the algorithm above to determine the order in which the top-level nodes should appear. The result should be typed nodes from preferred namespaces occuring first in the output document. (This ensures that rss:channel appears before rss:item and all other elements appear at end in RSS 1.0)
- RDF container elements: if a contiguous sequence of numeric list properties exist (i.e. rdf:_1, rdf:_2, rdf:_3 etc) then these must be the first properties written out in the container and must be written as rdf:li elements. Only use the rdf:_n form when there is a gap in the sequence or when the parent element is not an RDF container element. (This produces the items /rdf:Seq/rdf:li construct in RSS 1.0)
By top-level I mean child elements of the rdf:RDF element.
More later as I think of them and try some rules out in my writer. Comments on this much appreciated.
Later... Richard Cyganiak suggested the following in comments which help the make output more deterministic:
- Allow specification of a default namespace
- Order property elements within a node element using the algorithm described above.
- Don't use rdf:parseType="resource" or property attributes.
I'd also add:
- Don't use rdf:ID, use rdf:about/rdf:resource/rdf:nodeID instead.