I'm planning to start work on a ping interface for PlaceTime.com. This will enable a sort of event aggregator analogous to RSS news aggregators to be built. I'm planning to support the weblogs.com XML-RPC interface which basically is a notification that a URL has changed or is otherwise of interest.
The URL that is sent with the ping will be stored in a database and there will be a separate spider that will periodically check these. The spider will look for embedded RDF within the URL content and parse that to look for PlaceTime URIs that may be interesting. There may be multiple embedded RDF fragments each containing multiple URI references. I call these 'occurrences', so a URL may have zero or more occurrences of PlaceTime URIs.
The spider algorithm will look something like this:
- For each URL in database
- Get existing occurences for URL in database
- Fetch URL content
- If URL fetch error:
- Remove occurrences from database
- Remove URL from database
- Add occurences as new URLs to fetch
- Process next URL in queue
- If URL fetch ok:
- Extract and parse RDF fragments from content
- Extract triples with objects that are PlaceTime URIs
- For each triple
- Determine original subject (?? anonymous nodes)
- If existing occurrence, update database
- If new occurrence, add to database
- Add unreferenced occurrences as new URLs
- Update next fetch date for URL (1 month)
I've designed the algorithm in this way to support the weblog publishing model as well as a more static model. I'm expecting a weblog to embed some RDF for each item on the home page (e.g. extend the trackback RDF). These are fetched by the spider and stored in the database. The urls of the items that are missing when the spider revisits the page are added to the database a new first-class sources of RDF, i.e. they're now in the archive of the weblog.
Identifying the original subject could be tricky. For RSS 1.0 it should be easy:
<dc:date rdf:resource="http://placetime.com/instant/gregorian/2003-05-19T08:44:00T" />
Has the following triple:
S: http://blog.iandavis.com/2003/05/placeTimeTimeZones.html<br /> P: http://purl.org/dc/elements/1.1/date<br /> O: http://placetime.com/instant/gregorian/2003-05-19T08:44:00T<br />
More complicated RDF using anonymous nodes may be too hard to get a subject URI from. Maybe I'll just ignore any triple that doesn't have an HTTP URI for it's subject for the moment.
Discussion of this is taking place on the PlaceTime mailing list.