I’ve just come across Grunk, a Java based toolkit for scraping semi-structured text. The concept is a bit like having regular expressions that work in terms of words rather than characters. The patterns are specified in an XML format so conceivably a library of common ones could be built up. Output is also in XML so coupled with some XSLT this could be a significant addition to the scrapers toolset.
-
Ian Davis: British; married with kids; technical architect; CTO of Talis; co-author of RSS 1.0; creator of FOAF icons; Semantic Web hacker.

My URI:
http://iandavis.com/id/me
Email Me:
nospam@iandavis.com
Twitter:
http://twitter.com/iand Feeds
Projects
