Grunk

I've just come across Grunk, a Java based toolkit for scraping semi-structured text. The concept is a bit like having regular expressions that work in terms of words rather than characters. The patterns are specified in an XML format so conceivably a library of common ones could be built up. Output is also in XML so coupled with some XSLT this could be a significant addition to the scrapers toolset.

Permalink: http://blog.iandavis.com/2004/01/grunk/

Other posts tagged as text-analysis

Earlier Posts