I've just come across Grunk, a Java based toolkit for scraping semi-structured text. The concept is a bit like having regular expressions that work in terms of words rather than characters. The patterns are specified in an XML format so conceivably a library of common ones could be built up. Output is also in XML so coupled with some XSLT this could be a significant addition to the scrapers toolset.


Other posts tagged as text-analysis

Earlier Posts