Text Mining with the Cafetiere Toolkit

  • Speaker:   Mr  Bill Black  (University of Manchester)
  • Host:   AXR
  • 29th November 2006 at 14:15 in 1.5
Cafetiere is a modular toolkit for conducting natural language text analysis, especially information extraction. It has pre-processing components for tokenization, ontology or gazetteer lookup, part of speech tagging, stemming etc. The modules add metadata to a Document Object Model representation which places structural markup in-line and conceptual markup stand-off. The final phase of analysis is done by a rule-based system which applies a context-sensitive grammar of phrases to the text, so that the text is partially or fully parsed. The grammar formalism incorporates feature value expressions so that values deriving from any of the preprocessing stages or prior rule applications can be tested in rule conditions. It has various expressive devices, e.g. regular expression quantifiers, Prolog-like variables, sub-token pattern-matching, to reduce the number of rules needed.

A recent revision to the internal document representation has resulted in a system that can analyze conference paper-length texts in a few seconds. Current work with the system involves developing resources to analyze scientific texts in various ways, including, in joint work with Lancaster, informative abstracting by information extraction and generation. The talk will present the framework, relationship to other information extraction systems, analysis resources, applications and work in progress, which includes co-reference resolution, export of the markup to conceptually index documents for data mining purposes, and development of a more sophisticated text-ontology linkage. The user interface will be demonstrated, and pointers given to a trial version
