A Suite of Linguistic Tools for Use with the Penn-II Treebank

Aoife Cahill, Mairead McCarthy, Ruth O'Donovan, Josef van Genabith and Andy Way

Abstract

Proceedings of LFG03; CSLI Publications On-line

Treebanks of parsed, annotated text corpora are becoming more and more important resources in many areas of descriptive, theoretical and computational linguistic research. In LFG too, there exists quite a large body of work on semi-automatic extraction of large-scale resources, including grammars (e.g. Cahill et al., 2002a; Zinsemeister et al., 2002; Frank et al., 2003), subcategorisation frames (van Genabith et al., 1999; Cahill et al., 2003), and < c,f > pairs of LFG representations (e.g. Cahill et al., 2002b).

The current paper describes a suite of tools for inspection of the Penn-II Treebank. Cahill et al. (2002a) describes an algorithm for automatically annotating the 1 million words in 50,000 sentences in the treebank with f-structure annotations. This annotation method scales up by an order of magnitude on the method of van Genabith et al. (1999). Given the size of the dataset, a number of tools have been built in order to facilitate the inspection and annotation of the treebank trees. The tools include:

Figure 1 illustrates the display of a < c,f > pair for the simple sentence A man saw a woman. While some of the tools have been described in (Cahill and van Genabith, 2002), we shall demonstrate a number of new facilities, including the extraction of subcategorisation frames and quasi-logical forms, an automatic annotation algorithm, and full LFG parsing into both c- and f-structures of unseen input, should the user require. (Available at http://www.computing.dcu.ie/~acahill/get_lfg.html) This is made possible by a PCFG chart parser (based on the CYK algorithm) which operates on CFG grammars extracted by the annotation algorithm presented in Cahill et al. (2002a).


Figure 1: A $< c,f >$ pair for a simple sentence

References