Cahill/McCarthy/O'Donovan/van Genabith/Way LFG03 Abstract

A Suite of Linguistic Tools for Use with the Penn-II Treebank

Aoife Cahill, Mairead McCarthy, Ruth O'Donovan, Josef van Genabith and Andy Way

Abstract

Proceedings of LFG03; CSLI Publications On-line

Treebanks of parsed, annotated text corpora are becoming more and more important resources in many areas of descriptive, theoretical and computational linguistic research. In LFG too, there exists quite a large body of work on semi-automatic extraction of large-scale resources, including grammars (e.g. Cahill et al., 2002a; Zinsemeister et al., 2002; Frank et al., 2003), subcategorisation frames (van Genabith et al., 1999; Cahill et al., 2003), and < c,f > pairs of LFG representations (e.g. Cahill et al., 2002b).

The current paper describes a suite of tools for inspection of the Penn-II Treebank. Cahill et al. (2002a) describes an algorithm for automatically annotating the 1 million words in 50,000 sentences in the treebank with f-structure annotations. This annotation method scales up by an order of magnitude on the method of van Genabith et al. (1999). Given the size of the dataset, a number of tools have been built in order to facilitate the inspection and annotation of the treebank trees. The tools include:

treebank inspection and viewing options which enable searching for CFG-rule tokens extracted from the treebank;
graphical display of trees and subtrees according to rule instances;
display of the yield of the subtree (with and without context);
tagging and PCFG-parsing of new input;
an automatic annotation tool;
an f-structure generator.

Figure 1 illustrates the display of a < c,f > pair for the simple sentence A man saw a woman. While some of the tools have been described in (Cahill and van Genabith, 2002), we shall demonstrate a number of new facilities, including the extraction of subcategorisation frames and quasi-logical forms, an automatic annotation algorithm, and full LFG parsing into both c- and f-structures of unseen input, should the user require. (Available at http://www.computing.dcu.ie/~acahill/get_lfg.html) This is made possible by a PCFG chart parser (based on the CYK algorithm) which operates on CFG grammars extracted by the annotation algorithm presented in Cahill et al. (2002a).

Figure 1: A $< c,f >$ pair for a simple sentence

References

Cahill, A., M. McCarthy, J. van Genabith and A. Way (2002a): `Automatic Annotation of the Penn-Treebank with LFG F-Structure Information', in Proceedings of the LREC Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data, Las Palmas, Spain, pp.8-15.
Cahill, A., M. McCarthy, J. van Genabith and A. Way (2002b): `Parsing Text with a PCFG derived from Penn-II with an Automatic F-Structure Annotation Procedure', in M. Butt and T. Holloway-King (eds.) Proceedings of the Seventh International Conference on LFG, CSLI Publications, Stanford, CA., pp.76-95.
Cahill, A., M. McCarthy, J. van Genabith and A. Way (2003): `Quasi-logical forms from f-structures for the Penn treebank', in Proceedings of the Fifth International Workshop on Computational Semantics, Tilburg, The Netherlands, pp.55-71.
Cahill, A. and J. van Genabith (2002): `TTS: A Treebank Tool Suite', in Proceedings of LREC 2002, Third International Conference on Language Resources and Evaluation, Las Palmas, Spain, p.1712-1717.
Frank, A., L. Sadler, J. van Genabith and A. Way (2003): `From Treebank Resources to LFG f-Structures', in A. Abeille (ed.) Building and using Parsed Corpora, Kluwer, Dordrecht, The Netherlands (in press).
van Genabith, J., L.Sadler, and A.Way. (1999): `Data-driven Compilation of LFG Semantic Forms', in EACL-99 Workshop on Linguistically Interpreted Corpora, Bergen, Norway, pp.69-76.
Zinsemeister, H., J. Kuhn and S. Dipper (2002): `Utilizing LFG Parses for Treebank Annotation', in M. Butt and T. Holloway-King (eds.) Proceedings of the Seventh International Conference on LFG, CSLI Publications, Stanford, CA., pp.427-447.