Abstract
We describe a method that automatically induces LFG f-structures from treebank tree representations, given a set of f-structure annotation principles that define partial, modular c- to f-structure correspondences in a linguistically informed, principle-based way.
This work extends the approach of van Genabith, Sadler and Way (1999a,b,c) where f-structure annotation of treebanks is driven by manual annotation of treebank-extracted PS rules. In this paper we present a method for automatic f-structure annotation of treebank trees, building on a correspondence-based view of the LFG architecture. A rule-based approach is explored, in parallel, by Sadler, van Genabith and Way (2000).
In our method we build on a correspondence-based view of the LFG architecture, where annotation principles define Ø-correspondences directly in terms of Ø-projection constraints, relating partial (possibly non-local) c-structure tree fragments to their corresponding partial f-structures. Application of the modular annotation principles to treebank trees directly induces the f-structure. Due to the disambiguated tree input, the resulting f-structures require only minimal manual disambiguation, and can be used to build large f-structure corpora as training data for stochastic NLP applications.
The f-structure annotation principles provide by themselves a principle-based, modular description of the LFG c-structure/f-structure interface. They define characteristic functional correspondences between partial c-structure configurations and their f-structure projections. By abstracting from away from irrelevant c-structure context, these principles are highly general and modular, and therefore apply to previously unseen tree configurations.
To define and process the annotation principles we make use of an existing term rewriting system, originally designed for transfer-based Machine Translation. The method is inherently robust. It yields partial, unconnected f-structures in the case of missing annotation rules.
We present the results of a first experiment where we apply this method to the Susanne treebank. The experiment is designed to measure to which extent the partial c- to f-structure correspondences encoded in annotation principles scale up, by applying them to previously unseen tree configurations. We then extend the model to selective filtering of ambiguities, using lexical subcategorization information in conjunction with an OT-based constraint ranking mechanism for ambiguity filtering and ranking (cf. Frank et al. 1998, 2000).
Finally we address some conceptual issues. The principle-based projection of f-structures from disambiguated tree input has interesting implications for the definition of grammatical constraints as compared to the classical LFG parsing architecture. We also discuss issues such as systematic modifications of given treebank encodings, and which types of treebank encodings should be expoited for different applications: the construction of f-structure banks, as opposed to more far-reaching goals, including rapid, corpus-based LFG grammar development, and robust parsing architectures.