The development of rich, unification-based, wide-coverage computational grammatical resources is time consuming and expensive. (Cahill et al., 2002, 2003) present methods to construct robust, wide-coverage, statistical LFG grammars automatically from an f-structure annotated version of the Penn-II treebank. The f-structure annotations for the treebank trees are generated automatically by an f-structure annotation algorithm. The trees in the Penn-II treebank contain empty productions and a rich arsenal of traces to coindex ``displaced'' linguistic material with tree positions where this material should be interpreted semantically. The automatic f-structure annotation algorithm is sensitive to these traces and captures long distance dependencies (LDDs) in terms of corresponding reentrancies in the f-structure annotations. However, the wide-coverage statistical grammars automatically extracted from this resource in (Cahill et al., 2002) do not capture LDDs, but rather parse new text into ``proto-f-structures''. Proto-f-structures interpret linguistic material locally where it occurs in the parse tree. The reason is that current statistical parsers do not produce trees with empty productions and coindexed traces (two exceptions are Collins' (1999) model 3 and Johnson's (2002) tree post-processing approach). Indeed, statistical parsers standardly remove empty productions and traces from the training set (Charniak, 1996).
In this paper we present a method for resolving LDDs in an automatically constructed, wide-coverage, statistical LFG grammar for parse trees that do not contain empty nodes or coindexed traces. We follow the lead of standard LFG and resolve such dependencies on f-structure involving paths through f-structure (functional-uncertainty paths) and lexical information (semantic forms). In contrast to other approaches, however, we compute this information automatically from the (proper) f-structure annotated Penn-II treebank resource. Given such a resource it is possible to automatically extract semantic forms following (van Genabith et al., 1999). The precise results obtained are detailed in a companion paper. The semantic forms are associated with conditional probabilities P(s|l) (derived from the corpus) where l is a lemma and s a semantic form. We extract more than 15,500 (non-empty) semantic forms with probabilities. In a similar manner, from the same resource it is possible to automatically extract shortest paths linking LDD reentrancies in f-structure. These are classifies according to LDD type (e.g. TOPIC, FOCUS etc.) and associated with conditional probabilities P(p|d) where p is a path and d is either TOPIC or FOCUS. From the f-structure annotated Penn-II we extract 23 TOPIC and 54 FOCUS path types with associated probabilities. Given a proto-f-structure F, the LDD algorithm recursively traverses F and at each level tries to:
Focus Paths | # | Focus Paths | # |
up-subj | 7894 | up-obj | 1167 |
up-xcomp | 956 | up-xcomp:obj | 793 |
up-xcomp:xcomp | 161 | up-xcomp:xcomp:obj | 135 |
up-comp:subj | 119 | up-xcomp:subj | 92 |
The algorithm supports multiple topic/focus LDDs and multiplies the probabilities associated with each resolution to rank the resolved f-structure. It also supports resolution of LDDs where no overt linguistic material introduces a source topic/function (e.g. in wh-less ``reduced'' relative clause constructions).
We have implemented and carried out initial tests on the algorithm on grammars trained on sections 02--21 of the WSJ part of the Penn-II treebank and evaluated on section 23. Evaluation is carried out against a test set of manually constructed gold-standard f-structures for 105 sentences randomly extracted from section 23 and against the (proper) f-structures generated by the automatic annotation algorithm (Cahill et al. 2002) for the full set of 2400 sentences in section 23. The results in table 2 show that we get an increase of almost 3% in f-score when we resolve the LDDs.
Before Resolution | After Resolution | |||||
P | R | F | P | R | F | |
A-PCFG | 69.82 | 52.57 | 59.98 | 69.16 | 57.92 | 63.04 |
In our view, the research reported here has thrown up a number of interesting issues. First, surprisingly perhaps, the wide-coverage, proto-f-structure grammars and parsers in (Cahill et al., 2002) did not use lexical information. In order to extend these resources to proper f-structures (i.e. in order to account for LDDs) we naturally came to an architecture that involves lexical information in the form of subcat frames (semantic forms). However, these were not hand-coded but automatically extracted from the (proper) f-structure annotated Penn-II treebank (Cahill et al., 2002, 2003). Perhaps the most important aspect of this work, at least in our view, is that we have developed initial methodologies for the automatic construction of robust, wide-coverage, treebank based, proper LFG grammars and parsers that can parse the Penn-II treebank at a much reduced development cost compared to manual development of comparable resources. We believe that this constitutes an alternative to manual, wide-coverage, rich (deep-analysis) unification grammar development and opens up the possibility for interesting research on combining manual and automatic grammar development.
References