Extracting Large-scale Lexical Resources for LFG from the Penn-II Treebank

Aoife Cahill, Mairead McCarthy, Ruth O'Donovan, Josef van Genabith and Andy Way

Abstract

Proceedings of LFG03; CSLI Publications On-line

In modern syntactic theories (e.g. LFG, HPSG, CG, LTAG etc.), the lexicon is the central repository for much morphological, syntactic and semantic information. In order to construct computational systems based on such theories, extensive lexical resources are therefore crucial in order to achieve wide coverage. Manual construction of such lexical resources is, however, extremely time-consuming, expensive and requires a lot of linguistic expertise. Indeed, it is often the case that limitations of NLP systems based on lexicalised approaches are due to bottlenecks in the lexicon component.

Given this, research on automating lexical aquisition for lexically-based NLP systems is a particularly important issue. In this paper we present one approach to automating lexical acquisition for LFG-based systems. In LFG, subcategorisation requirements are enforced though semantic forms specifying which grammatical functions are required by a particular predicate. Our approach is based on earlier work on LFG semantic form extraction (van Genabith et al., 1999) and recent progress in automatically annotating the Penn-II treebank with LFG f-structures (Cahill et al., 2002a). Our approach requires a treebank annotated with f-structure schemata. In the original approach (van Genabith et al., 1999), this was provided by automatically annotating the rules extracted from the publicly available subset of the AP treebank (100 trees) with f-structure annotations. If the quality of these f-structure annotations is sufficiently high, then LFG semantic forms can generated quite simply by recursively reading off the subcategorisable grammatical functions for each local \verb+pred+ value at each level of embedding in those f-structures. The work reported in (van Genabith et al., op cit.) was small scale and proof-of-concept.

In this paper we show how semantic forms can be extracted from the complete WSJ section of the Penn-II treebank---about 1 million words in 50,000 sentences---based on an automatic f-structure annotation algorithm described in (Cahill et al., 2002). Currently we extract over 15,500 unique non-empty semantic forms (there are over 50,000 in total) by our method, with about half of these occurring only once. About 5,700 non-empty semantic forms occur more than twice, and over 3,200 occur more than five times. (Note: Non-empty semantic forms contain at least one subcategorised grammatical function.)

Semantic Form No. Occurrences Probability
accept([obj,subj]) 122 0.814
accept([subj]) 11 0.073
accept([comp,subj]) 5 0.033
accept([obj,subj,obl:as]) 3 0.020
accept([obj,subj,obl:from]) 3 0.020
accept([subj,obl:as]) 3 0.020
Others 3 0.020
Semantic forms for accept extracted from the Penn-II Treebank

To provide a concrete example, the subcategorisation frames extracted for accept (ignoring once and twice occurring frames) are provided in Table 1. By far the highest occurring entry is `accept< ^ SUBJ ^ OBJ>', which is seen in 122 out of the total of 150 semantic forms for accept in the Penn-II treebank (81%). We sometimes even find that any odd semantic forms are useful in providing feedback to the automatic annotation algorithm, which may have contained errors leading to such subcategorisation frames. In this approach, subcategorisation frames are naturally associated with conditional probabilities (given lemma l, what is the probability of subcatframe x). These probabilities are useful in two respects: (i) in most theoretical linguistic theories, the distinction between complements and adjuncts is delicate and often not entirely clear cut. Probabilistic subcategorisation frames help to give an account of subcategorisation preferences, rather than a simple bipartition into complements and adjuncts; (ii) probabilitities associated with subcategorisation frames are useful in ranking analysis possibilities in computational systems.

We carry out evaluation of the subcategorisation frames extracted against the COMLEX resource (Grishman et al. 1994) and achieve an f-score of 72% (ignoring oblique arguments and subcategorisation frames that occur less than one percent of the total frames extracted). When we include oblique arguments we achieve an f-score of 64.3%. When we ignore frames that occur less than 5% of the total frames extracted, we get precision of 80.2%, recall of 53.6% and f-score of 70.9% (ignoring obliques).

In a companion paper, we demonstrate how the semantic forms and the associated probabilities induced by our methodology can be interwoven into the wide-coverage, probabilistic LFG grammars and parsers derived automatically from the f-structure annotated Penn treebank in Cahill et al. (2002b, 2003). In so doing, we achieve a lexicalised resolution of a number of long-distance dependencies in a statistical parser at the level of f-structure, without the need for empty productions or coindexation at the level of c-structure.

References