Automatic Acquisition of Spanish LFG Resources from the Cast3LB Treebank

Ruth O'Donovan, Aoife Cahill, Josef van Genabith, and Andy Way

Abstract

Proceedings of LFG05; CSLI Publications On-line

In this paper, we describe the automatic annotation of the Cast3LB Treebank with LFG f-structures for the subsequent extraction of Spanish probabilistic grammar and lexical resources. We adapt the approach and methodology of Cahill et al. (2004) and O'Donovan et al. (2004) for English to Spanish and the Cast3LB treebank encoding. We report on the quality and coverage of the automatic f-structure annotation. Following the pipeline and integrated models of Cahill et al. (2004) , we extract wide-coverage probabilistic LFG approximations and parse unseen Spanish text into f-structures. We also extend Bikel's (2002) Multilingual Parse Engine to include a Spanish language module. Using the retrained Bikel parser in the pipeline model gives the best results against a manually constructed gold standard (73.20% preds-only f-score). We also extract Spanish lexical resources: 4090 non-empty semantic form types with 98 frame types. Subcategorised prepositions and particles are included in the frames.