Abstract
The paper presents a dependency-based representation within the LFG framework which is less language-specific than f-structures. It was designed as an intermediary representation in a rule-based MT system because we found f-structures too language-specific and unsuitable for generation due to the tendency of LFG grammars to overgenerate. The formal representation provided here has been tested on a parallel Aymara-English-Spanish corpus. A machine translation toolchain has been implemented that includes an LFG-based parser and a transfer module which utilizes the method described in this paper.
Although f-structures abstract to some extent from language specific features (such as differential object marking), there are still many differences even between relatively closely related languages (for example, the East Baltic language Latvian has only agentless passives, i.e. it completely lacks OBLag [Forssman, 2001], whereas its closest relative, Lithuanian, with which Latvian has a degree of mutual intelligibility, frequently uses agents in passives).
We use the information provided by f-structures, i-structures [King, 1997], c-structures and a-structures to create a dependency-based representation of parsed sentences (a tectogrammatical tree in the terminology of Sgall et al. [1986]).
Throughout this article we will use the term dependency tree (DT) to refer to deep syntax trees induced by LFG structures. The skeleton of a DT is provided by the f-structure. According to a generally-accepted principle of deep syntax (tectogram- matics), only autosemantic (content) words are represented by nodes in DTs. In LFG, autosemantic words are associated with projections of lexical categories, i.e., f-structures with the PRED attribute. Table 1 provides a concise overview of which information at different levels of linguistic representation in LFG is used in DTs.
LFG layer | information in DTs |
c-structure | original word order |
f-structure | dependencies and coreferences |
i-structure | topic-focus articulation |
a-structure | thematic roles |
The edges are labelled with semantic roles. This is possible due to the bi-uniqueness of the mapping between roles and GFs. However, there is one exception: The initial role is assigned a special label that we call 'actor' (ACT, which is equvalent to what Bresnan [2001] marks θ and calls 'logical subject'). This partially reflects the shifting of actants in tectogrammatics as defined by Sgall et al. [1986].
So far, we have an unordered tree (f-structures render directed acyclic graphs if structure sharing occurs but in such a case only one edge represents linguistic dependency while the other edges represent coreferences which occur at a different level and are thus absent from DTs). We define an ordering based on information structure, as proposed for deep syntax by Sgall et al. [1986]. Thus we use i-structures to define a partial ordering on the nodes of the DT. The nodes in each topic-focus domain are ordered according to their original ordering in the sentence (which is captured by c-structures).
The corresponding f-structure and DT of a sample Spanish sentence (1) are given in (2) (the attributes associated with nodes can be obtained from corresponding f-structures).
(1) | El | libro | lo | ha | escrito | un | cientifico. |
the | book | it-ACC | has | written | a | scientist | |
"The book has been written by a scientist." |
(2) [ PRED 'write<SUBJ,OBJ>' TENSE PERF SUBJ [ PRED 'scientist' SPEC [ DEF - ] ] OBJ [ PRED 'book' SPEC [ DEF + ]]] o / | \ / | \ PAT | ACT / | \ book write scientist
Note that the appropriate English translation uses passive voice (hence the GFs in the corresponsing f-structure would be SUBJ and OBLag) whereas in Spanish the passive voice is much more marked.
DTs can be viewed as interlingual representations that serve as input for syntactic and morphological synthesis. Let us briefly point out some properties of DTs, most of which directly correspond to properties of tectogrammatical trees as defined by Sgall et al. [1986].
Table 2 shows how many c-structures, f-structures and DTs are identical in a parallel Aymara-English corpus of some 200 sentences. Two DTs are identical if they have the same structure (including node order), edge labels and relevant node labels.
level | identical representation |
c-structure | 6.5% |
f-structure | 37.2% |
DT | 71.8% |
Table 2: Identical c-structures, f-structures and DTs in a parallel corpus |
In the transfer phase, PRED values are translated, DTs are linearized and inflected word forms are generated using a lexicon for the target language. Since every tree node is associated with an f-structure, the algorithm has access to all attributes which might be relevant for generation. The linearization is defined by hand-written rules which form a grammar that is independent of the source language. Due to lexical ambiguity, the output of the transfer phase generally consists of more than one sentence in which case a language model might be used to resolve the ambiguity.
We use a simple trigram based language model. We have decided to use DTs in our MT experiments after we had problems with overgeneration when we used f-structures for syntactic synthesis (since the corresponding grammar was developed primarily for analysis). Although there are possible solutions such as marking certain rules only for analysis [Butt et al., 1999] or using OT constraints, our experience shows that from the practical point of view it is easier to use DTs for generation. The significant advantage is that DTs can be obtained automatically from the well-established LFG layers.