Machine Translation Using Dependency Representation

Petr Homola and Matt Coler

Abstract

The paper presents a dependency-based representation within the LFG framework which is less language-specific than f-structures. It was designed as an intermediary representation in a rule-based MT system because we found f-structures too language-specific and unsuitable for generation due to the tendency of LFG grammars to overgenerate. The formal representation provided here has been tested on a parallel Aymara-English-Spanish corpus. A machine translation toolchain has been implemented that includes an LFG-based parser and a transfer module which utilizes the method described in this paper.

Although f-structures abstract to some extent from language specific features (such as differential object marking), there are still many differences even between relatively closely related languages (for example, the East Baltic language Latvian has only agentless passives, i.e. it completely lacks OBLag [Forssman, 2001], whereas its closest relative, Lithuanian, with which Latvian has a degree of mutual intelligibility, frequently uses agents in passives).

We use the information provided by f-structures, i-structures [King, 1997], c-structures and a-structures to create a dependency-based representation of parsed sentences (a tectogrammatical tree in the terminology of Sgall et al. [1986]).

Throughout this article we will use the term dependency tree (DT) to refer to deep syntax trees induced by LFG structures. The skeleton of a DT is provided by the f-structure. According to a generally-accepted principle of deep syntax (tectogram- matics), only autosemantic (content) words are represented by nodes in DTs. In LFG, autosemantic words are associated with projections of lexical categories, i.e., f-structures with the PRED attribute. Table 1 provides a concise overview of which information at different levels of linguistic representation in LFG is used in DTs.

Table 1: Information provided by LFG layers to DTs
LFG layerinformation in DTs
c-structureoriginal word order
f-structuredependencies and coreferences
i-structuretopic-focus articulation
a-structurethematic roles

The edges are labelled with semantic roles. This is possible due to the bi-uniqueness of the mapping between roles and GFs. However, there is one exception: The initial role is assigned a special label that we call 'actor' (ACT, which is equvalent to what Bresnan [2001] marks θ and calls 'logical subject'). This partially reflects the shifting of actants in tectogrammatics as defined by Sgall et al. [1986].

So far, we have an unordered tree (f-structures render directed acyclic graphs if structure sharing occurs but in such a case only one edge represents linguistic dependency while the other edges represent coreferences which occur at a different level and are thus absent from DTs). We define an ordering based on information structure, as proposed for deep syntax by Sgall et al. [1986]. Thus we use i-structures to define a partial ordering on the nodes of the DT. The nodes in each topic-focus domain are ordered according to their original ordering in the sentence (which is captured by c-structures).

The corresponding f-structure and DT of a sample Spanish sentence (1) are given in (2) (the attributes associated with nodes can be obtained from corresponding f-structures).

(1) El libro lo ha escrito un cientifico.
the book it-ACC has written a scientist
"The book has been written by a scientist."
(2) [ PRED  'write<SUBJ,OBJ>'
      TENSE  PERF
      SUBJ   [ PRED 'scientist'
               SPEC [ DEF - ] ]
      OBJ    [ PRED 'book'
               SPEC [ DEF + ]]]
    
            o
         /  |   \
       /    |     \
     PAT    |     ACT
     /      |       \
    book  write  scientist

Note that the appropriate English translation uses passive voice (hence the GFs in the corresponsing f-structure would be SUBJ and OBLag) whereas in Spanish the passive voice is much more marked.

DTs can be viewed as interlingual representations that serve as input for syntactic and morphological synthesis. Let us briefly point out some properties of DTs, most of which directly correspond to properties of tectogrammatical trees as defined by Sgall et al. [1986].

  1. There is a bi-unique mapping between DT nodes and autosemantic (content) words. Synsemantic (auxiliary/function) words are represented as attributes of nodes. This is naturally achieved by using coheads in LFG.
  2. 'Dropped' words (e.g., subject and/or object pronouns in so-called pro-drop languages) are re-established in DTs as a consequence of the principle of completeness since PRED attributes are instantiated in the lexicon if needed (cf. [Bres- nan, 2001]).
  3. Edge labels in DTs reflect semantic relations rather the GFs which are more language specific. 4. The ordering of DTs nodes is partially determined by topic-focus articulation.

Table 2 shows how many c-structures, f-structures and DTs are identical in a parallel Aymara-English corpus of some 200 sentences. Two DTs are identical if they have the same structure (including node order), edge labels and relevant node labels.

level identical representation
c-structure 6.5%
f-structure 37.2%
DT 71.8%
Table 2: Identical c-structures, f-structures and DTs in a parallel corpus

In the transfer phase, PRED values are translated, DTs are linearized and inflected word forms are generated using a lexicon for the target language. Since every tree node is associated with an f-structure, the algorithm has access to all attributes which might be relevant for generation. The linearization is defined by hand-written rules which form a grammar that is independent of the source language. Due to lexical ambiguity, the output of the transfer phase generally consists of more than one sentence in which case a language model might be used to resolve the ambiguity.

We use a simple trigram based language model. We have decided to use DTs in our MT experiments after we had problems with overgeneration when we used f-structures for syntactic synthesis (since the corresponding grammar was developed primarily for analysis). Although there are possible solutions such as marking certain rules only for analysis [Butt et al., 1999] or using OT constraints, our experience shows that from the practical point of view it is easier to use DTs for generation. The significant advantage is that DTs can be obtained automatically from the well-established LFG layers.

References