Abstract
This paper presents an approach to annotation projection in a multi-parallel corpus, that is, a collection of translated texts in more than two languages. Existing analysis tools, like the LFG grammars from the ParGram project, are applied to two of the languages in the corpus and the resulting annotation is projected to a third language, taking advantage of the largely parallel character of f-structure. The third language can be a low-resource language. The technique can thus be particularly beneficial for corpus-based (cross-) linguistic research.
We discuss a number of ways to realize automatic corpus annotation based on multi-source projection, including direct projection and approaches with an additional generalization step that employs machine learning techniques. We present a series of detailed experiments for a sample annotation task, verb argument identification, using the German and English ParGram grammars for projection to Dutch and maximum entropy models for learning generalization.