Multi-word expressions - basic facts
- MWEs are prevalent in a natural language (they account for up to 40% of text items),
- MWEs are complex phenomena involving different levels of language:
- non-compositional meaning (to take a haircut means to suffer a serious financial loss),
- syntactic variation (take/takes/have taken/has taken/took a serious/70% haircut),
- lexical idiosyncrasies (cross-roads, *cross-road),
- syntactic idiosyncrasies (he kicked the bucket, *the bucket was kicked),
- MWEs are still not sufficiently understood,
- MWEs are under-represented in language resources and tools,
- MWEs are hard to detect, understand, translate, etc.
Current state of knowledge
NLP has made a considerable progress within the past decades. Language resources, such as annotated corpora, electronic lexicons and grammars, are being developed for an increasing number of languages. New algorithms make it possible to process very large amounts of textual data and produce pertinent results. New small and medium enterprises offer text processing technology transfer, and the end users become aware of the added value of NLP applications.
Despite these encouraging results NLP applications need further improvement. Currently most of them admit an (explicit or implicit) division of language phenomena into clear-cut levels: (i) tokens (indivisible text units, roughly words), (ii) morphology (properties of words e.g. number, gender, etc.), (iii) syntax (structural links between words, e.g. number/gender agreement), (iv) semantics (meaning of words and sentences). However, human languages frequently show a high degree of ambiguity and fuzziness with respect to this layer-oriented model. In particular, MWEs are placed on the frontier between these levels due to their idiosyncratic properties on the one hand, and their morphological, syntactic and semantic variations on the other hand. For instance, their meaning is often non-compositional as in "to take a haircut" (i.e. "to suffer a serious financial loss"), although they admit some syntactic variation similarly to many other expressions ("take/takes/have taken/has taken/took a serious/70% haircut"). Strictly layer-oriented language models fail to reflect this specificity, and thus yield erroneous text processing results (e.g. word-to-word translations of idioms).
Although the quantitative importance of MWEs is well known (they cover up to 30% of all words in human language utterances, and are much more numerous in lexicons than single words), the achievements in their formal representation and automatic processing are still largely unsatisfactory. Current research on MWEs shows that most proposals still concentrate either on creating MWE lexicons or on the automatic recognition of MWEs in text. Only few approaches address the links between MWEs and a comprehensive linguistic analysis of text. These approaches confirm that a proper treatment of MWEs increases both linguistic precision and robustness. With respect to this state of the art, the Action will make a cutting-edge contribution by:
- A highly contrastive methodology. The Action will cross language boundaries by studying MWEs in different European languages. It will also compare points of view on MWEs in different linguistic and methodological frameworks.
- Accounting for the richness of the linguistic heritage in Europe. The Action will consider over 14 languages from all major European language families: Germanic (English, German, Norwegian, Swedish), Romance (French, Italian, Portuguese, Spanish), Slavic (Bulgarian, Czech, Polish, Serbian) and Finno-Ugric (Estonian, Hungarian).
- Increasing the cohesion of different levels of linguistic processing. Methods of explicit inclusion of MWEs into most levels of linguistic processing will be defined. The existing lexicons and grammars will increase their MWE coverage, and will be extended by dealing with morphology/syntax/semantics interface aspects. The strictly word-by-word processing framework will be abandoned, and the treatment of MWEs in syntactic and semantic processing will be brought into focus instead.
- Developing methodologies for cost saving resource development. In order to cope with the high cost of the development of language resources, the Action will put forward abstract MWE representation formalisms that can be mapped onto different linguistic frameworks.
- Simultaneously accounting for both linguistic precision and robustness. The Action will address both knowledge-based and data-driven approaches (cf. the following section), and make the most of their complementarity by enhancing and extending hybrid models. MWEs will play a central role in these considerations.