
Multi-word expressions - basic facts

Current state of knowledge

NLP has made a considerable progress within the past decades. Language resources, such as annotated corpora, electronic lexicons and grammars, are being developed for an increasing number of languages. New algorithms make it possible to process very large amounts of textual data and produce pertinent results. New small and medium enterprises offer text processing technology transfer, and the end users become aware of the added value of NLP applications.

Despite these encouraging results NLP applications need further improvement. Currently most of them admit an (explicit or implicit) division of language phenomena into clear-cut levels: (i) tokens (indivisible text units, roughly words), (ii) morphology (properties of words e.g. number, gender, etc.), (iii) syntax (structural links between words, e.g. number/gender agreement), (iv) semantics (meaning of words and sentences). However, human languages frequently show a high degree of ambiguity and fuzziness with respect to this layer-oriented model. In particular, MWEs are placed on the frontier between these levels due to their idiosyncratic properties on the one hand, and their morphological, syntactic and semantic variations on the other hand. For instance, their meaning is often non-compositional as in "to take a haircut" (i.e. "to suffer a serious financial loss"), although they admit some syntactic variation similarly to many other expressions ("take/takes/have taken/has taken/took a serious/70% haircut"). Strictly layer-oriented language models fail to reflect this specificity, and thus yield erroneous text processing results (e.g. word-to-word translations of idioms).

Although the quantitative importance of MWEs is well known (they cover up to 30% of all words in human language utterances, and are much more numerous in lexicons than single words), the achievements in their formal representation and automatic processing are still largely unsatisfactory. Current research on MWEs shows that most proposals still concentrate either on creating MWE lexicons or on the automatic recognition of MWEs in text. Only few approaches address the links between MWEs and a comprehensive linguistic analysis of text. These approaches confirm that a proper treatment of MWEs increases both linguistic precision and robustness. With respect to this state of the art, the Action will make a cutting-edge contribution by: