The linguistic diversity of European countries and nations belongs to the main cultural heritage to be maintained and developed within Europe and beyond. At the same time, the competitiveness of European markets within the global economic landscape must rely on efficient information access and processing. Since information is most often available in a textual or spoken form, in particular in the fast evolving Internet, and its amount is constantly growing, support from Information and Communication Technologies (ICT) is crucial. Thus, methods for intelligent text processing have been developed for decades, resulting in an increasing number of applications such as information extraction, machine translation, question answering, automatic text summarization, sentiment and opinion mining, human-machine dialogue, etc. Such Natural Language Processing (NLP) applications face three essential challenges:
- linguistic precision of methods and results (reflecting, at least partly, the richness and creativity of human language),
- specificities of particular languages and language families,
- computational efficiency in the context of large amounts of (possibly noisy) data to be processed rapidly.
It has been shown that one of the key problems to be overcome in order to meet all of these requirements simultaneously are multi-word expressions (MWEs), i.e. sequences of words with some unpredictable properties such as to count somebody in or to take a haircut. MWEs are truly a bottleneck of NLP, e.g. in machine translation, which tends to translate MWEs word by word. For instance Google wrongly translates:
- to count Poland in as: compter la Pologne en (FR), contar con Polonia en (ES), contano in Polonia (IT), nach Polen rechnen (DE), etc., and
- European banks have to take a serious haircut as: les banques européennes ont à prendre une coupe de cheveux grave (FR), los bancos europeos tienen que tener un corte de pelo seria (ES), banche europee devono prendere un taglio di capelli grave (IT), europäische Banken haben eine ernste Haarschnitt nehmen (DE), evropske banke moraju da uzmu ozbiljan šišanje (RS), etc.
These are meaningless, partly ungrammatical, literal, word by word translations. The difficulty stems from the highly heterogeneous behaviour of MWEs at the lexical, syntactic and semantic level. Since modelling language phenomena and providing efficient processing tools for their treatment prove difficult, most efforts have focused on ICT language tools dedicated to English. Taking a variety of languages into account has often been seen as an additional obstacle. This Action admits an opposite point of view. It sees Europe’s multilingualism – provided that it is considered within a coordinated framework - as the source of a better comprehension of general linguistic phenomena that are crucial to ICT multilingual language technologies. Thus, Europe’s multilingual heritage may become an advantage over other major NLP communities, e.g. in the USA, Japan, China, etc.
In this context, some of the main challenges come from fragmentation issues. Europe, being multilingual at its heart, must make more effort to bring together NLP research across language and nation boundaries. Moreover, the multidisciplinary richness of sciences involved (linguistics, computing, statistics, psychology, etc.) demands convergence via a common meeting place to discuss the handling of MWEs from various perspectives.