3rd WG2 meeting parallel session "Representation of MWEs within linguistic resources"

(thanks to Stella Markantonatou for accepting to chair this parallel session, and for this summary)

List of participants: Aggeliki Fotopoulou, Agnieszka Patejuk, Adam Przepi√≥rkowski, Stella Markantonatou, Shuly Wintner, Eric Wehrli, Katerina Zdravkova, Cvetana Krstev, Dusko Vitas, Niki Samaridi, Daniela Majchrakova, Sascha Bargmann, Gert Webelhut, Agata Savary.

GENERAL TOPICS:

Which properties in MWEs are easy to express in grammars?
Which syntactic properties of MWEs are hard to express in grammars (challenging examples)?

We realised that these questions have to do with the special features of the resources, the systems and the theoretical approaches adopted as well as the particular features of each language. So, each research group briefly talked about these topics and also, about the linguistic facts they considered challenging.

The following general picture emerged:

1. Grammars for various languages and within various grammatical frameworks are being developed: HPSG (Hebrew, German), LFG (Greek, Polish), generative grammar (French, English, German, Italian, Spanish), lexicon-grammar (Italian), LTAG (planned for Serbian).

2. Hebrew and Polish seem to have comprehensive grammars and lexical resources for MWEs (valence dictionaries)

3. There is a large generative grammar with French, Italian and Spanish collocations

4. There are comprehensive lexical resources and local grammars for Italian MWEs

5. There are comprehensive resources for Serbian

6. A grammar is being developed for German MWEs that takes into account cross-sentence phenomena such as ellipsis

7. There are some resources and an LFG grammar is being developed for Modern Greek MWEs

8. There is a large resource derived from Europarl and Wikipedia consisting of aligned text that contains collocations and MWEs for Macedonian and Serbian languages, however the resource has to be cleared of noise

9. The grammars discussed do not produce semantic representations. The Polish grammar is an exception to some extend.

In general, the way to parse MWEs includes a lexicon where the various words that form MWEs are enriched with appropriate morphological and subcategorisation constraints. The Greek approach employs a preprocessor that recognizes WWSs and the grammar reads a lexicon that contains WWSs together with ‚Äúnormal‚Äù words. In most cases, a general grammar component is used for parsing, in some cases only local grammars exist.

Idioms seem to be hard to parse. Individual languages present particular phenomena, see for instance Hebrew and Serbian. The Polish and the Greek team reported that the theoretical framework that they employ (LFG) seems to lack expressive means for an adequate representation of a set of phenomena related to MWEs.

In various formalisms/languages difficult cases include:

(non) agreement between the subject and the modifier of an object (Hebrew, French, English)
encoding subcategorisation frames (generative grammar)
discontinuous MWEs (local grammars)
fixed parts of MWEs which cannot be assigned reasonable POSs or functions (Greek)
agreement discrepancies between morphological and semantic features, e.g. morphological vs. semantic gender (Serbian, Polish), morphological vs. semantic number (English)
discrepancies between syntactic (object only) and semantic (whole expression) modification scope (English)
MWEs with a fixed order when the language itself has a relatively free word order (Polish)
distinguishing idiomatic and non-idiomatic uses (all grammars?)

INDIVIDUAL CONTRIBUTIONS TO THE DISCUSSION

Shuly: a large HPSG grammar exists for Hebrew MWEs together with a rich lexicon and a well developed external morphology component, which is important given that Hebrew belongs to the Semitic language group. Additionally to the lexicon/grammar interface, there is some impact on orthography since words combine ‚Äúwhen they shouldn't‚Äù. Corpora are used in the grammar construction. An automatically created verb valence dictionary exists. There is no semantics component.

As a very difficult problem SW mentioned the case of a type of Hebrew MWE formed with the name ‚Äúbrains‚Äù where the name takes a suffix that agrees with the subject:
(1) I stand on my-mind. (i.e. I insist on sth)
The subject (here: I) and the modifier of the object (here: my) must agree in person, number and gender.
Eric mentioned that many such examples exist in other languages (e.g. FR: vider son sac 'to empty one's bag' = to reveal ones secret thoughts).
Eric also mentions that challenging examples of the opposite constraint exist:
(2) He was pulling my leg (I.e. He was playing a joke on me)
The subject (here: he) cannot agree in person with the object's modifier (here: my), otherwise the idiomatic sense is lost.

Eric: A large home-made symbolic roughly generative grammar and a parser exist that can treat German, English, French, Italian and Spanish collocations retrieved from corpora (16,000 for French). The particularity of the parser is to integrate MWE identification during parsing (rather than before or after): whenever there is an idiomatic reading possible you go for it. The grammar takes advantage of ideas from several theoretical frameworks. It does not have a semantic component. The various languages are not aligned.

The special structure of MWEs is encoded on the words that form the expression as sets of appropriate constraints on grammatical and subcategorisation properties.

Particular difficulties are encountered with idioms encoding subcategorisation frames.

Johanna: Works within the lexicon-grammar framework. There are large lexical resources of Italian MWEs that encode semantic constraints (e.g. subject should be human) and there are local grammars that encode morphological and syntactic constraints, notably on MWEs. Difficult cases: discontinuous MWEs, distinguishing idiomatic and non-idiomatic uses.

Sascha: MWEs are processed as part of the MIMO German HPSG-based grammar (which is not a computational grammar yet). The emphasis is not on the variety of MWEs parsed but on cross-sentential phenomena where MWEs are involved such as ellipsis and anaphora resolution, e.g.:
   (3) He pulled a lot of strings to get the job. Those strings...
For each idiom, the components are in an e-lexicon, where we define how they find each other.
The scope of adjectives that modify components of a MWE expression was mentioned as an important problem that probably requires a semantics component.
Examples:
   (4) Kick the proverbial bucket - 'proverbial' modifies 'bucket' syntactically but not semantically (semantically it modifies the whole expression)
   (5) They spill bean after bean ‚Äì 'bean after bean' is plural semantically but not morphologically)
   (6) spill bean after horrible bean - 'horrible' modifies both beans, not only the last one
   (7) It is difficult to judge if relative clauses are allowed or not is such idioms.

Cvetana: Large lexical resources exist for Serbian nominal MWEs as well as local grammars. The team is currently developing a tagger and an LTAG grammar.

Agreement is an important problem because in Serbian it is the case that both grammatical and natural gender play a role: some things agree with the grammatical gender and some others with the natural gender (e.g. in numerals), e.g.:
(8) (SR) two men
(fem) (masc)
It seems that the grammatical gender agreement is more frequent for words that are closer to each other, while natural gender agreement is increasingly frequent in more distant words. As mentioned by Agnieszka, similar phenomena exist in Polish.

Niki and Stella: A resource of Greek verb MWE expressions exists that is used for LFG/XLE parsing. A preprocessor is employed that recognizes WWSs. Fixed parts of MWEs are extracted and treated as flat strings, which rises the problem that these strings cannot be given reasonable POSs.
They think that the LFG framework lacks expressive means for an adequate representation of MWE properties.

Katerina: A Macedonian 2-million word (100 thousand lemmas) lexicon exists. Some collocations and MWEs (toponyms, nominal or verbal phrases) have been extracted from from Europarl and Wikipedia. The underlying corpus processor in NooJ. Moses has been used to align the MWEs with Serbian. The material contains noise and has to be cleared. Difficult cases: MWEs which have no equivalents in other languages

Agnieszka and Adam: A comprehensive LFG/XLE grammar for Polish (Polfie) exists that draws on a large valence lexicon called ‚ÄòWalenty‚Äô, on a large pre-existing generative grammar and on a small but deep HPSG grammar. The grammar parses currently 30% of sentences in the National Corpus of Polish (NKJP). Most cases of non-parsed sentences are due to timeout. The system employs an independent morphological analyser.
'Walenty' has a rather rich formalism:

several layers: (i) syntax, (ii) semantics (deep, in futuer:, two different schemata connected to the same deep semantic; semantics done as post-processing of F-structures - 1st order logic), (iii) selectional preferences (horse, human subject, etc.)
takes coordination explicitly into account (what can/cannot coordinate is explicitly mentioned); coordination is the test to know if we have one frame or two frames
based on authentic data (NKJP corpus)
a lot of attention is payed in the formalism to MWEs
arguments of frames can be lexicalized/instantiated, constrained, etc.
currently: 11,000 lemata, 15,000 schemata, 8-9% of schemata contain a lexicalized element.
Compiling Walenty to LFG was tested, it would be good to test a compilation to other formalisms

A ‚Äòlight‚Äô version of semantics is received directly from the f-structures (e.g. semantics roles can be approximated by cases) but more work on semantics is foreseen and probably a ‚Äòsemantic‚Äô extension of XLE. Future application: textual entailment.

Difficult cases:
Polish is a relatively free word order language but some MWEs have a fixed order:
(9) chodziƒá od Annasza do Kaifasza
'to go from Annas to Caiaphas = go to many places without being able to arrange anything'
Without semantics we cannot distinguish compositional from idiomatic meaning in syntactically regular expressions.

Aggeliki: stresses the need of a typology of MWE properties to be able to better discuss on these general topics