Event title: PARSEME shared task on automatic identification of verbal MWEs
- In our context, a shared task is usually a competition of existing NLP tools performing a similar task (here: automatic identification of verbal MWEs).
- The organizers of the task provide two corpora:
- A training corpus annotated (possibly manually) according to common guidelines. This corpus is sent to the participants in advance in order to allow them to adapt their systems to the guidelines.
- An evaluation corpus annotated according to the same guidelines. This corpus in its raw (un-annotated) version is used as input to the systems. The participants provide the output produced by their systems on this corpus. This output is compared with the 'gold standard' (the previously annotated version of the same corpus). The results are calculated according to previously agreed on evaluation measures (e.g. precision, recall, F-measure, accuracy, etc.).
- A common final workshop gathering the organizers and the participants allows for comparisons of the (manual and automatic) annotation methodologies, pointing at the challenging issues, etc.
- See the official shared task website for more details on the competition
Objectives of the PARSEME shared task:
- to cover many languages of different language families
- to boost the development of MWE processing tools
- to make results of tools performing similar tasks comparable
- to take discontinuous MWEs into account
- to bring MWE detection closer to parsing
- to provide an multilingual open resource to the NLP community
- NEW! Seven systems particpated in the shared task (5 of them were multilingual). They submitted 71 results in total. Each of the 18 languages, for which annotated corpora were procided, was covered by at least 2 systems. The results are available on the official shared task page.
- Who is who
- Organizing team: Marie Candito, Fabienne Cap, Silvio Cordeiro, Antoine Doucet, Voula Giouli, Behrang QasemiZadeh, Carlos Ramisch, Federico Sangati, Agata Savary, Ivelina Stoyanova, Veronika Vincze,
- Technical support: Federico Sangati, Behrang QasemiZadeh, Silvio Cordeiro, Carlos Ramisch
- Language group leaders - see below
- Language leaders - see below
- Annotators - see below
- NEW! The training and test corpora for the shared task have been released for 18 languages. In total, the corpus contains about 4.5 million tokens and about 55,000 VMWE annotations. See the official ST page for more details.
- Master spreadsheet for the corpus file management (pilot and final annotation)
- Current state of the annotations in FLAT (some languages are annotated outside FLAT)
- annotating training and evaluation corpora of about 3,500-4,000 MWEs per language (by combining pre-exiting data sets with newly annotated ones)
- annotation layers to decide (MWE layer, possibly also the part-of-speech layer)
- common annotation guidelines (but taking language specificities into account)
- selecting one or two text genres (e.g. newspaper texts)
- the corpus should be written in the original rather than translated
- the corpus should possibly be free from copyright issues, so as to be distributed under an open license
- annotation experts will be mostly from PARSEME
- teams participating in the contest may also be from outside the action
- Annotation guidelines v6 for the final annotation:
- Methodology for the final annotation phase:
- Guide for language leaders
- FLAT server
- FLAT user's and administration guide
- PARSEME-FLAT discussion group on Telegram for fast peer-to-peer feedback on FLAT annotation platform (usable on mobile phones and regular PCs via web browser, preferably Chromium or Google Chrome)
- Github space for reporting FLAT bugs (for authorised technical staff only)
- Gitlab space for discussing evaluation tools and annotation guidelines issues (for registered users only)
- Corpus for the final annotation:
- Corpus file directory
- Corpus format description
- parseme-tsv format for annotated files - files in this format are: (i) given on input to VMWE identification tools (in this case the last column contains underscores only), (ii) expected on output from VMWE identification tools (in this case the last column contains MWE tags or underscores)
- parseme-tsv-pos format - useful if (i) part-of-speech tags are available, to highlight verbs in FLAT; (ii) if VMWE pre-annotations are available
- parseme-tsv-split format - increasingly obsolete, used in the annotation of the platinum corpus v6; can still be used to upload (e.g. manually pre-annotated) files to FLAT.
- parseme-tsv-input format - obsolete; files in this format were initially supposed to be provided on input to VMWE identification tools; now the parseme-tsv format is used both on input and on output of such tools
- Corpus segmentation rules
- FLAT server - central on-line annotation platform used for the final annotation
- conllu2parsemetsv - a bash script for converting Universal Dependencies files in the CoNLL-U format into the parseme-tsv format
- conll2mwe - a Python script produced parseme-tsv files from CoNLL-U files with missing NoSpace information in the 10th column; the script (i) automatically guesses the value of the nsp (nospace), (ii) splitting of input file in a number of files (--NoF) with a specific number of sentence per files (--SpF), (iii) removes multi-word tokens (e.g., 3-4)
- folia2tsv - a Python converter from Folia (FLAT XML format) to parseme-tsv
- tsv2folia - a Python converted from parseme-tsv, parseme-tsv-pos, parseme-tsv-input, and parseme-tsv-split to Folia; it is also integrated, as a library, to FLAT so running it off-line is only needed for a specific use
- extract_annotated_mwes - a Python script extracting and sorting all annotated VMWEs from a file; can be used to check the annotation consistency
- Generic corpus tokenizer
Documents for pilot annotation (obsolete):
- annotation guidelines v4 and v5 used in pilot annotation phases 1 and 2
- annotation guidelines for quasi-universal categories:
- methodology for pilot annotation
- corpus format from the pilot annotation phase 2 and from the platinum v6 annotation
- UDPipe - trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files. Trained models are provided for nearly all Universal Dependency treebanks. They can be used to POS-tag a corpus; the verbal POS tags can then be added to the files (see the 5th column of the parseme-tsv format) to be annotated for the shared task, which greatly speeds up the manual annotation. The output of UDPipe models is in the CoNLL format, which can be transformed to the parseme-tsv format by the conllu2parsemetsv.sh script.
- Annotation guidelines of related initiatives
mid July 2015: open call for participants 31 August 2015: deadline for expression of interest by the contributors 23-24 September 2015: organization meetings in Iasi October-November 2015: structuring the language teams into language families mid-January 2015: first version of the annotation guidelines, methodology for the annotation 20 January - 10 February 2016: first pilot annotationphase 1, with feedback on guidelines 11 February - 7 March: enhancing the guidelines 8 March - 22 March: pilot annotationphase 2, based on the enhanced guidelines, adding language-specific features; developing annotation tools 7-8 April: Annotathon at the general meeting in Struga, discussing challenging issues, finalizing the guidelines 9 April - September: producing "platinium standard" corpora from the results of phase 2; drafting language-specific section of the guidelines September-December: annotation, development of evaluation measures and tools
- January 2017 - February 2017: system training, evaluation, paper submissions and reviewing, notifications
- 3 or 4 April 2017: final workshop as part of MWE 2017, colocated with EACL 2017 in Valencia, Spain
We have received 20 expressions of interest so far from potential corpus contributors to the shared task. 21 languages are concerned and we divided them into 4 language groups. Group leaders coordinate the discussions on the annotation guidelines within their respective groups and communicate with the shared task leaders.
- Germanic languages - group leader: Fabienne Cap
- English: Ismail El Maarouf (leader), Teresa Lynn, Michael Oakes, Jamie Findlay, John McCrae; possibly assisted by: Corina Forascu et al., , Federico Sangati et al., Veronika Vincze et al.
- German: Fabienne Cap (leader), Glorianna Jagfeld
- Swedish: Fabienne Cap (leader), Joakim Nivre, Eva Pettersson, Sara Stymne
- Yiddish: Yaakov Ha-Cohen Kerner, Chaya Liebeskind
- Romance languages - group leaders: Marie Candito and Carlos Ramisch
- French: Marie Candito (leader), Matthieu Constant, Ismail El Maarouf, Carlos Ramisch, Caroline Pasquer, Yannick Parmentier, Jean-Yves Antoine, Agata Savary
- Italian: Johanna Monti (leader), Valeria Caruso, Manuela Cherchi, Anna De Santis, Maria Pia di Buono, Annalisa Raffone
- Romanian: Verginica Barbu Mititelu (leader), Monica-Mihaela Rizea, Mihaela Ionescu, Mihaela Onofrei
- Spanish: Carla Parra Escartín (leader), Cristina Aceta, Itziar Aduriz, Uxoa Iñurrieta, Carlos Herrero, Héctor Martínez Alonso, Belem Priego Sanchez
- Brazilian Portuguese: Silvio Ricardo Cordeiro (leader), Aline Villavicencio, Carlos Ramisch, Leonardo Zilio, Helena de Medeiros Caseli, Renata Ramisch
- Balto-Slavic languages - group leader: Ivelina Stoyanova
- Bulgarian: Ivelina Stoyanova (leader), Tsvetana Dimitrova, Svetla Koeva, Svetlozara Leseva, Valentina Stefanova, Maria Todorova
- Czech: Eduard Bejček (leader), Zdeňka Urešová
- Croatian: (Marko Tadić et al.)
- Lithuanian: Jolanta Kovalevskaitė (leader), Loic Boizou, Erika Rimkutė, Ieva Bumbulienė
- Polish: Agata Savary (leader), Monika Czerepowicka
- Slovene: Simon Krek (leader), Polona Gantar, Taja Kuzman
- Other languages - group leader: Voula Giouli
- Farsi: Behrang QasemiZadeh (co-leader)
- Greek: Voula Giouli (leader), Vassiliki Foufi, Aggeliki Fotopoulou, Sevi Louisou
- Hebrew: Yaakov Ha-Cohen Kerner (co-leader), Chaya Liebeskind (co-leader), Hevi Elyovich, Ruth Malka
- Hungarian: Veronika Vincze (leader), Katalin Simkó, Viktória Kovács
- Maltese: Lonneke van der Plaas (co-leader), Luke Galea (co-leader), Greta Attard, Kirsty Azzopardi, Janice Bonnici, Jael Busuttil, Ray Fabri, Alison Farrugia, Sara Anne Galea, Albert Gatt, Anabelle Gatt, Amanda Muscat, Michael Spagnol, Nicole Tabone, Marc Tanti.
- Turkish: Kübra Adalı (co-leader), Gülşen Eryiğit (co-leader), Tutkum Dinç, Ayşenur Miral, Mert Boz
So far, at least 19 potential participants are willing to present their systems for the contest.
- 26 September 2016 - Annotathon 2 in Dubrovnik
- 7-8 April 2016 - Annotathon in Struga
- 29 March 2016 - feedback from the pilot annotation phase 2 (organizers, language group leaders)
- 11 February 2016 - feedback from the pilot annotation phase 1 (organizers, technical experts, language group leaders)
- 4 December 2015 - pilot annotation phase 1kick-off meeting (organizers, technical experts, language group leaders)
- 23 September 2015 - slides from the organizational meeting in Iasi
Sara Anne Galea,