PARSEME shared task on automatic identification of verbal MWEs - edition 1.1

		Event title: PARSEME shared task on automatic identification of verbal MWEs (edition 1.1)
New! The shared task edition 1.1 corpus annotated with verbal MWEs in 19 languages, has been published at LINDAT/CLARIN.
Scope: following the success of edition 1.0 of the PARSEME shared task (see also the summary and the PARSEME-internal webpage of this initiative), it will be reiterated in 2018 with:

an extended number of languages (Basque, Croatian, English and Hindi)
enhanced guidelines
larger corpora
new corpora possibly aligned with manually annotated treebanks (to prepare edition 2.0 of the shared task on joint MWE identification and parsing)

Timeline:

[May 2017: CORE GROUP]: recruiting core organizers, annotators, LLs, technical and guidelines experts
[June - October 20]: guidelines enhacements

[June 2-15: new language teams] pilot annotation (about 200 setences)
[June 2-15: LLs] collecting feedback on the guidelines from the language teams
[July - August LGLs] collecting feedback from LLs and grouping it into Gitlab issues
[September -  20 October: CORE GROUP] discussing and solving the Gitlab issues, updating the guidelines accordingly

[October 20-November 30: LLs] adding and updating multilingual examples, training new annotators, corpus selection and split
[October 20-November 30: language teams] pilot annotation (200 phrases), sending feedback on the new guidelines to Carlos and Agata
[November 30: Carlos and Agata] finalizing the guidelines
[November 17-20: Carlos, Agata, Silvio] automatic update of the previously annotated corpora (by renaming categories: LVC→LVC.full, IReflV→IRV, VPC→VPC.full, ID→VID, OTH→TODO)
[December 1 - February 28] annotation (see also QA to estimate the annotation effort):

[previously existing language teams] manual update of the previously annotated corpora with respect to the new guidelines:

Annotating LVC.cause and VPC.semi,
Renaming some IDs (containing two verbs) into MVC
Changing some IDs to LVC.cause
Annotating IAVs (optional)
Manual re-classification of TODOs needs to be done, especially for FA, TR and HE

[previously existing language teams] annotating new test data

Minimal requirement: at least 500 newly annotated VMWEs
Optimal requirements: at least 3,500 annotated VMWEs in total

[new language teams] annotating the training and test data: 3,500 annotated VMWEs in total

[January 20, February 28]annotation freeze, all annotators finish annotating their files

[February 5 March 1-16: LLs] gather files from all annotators and perform consistency checks
[March 16: LLs] send the full corpora to the core organizers
[February 5 March 21: core organizers] trial data and the evaluation script released
[March 17-30: LLs and core organizers] corpora conversion and train/dev/test split
[February 19 April 4: core organizers] training and development data released (synchronous with syntactic annotations)
[April 20: LLs] send Inter-Annotator Agreement (IAA) pair of files to core organizers
[Mar 9 April 30: core organizers] blind test data released
[March 19 May 4: participants] submission of system results
[March 26 May 11: core organizers] announcement of results
[May 25: participants] shared task system description papers: submission and review
[June 20] notification of acceptance
[June 30] camera ready papers
[August 25-26] shared task culmination at the LAW-MWE-CxG workshop (co-located with COLING 2018)

Documents:

Annotation guidelines
What's new in the guidelines v 1.1.
Guide for FLAT users (annotators)
Guide for language leaders
Guide for FLAT administrators (for authorised users only - contact the technical support)
Gitlab issue tracker for discussing evaluation tools and annotation guidelines issues (for registered users only)
How to udpate previously annotated corpora to the guidelines version 1.1
Current state of the annotations in FLAT (some languages are annotated outside FLAT) - note that this page contains counts for all the files present on the FLAT server, also those from edition 1.0. 

Tools:

FLAT server (central annotation platform)
Telegram discussion groups for fast peer-to-peer problem solving (usable on mobile phones and regular PCs via web browser, preferably Chromium or Google Chrome):

PARSEME helpdesk - annotators and language leaders can post technical issues concerning FLAT and other tools - see the user's guide
PARSEME admin - technical experts can discuss technical choices

Github space for reporting FLAT bugs (for authorised technical staff only)
Gitlab spaces (see also their detailed description):

release Gitlab space (public)

train and test corpora from edition 1.0 of the shared task
scripts for system evaluation and parsemetsv format validation

development Gitlab space (for authorised users)

development version of the corpora
double-aligned corpora for IAA calculus
system results from edition 1.0 of the shared task
various scripts for ST organizers (automating system evaluation, publishing the results, running IAA calculus)

guidelines Gitlab space (for authorised users)

HTML source codes for the annotation guidelines
scripts for the guidelines (inserting exemples in different languages)

utilities Gitlab space (public)

scripts for the corpus consistency checks
various corpus converters, aligners and splitters

Mailing lists

parseme-shared-task - annotators, language leaders, language group leaders, technical experts, guideines experts, core group
parseme-st-org - language leaders, language group leaders, technical experts, guideines experts, core group
parseme-st-core - core group

Corpus:

Corpus selection rules

source: in the perspective of future editions of the shared task on joint MWE identification and parsing, a corpus that is already (manually) syntactically annotated would be preferable
size: sufficient to include at least 3,500 VMWE annotations
genre: preferably newspaper texts or Wikipedia articles
translationese issues: the corpus should be written in the original rather than translated
license issues: the corpus should be free from copyright issues so as to allow publication under the Creative Commons license
dialects: for languages with large dialects (English, Spanish etc.), select corpora from the dialects for which there is at least one native annotator in the ST language team

Training and test corpus from edition 1.0
Corpus format description

parseme-tsv format for annotated files - files in this format are: (i) given on input to VMWE identification tools (in this case the last column contains underscores only), (ii) expected on output from VMWE identification tools (in this case the last column contains MWE tags or underscores)
parseme-tsv-pos format - useful if (i) part-of-speech tags are available, to highlight verbs in FLAT; (ii) if VMWE pre-annotations are available
UD-compatible format - work in progress (planned for edition 2.0 of the PARSEME corpus): sample UD file

Corpus segmentation - it is recommended to use one of the following solutions:

your own tokenization, conforming to the generic segmentation rules
your own custom tokenization, provided that you provide your definition of a token and your custom tokenization and/or sentence segmentation rules
an off-the-shelf tokenizer adapted to your language (e.g. UDPipe)

New! The PARSEME corpus in version 1.0 is now available via the KonText and NoSke query systems. To use the system:

select the PARSEME VMWE corpus from the list
(in KonText only) click on Query -> Enter new query
choose Query type -> CQL
see the project page with sample queries, test the queries
post questions and query examples to the parseme-cql group
see also the project Github space

People and roles:

Core group: Silvio Cordeiro, Carlos Ramisch, Agata Savary, Veronika Vincze
Annotation guidelines experts: Verginica Barbu Mititelu, Marie Candito, Voula Giouli, Carlos Ramisch, Agata Savary, Nathan Schneider, Ivelina Stoyanova, Veronika Vincze
Technical support: Federico Sangati, Behrang QasemiZadeh, Silvio Cordeiro, Carlos Ramisch (please, contact them mainly via the PARSEME helpdesk Telegram group, this will make their and your life easier)

Gitlab maintenance (Silvio)
evaluation/conversion script maintenance (Silvio)
FLAT server administration and updates (Behrang, backup: Carlos, Silvio)
creating and managing FLAT user accounts (Behrang, backup: Carlos, Agata)
addition/update of FLAT configurations (Federico)
user support via a Telegram group (Federico)
technical infrastructure for the guidelines and examples (Carlos)

FLAT consultant: Maarten van Gompel
Language Group Leaders (see below)

Collecting feedback on the guidelines from LLs,
Creating and administrating Gitlab issues assigned to the LG,
Participation in guidelines enhancements,
Periodically communicating with the language leaders and check how annotation is going
Helping language leaders during the consistency checks
Getting in touch with the technical team when language leaders experience problems or need to work on technical aspects such as missing CONLL-U files, data formats, FLAT issues, etc.
Co-authoring publications on the ST organisation and corpus

Language Leaders (see below)

Recruiting and training annotators
Maintaining the list of examples in the language
Collecting feedback on the guidelines and transferring it to the LGLs
Preparing the corpus for annotation
Coordinating the annotations
Performing the consistency checks after the annotation
Preparing the CoNLL-U files
Administering the Gitlab issues assigned to the language
Co-authoring publications on the ST corpus

Languages:

Germanic languages - group leader: Paul Cook

English: Abigail Walsh (leader), Claire Bonial, Paul Cook, Jamie Findlay, Teresa Lynn, John McCrae, Nathan Schneider, Clarissa Somers
German: Timm Lichte (leader), Rafael Ehren

Romance languages - group leaders: Marie Candito and Carlos Ramisch

French: Marie Candito (leader), Matthieu Constant, Carlos Ramisch, Caroline Pasquer, Yannick Parmentier, Jean-Yves Antoine, Agata Savary
Italian: Johanna Monti (leader), Valeria Caruso, Maria Pia di Buono, Antonio Pascucci, Annalisa Raffone, Anna Riccio
Romanian: Verginica Barbu Mititelu (leader), Monica-Mihaela Rizea, Mihaela Ionescu, Mihaela Onofrei
Spanish: Carla Parra Escartín (leader), Cristina Aceta, Alfredo Maldonado, Héctor Martínez Alonso, Belem Priego Sanchez
Brazilian Portuguese: Renata Ramisch (leader), Silvio Ricardo Cordeiro, Aline Villavicencio, Carlos Ramisch, Leonardo Zilio, Helena de Medeiros Caseli

Balto-Slavic languages - group leader: Ivelina Stoyanova

Bulgarian: Ivelina Stoyanova (leader), Tsvetana Dimitrova, Svetlozara Leseva, Valentina Stefanova, Maria Todorova
Czech: Eduard Bejček (leader), Zdeňka Urešová
Croatian: Maja Buljan (leader), Goranka Blagus, Ivo-Pavao Jazbec, Nikola Ljubešić, Ivana Matas, Jan Šnajder
Lithuanian: Jolanta Kovalevskaitė (leader), Agne Bielinskiene, Loic Boizou
Polish: Agata Savary (leader), Emilia Palka-Binkiewicz 
Slovene: Polona Gantar (co-leader), Simon Krek (co-eader), Špela Arhar Holdt, Jaka Čibej, Teja Kavčič, Taja Kuzman

Other languages - group leader: Voula Giouli, Uxoa Iñurrieta

Arabic: Abdelati Hawwari (leader), Mona Diab, Mohamed Elbadrashiny, Rehab Ibrahim
Basque: Uxoa Iñurrieta (leader), Itziar Aduriz, Ainara Estarrona, Itziar Gonzalez, Antton Gurrutxaga, Larraitz Uria, Ruben Urizar
Farsi: Behrang QasemiZadeh (leader), Shiva Taslimipoor
Greek: Voula Giouli (leader),  Vassiliki Foufi, Aggeliki Fotopoulou, Stella Markantonatou, Stella Papadelli, Natasa Theoxari

Hebrew: Chaya Liebeskind (leader), Hevi Elyovich, Yaakov Ha-Cohen Kerner, Ruth Malka
Hindi: Archna Bhatia (co-leader), Ashwini Vaidya (co-leader), Kanishka Jain, Vandana Puri, Shraddha Ratori, Vishakha Shukla, Shubham Srivastava, 

Hungarian: Veronika Vincze (leader), Katalin Simkó, Viktória Kovács
Maltese (skipping edition 1.1, should join edition 2.0): Lonneke van der Plaas, Luke Galea, Greta Attard, Kirsty Azzopardi, Janice Bonnici, Jael Busuttil, Ray Fabri, Alison Farrugia, Sara Anne Galea, Albert Gatt, Anabelle Gatt, Amanda Muscat, Michael Spagnol, Nicole Tabone, Marc Tanti.
Turkish: Tunga Güngör (leader), Gozde Berk, Berna Erden

PARSEME shared task on automatic identification of verbal MWEs - edition 1.1

Events

News

Login Form