Event titlePARSEME shared task on automatic identification of verbal MWEs (edition 1.1)

New! The shared task edition 1.1 corpus annotated with verbal MWEs in 19 languages, has been published at LINDAT/CLARIN.

Scope: following the success of edition 1.0 of the PARSEME shared task (see also the summary and the PARSEME-internal webpage of this initiative), it will be reiterated in 2018 with:

  • an extended number of languages (Basque, Croatian, English and Hindi)
  • enhanced guidelines
  • larger corpora
  • new corpora possibly aligned with manually annotated treebanks (to prepare edition 2.0 of the shared task on joint MWE identification and parsing)

Timeline:

  • [May 2017: CORE GROUP]: recruiting core organizers, annotators, LLs, technical and guidelines experts
  • [June - October 20]: guidelines enhacements
    • [June 2-15: new language teams] pilot annotation (about 200 setences)
    • [June 2-15: LLs] collecting feedback on the guidelines from the language teams
    • [July - August LGLs] collecting feedback from LLs and grouping it into Gitlab issues
    • [September -  20 October: CORE GROUP] discussing and solving the Gitlab issues, updating the guidelines accordingly
  • [October 20-November 30: LLs] adding and updating multilingual examples, training new annotators, corpus selection and split
  • [October 20-November 30: language teams] pilot annotation (200 phrases), sending feedback on the new guidelines to Carlos and Agata
  • [November 30: Carlos and Agata] finalizing the guidelines
  • [November 17-20: Carlos, Agata, Silvio] automatic update of the previously annotated corpora (by renaming categories: LVC→LVC.full, IReflV→IRV, VPC→VPC.full, ID→VID, OTH→TODO)
  • [December 1 - February 28] annotation (see also QA to estimate the annotation effort):
    • [previously existing language teams] manual update of the previously annotated corpora with respect to the new guidelines:
      • Annotating LVC.cause and VPC.semi,
      • Renaming some IDs (containing two verbs) into MVC
      • Changing some IDs to LVC.cause
      • Annotating IAVs (optional)
      • Manual re-classification of TODOs needs to be done, especially for FA, TR and HE
    • [previously existing language teams] annotating new test data
      • Minimal requirement: at least 500 newly annotated VMWEs
      • Optimal requirements: at least 3,500 annotated VMWEs in total
    • [new language teams] annotating the training and test data: 3,500 annotated VMWEs in total
  • [January 20, February 28]annotation freeze, all annotators finish annotating their files
  • [February 5 March 1-16: LLs] gather files from all annotators and perform consistency checks
  • [March 16: LLs] send the full corpora to the core organizers
  • [February 5 March 21: core organizers] trial data and the evaluation script released
  • [March 17-30: LLs and core organizers] corpora conversion and train/dev/test split
  • [February 19 April 4: core organizers] training and development data released (synchronous with syntactic annotations)
  • [April 20: LLs] send Inter-Annotator Agreement (IAA) pair of files to core organizers
  • [Mar 9 April 30: core organizers] blind test data released
  • [March 19 May 4: participants] submission of system results
  • [March 26 May 11: core organizers] announcement of results
  • [May 25: participants] shared task system description papers: submission and review
  • [June 20] notification of acceptance
  • [June 30] camera ready papers
  • [August 25-26] shared task culmination at the LAW-MWE-CxG workshop (co-located with COLING 2018)

Documents:

Tools:

  • FLAT server (central annotation platform)
  • Telegram discussion groups for fast peer-to-peer problem solving (usable on mobile phones and regular PCs via web browser, preferably Chromium or Google Chrome):
  • Github space for reporting FLAT bugs (for authorised technical staff only)
  • Gitlab spaces (see also their detailed description):
    • release Gitlab space (public)
      • train and test corpora from edition 1.0 of the shared task
      • scripts for system evaluation and parsemetsv format validation
    • development Gitlab space (for authorised users)
      • development version of the corpora
      • double-aligned corpora for IAA calculus
      • system results from edition 1.0 of the shared task
      • various scripts for ST organizers (automating system evaluation, publishing the results, running IAA calculus)
    • guidelines Gitlab space (for authorised users)
      • HTML source codes for the annotation guidelines
      • scripts for the guidelines (inserting exemples in different languages)
    • utilities Gitlab space (public)
      • scripts for the corpus consistency checks
      • various corpus converters, aligners and splitters
  • Mailing lists
    • parseme-shared-task - annotators, language leaders, language group leaders, technical experts, guideines experts, core group
    • parseme-st-org - language leaders, language group leaders, technical experts, guideines experts, core group
    • parseme-st-core - core group

Corpus:

  • Corpus selection rules
    • source: in the perspective of future editions of the shared task on joint MWE identification and parsing, a corpus that is already (manually) syntactically annotated would be preferable
    • size: sufficient to include at least 3,500 VMWE annotations
    • genre: preferably newspaper texts or Wikipedia articles
    • translationese issues: the corpus should be written in the original rather than translated
    • license issues: the corpus should be free from copyright issues so as to allow publication under the Creative Commons license
    • dialects: for languages with large dialects (English, Spanish etc.), select corpora from the dialects for which there is at least one native annotator in the ST language team
  • Training and test corpus from edition 1.0
  • Corpus format description
    • parseme-tsv format for annotated files - files in this format are: (i) given on input to VMWE identification tools (in this case the last column contains underscores only), (ii) expected on output from VMWE identification tools (in this case the last column contains MWE tags or underscores)
    • parseme-tsv-pos format - useful if (i) part-of-speech tags are available, to highlight verbs in FLAT; (ii) if VMWE pre-annotations are available
    • UD-compatible format - work in progress (planned for edition 2.0 of the PARSEME corpus): sample UD file
  • Corpus segmentation - it is recommended to use one of the following solutions:
      • your own tokenization, conforming to the generic segmentation rules
      • your own custom tokenization, provided that you provide your definition of a token and your custom tokenization and/or sentence segmentation rules
      • an off-the-shelf tokenizer adapted to your language (e.g. UDPipe)
  • New! The PARSEME corpus in version 1.0 is now available via the KonText and NoSke query systems. To use the system:
    • select the PARSEME VMWE corpus from the list
    • (in KonText only) click on Query -> Enter new query
    • choose Query type -> CQL
    • see the project page with sample queries, test the queries
    • post questions and query examples to the parseme-cql group
    • see also the project Github space

People and roles:

  • Core group: Silvio Cordeiro, Carlos RamischAgata SavaryVeronika Vincze
  • Annotation guidelines expertsVerginica Barbu Mititelu, Marie CanditoVoula Giouli, Carlos RamischAgata SavaryNathan SchneiderIvelina StoyanovaVeronika Vincze
  • Technical support: Federico Sangati, Behrang QasemiZadeh, Silvio Cordeiro, Carlos Ramisch (please, contact them mainly via the PARSEME helpdesk Telegram group, this will make their and your life easier)
    • Gitlab maintenance (Silvio)
    • evaluation/conversion script maintenance (Silvio)
    • FLAT server administration and updates (Behrang, backup: Carlos, Silvio)
    • creating and managing FLAT user accounts (Behrang, backup: Carlos, Agata)
    • addition/update of FLAT configurations (Federico)
    • user support via a Telegram group (Federico)
    • technical infrastructure for the guidelines and examples (Carlos)
  • FLAT consultant: Maarten van Gompel
  • Language Group Leaders (see below)
    • Collecting feedback on the guidelines from LLs,
    • Creating and administrating Gitlab issues assigned to the LG,
    • Participation in guidelines enhancements,
    • Periodically communicating with the language leaders and check how annotation is going
    • Helping language leaders during the consistency checks
    • Getting in touch with the technical team when language leaders experience problems or need to work on technical aspects such as missing CONLL-U files, data formats, FLAT issues, etc.
    • Co-authoring publications on the ST organisation and corpus
  • Language Leaders (see below)
    • Recruiting and training annotators
    • Maintaining the list of examples in the language
    • Collecting feedback on the guidelines and transferring it to the LGLs
    • Preparing the corpus for annotation
    • Coordinating the annotations
    • Performing the consistency checks after the annotation
    • Preparing the CoNLL-U files
    • Administering the Gitlab issues assigned to the language
    • Co-authoring publications on the ST corpus

Languages:

  • Germanic languages - group leader: Paul Cook
    • English: Abigail Walsh (leader), Claire Bonial, Paul Cook, Jamie Findlay, Teresa Lynn, John McCrae, Nathan Schneider, Clarissa Somers
    • German: Timm Lichte (leader), Rafael Ehren
  • Romance languages - group leaders: Marie Candito and Carlos Ramisch
    • French: Marie Candito (leader), Matthieu Constant, Carlos Ramisch, Caroline Pasquer, Yannick Parmentier, Jean-Yves Antoine, Agata Savary
    • Italian: Johanna Monti (leader), Valeria Caruso, Maria Pia di Buono, Antonio Pascucci, Annalisa Raffone, Anna Riccio
    • Romanian: Verginica Barbu Mititelu (leader), Monica-Mihaela Rizea, Mihaela Ionescu, Mihaela Onofrei
    • Spanish: Carla Parra Escartín (leader), Cristina Aceta, Alfredo Maldonado, Héctor Martínez Alonso, Belem Priego Sanchez
    • Brazilian Portuguese: Renata Ramisch (leader), Silvio Ricardo Cordeiro, Aline Villavicencio, Carlos Ramisch, Leonardo Zilio, Helena de Medeiros Caseli
  • Balto-Slavic languages - group leader: Ivelina Stoyanova
    • Bulgarian: Ivelina Stoyanova (leader), Tsvetana Dimitrova, Svetlozara Leseva, Valentina Stefanova, Maria Todorova
    • Czech: Eduard Bejček (leader), Zdeňka Urešová
    • Croatian: Maja Buljan (leader), Goranka Blagus, Ivo-Pavao Jazbec, Nikola Ljubešić, Ivana Matas, Jan Šnajder
    • Lithuanian: Jolanta Kovalevskaitė (leader), Agne Bielinskiene, Loic Boizou
    • Polish: Agata Savary (leader), Emilia Palka-Binkiewicz 
    • Slovene: Polona Gantar (co-leader), Simon Krek (co-eader), Špela Arhar Holdt, Jaka Čibej, Teja Kavčič, Taja Kuzman
  • Other languages - group leader: Voula Giouli, Uxoa Iñurrieta
    • Arabic: Abdelati Hawwari (leader), Mona Diab, Mohamed Elbadrashiny, Rehab Ibrahim
    • Basque: Uxoa Iñurrieta (leader), Itziar Aduriz, Ainara Estarrona, Itziar Gonzalez, Antton Gurrutxaga, Larraitz Uria, Ruben Urizar
    • Farsi: Behrang QasemiZadeh (leader), Shiva Taslimipoor
    • Greek: Voula Giouli (leader), Vassiliki Foufi, Aggeliki Fotopoulou, Stella Markantonatou, Stella Papadelli, Natasa Theoxari
    • Hebrew: Chaya Liebeskind (leader), Hevi Elyovich, Yaakov Ha-Cohen Kerner, Ruth Malka
    • Hindi: Archna Bhatia (co-leader), Ashwini Vaidya (co-leader), Kanishka Jain, Vandana Puri, Shraddha Ratori, Vishakha Shukla, Shubham Srivastava,
    • Hungarian: Veronika Vincze (leader), Katalin Simkó, Viktória Kovács
    • Maltese (skipping edition 1.1, should join edition 2.0): Lonneke van der Plaas, Luke Galea, Greta Attard, Kirsty Azzopardi, Janice Bonnici, Jael Busuttil, Ray Fabri, Alison Farrugia, Sara Anne Galea, Albert Gatt, Anabelle Gatt, Amanda Muscat, Michael Spagnol, Nicole Tabone, Marc Tanti.
    • Turkish: Tunga Güngör (leader), Gozde Berk, Berna Erden

Other useful links:

  • UDPipe - trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files. Trained models are provided for nearly all Universal Dependency treebanks. They can be used to POS-tag a corpus; the verbal POS tags can then be added to the files (see the 5th column of the parseme-tsv format) to be annotated for the shared task, which greatly speeds up the manual annotation. The output of UDPipe models is in the CoNLL format, which can be transformed to the parseme-tsv format by the conllu2parsemetsv.sh script.
  • Annotation guidelines of related initiatives

Questions and answers:

  • How much (new) data needs to be annotated for the ST edition 1.1?
  • The short answer is: as much as possible! It's up to each language team to decide, but we have some suggestions as follows:
    • For new languages, we intend to keep the same goals as last year, that is, to create a training corpus with around 3000 annotated VMWEs, and a test set of at least 500 annotated VMWEs.
    • For languages that already have some annotated data, but did not reach the goal of 3000+500 annotated VMWEs last year, this is an opportunity to reach this milestone.
    • Finally, even for languages that already have reached the goal last year, we will need to annotate a new test set with at least 500 annotated VMWEs. This last point is particularly important because the previous test sets were released publicly, so they are not secret anymore.
  • Will the new guidelines require revising the existing annotations?
  • Here, it's a bit early to say. We intend to indicate, for each major change in the guidelines, what action should be taken for the existing annotations. Some may require simply running the consistency check scripts again, while others may be more complex and require going through the whole corpus again. Given the amount of existing annotated corpora, the impact of the updates on these annotations is one of the factors that we will consider when updating the guidelines.

Silvio Cordeiro,