General - PARSEME

July 25, 2023: This page is kept for archival, the PARSEME Telegram group is not active anymore.
If you need help with PARSEME, please write to Agata Savary and/or Carlos Ramisch or use the language leaders mailing list if you are a language leader.

In order to provide an effective user support in the PARSEME shared task, a helpdesk chat group was created in the Telegram platform. This page is a user's guide for installing, configuring and participating in this chat group. Use this tool each time you need technical help with FLAT or any other shared task-related tool. Please, avoid sending emails to our technical experts. This will make your and their life easier :)

Option 1: use Telegram on your computer or mobile device via the app

Install Telegram on your mobile device (available from your app library) or on your computer (available for Mac or PC)
Log in using your mobile phone number and the code sent by sms
Open the group's ink 1 in your browser -- this will redirect you to the Telegram app, click on "Launch application" if you're asked.
Join the group and share your questions.

Option 1: use Telegram on your computer via the web interface

Open web.telegram.org in your browser
Log in using your mobile phone number and the code sent by sms
Open the group's link 2 in your browser
Join the group and share your questions

Event title: PARSEME shared task on automatic identification of verbal MWEs (edition 1.1)

New! The shared task edition 1.1 corpus annotated with verbal MWEs in 19 languages, has been published at LINDAT/CLARIN.

Scope: following the success of edition 1.0 of the PARSEME shared task (see also the summary and the PARSEME-internal webpage of this initiative), it will be reiterated in 2018 with:

an extended number of languages (Basque, Croatian, English and Hindi)
enhanced guidelines
larger corpora
new corpora possibly aligned with manually annotated treebanks (to prepare edition 2.0 of the shared task on joint MWE identification and parsing)

Timeline:

~~[May 2017: CORE GROUP]: recruiting core organizers, annotators, LLs, technical and guidelines experts~~
~~[June - October 20]: guidelines enhacements~~
- ~~[June 2-15: new language teams] pilot annotation (about 200 setences)~~
- ~~[June 2-15: LLs] collecting feedback on the guidelines from the language teams~~
- ~~[July - August LGLs] collecting feedback from LLs and grouping it into Gitlab issues~~
- ~~[September - 20 October: CORE GROUP] discussing and solving the Gitlab issues, updating the guidelines accordingly~~
~~[October 20-November 30: LLs] adding and updating multilingual examples, training new annotators, corpus selection and split~~
~~[October 20-November 30: language teams] pilot annotation (200 phrases), sending feedback on the new guidelines to Carlos and Agata~~
~~[November 30: Carlos and Agata] finalizing the guidelines~~
~~[November 17-20: Carlos, Agata, Silvio] automatic update of the previously annotated corpora (by renaming categories: LVC→LVC.full, IReflV→IRV, VPC→VPC.full, ID→VID, OTH→TODO)~~
~~[December 1 - February 28] annotation (see also QA to estimate the annotation effort):~~
- ~~[previously existing language teams] manual update of the previously annotated corpora with respect to the new guidelines:~~
  - ~~Annotating LVC.cause and VPC.semi,~~
  - ~~Renaming some IDs (containing two verbs) into MVC~~
  - ~~Changing some IDs to LVC.cause~~
  - ~~Annotating IAVs (optional)~~
  - ~~Manual re-classification of TODOs needs to be done, especially for FA, TR and HE~~
- [previously existing language teams] annotating new test data
  - ~~Minimal requirement: at least 500 newly annotated VMWEs~~
  - ~~Optimal requirements: at least 3,500 annotated VMWEs in total~~
- ~~[new language teams] annotating the training and test data: 3,500 annotated VMWEs in total~~
[January 20, February 28]annotation freeze, all annotators finish annotating their files
~~[February 5 March 1-16: LLs] gather files from all annotators and perform consistency checks~~
~~[March 16: LLs] send the full corpora to the core organizers~~
~~[February 5 March 21: core organizers] trial data and the evaluation script released~~
~~[March 17-30: LLs and core organizers] corpora conversion and train/dev/test split~~
~~[February 19 April 4: core organizers] training and development data released (synchronous with syntactic annotations)~~
~~[April 20: LLs] send Inter-Annotator Agreement (IAA) pair of files to core organizers~~
~~[Mar 9 April 30: core organizers] blind test data released~~
~~[March 19 May 4: participants] submission of system results~~
~~[March 26 May 11: core organizers] announcement of results~~
~~[May 25: participants] shared task system description papers: submission and review~~
~~[June 20] notification of acceptance~~
~~[June 30] camera ready papers~~
[August 25-26] shared task culmination at the LAW-MWE-CxG workshop (co-located with COLING 2018)

Documents:

Annotation guidelines
What's new in the guidelines v 1.1.
Guide for FLAT users (annotators)
Guide for language leaders
Guide for FLAT administrators (for authorised users only - contact the technical support)
Gitlab issue tracker for discussing evaluation tools and annotation guidelines issues (for registered users only)
How to udpate previously annotated corpora to the guidelines version 1.1
Current state of the annotations in FLAT (some languages are annotated outside FLAT) - note that this page contains counts for all the files present on the FLAT server, also those from edition 1.0.

Tools:

FLAT server (central annotation platform)
Telegram discussion groups for fast peer-to-peer problem solving (usable on mobile phones and regular PCs via web browser, preferably Chromium or Google Chrome):
- PARSEME helpdesk - annotators and language leaders can post technical issues concerning FLAT and other tools - see the user's guide
- PARSEME admin - technical experts can discuss technical choices
Github space for reporting FLAT bugs (for authorised technical staff only)
Gitlab spaces (see also their detailed description):
- release Gitlab space (public)
  - train and test corpora from edition 1.0 of the shared task
  - scripts for system evaluation and parsemetsv format validation
- development Gitlab space (for authorised users)
  - development version of the corpora
  - double-aligned corpora for IAA calculus
  - system results from edition 1.0 of the shared task
  - various scripts for ST organizers (automating system evaluation, publishing the results, running IAA calculus)
- guidelines Gitlab space (for authorised users)
  - HTML source codes for the annotation guidelines
  - scripts for the guidelines (inserting exemples in different languages)
- utilities Gitlab space (public)
  - scripts for the corpus consistency checks
  - various corpus converters, aligners and splitters
Mailing lists
- parseme-shared-task - annotators, language leaders, language group leaders, technical experts, guideines experts, core group
- parseme-st-org - language leaders, language group leaders, technical experts, guideines experts, core group
- parseme-st-core - core group

Corpus:

Corpus selection rules
- source: in the perspective of future editions of the shared task on joint MWE identification and parsing, a corpus that is already (manually) syntactically annotated would be preferable
- size: sufficient to include at least 3,500 VMWE annotations
- genre: preferably newspaper texts or Wikipedia articles
- translationese issues: the corpus should be written in the original rather than translated
- license issues: the corpus should be free from copyright issues so as to allow publication under the Creative Commons license
- dialects: for languages with large dialects (English, Spanish etc.), select corpora from the dialects for which there is at least one native annotator in the ST language team
Training and test corpus from edition 1.0
Corpus format description
- parseme-tsv format for annotated files - files in this format are: (i) given on input to VMWE identification tools (in this case the last column contains underscores only), (ii) expected on output from VMWE identification tools (in this case the last column contains MWE tags or underscores)
- parseme-tsv-pos format - useful if (i) part-of-speech tags are available, to highlight verbs in FLAT; (ii) if VMWE pre-annotations are available
- UD-compatible format - work in progress (planned for edition 2.0 of the PARSEME corpus): sample UD file
Corpus segmentation - it is recommended to use one of the following solutions:
New! The PARSEME corpus in version 1.0 is now available via the KonText and NoSke query systems. To use the system:
- select the PARSEME VMWE corpus from the list
- (in KonText only) click on Query -> Enter new query
- choose Query type -> CQL
- see the project page with sample queries, test the queries
- post questions and query examples to the parseme-cql group
- see also the project Github space

People and roles:

Core group: Silvio Cordeiro, Carlos Ramisch, Agata Savary, Veronika Vincze
Annotation guidelines experts: Verginica Barbu Mititelu, Marie Candito, Voula Giouli, Carlos Ramisch, Agata Savary, Nathan Schneider, Ivelina Stoyanova, Veronika Vincze
Technical support: Federico Sangati, Behrang QasemiZadeh, Silvio Cordeiro, Carlos Ramisch (please, contact them mainly via the PARSEME helpdesk Telegram group, this will make their and your life easier)
- Gitlab maintenance (Silvio)
- evaluation/conversion script maintenance (Silvio)
- FLAT server administration and updates (Behrang, backup: Carlos, Silvio)
- creating and managing FLAT user accounts (Behrang, backup: Carlos, Agata)
- addition/update of FLAT configurations (Federico)
- user support via a Telegram group (Federico)
- technical infrastructure for the guidelines and examples (Carlos)
FLAT consultant: Maarten van Gompel
Language Group Leaders (see below)
- Collecting feedback on the guidelines from LLs,
- Creating and administrating Gitlab issues assigned to the LG,
- Participation in guidelines enhancements,
- Periodically communicating with the language leaders and check how annotation is going
- Helping language leaders during the consistency checks
- Getting in touch with the technical team when language leaders experience problems or need to work on technical aspects such as missing CONLL-U files, data formats, FLAT issues, etc.
- Co-authoring publications on the ST organisation and corpus
Language Leaders (see below)
- Recruiting and training annotators
- Maintaining the list of examples in the language
- Collecting feedback on the guidelines and transferring it to the LGLs
- Preparing the corpus for annotation
- Coordinating the annotations
- Performing the consistency checks after the annotation
- Preparing the CoNLL-U files
- Administering the Gitlab issues assigned to the language
- Co-authoring publications on the ST corpus

Languages:

Germanic languages - group leader: Paul Cook
- English: Abigail Walsh (leader), Claire Bonial, Paul Cook, Jamie Findlay, Teresa Lynn, John McCrae, Nathan Schneider, Clarissa Somers
- German: Timm Lichte (leader), Rafael Ehren
Romance languages - group leaders: Marie Candito and Carlos Ramisch
- French: Marie Candito (leader), Matthieu Constant, Carlos Ramisch, Caroline Pasquer, Yannick Parmentier, Jean-Yves Antoine, Agata Savary
- Italian: Johanna Monti (leader), Valeria Caruso, Maria Pia di Buono, Antonio Pascucci, Annalisa Raffone, Anna Riccio
- Romanian: Verginica Barbu Mititelu (leader), Monica-Mihaela Rizea, Mihaela Ionescu, Mihaela Onofrei
- Spanish: Carla Parra Escartín (leader), Cristina Aceta, Alfredo Maldonado, Héctor Martínez Alonso, Belem Priego Sanchez
- Brazilian Portuguese: Renata Ramisch (leader), Silvio Ricardo Cordeiro, Aline Villavicencio, Carlos Ramisch, Leonardo Zilio, Helena de Medeiros Caseli
Balto-Slavic languages - group leader: Ivelina Stoyanova
- Bulgarian: Ivelina Stoyanova (leader), Tsvetana Dimitrova, Svetlozara Leseva, Valentina Stefanova, Maria Todorova
- Czech: Eduard Bejček (leader), Zdeňka Urešová
- Croatian: Maja Buljan (leader), Goranka Blagus, Ivo-Pavao Jazbec, Nikola Ljubešić, Ivana Matas, Jan Šnajder
- Lithuanian: Jolanta Kovalevskaitė (leader), Agne Bielinskiene, Loic Boizou
- Polish: Agata Savary (leader), Emilia Palka-Binkiewicz
- Slovene: Polona Gantar (co-leader), Simon Krek (co-eader), Špela Arhar Holdt, Jaka Čibej, Teja Kavčič, Taja Kuzman
Other languages - group leader: Voula Giouli, Uxoa Iñurrieta
- Arabic: Abdelati Hawwari (leader), Mona Diab, Mohamed Elbadrashiny, Rehab Ibrahim
- Basque: Uxoa Iñurrieta (leader), Itziar Aduriz, Ainara Estarrona, Itziar Gonzalez, Antton Gurrutxaga, Larraitz Uria, Ruben Urizar
- Farsi: Behrang QasemiZadeh (leader), Shiva Taslimipoor
- Greek: Voula Giouli (leader), Vassiliki Foufi, Aggeliki Fotopoulou, Stella Markantonatou, Stella Papadelli, Natasa Theoxari
- Hebrew: Chaya Liebeskind (leader), Hevi Elyovich, Yaakov Ha-Cohen Kerner, Ruth Malka
- Hindi: Archna Bhatia (co-leader), Ashwini Vaidya (co-leader), Kanishka Jain, Vandana Puri, Shraddha Ratori, Vishakha Shukla, Shubham Srivastava,
- Hungarian: Veronika Vincze (leader), Katalin Simkó, Viktória Kovács
- Maltese (skipping edition 1.1, should join edition 2.0): Lonneke van der Plaas, Luke Galea, Greta Attard, Kirsty Azzopardi, Janice Bonnici, Jael Busuttil, Ray Fabri, Alison Farrugia, Sara Anne Galea, Albert Gatt, Anabelle Gatt, Amanda Muscat, Michael Spagnol, Nicole Tabone, Marc Tanti.
- Turkish: Tunga Güngör (leader), Gozde Berk, Berna Erden

Other useful links:

UDPipe - trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files. Trained models are provided for nearly all Universal Dependency treebanks. They can be used to POS-tag a corpus; the verbal POS tags can then be added to the files (see the 5th column of the parseme-tsv format) to be annotated for the shared task, which greatly speeds up the manual annotation. The output of UDPipe models is in the CoNLL format, which can be transformed to the parseme-tsv format by the conllu2parsemetsv.sh script.
Annotation guidelines of related initiatives

Questions and answers:

How much (new) data needs to be annotated for the ST edition 1.1?

For new languages, we intend to keep the same goals as last year, that is, to create a training corpus with around 3000 annotated VMWEs, and a test set of at least 500 annotated VMWEs.
For languages that already have some annotated data, but did not reach the goal of 3000+500 annotated VMWEs last year, this is an opportunity to reach this milestone.
Finally, even for languages that already have reached the goal last year, we will need to annotate a new test set with at least 500 annotated VMWEs. This last point is particularly important because the previous test sets were released publicly, so they are not secret anymore.

Will the new guidelines require revising the existing annotations?

Silvio Cordeiro,

This page contains links to PARSEME outcomes other than those listed in dedicated pages.

STSMs:

PARSEME funded 39 Short Term Scientific Missions for 35 reseachers and a total of 49 months with the following distribution:

early-stage researchers: 30 STSMs (77%); senior researchers: 9 STSMs (23%)
male researchers: 20 STSMs (51%); female researchers: 19 STSMs (49%)
25 countries were concerned in total (either as a sending or as a hosting country)
- STSMs coming from an inclusiveness country: 14 (36%); STSMs coming from a non-inclusiveness country: 25 (64%)
- STSMs going to an inclusiveness country: 8 (21%); STSMs going to a non-inclusiveness country: 31 (79%)
Average STSM duration: 29 days
All reports are available online

Members' lists:

PARSEME gathers members of 2 categories:

Management Committee members and substitutes were nominated by the participating countries as their official representative. The MC list is maintained by COST.
Working Group members are admitted according to PARSEME internal rules. The WG members' list, containing profiles and contacts of the members, is one of our networking instruments.

Spin-off projects:

Five PARSEME spin-off projects received national funding in the Czech Republic, France, Lithuania, Poland and Slovenia.

Success stories:

PARSEME was shortlisted by COST for a presentation at the European Conference for Science Journalists (Copenhagen, 26-30 June), the largest gathering of science journalists in Europe in 2017
Glorianna Jagfeld's bachelor thesis "Towards a Better Semantic Role Dimension of the success Labeling of Complex Predicates", supervised by Lonneke van der Plas, has received the German national GSCL prize for the best ESR support Bachelor thesis in Computational Linguistics as well as the local Infos prize.
Heinrich Heine Universitaet Duesseldorf (Laura Kallmeyer) and Agata Savary received a Seal of Excellence for the Marie Sk≈Çodowska-Curie Action proposal entitled "Object-Oriented Modeling of Multiword Expressions (MWE-Plus-Plus)", submitted on 14 September 2016 (call H2020-MSCA-IF-2016).
Agnieszka Patejuk received a mobility grant from the Polish "Mobilno≈õƒá Plus" program, for a 3-year research visit to the University of Oxford.

Theses:

Bojana Djordjevic (submitted) "Construction of a Formal Grammar of Serbian Using a Metagrammar", PhD thesis, under evaluation, supervised by Cvetana Krstev, University of Belgrade, Serbia
Agnieszka Patejuk (2015) "Unlike coordination in Polish: an LFG account", PhD thesis with honors, supervised by Adam Przepiórkowski, Institute of Polish Language, Polish Academy of Sciences, Kraków, Poland.
Gyri Smørdal Losnegaard (in preparation) "Predicting the unpredictable: Developing a lexicon model for Norwegian multiword expressions", PhD thesis, supervised by Victoria Rosén, University of Bergen, Norway.
Jakub Waszczuk (2017) "Leveraging MWEs in practical TAG parsing: towards the best of the two worlds", PhD thesis, supervised by Agata Savary and Yannick Parmentier, François Rabelais University Tours, France
Agata Savary (2014): "Representation and Processing of Composition, Variation and Approximation in Language Resources and Tools", dissertation in view of an accreditation to supervise research (Habilitation à Diriger des Recherches), Université François Rabelais Tours, France

MWE games - a crowdsourced database of idioms in many languages structured as games for an easy, user-friendly and appealing discovery of mental images, metaphors and stereotypes conveyed by MWEs in various countries.

This page PARSEME deliverables, as defined in the Memorandum of Understanding.

Contrastive analysis of the linguistic properties of MWEs in different European languages.
- MWE templates - a WG1 survey on multilingually applicable classification of MWEs
- Manfred Sailer and Stella Markantonatou (eds.) Mutliword Expressions: Insights from a Multi-lingual Perspective - WG1 book
Proposal of a common design for lexicons including both valence data and MWE data.
- Publication on contrastive analysis of the design of valence MWE dictionaries in Czech and Polish:
  Adam Przepiórkowski, Jan Hajiƒç, El≈ºbieta Hajnicz, and Zde≈àka Urešová. Phraseology in two Slavic valency dictionaries: Limitations and perspectives. International Journal of Lexicography, 30(1):1–38, 2017
- WG1 workshop on lexical encoding of MWEs, based on the DuELME formalism meant to be theory- and grammar-independent, and interoperable with valence-aware grammars
Lexical databases: possibly interoperable parsing-oriented MWE lexicons and valence dictionaries in several European languages.
- WG1 MWE lexicon survey
- Publications about new MWE language resources
Extensions of existing corpora and treebanks in several languages with MWE annotation levels.
- Annotation guidelines for 18 languages in the PARSEME shared task on automatic identification of verbal MWEs
- Course on MWEs and the Praque Dependency Treebank
- PARSEME-FR - a PARSEME French spin-off project with annotating MWEs as one of the main objectives
- Papers on projecting MWE resources on treebanks
Extensions of existing grammars for several European languages with rules dedicated to MWEs.
- Course on Multi-word Expressions in HPSG
- Posters and papers on integrating MWEs in symbolic grammars (in English, Greek, Hebrew and Polish)
Definitions of abstract models (e.g. meta-grammars) of MWEs’ properties that would: (i) capture linguistic richness of MWEs independently of particular grammatical frameworks, (ii) help reduce the cost of resource development, (iii) adapt to different languages studied.
- 2 tutorials on XMG, a meta-grammar framework for efficient development of lexicalized grammars with MWEs
- Papers and posters on MWE encoding in XMG
- A tutorial on integrating MWEs in FRMG, a French Meta-Grammar
- WG1 workshop on lexical encoding of MWEs, based on the DuELME formalism meant to be theory- and grammar-independent, and interoperable with valence-aware grammars
Recommendations of best practices for MWE representation and treatment in parsing within different theoretical frameworks.
- Papers on joint parsing and MWE identification
- Tutorials on MWEs in FRMG and in the Grammatical Framework
- Course on "Dependency grammar, dependency parsing and MWEs"
- WG2 book "Representation and Parsing of Multiword Expressions"
Extension of hybrid (knowledge-based and data-driven) methods for parsing MWEs.
- WG3 survey on hybrid processing of MWEs
- Papers on a novel architecture of joint dependency parsing and MWE identification
- Papers on promoting MWEs in TAG parsing
Annotation guidelines for the representation of MWEs in treebanks.
- WG4 survey on annotating MWEs in treebanks
- 2 papers describing the survey and paving the way towards guidelines
- The guidelines (with examples and references)
A common publishing platform gathering initiatives in the field of MWEs and parsing.
- This website
- Publicly available Google table from the WG1 MWE lexicon survey
- Publicly available Wiki table from the WG4 survey on annotating MWEs in treebanks
Scientific publications in established conferences and journals in various domains - see the pages dedicated to papers and proceedings.

SIGLEX-MWE section and PARSEME are co-organizing the annual Multiword Expressions Workshop on 4 April 2017. It will be co-located with the EACL 2017 conference. It includes a special track dedicated the PARSEME shared task on automatic identification of MWEs.

PARSEME grants

PARSEME will fund travel and stay for 33 workshop participants from the PARSEME member countries. Applicants should fill in the application form by 15 February 2017. The selection of applicants entitled to reimbursement will be done by the PARSEME Steering Committee. Priority is given to:

workshop and shared task organizers, technical experts and language group leaders,
shared task language leaders,
authors of the best systems in the shared task,
presenters of papers/posters,
shared task annotators,
early-stage researchers,
PARSEME membres.

The reimbursement rates:

Hotel: 120 EUR per night (flat rate). The number of the reimbursed nights is equal to the number attended worhshop days plus 1 (in case the participant arrives earlier than her/his first attended day and leaves later than his last attended day). An attendance list must be signed each day of presence at the workshop.
Meals: 20 EUR per meal (flat rate).
Travel: real costs limited to 1200 € (economy class air tickets, train tickets, local transport, etc.).
Workshop admission fees are not eligible for reimbursement.

Detailed reimbursement rules are defined in the COST Vademecum, pp. 19-23, section 4. The applicants selected for funding will receive a formal invitation via the e-COST system (which they should accept before their travel). They should cover their travel and stay in advance and will be reimbursed on return.

Important dates:

16 22 January, 2017: Submission deadline for the main track long & short papers
5 February: Submission deadline for shared task system description papers
11 February: Notification of acceptance for the main track papers
12 February: Notification of acceptance for the shared task papers
15 February: deadline for applications for funding
20 February: Camera-ready papers due (main track and shared task)
1 March: notification to applicants about funding
4 April, 2017: MWE 2017 Workshop

PARSEME helpdesk Telegram group [DEPRECATED]

PARSEME shared task on automatic identification of verbal MWEs - edition 1.1

Other outcomes

MoU deliverables

MWE Workshop at EACL-2017

PARSEME grants

Events

News

Login Form