Print

This page describes the methodology of the pilot corpus annotation in the PARSEME shared task on on automatic detection of verbal MWEs.

Objectives of the pilot annotation:

Edition of the guidelines:

Choice of the corpus:

Pilot annotation - phase 1 (20 January - 10 February 2016):

  1. [LANGUAGE TEAMS] Selecting a corpus of 200 sentences from newspapers (see above). Encoding the corpus in a text-only format in UTF-8.
  2. [LANGUAGE TEAMS or TECHNICAL SUPPORT] Converting it to a one-token-per-line format, with the first two columns filled in as specified in the format description. Language teams may either use their custom tokenizers or be assited by the technical support with a generic tokenizer. See the segmentation rules for details.
  3. [LANGUAGE TEAMS] Annotating the corpus according to the proposed format. Sending comments on the annotation guidelines to the language group leaders. Annotators may use their custom annotation tools or centralized spreadsheet-based tools prepared by the technical support team.

Enhancement of guidelines (11 February - 7 March 2016):

[LANGUAGE GROUP LEADERS AND ORGANIZERS] Centralizing comments and publishing a new version of the guidelines.

Pilot annotation - phase 2 (8 - 22 March 2016):

Annotathon (at the 6th general meeting in Struga, 7-8 April 2016):

[LANGUAGE TEAMS, LANGUAGE GROUP LEADERS, ORGANIZERS] Discussing challenging issues, finalizing the guidelines