This page describes the methodology of the pilot corpus annotation in the PARSEME shared task on on automatic detection of verbal MWEs.

Objectives of the pilot annotation:

  • test, enhance and validate the annotation guidelines
  • add language-specific features to the guidelines

Edition of the guidelines:

  • under Google Docs
  • modification access allowed for the organizers and the language group leaders
  • annotators make enhancement suggestions to their language groups

Choice of the corpus:

  • Length: Each language team selects a common corpus for each of the 2 pilot annotation phases. It should be of about 200 sentences, possibly distinct for phase 1 and phase 2.
  • Genre:
    • For pilot annotation phase 1 the corpus should contain newspaper texts or alike, dedicated to no specific technical domain (general news rather than sports, weather forecasts, economics, etc.).
    • For pilot annotation phase 2 the choice of the corpus is more open, it can be as in phase 1 but it can also be e.g. a corpus of spoken dialogs, blogs, chats, etc.
    • Conclusions from phases 1 and 2 should allow us to concude about the final choice of the corpus type.
  • Translationese issues: The corpus should be written in the original (rather than translated from another language) and should possibly be free from copyright issues, so as to be compatible with an open license.
  • Selection strategies: In order to highlight the existing VMWE-related issues, and to avoid bias in annotation and evaluation, the corpus should possibly contain both positive and negative examples of MWE occurrences. Therefore, it should be a running text rather than a set of automatically pre-selected sentences that would maximize the number of MWE occurences.
  • Noisy data: The corpus should be kept in its original state. Notably, possible spelling, grammar or punctuation errors should not be corrected. 

Pilot annotation - phase 1 (20 January - 10 February 2016):

  1. [LANGUAGE TEAMS] Selecting a corpus of 200 sentences from newspapers (see above). Encoding the corpus in a text-only format in UTF-8.
  2. [LANGUAGE TEAMS or TECHNICAL SUPPORT] Converting it to a one-token-per-line format, with the first two columns filled in as specified in the format description. Language teams may either use their custom tokenizers or be assited by the technical support with a generic tokenizer. See the segmentation rules for details.
  3. [LANGUAGE TEAMS] Annotating the corpus according to the proposed format. Sending comments on the annotation guidelines to the language group leaders. Annotators may use their custom annotation tools or centralized spreadsheet-based tools prepared by the technical support team.

Enhancement of guidelines (11 February - 7 March 2016):

[LANGUAGE GROUP LEADERS AND ORGANIZERS] Centralizing comments and publishing a new version of the guidelines.

Pilot annotation - phase 2 (8 - 22 March 2016):

  • [LANGUAGE TEAMS] Selecting at least two annotators per language.
  • [LANGUAGE TEAMS] Selecting another sample of 200 sentences and converting them to the same format as previously.
  • [LANGUAGE TEAMS] Annotating the new corpus by the two annotators independently. Comparing discrepancies (possibly with the help of the technical support). Sending new comments on the annotation guidelines to the language group leaders.

Annotathon (at the 6th general meeting in Struga, 7-8 April 2016):

[LANGUAGE TEAMS, LANGUAGE GROUP LEADERS, ORGANIZERS] Discussing challenging issues, finalizing the guidelines