This page describes the format, called parseme-tsv-pos format, of the input corpora to be uploaded to the FLAT annotation platform in the PARSEME shared task on on automatic detection of verbal MWEs. See a sample file for illustration.

The parseme-tsv-pos format is a five-column format derived from the parseme-tsv format in the following way:

  • The fourth column may or may not contain VMWE annotations (in the latter case, the whole column contains underscores '_').
  • The fifth column contains the part-of-speech tag for the current token, or an underscore ('_') if no tag is provided. No specific POS tagset is recommended, and the POS tags can take any form.
  • No comment lines are admitted.

Examples:

1        Delegates _ _
2 are 1:LVC   V
3 in 1 _
4 little _
5 doubt 1 _
6 that _
7 the _
8 shadow 2:ID _
9 cast 2 Vpp
10 over _
11 the _
12 city _
13 by _
14 the _
15 attacks _
16 will V
17 enhance VInf
18 the _
19 chances _
20 of _
21 agreement nsp  _
22 . _
         
1 Questioning Vger
2 colonial _
3 boundaries _
4 would V
5 open _ Vinf
6 a _
7 dangerous _
8 Pandora nsp    _ _
9 ' nsp _ _
10 s _ _
11 box nsp _ _
12 .  _ _

 

Files in this format are useful in the following cases:

  • part-of-speech tags are available for the corpora; we recommend in this case to keep only the verbal POS tags (including gerunds and participles), which will then display in FLAT above the verbal tokens; this may greatly speed up the manual annotations since head verbs are automatically underlined in the FLAT interface; annotators should, however, be aware of the bias, especially in the POS tags are not gold standard tags,
  • automatic VMWE pre-annotations are available; and they need a manual validation in FLAT,
  • some annotators work off-line in Excel-like spreadsheets.