This page describes the format, called parseme-tsv-pos format, of the input corpora to be uploaded to the FLAT annotation platform in the PARSEME shared task on on automatic detection of verbal MWEs. See a sample file for illustration.
The parseme-tsv-pos format is a five-column format derived from the parseme-tsv format in the following way:
- The fourth column may or may not contain VMWE annotations (in the latter case, the whole column contains underscores '_').
- The fifth column contains the part-of-speech tag for the current token, or an underscore ('_') if no tag is provided. No specific POS tagset is recommended, and the POS tags can take any form.
- No comment lines are admitted.
Examples:
1 | Delegates | _ | _ | _ |
2 | are | _ | 1:LVC | V |
3 | in | _ | 1 | _ |
4 | little | _ | _ | _ |
5 | doubt | _ | 1 | _ |
6 | that | _ | _ | _ |
7 | the | _ | _ | _ |
8 | shadow | _ | 2:ID | _ |
9 | cast | _ | 2 | Vpp |
10 | over | _ | _ | _ |
11 | the | _ | _ | _ |
12 | city | _ | _ | _ |
13 | by | _ | _ | _ |
14 | the | _ | _ | _ |
15 | attacks | _ | _ | _ |
16 | will | _ | _ | V |
17 | enhance | _ | _ | VInf |
18 | the | _ | _ | _ |
19 | chances | _ | _ | _ |
20 | of | _ | _ | _ |
21 | agreement | nsp | _ | _ |
22 | . | _ | _ | _ |
1 | Questioning | _ | _ | Vger |
2 | colonial | _ | _ | _ |
3 | boundaries | _ | _ | _ |
4 | would | _ | _ | V |
5 | open | _ | _ | Vinf |
6 | a | _ | _ | _ |
7 | dangerous | _ | _ | _ |
8 | Pandora | nsp | _ | _ |
9 | ' | nsp | _ | _ |
10 | s | _ | _ | _ |
11 | box | nsp | _ | _ |
12 | . | _ | _ | _ |
Files in this format are useful in the following cases:
- part-of-speech tags are available for the corpora; we recommend in this case to keep only the verbal POS tags (including gerunds and participles), which will then display in FLAT above the verbal tokens; this may greatly speed up the manual annotations since head verbs are automatically underlined in the FLAT interface; annotators should, however, be aware of the bias, especially in the POS tags are not gold standard tags,
- automatic VMWE pre-annotations are available; and they need a manual validation in FLAT,
- some annotators work off-line in Excel-like spreadsheets.