This page describes the obsolete format, called parseme-tsv-input format, of the input corpora provided to VMWE identification tools in the PARSEME shared task on on automatic detection of verbal MWEs. See a sample file for illustration.
The parseme-tsv-iput format is a three-column format derived from the parseme-tsv format in that it is simply limited to the first three columns of the latter:
- The first column contains the rank of the corpus token in the sentence (blank characters are neglected in the token rank count) or a scope of ranks in case of a multiword token (MWT).
- The second column contains the corpus token (blank characters are not represented explicitely) or a multiword token. In the latter case all other columns are empty.
- The third column contains an underscore ('_') if the token is followed by a space or another blanc character (see segmentation rules) in the original file. Otherwise it contains nsp (no space).
Examples:
1-2 | Don't | _ |
1 | Do | _ |
2 | not | _ |
3 | talk | _ |
4 | the | _ |
5 | talk | _ |
6 | if | _ |
7 | you | _ |
8-9 | can't | _ |
8 | can | _ |
9 | not | _ |
10 | walk | _ |
11 | the | _ |
12 | walk | nsp |
13 | . | _ |
1 | Questioning | _ |
2 | colonial | _ |
3 | boundaries | _ |
4 | would | _ |
5 | open | _ |
6 | a | _ |
7 | dangerous | _ |
8 | Pandora | nsp |
9 | ' | nsp |
10 | s | _ |
11 | box | nsp |
12 | . | _ |