This page describes the obsolete format, called parseme-tsv-input format, of the input corpora provided to VMWE identification tools in the PARSEME shared task on on automatic detection of verbal MWEs. See a sample file for illustration.
The parseme-tsv-iput format is a three-column format derived from the parseme-tsv format in that it is simply limited to the first three columns of the latter:
- The first column contains the rank of the corpus token in the sentence (blank characters are neglected in the token rank count) or a scope of ranks in case of a multiword token (MWT).
- The second column contains the corpus token (blank characters are not represented explicitely) or a multiword token. In the latter case all other columns are empty.
- The third column contains an underscore ('_') if the token is followed by a space or another blanc character (see segmentation rules) in the original file. Otherwise it contains nsp (no space).
Examples:
| 1-2 | Don't | _ |
| 1 | Do | _ |
| 2 | not | _ |
| 3 | talk | _ |
| 4 | the | _ |
| 5 | talk | _ |
| 6 | if | _ |
| 7 | you | _ |
| 8-9 | can't | _ |
| 8 | can | _ |
| 9 | not | _ |
| 10 | walk | _ |
| 11 | the | _ |
| 12 | walk | nsp |
| 13 | . | _ |
| 1 | Questioning | _ |
| 2 | colonial | _ |
| 3 | boundaries | _ |
| 4 | would | _ |
| 5 | open | _ |
| 6 | a | _ |
| 7 | dangerous | _ |
| 8 | Pandora | nsp |
| 9 | ' | nsp |
| 10 | s | _ |
| 11 | box | nsp |
| 12 | . | _ |