This page describes the obsolete format, called parseme-tsv-input format, of the input corpora provided to VMWE identification tools in the PARSEME shared task on on automatic detection of verbal MWEs. See a sample file for illustration.

The parseme-tsv-iput format is a three-column format derived from the parseme-tsv format in that it is simply limited to the first three columns of the latter:

  • The first column contains the rank of the corpus token in the sentence (blank characters are neglected in the token rank count) or a scope of ranks in case of a multiword token (MWT).
  • The second column contains the corpus token (blank characters are not represented explicitely) or a multiword token. In the latter case all other columns are empty.
  • The third column contains an underscore ('_') if the token is followed by a space or another blanc character (see segmentation rules) in the original file. Otherwise it contains nsp (no space).


1-2    Don't
1 Do
2 not
3 talk
4 the
5 talk
6 if
7 you
8-9 can't
8 can
9 not
10 walk
11 the
12 walk nsp
13 .
1 Questioning
2 colonial
3 boundaries
4 would
5 open
6 a
7 dangerous
8 Pandora nsp
9 ' nsp
10 s
11 box nsp
12 .