parseme-tsv-input format (obsolete)

This page describes the obsolete format, called parseme-tsv-input format, of the input corpora provided to VMWE identification tools in the PARSEME shared task on on automatic detection of verbal MWEs. See a sample file for illustration.

The parseme-tsv-iput format is a three-column format derived from the parseme-tsv format in that it is simply limited to the first three columns of the latter:

The first column contains the rank of the corpus token in the sentence (blank characters are neglected in the token rank count) or a scope of ranks in case of a multiword token (MWT).
The second column contains the corpus token (blank characters are not represented explicitely) or a multiword token. In the latter case all other columns are empty.
The third column contains an underscore ('_') if the token is followed by a space or another blanc character (see segmentation rules) in the original file. Otherwise it contains nsp (no space).

Examples:

1-2	Don't	_
1	Do	_
2	not	_
3	talk	_
4	the	_
5	talk	_
6	if	_
7	you	_
8-9	can't	_
8	can	_
9	not	_
10	walk	_
11	the	_
12	walk	nsp
13	.	_

1	Questioning	_
2	colonial	_
3	boundaries	_
4	would	_
5	open	_
6	a	_
7	dangerous	_
8	Pandora	nsp
9	'	nsp
10	s	_
11	box	nsp
12	.	_