Print

This page describes the parseme-tsv format. The training (annotated) and blind (non-annotated) corpora files provided to participants in the PARSEME shared task on on automatic detection of verbal MWEs use this format. See a sample blind corpus file and a sample training corpus file for illustration. test

The parseme-tsv format is an UTF-8 textual four-column format. Columns must be separated by single tabulations, not by blank spaces. Empty fields are marked by underscores1 ('_'):

Example of an annotated training corpus file:

1      Delegates _
2 are 1:LVC
3 in 1
4 little
5 doubt 1
6 that
7 the
8 shadow 2:ID
9 cast 2
10 over
11 the
12 city
13 by
14 the
15 attacks
16 will
17 enhance
18 the
19 chances
20 of
21 agreement nsp
22 .
    
# sent_id = 2
# text = Don't talk the talk...
1-2 Don't
1 Do  1:ID
2 not  1
3 talk  1
4 the  1
5 talk  1
6 if  1
7 you  1
8-9 can't
8 can  1
9 not  1
10 walk  1
11 the  1
12 walk nsp  1
13 .
    
1 Questioning
2 colonial
3 boundaries
4 would
5 open 1:ID
6 a
7 dangerous
8 Pandora nsp 1
9 ' nsp 1
10 s 1
11 box nsp 1
12 .  _
 

 

If one VMWE is embedded in another one, a new semicolon-separated code is added to column 4. The identifiers of both VMWEs must be distinct. In case of VMWE overlapping, such as in coordination, the same rule applies.

Example of an annotated training corpus file:

1      Once _           
2 again
3 it
4 was
5 a
6 senior
7 BBC
8 person
9 who
10 let 1:ID;2:VCP
11 the 1
12 cat 1
13 out 1;2
14 of 1
15 the 1
16 bag nsp  1
17 .
    
1 They
2 were
3 letting 1:VPC;2:VPC
4 us
5 in 1
6 and
7 out  2
8 for
9 quite
10 some
11 time nsp
12 .
 

 

Example of a blind corpus file:

Delegates _
2 are _
3 in _
4 little
5 doubt _
6 that
7 the
8 shadow _
9 cast _
10 over
11 the
12 city
13 by
14 the
15 attacks
16 will
17 enhance
18 the
19 chances
20 of
21 agreement nsp
22 .
    
# sent_id = 2
# text = Don't talk the talk...
1-2 Don't
1 Do _
2 not _
3 talk _
4 the _
5 talk _
6 if _
7 you _
8-9 can't
8 can _
9 not _
10 walk _
11 the _
12 walk nsp _
13 .
    
1 Questioning
2 colonial
3 boundaries
4 would
5 open _
6 a
7 dangerous
8 Pandora nsp _
9 ' nsp _
10 s _
11 box nsp _
12 . _
 

 

Deprecated versions of the format

While developing the annotation guidelines and methodology for the PARSEME shared task other variants of the format were used:

The final parseme-tsv format is slightly redefined with respect to these previous formats in that:


1 While the format requires empty fields to be marked by underscores ('_'), as in the CoNLL-U format, missing underscores are tolerated when used on input of the FLAT annotation platform.