This page describes the parseme-tsv format. The training (annotated) and blind (non-annotated) corpora files provided to participants in the PARSEME shared task on on automatic detection of verbal MWEs use this format. See a sample blind corpus file and a sample training corpus file for illustration.

  • A word is a language-specific notion. Most often it coincides with a token.
  • A token can be defined according to either generic or language-specific rules (see annotation guidelines section 1.2).
  • Each token appears on a separate line. Line breaks are indicated by a linefeed character (LF or '\n').
  • Each sentence, including the last one in the file, must be followed by exactly one empty line (which can contain any number of tabulations, but no other character).
  • Before the first token of each sentence, it is possible to add any number of comment lines, whose first character must be a hash ('#'). Comment lines can contain any free text, and are ignored. Comments cannot appear in the middle of a sentence.

The parseme-tsv format is an UTF-8 textual four-column format. Columns must be separated by single tabulations, not by blank spaces. Empty fields are marked by underscores1 ('_'):

  • The first column contains the numerical rank of the token in the sentence (blank characters are ignored in the token rank count) or a range of ranks, in the case of a multiword token (MWT).
  • The second column contains the token's surface form (blank characters are not represented explicitly) or a multiword token. In the latter case, all other columns are empty.
  • The third column contains an underscore ('_') if the token is followed by whitespace in the raw text (see segmentation rules). Otherwise it contains nsp (no space).
  • The fourth column contains an underscore ('_') if the token is not part of a verbal multiword expression (VMWE) or if this information is not provided (in the blind corpus file). Otherwise, it contains a list of semicolon-separated VMWE codes. Codes are only assigned to lexicalized tokens of a VMWE (see annotation guidelines, section 1.5). The code of the first token of a VMWE consists of an integer identifier, optionally followed by a colon and a VMWE category label (e.g., 1:ID). VMWE identifiers start from 1 for each new sentence, and increase by 1 for each new VMWE. For a non-VMWE-initial token, the code contains the identifier only. The following VMWE category labels are distinguished (see annotation guidelines, section 3).
    • universal categories (existing in all languages covered by the shared task)
      • LVC - a light verb construction (e.g. to take a decision)
      • ID - a verbal idiom (e.g. to kick the bucket)
    • quasi-universal categories (existing in some languages or language families, but not all)
      • IReflV - an inherently reflexive verb (e.g. (FR) se suicider 'suicide')
      • VPC - a verb particle construction (e.g. to take off)
    • language-specific categories (if any)
    • OTH - a VMWE of a type different from the above
  • Note that, while the manually annotated training corpora will contain category labels for all VMWEs, system outputs do not need to provide category labels, as they will be ignored by the evaluation metrics and script. VMWE codes in system outputs can contain identifiers only (e.g., 1, 2), and no LVC, ID, IReflV, VPC and OTH category labels.

Example of an annotated training corpus file:

1      Delegates _
2 are 1:LVC
3 in 1
4 little
5 doubt 1
6 that
7 the
8 shadow 2:ID
9 cast 2
10 over
11 the
12 city
13 by
14 the
15 attacks
16 will
17 enhance
18 the
19 chances
20 of
21 agreement nsp
22 .
    
# sent_id = 2
# text = Don't talk the talk...
1-2 Don't
1 Do  1:ID
2 not  1
3 talk  1
4 the  1
5 talk  1
6 if  1
7 you  1
8-9 can't
8 can  1
9 not  1
10 walk  1
11 the  1
12 walk nsp  1
13 .
    
1 Questioning
2 colonial
3 boundaries
4 would
5 open 1:ID
6 a
7 dangerous
8 Pandora nsp 1
9 ' nsp 1
10 s 1
11 box nsp 1
12 .  _
 

 

If one VMWE is embedded in another one, a new semicolon-separated code is added to column 4. The identifiers of both VMWEs must be distinct. In case of VMWE overlapping, such as in coordination, the same rule applies.

Example of an annotated training corpus file:

1      Once _           
2 again
3 it
4 was
5 a
6 senior
7 BBC
8 person
9 who
10 let 1:ID;2:VCP
11 the 1
12 cat 1
13 out 1;2
14 of 1
15 the 1
16 bag nsp  1
17 .
    
1 They
2 were
3 letting 1:VPC;2:VPC
4 us
5 in 1
6 and
7 out  2
8 for
9 quite
10 some
11 time nsp
12 .
 

 

  • In the blind corpus files (to be given as input to a participant VMWE identification system), the 4th column always contains an underscore. Thus, an underscore in the 4th column of a training corpus file means that the current token is not part of a VMWE. An underscore in the 4th column of a blind corpus file means that the status of the token (as being part of a VMWE or not) is not provided.

Example of a blind corpus file:

Delegates _
2 are _
3 in _
4 little
5 doubt _
6 that
7 the
8 shadow _
9 cast _
10 over
11 the
12 city
13 by
14 the
15 attacks
16 will
17 enhance
18 the
19 chances
20 of
21 agreement nsp
22 .
    
# sent_id = 2
# text = Don't talk the talk...
1-2 Don't
1 Do _
2 not _
3 talk _
4 the _
5 talk _
6 if _
7 you _
8-9 can't
8 can _
9 not _
10 walk _
11 the _
12 walk nsp _
13 .
    
1 Questioning
2 colonial
3 boundaries
4 would
5 open _
6 a
7 dangerous
8 Pandora nsp _
9 ' nsp _
10 s _
11 box nsp _
12 . _
 

 

Deprecated versions of the format

While developing the annotation guidelines and methodology for the PARSEME shared task other variants of the format were used:

The final parseme-tsv format is slightly redefined with respect to these previous formats in that:

  • we merge the VMWE identifier and its type into one label (e.g. 1:ID) and overlapping VMWE components are marked in the same column and separated by a semi-colon (e.g. 1:ID;2:VPC),
  • we have no multitoken-word identifiers,
  • we now only have two quasi-universal categories, the IPronV category is renamed to IReflV, and we no longer annotate selected prepositions  (see the annotation guidelines)
  • hesitation labels (e.g; LVC/ID pr ID/_) no longer apply; the annotation guidelines contain decision trees which should normally allow to discriminate among several candidate categories; if hesitation still persists, it can be expressed by a comment or by a confidence level marker attached to an annotation.

1 While the format requires empty fields to be marked by underscores ('_'), as in the CoNLL-U format, missing underscores are tolerated when used on input of the FLAT annotation platform.