This page describes the format of the pilot corpus annotation phase 2 in the PARSEME shared task on on automatic detection of verbal MWEs. See a sample annotated file for illustration.

This format is slightly redefined with respect to the format used in phase 1 in that we now only have two universal categories, 3 quasi-universal categories, and we still annotate lexicalized prepositions but not the selected ones (see the annotation guidelines).

Language teams can propose language-dependent format specificities provided that they are compatible with the generic format.

In phase 2, annotators work again on small corpora (200 sentences at a time). The annotation environment proposed for this phase is the ame as for phase 2, i.e. a customized Google Docs spreadsheet, which will allow basic format control and future easy comparision of parallel annotations. Each annotator will have her/his own spreadsheet made available on-line. We recommend working with Google spreadsheets under the Google Chrome browser. It notably allows an off-line work mode.

As in phase 1, the corpus to be annotated (in plain text UTF-8 encoding) is initially automatically converted into a one-token-per-line format.

  • A word is a language-specific notion. Most often it coincides with a token.
  • A token is can be defined according to generic or language-specific rules (see the annotation guidelines section 1.1 for a more detailed discussion about tokens vs. words).
  • Each token, except blanc characters, appears in a separate line.
  • Each sentence is separated from the following sentence by an empty line.

You can use the script available on-line (to appear soon) or ask the technical support for help with the conversion.

Sentence segmentation errors should be corrected manually by each language team prior to MWE annotation. Possible tokenization errors should not be corrected manually, so as to enable an easy comparison of parallel annotations.

The manual annotation consists in marking the occurences of verbal MWEs (and of their nominalizations) according to the six-column format (the 7th column from the format in phase 1 has disappeared) :

  • The first column contains the rank of the corpus token in the sentence (blank characters are neglected in the token rank count) or a scope of ranks in case of a multi-word token (MWT). Note that MWTs can only stem from a language-specific tokenizer. The generic tokenizer contains not rules to detect MWTs.
  • The second column contains the corpus token (blank characters are not represented explicitely) or a multi-word token. In the latter case all other columns are empty.
  • The third column is empty if the token is followed by a space or another blanc character (see segmentation rules) in the original file. Otherwise it contains nsp (no space).
  • The fourth column is empty if the token does not belong to any verbal MWE or if it coincides with a word. Otherwise it contains a multi-token word (MTW) identifier. MTW identifiers start from A for each new sentence and pass to the next ASCII character for each new MTW. They are to be distinguished from MWE identifiers, which appear in column 5. MTWs are to be annotated at least when they belong to MWEs. It remains to be decided if they should also be annotated outside MWEs.
  • The fifth column is empty if the token is not a part of a MWE. Otherwise, it contains the MWE identifier. Only lexicalized tokens of a MWE are to be assigned identifiers (cf. annotation guidelines). Identifiers start from 1 for each new sentence and increase by 1 for each new MWE.
  • The sixth column is empty except for initial tokens of MWEs, which are marked with MWE categories. The following MWE categories are distinguished (cf. the annotation guidelines):
    • universal categories (existing in all languages concerned by the shared task)
      • LVC - a light verb construction (e.g. to take a decision)
      • ID - a verbal idiom (e.g. to kick the bucket)
    • quasi-universal categories (existing in some languages or langauge families but not all)
      • IPronV - aninherently pronominal verb (e.g. (FR) se suicider 'suicide')
      • IPrepV - a inherently prepositional verb (e.g. to come across sth)
      • VPC - a verb particle construction (e.g. to take off)
    • language-specific categories (if any)
    • OTH - a verbal MWE of a type different from the above, and from any language-specific categories (not that the previous SENT type is now included in OTH)

The contents of the first three columns should stem from the tokenizer and should normally not be edited during manual annotation.


1      Delegates                
2 are     1 LVC
3 in     1  
4 little        
5 doubt     1  
6 that        
7 the        
8 shadow     2 ID
9 cast     2  
10 over        
11 the        
12 city        
13 by        
14 the        
15 attacks        
16 will        
17 enhance        
18 the        
19 chances        
20 of        
21 agreement nsp      
22 .        
1-2 Don't        
1 Do           1  OTH
2 not      1  
3 talk      1  
4 the      1  
5 talk      1  
6 if      1  
7 you      1  
8-9 can't        
8 can      1  
9 not      1  
10 walk      1  
11 the      1  
12 walk nsp    1  
13 .        
1 Questioning        
2 colonial        
3 boundaries        
4 would        
5 open     1 ID
6 a        
7 dangerous        
8 Pandora nsp   A 1  
9 ' nsp A 1  
10 s   A 1  
11 box nsp   1  
12 .        
  • If one verbal MWE is embedded in another one, columns 5-6 (identifier and type) are repeated for the embedded MWE. The identifiers of both MWE must be distinct. In case of MWE coordination or overlapping, similar rules apply.


1      Once                                   
2 again              
3 it              
4 was              
5 a              
6 senior              
7 BBC              
8 person              
9 who              
10 let     1 ID   2 VPC
11 the     1        
12 cat     1        
13 out     1     2  
14 of     1        
15 the     1        
16 bag nsp    1             
17 .              
1 They              
2 were              
3 letting     1 VPC    2 VPC
4 us              
5 in     1        
6 and              
7 out           2  
8 for              
9 quite              
10 some              
11 time nsp            
12 .              
  • In case of hesitation between two MWE types, we mark the initial token with all the potential types separated by a slash, e.g. "LVC/ID". We assume that hesitation should not concern more than two types. In case it does, one should indicate the two most probable interpretations.
  • In case of hesitation if a sequence is a MWE (of a particular type) or a compositional sequence, we use the '_' character to mark the latter. E.g. "ID/_" means that we hesitate between an idiom and a compositional phrase.


1     They                        
2 had      1  LVC/_
3 a        
4 walk      1  
5 shortly        
6 before        
7 the        
8 meeting nsp      
9 .        
1-2 It's        
1 It        
2 is        
3 her        
4 heart      1 LVC/ID
5 not        
6 mine        
7 that        
8 he        
9 broke nsp      1  
10 .        
1 Take     1  LVC/ID
2 care nsp   1  
3 ,        
4-5 don't        
4 do        
5 not        
6 go     2 ID/_
7 too             2  
8 far     2  
9 with        
10 this        
11 claim nsp       
12 .        

Language-dependent format specificities

  • Spanish
    • Column 3 in the generic format may contain, alternatively to nsp, also the underscore ('_') character, in order to account for the initial MWE pre-annotation present in the source corpus. For instance:
1     Vale     _             1        ID                    
2 la _   1    
3 pena     1