Print

This page describes the format of the pilot corpus annotation phase 2 in the PARSEME shared task on on automatic detection of verbal MWEs. See a sample annotated file for illustration.

This format is slightly redefined with respect to the format used in phase 1 in that we now only have two universal categories, 3 quasi-universal categories, and we still annotate lexicalized prepositions but not the selected ones (see the annotation guidelines).

Language teams can propose language-dependent format specificities provided that they are compatible with the generic format.

In phase 2, annotators work again on small corpora (200 sentences at a time). The annotation environment proposed for this phase is the ame as for phase 2, i.e. a customized Google Docs spreadsheet, which will allow basic format control and future easy comparision of parallel annotations. Each annotator will have her/his own spreadsheet made available on-line. We recommend working with Google spreadsheets under the Google Chrome browser. It notably allows an off-line work mode.

As in phase 1, the corpus to be annotated (in plain text UTF-8 encoding) is initially automatically converted into a one-token-per-line format.

You can use the script available on-line (to appear soon) or ask the technical support for help with the conversion.

Sentence segmentation errors should be corrected manually by each language team prior to MWE annotation. Possible tokenization errors should not be corrected manually, so as to enable an easy comparison of parallel annotations.

The manual annotation consists in marking the occurences of verbal MWEs (and of their nominalizations) according to the six-column format (the 7th column from the format in phase 1 has disappeared) :

The contents of the first three columns should stem from the tokenizer and should normally not be edited during manual annotation.

Examples:

1      Delegates                
2 are     1 LVC
3 in     1  
4 little        
5 doubt     1  
6 that        
7 the        
8 shadow     2 ID
9 cast     2  
10 over        
11 the        
12 city        
13 by        
14 the        
15 attacks        
16 will        
17 enhance        
18 the        
19 chances        
20 of        
21 agreement nsp      
22 .        
           
1-2 Don't        
1 Do           1  OTH
2 not      1  
3 talk      1  
4 the      1  
5 talk      1  
6 if      1  
7 you      1  
8-9 can't        
8 can      1  
9 not      1  
10 walk      1  
11 the      1  
12 walk nsp    1  
13 .        
           
1 Questioning        
2 colonial        
3 boundaries        
4 would        
5 open     1 ID
6 a        
7 dangerous        
8 Pandora nsp   A 1  
9 ' nsp A 1  
10 s   A 1  
11 box nsp   1  
12 .        

Examples:

1      Once                                   
2 again              
3 it              
4 was              
5 a              
6 senior              
7 BBC              
8 person              
9 who              
10 let     1 ID   2 VPC
11 the     1        
12 cat     1        
13 out     1     2  
14 of     1        
15 the     1        
16 bag nsp    1             
17 .              
                 
1 They              
2 were              
3 letting     1 VPC    2 VPC
4 us              
5 in     1        
6 and              
7 out           2  
8 for              
9 quite              
10 some              
11 time nsp            
12 .              

Examples:

1     They                        
2 had      1  LVC/_
3 a        
4 walk      1  
5 shortly        
6 before        
7 the        
8 meeting nsp      
9 .        
           
1-2 It's        
1 It        
2 is        
3 her        
4 heart      1 LVC/ID
5 not        
6 mine        
7 that        
8 he        
9 broke nsp      1  
10 .        
           
1 Take     1  LVC/ID
2 care nsp   1  
3 ,        
4-5 don't        
4 do        
5 not        
6 go     2 ID/_
7 too             2  
8 far     2  
9 with        
10 this        
11 claim nsp       
12 .        

Language-dependent format specificities

1     Vale     _             1        ID                    
2 la _   1    
3 pena     1