Print

This page describes the format of the pilot corpus annotation in the PARSEME shared task on on automatic detection of verbal MWEs. See a sample annotated file for illustration.

Language teams can propose language-dependent format specificities provided that they are compatible with the generic format.

In order to facilitate a swift start of the pilot annotation, annotators will work on small corpora (200 sentences at a time). The annotation environment proposed for this phase is a customized Google Docs spreadsheet, which will allow basic format control and future easy comparision of parallel annotations. Each annotator will have her/his own spreadsheet made available on-line. Working off-line with Google spreadsheets is also possible with Google Chrome.

The corpus (in plain text UTF-8 encoding) to be annotated in the pilot annotation is initially automatically converted into a one-token-per-line format:

You can use the script available on-line (to appear soon) or ask the technical support for help with the conversion.

Sentence segmentation errors should be corrected manually by each language team prior to MWE annotation. Possible tokenization errors should not be corrected manually, so as to enable an easy comparison of parallel annotations.

The manual annotation consists in marking the occurences of verbal MWEs (and of their nominalizations) according to the seven-column format:

The contents of the first three columns should stem from the tokenizer and should normally not be edited during manual annotation.

Examples:

1      Delegates                  
2 are     1 LVC  
3 in     1    
4 little          
5 doubt     1    
6 that          
7 the          
8 shadow     2 ID  
9 cast     2    
10 over     2    SEL
11 the          
12 city          
13 by          
14 the          
15 attacks          
16 will          
17 enhance          
18 the          
19 chances          
20 of          
21 agreement nsp        
22 .          
             
1-2 Don't          
1 Do           1  SENT  
2 not      1    
3 talk      1    
4 the      1    
5 talk      1    
6 if      1    
7 you      1    
8-9 can't          
8 can      1    
9 not      1    
10 walk      1    
11 the      1    
12 walk nsp    1    
13 .          
             
1 Questioning          
2 colonial          
3 boundaries          
4 would          
5 open     1 ID  
6 a          
7 dangerous          
8 Pandora nsp   A 1    
9 ' nsp A 1    
10 s   A 1    
11 box nsp   1    
12 .          

Examples:

1      Once                                     
2 again                
3 it                
4 was                
5 a                
6 senior                
7 BBC                
8 person                
9 who                
10 let     1 ID   2 VPC       
11 the     1          
12 cat     1          
13 out     1     2    
14 of     1     2   SEL
15 the     1          
16 bag nsp    1               
17 .                
                   
1 They                
2 were                
3 letting     1 VPC    2 VPC  
4 us                
5 in     1          
6 and                
7 out           2    
8 for                
9 quite                
10 some                
11 time nsp              
12 .                

Examples:

1     They                          
2 had      1  LVC/_  
3 a          
4 walk      1    
5 shortly          
6 before          
7 the          
8 meeting nsp        
9 .          
             
1-2 It's          
1 It          
2 is          
3 her          
4 heart      1 LVC/ID  
5 not          
6 mine          
7 that          
8 he          
9 broke nsp      1    
10 .          
             
1 Take     1  LVC/ID  
2 care nsp   1    
3 ,          
4-5 don't          
4 do              
5 not          
6 go     2 ID/_  
7 too             2    
8 far     2    
9 with     2    SEL/_
10 this          
11 claim nsp         
12 .          

Language-dependent format specificities

1     Vale     _             1        ID                    
2 la _   1    
3 pena     1