Print

This page describes an obsolete version of the format of the platinum v6 corpus annotation, also called the parseme-tsv-split format, used in the PARSEME shared task on on automatic detection of verbal MWEs. See a sample annotated file for illustration. This format is still partly supported for the sake of the platinum standard (pilot annotation phase 2 corpora, adjudicated by all annotators of the given language), as well as for manual off-line annotations.

Sentence segmentation errors should be corrected manually by each language team prior to MWE annotation. Possible tokenization errors should not be corrected manually, so as to enable an easy comparison of parallel annotations. The multitoken word (MTW) labels (see below) can be used to handle some tokenization errors. The lines from 2 on (except the sentence-separating lines, which are empty) have the following contants:

The parseme-tsv-split format has at least six columns. The precise number of columns depends on the maximum level of VMWE nesting in the given files (see below). Columns must be separated by single tabulations, not by blanc spaces. Every row should have the same number of tabulations, even if most fields are empty. Empty fields are truly empty (i.e. they don't contain underscores '_', contrary to the parseme-tsv format)1. The first row has to contain column headers (unlike in all other parseme-tsv-* formats): rank, token, nsp, mtw, mwe1, mwecat1, mwe2, mwecat2, ..., com (a second row of headers is also admitted, since files with 2 header lines are generated from Goolge spreadsheets used in the pilot annotation).

The contents of the first three columns should stem from the tokenizer and should normally not be edited during manual annotation.

Examples:

1      Delegates                  
2 are     1 LVC    Unsure
3 in     1    
4 little          
5 doubt     1    
6 that          
7 the          
8 shadow     2 ID  
9 cast     2    
10 over          
11 the          
12 city          
13 by          
14 the          
15 attacks          
16 will          
17 enhance          
18 the          
19 chances          
20 of          
21 agreement nsp        
22 .          
             
1-2 Don't          
1 Do           1  ID  
2 not      1    
3 talk      1    
4 the      1    
5 talk      1    
6 if      1    
7 you      1    
8-9 can't          
8 can      1    
9 not      1    
10 walk      1    
11 the      1    
12 walk nsp    1    
13 .          
             
1 Questioning          
2 colonial          
3 boundaries          
4 would          
5 open     1 ID  
6 a          
7 dangerous          
8 Pandora nsp   A 1   Tokenizer error?
9 ' nsp A 1    
10 s   A 1    
11 box nsp   1    
12 .          

Examples:

1      Once                                   
2 again              
3 it              
4 was              
5 a              
6 senior              
7 BBC              
8 person              
9 who              
10 let     1 ID   2 VPC
11 the     1        
12 cat     1        
13 out     1     2  
14 of     1        
15 the     1        
16 bag nsp    1             
17 .              
                 
1 They              
2 were              
3 letting     1 VPC    2 VPC
4 us              
5 in     1        
6 and              
7 out           2  
8 for              
9 quite              
10 some              
11 time nsp            
12 .              

 

Previous versions of the format

The parseme-tsv-split format emerged from the format used in pilot annotation phase 2. The differences are the following:


1 Even if empty fields in this format are expected to be truly empty, underscores are tolerated when used on input of the FLAT annotation platform.