SIGLEX-MWE section and PARSEME are co-organizing the annual Multiword Expressions Workshop on 4 April 2017. It will be co-located with the EACL 2017 conference. It includes a special track dedicated the PARSEME shared task on automatic identification of MWEs. 


PARSEME grants

PARSEME will fund travel and stay for 33 workshop participants from the PARSEME member countries. Applicants should fill in the application form by 15 February 2017. The selection of applicants entitled to reimbursement will be done by the PARSEME Steering Committee. Priority is given to:

  • workshop and shared task organizers, technical experts and language group leaders,
  • shared task language leaders,
  • authors of the best systems in the shared task,
  • presenters of papers/posters,
  • shared task annotators,
  • early-stage researchers,
  • PARSEME membres.

The reimbursement rates:

  • Hotel: 120 EUR per night (flat rate). The number of the reimbursed nights is equal to the number attended worhshop days plus 1 (in case the participant arrives earlier than her/his first attended day and leaves later than his last attended day). An attendance list must be signed each day of presence at the workshop.
  • Meals: 20 EUR per meal (flat rate).
  • Travel: real costs limited to 1200 € (economy class air tickets, train tickets, local transport, etc.).
  • Workshop admission fees are not eligible for reimbursement.

Detailed reimbursement rules are defined in the COST Vademecum, pp. 19-23, section 4. The applicants selected for funding will receive a formal invitation via the e-COST system (which they should accept before their travel). They should cover their travel and stay in advance and will be reimbursed on return.

Important dates:

  • 16 22 January, 2017: Submission deadline for the main track long & short papers
  • 5 February: Submission deadline for shared task system description papers
  • 11 February: Notification of acceptance for the main track papers
  • 12 February: Notification of acceptance for the shared task papers
  • 15 February: deadline for applications for funding
  • 20 February: Camera-ready papers due (main track and shared task)
  • 1 March: notification to applicants about funding
  • 4 April, 2017: MWE 2017 Workshop

 

Negative Polarity MWEs (NPMWEs) are a theoretically and practically challenging class since their obligatory licensing environments can be abstract grammatical, semantic, and even pragmatic categories; this makes them difficult to identify and classify. Such special lexical units have already been researched within the PARSEME community for Polish, German, and Romanian, and we intend to share and discuss our methodologies in order to:

  •  Document and classify NPMWEs for multiple languages
  • Verify the effectiveness of the tests that we already developed for individual languages and research whether other tests should be used after comparing NPMWEs from different languages.
  • Develop a set of tests that will prove efficient for classifying and identifying NPMWEs across languages.
  • Research the distributional properties of  NPMWEs in different languages.
  • Develop a multilingual resource (such as an electronic dictionary) of negative polarity items. 

This page describes the obsolete format, called parseme-tsv-input format, of the input corpora provided to VMWE identification tools in the PARSEME shared task on on automatic detection of verbal MWEs. See a sample file for illustration.

The parseme-tsv-iput format is a three-column format derived from the parseme-tsv format in that it is simply limited to the first three columns of the latter:

  • The first column contains the rank of the corpus token in the sentence (blank characters are neglected in the token rank count) or a scope of ranks in case of a multiword token (MWT).
  • The second column contains the corpus token (blank characters are not represented explicitely) or a multiword token. In the latter case all other columns are empty.
  • The third column contains an underscore ('_') if the token is followed by a space or another blanc character (see segmentation rules) in the original file. Otherwise it contains nsp (no space).

Examples:

1-2    Don't
1 Do
2 not
3 talk
4 the
5 talk
6 if
7 you
8-9 can't
8 can
9 not
10 walk
11 the
12 walk nsp
13 .
     
1 Questioning
2 colonial
3 boundaries
4 would
5 open
6 a
7 dangerous
8 Pandora nsp
9 ' nsp
10 s
11 box nsp
12 .

This page describes the format, called parseme-tsv-pos format, of the input corpora to be uploaded to the FLAT annotation platform in the PARSEME shared task on on automatic detection of verbal MWEs. See a sample file for illustration.

The parseme-tsv-pos format is a five-column format derived from the parseme-tsv format in the following way:

  • The fourth column may or may not contain VMWE annotations (in the latter case, the whole column contains underscores '_').
  • The fifth column contains the part-of-speech tag for the current token, or an underscore ('_') if no tag is provided. No specific POS tagset is recommended, and the POS tags can take any form.
  • No comment lines are admitted.

Examples:

1        Delegates _ _
2 are 1:LVC   V
3 in 1 _
4 little _
5 doubt 1 _
6 that _
7 the _
8 shadow 2:ID _
9 cast 2 Vpp
10 over _
11 the _
12 city _
13 by _
14 the _
15 attacks _
16 will V
17 enhance VInf
18 the _
19 chances _
20 of _
21 agreement nsp  _
22 . _
         
1 Questioning Vger
2 colonial _
3 boundaries _
4 would V
5 open _ Vinf
6 a _
7 dangerous _
8 Pandora nsp    _ _
9 ' nsp _ _
10 s _ _
11 box nsp _ _
12 .  _ _

 

Files in this format are useful in the following cases:

  • part-of-speech tags are available for the corpora; we recommend in this case to keep only the verbal POS tags (including gerunds and participles), which will then display in FLAT above the verbal tokens; this may greatly speed up the manual annotations since head verbs are automatically underlined in the FLAT interface; annotators should, however, be aware of the bias, especially in the POS tags are not gold standard tags,
  • automatic VMWE pre-annotations are available; and they need a manual validation in FLAT,
  • some annotators work off-line in Excel-like spreadsheets.

 

This page describes an obsolete version of the format of the platinum v6 corpus annotation, also called the parseme-tsv-split format, used in the PARSEME shared task on on automatic detection of verbal MWEs. See a sample annotated file for illustration. This format is still partly supported for the sake of the platinum standard (pilot annotation phase 2 corpora, adjudicated by all annotators of the given language), as well as for manual off-line annotations.

  • A word is a language-specific notion. Most often it coincides with a token.
  • A token can be defined according to either generic or language-specific rules (see the annotation guidelines section 1.1 for a more detailed discussion about tokens vs. words).
  • Each token, except blanc characters, appears in a separate line.
  • Each sentence is separated from the following sentence by an empty line (it can contain any number of tabulations, but no other character).

Sentence segmentation errors should be corrected manually by each language team prior to MWE annotation. Possible tokenization errors should not be corrected manually, so as to enable an easy comparison of parallel annotations. The multitoken word (MTW) labels (see below) can be used to handle some tokenization errors. The lines from 2 on (except the sentence-separating lines, which are empty) have the following contants:

The parseme-tsv-split format has at least six columns. The precise number of columns depends on the maximum level of VMWE nesting in the given files (see below). Columns must be separated by single tabulations, not by blanc spaces. Every row should have the same number of tabulations, even if most fields are empty. Empty fields are truly empty (i.e. they don't contain underscores '_', contrary to the parseme-tsv format)1. The first row has to contain column headers (unlike in all other parseme-tsv-* formats): rank, token, nsp, mtw, mwe1, mwecat1, mwe2, mwecat2, ..., com (a second row of headers is also admitted, since files with 2 header lines are generated from Goolge spreadsheets used in the pilot annotation).

  • The first column contains the rank of the corpus token in the sentence (blank characters are neglected in the token rank count) or a scope of ranks in case of a multi-word token (MWT). Note that MWTs can only stem from a language-specific tokenizer. The generic tokenizer contains no rules to detect MWTs.
  • The second column contains the corpus token (blank characters are not represented explicitely) or a multi-word token. In the latter case all other columns are empty.
  • The third column is empty if the token is followed by a space or another blanc character (see segmentation rules) in the original file. Otherwise it contains nsp (no space).
  • The fourth column is empty if the token does not belong to any verbal MWE or if it coincides with a word. Otherwise it contains a multi-token word (MTW) identifier. MTW identifiers start from A for each new sentence and pass to the next ASCII character for each new MTW. They are to be distinguished from MWE identifiers, which appear in column 5. MTWs are to be annotated at least when they belong to MWEs. It remains to be decided if they should also be annotated outside MWEs.
  • The fifth column is empty if the token is not part of a MWE. Otherwise, it contains the MWE rank. Only lexicalized tokens of a MWE are to be assigned identifiers (cf. annotation guidelines, section 1.4). Identifiers start from 1 for each new sentence and increase by 1 for each new MWE.
  • The sixth column is empty except for initial tokens of MWEs, which are marked with MWE categories. The following MWE categories are distinguished (cf. the annotation guidelines):
    • universal categories (existing in all languages concerned by the shared task)
      • LVC - a light verb construction (e.g. to take a decision)
      • ID - a verbal idiom (e.g. to kick the bucket)
    • quasi-universal categories (existing in some languages or langauge families but not all)
      • IReflV - an inherently reflexive verb (e.g. (FR) se suicider 'suicide')
      • VPC - a verb particle construction (e.g. to take off)
    • language-specific categories (if any)
    • OTH - a verbal MWE of a type different from the above
  • The columns 7-8, 9-10 etc. (if provided) contain the same data as the fifth and the sixth ones, in case of embedded or overlapping annotations (see below).
  • The last column (if provided) contains a comment.

The contents of the first three columns should stem from the tokenizer and should normally not be edited during manual annotation.

Examples:

1      Delegates                  
2 are     1 LVC    Unsure
3 in     1    
4 little          
5 doubt     1    
6 that          
7 the          
8 shadow     2 ID  
9 cast     2    
10 over          
11 the          
12 city          
13 by          
14 the          
15 attacks          
16 will          
17 enhance          
18 the          
19 chances          
20 of          
21 agreement nsp        
22 .          
             
1-2 Don't          
1 Do           1  ID  
2 not      1    
3 talk      1    
4 the      1    
5 talk      1    
6 if      1    
7 you      1    
8-9 can't          
8 can      1    
9 not      1    
10 walk      1    
11 the      1    
12 walk nsp    1    
13 .          
             
1 Questioning          
2 colonial          
3 boundaries          
4 would          
5 open     1 ID  
6 a          
7 dangerous          
8 Pandora nsp   A 1   Tokenizer error?
9 ' nsp A 1    
10 s   A 1    
11 box nsp   1    
12 .          
  • If one verbal MWE is embedded in another one, columns 5-6 (identifier and type) are repeated for the embedded MWE. The identifiers of both MWE must be distinct. In case of MWE coordination or overlapping, similar rules apply.

Examples:

1      Once                                   
2 again              
3 it              
4 was              
5 a              
6 senior              
7 BBC              
8 person              
9 who              
10 let     1 ID   2 VPC
11 the     1        
12 cat     1        
13 out     1     2  
14 of     1        
15 the     1        
16 bag nsp    1             
17 .              
                 
1 They              
2 were              
3 letting     1 VPC    2 VPC
4 us              
5 in     1        
6 and              
7 out           2  
8 for              
9 quite              
10 some              
11 time nsp            
12 .              

 

Previous versions of the format

The parseme-tsv-split format emerged from the format used in pilot annotation phase 2. The differences are the following:

  • we only have two quasi-universal categories,
  • the IPronV category is renamed to IReflV,
  • we no longer annotate selected prepositions (see the annotation guidelines),
  • hesitation labels (e.g; LVC/ID pr ID/_) no longer apply; the annotation guidelines contain decision trees which should normally allow to discriminate among several candidate categories; if hesitation still persists, it can be expressed by a comment or by a confidence level marker attached to an annotation.

1 Even if empty fields in this format are expected to be truly empty, underscores are tolerated when used on input of the FLAT annotation platform.