This page describes the format of the pilot corpus annotation in the PARSEME shared task on on automatic detection of verbal MWEs. See a sample annotated file for illustration.
Language teams can propose language-dependent format specificities provided that they are compatible with the generic format.
In order to facilitate a swift start of the pilot annotation, annotators will work on small corpora (200 sentences at a time). The annotation environment proposed for this phase is a customized Google Docs spreadsheet, which will allow basic format control and future easy comparision of parallel annotations. Each annotator will have her/his own spreadsheet made available on-line. Working off-line with Google spreadsheets is also possible with Google Chrome.
The corpus (in plain text UTF-8 encoding) to be annotated in the pilot annotation is initially automatically converted into a one-token-per-line format:
- A word is a language-specific notion. Most often it coincides with a token.
- A token is can be defined according to generic or language-specific rules (see segmentation rules).
- Each token, except blanc characters, appears in a separate line.
- Each sentence is separated from the following sentence by an empty line.
You can use the script available on-line (to appear soon) or ask the technical support for help with the conversion.
Sentence segmentation errors should be corrected manually by each language team prior to MWE annotation. Possible tokenization errors should not be corrected manually, so as to enable an easy comparison of parallel annotations.
The manual annotation consists in marking the occurences of verbal MWEs (and of their nominalizations) according to the seven-column format:
- The first column contains the rank of the corpus token in the sentence (blank characters are neglected in the token rank count) or a scope of ranks in case of a multi-word token (MWT). Note that MWTs can only stem from a language-specific tokenizer. The generic tokenizer contains not rules to detect MWTs.
- The second column contains the corpus token (blank characters are not represented explicitely) or a multi-word token. In the latter case all other columns are empty.
- The third column is empty if the token is followed by a space or another blanc character (see segmentation rules) in the original file. Otherwise it contains nsp (no space).
- The fourth column is empty if the token does not belong to any verbal MWE or if it coincides with a word. Otherwise it contains a multi-token word (MTW) identifier. MTW identifiers start from A for each new sentence and pass to the next ASCII character for each new MTW. They are to be distinguished from MWE identifiers, which appear in column 5. MTWs are to be annotated at least when they belong to MWEs. It remains to be decided if they should also be annotated outside MWEs.
- The fifth column is empty if the token is not a part of a MWE. Otherwise, it contains the MWE identifier. Only lexicalized tokens of a MWE are to be assigned identifiers (cf. annotation guidelines). Identifiers start from 1 for each new sentence and increase by 1 for each new MWE.
- The sixth column is empty except for initial tokens of MWEs, which are marked with MWE types. The following MWE types are distinguished (cf. the annotation guidelines):
- LVC - a light verb construction
- VPC - a verb particle construction
- ID - a verbal idiom
- SENT - a fully lexicalized sentential MWE
- OTH - a verbal MWE of a type different from the above
- The seventh column is empty except for prepositions selected by VMWEs (see the annotation guidelines), but not being their inherent parts (e.g. due to the possibility of the replacement of a prepositional group by a pronominal complement as in the solution rests in the hands of John, the solution rests in his hands). Selected prepositions are marked by SEL.
The contents of the first three columns should stem from the tokenizer and should normally not be edited during manual annotation.
- If one verbal MWE is embedded in another one, columns 4-6 (identifier, type and selected preposition marker) are repeated for the embedded MWE. The identifiers of both MWE must be distinct. In case of MWE coordination or overlapping, similar rules apply.
- In case of hesitation between two MWE types, we mark the initial token with all the potential types separated by a slash, e.g. "LVC/ID". We assume that hesitation should not concern more than two types. In case it does, one should indicate the two most probable interpretations.
- In case of hesitation if a sequence is a MWE (of a particular type) or a compositional sequence, we use the '_' character to mark the latter. E.g. "ID/_" means that we hesitate between an idiom and a compositional phrase.
Language-dependent format specificities
- Column 3 in the generic format may contain, alternatively to nsp, also the underscore ('_') character, in order to account for the initial MWE pre-annotation present in the source corpus. For instance: