parseme-tsv-split format (obsolete)

This page describes an obsolete version of the format of the platinum v6 corpus annotation, also called the parseme-tsv-split format, used in the PARSEME shared task on on automatic detection of verbal MWEs. See a sample annotated file for illustration. This format is still partly supported for the sake of the platinum standard (pilot annotation phase 2 corpora, adjudicated by all annotators of the given language), as well as for manual off-line annotations.

A word is a language-specific notion. Most often it coincides with a token.
A token can be defined according to either generic or language-specific rules (see the annotation guidelines section 1.1 for a more detailed discussion about tokens vs. words).
Each token, except blanc characters, appears in a separate line.
Each sentence is separated from the following sentence by an empty line (it can contain any number of tabulations, but no other character).

Sentence segmentation errors should be corrected manually by each language team prior to MWE annotation. Possible tokenization errors should not be corrected manually, so as to enable an easy comparison of parallel annotations. The multitoken word (MTW) labels (see below) can be used to handle some tokenization errors. The lines from 2 on (except the sentence-separating lines, which are empty) have the following contants:

The parseme-tsv-split format has at least six columns. The precise number of columns depends on the maximum level of VMWE nesting in the given files (see below). Columns must be separated by single tabulations, not by blanc spaces. Every row should have the same number of tabulations, even if most fields are empty. Empty fields are truly empty (i.e. they don't contain underscores '_', contrary to the parseme-tsv format)¹. The first row has to contain column headers (unlike in all other parseme-tsv-* formats): rank, token, nsp, mtw, mwe1, mwecat1, mwe2, mwecat2, ..., com (a second row of headers is also admitted, since files with 2 header lines are generated from Goolge spreadsheets used in the pilot annotation).

The first column contains the rank of the corpus token in the sentence (blank characters are neglected in the token rank count) or a scope of ranks in case of a multi-word token (MWT). Note that MWTs can only stem from a language-specific tokenizer. The generic tokenizer contains no rules to detect MWTs.
The second column contains the corpus token (blank characters are not represented explicitely) or a multi-word token. In the latter case all other columns are empty.
The third column is empty if the token is followed by a space or another blanc character (see segmentation rules) in the original file. Otherwise it contains nsp (no space).
The fourth column is empty if the token does not belong to any verbal MWE or if it coincides with a word. Otherwise it contains a multi-token word (MTW) identifier. MTW identifiers start from A for each new sentence and pass to the next ASCII character for each new MTW. They are to be distinguished from MWE identifiers, which appear in column 5. MTWs are to be annotated at least when they belong to MWEs. It remains to be decided if they should also be annotated outside MWEs.
The fifth column is empty if the token is not part of a MWE. Otherwise, it contains the MWE rank. Only lexicalized tokens of a MWE are to be assigned identifiers (cf. annotation guidelines, section 1.4). Identifiers start from 1 for each new sentence and increase by 1 for each new MWE.
The sixth column is empty except for initial tokens of MWEs, which are marked with MWE categories. The following MWE categories are distinguished (cf. the annotation guidelines):
- universal categories (existing in all languages concerned by the shared task)
  - LVC - a light verb construction (e.g. to take a decision)
  - ID - a verbal idiom (e.g. to kick the bucket)
- quasi-universal categories (existing in some languages or langauge families but not all)
  - IReflV - an inherently reflexive verb (e.g. (FR) se suicider 'suicide')
  - VPC - a verb particle construction (e.g. to take off)
- language-specific categories (if any)
- OTH - a verbal MWE of a type different from the above
The columns 7-8, 9-10 etc. (if provided) contain the same data as the fifth and the sixth ones, in case of embedded or overlapping annotations (see below).
The last column (if provided) contains a comment.

The contents of the first three columns should stem from the tokenizer and should normally not be edited during manual annotation.

Examples:

1	Delegates
2	are			1	LVC	Unsure
3	in			1
4	little
5	doubt			1
6	that
7	the
8	shadow			2	ID
9	cast			2
10	over
11	the
12	city
13	by
14	the
15	attacks
16	will
17	enhance
18	the
19	chances
20	of
21	agreement	nsp
22	.

1-2	Don't
1	Do			1	ID
2	not			1
3	talk			1
4	the			1
5	talk			1
6	if			1
7	you			1
8-9	can't
8	can			1
9	not			1
10	walk			1
11	the			1
12	walk	nsp		1
13	.

1	Questioning
2	colonial
3	boundaries
4	would
5	open			1	ID
6	a
7	dangerous
8	Pandora	nsp	A	1		Tokenizer error?
9	'	nsp	A	1
10	s		A	1
11	box	nsp		1
12	.

If one verbal MWE is embedded in another one, columns 5-6 (identifier and type) are repeated for the embedded MWE. The identifiers of both MWE must be distinct. In case of MWE coordination or overlapping, similar rules apply.

Examples:

1	Once
2	again
3	it
4	was
5	a
6	senior
7	BBC
8	person
9	who
10	let		1	ID	2	VPC
11	the		1
12	cat		1
13	out		1		2
14	of		1
15	the		1
16	bag	nsp	1
17	.

1	They
2	were
3	letting		1	VPC	2	VPC
4	us
5	in		1
6	and
7	out				2
8	for
9	quite
10	some
11	time	nsp
12	.

Previous versions of the format

The parseme-tsv-split format emerged from the format used in pilot annotation phase 2. The differences are the following:

we only have two quasi-universal categories,
the IPronV category is renamed to IReflV,
we no longer annotate selected prepositions (see the annotation guidelines),
hesitation labels (e.g; LVC/ID pr ID/_) no longer apply; the annotation guidelines contain decision trees which should normally allow to discriminate among several candidate categories; if hesitation still persists, it can be expressed by a comment or by a confidence level marker attached to an annotation.

1 Even if empty fields in this format are expected to be truly empty, underscores are tolerated when used on input of the FLAT annotation platform.