PARSEME shared task - segmentation issues

This page describes the generic text segmentation rules that should be used when preparing a new corpus to be annotated.

General remarks:

While preparing a corpus to be used in the shared task, each language team can use its own language-specific sentence and token definition.
In this case, two inputs are requested for the shared task from the corpus providers:
- compulsory: a precise definition of a token and tokenization and/or sentence segmentation rules used in the corpus
- recommended: a tokenizer and/or sentence segmenter conforming to these rules
This document describes a proposal of a generic language-independent segmenter. Tokens defined by this proposal are referred to as generic tokens.
See the annotation guidelines section that explains the distinction between tokens and words.

Generic rules for segmentation into sentences:

The segmentation is based on simple rules language-independent rules such as the occurrence of a dot ('.'), and explamation mark ('!'), a question mark '?', and a ellipsis ('...').
Exceptions - sentence splits are never introduced by:
- punctuation included in URLs (www.parseme.eu) and email addresses (This email address is being protected from spambots. You need JavaScript enabled to view it.)
- dots within acronyms (U.S.A.)
Generic sentence segmentation is heavily prone to errors. It should systematically be checked and modified by the langauge team before the MWE annotation starts.

Generic rules for segmentation into tokens:

blanc character = any of the following:
- a space (U+0020)
- a newline character: LF (U+000A), CR (U+000D), NEL (U+0085), LS (U+2028), PS (U+2029)
- a tabulation (U+0009)
punctuation mark = any of the following
- ASCII punctuation marks (from U+0021 to U+002F, from U+003A to U+0040, from U+005B to U+0060, from U+007B to U+007E)
- UTF-8 general punctuation marks (from U+2000 to U+206F) except newline characters (see above)
separator = any of the following
- a blanc character
- a punctuation mark
token = any of the following
- continuous sequence of non-separators appearing between two separators
- a single separator

Known problems:

- this definition of a segmentation underestimates the number of tokens in languages such as German ('Schulaufgabe') and Spanish ('hacerlo'); custom tokenizers are to be preferred for these languages