This page describes the generic text segmentation rules that should be used when preparing a new corpus to be annotated.
- While preparing a corpus to be used in the shared task, each language team can use its own language-specific sentence and token definition.
- In this case, two inputs are requested for the shared task from the corpus providers:
- compulsory: a precise definition of a token and tokenization and/or sentence segmentation rules used in the corpus
- recommended: a tokenizer and/or sentence segmenter conforming to these rules
- This document describes a proposal of a generic language-independent segmenter. Tokens defined by this proposal are referred to as generic tokens.
- See the annotation guidelines section that explains the distinction between tokens and words.
Generic rules for segmentation into sentences:
- The segmentation is based on simple rules language-independent rules such as the occurrence of a dot ('.'), and explamation mark ('!'), a question mark '?', and a ellipsis ('...').
- Exceptions - sentence splits are never introduced by:
- dots within acronyms (U.S.A.)
- Generic sentence segmentation is heavily prone to errors. It should systematically be checked and modified by the langauge team before the MWE annotation starts.
Generic rules for segmentation into tokens:
- blanc character = any of the following:
- a space (U+0020)
- a newline character: LF (U+000A), CR (U+000D), NEL (U+0085), LS (U+2028), PS (U+2029)
- a tabulation (U+0009)
- punctuation mark = any of the following
- ASCII punctuation marks (from U+0021 to U+002F, from U+003A to U+0040, from U+005B to U+0060, from U+007B to U+007E)
- UTF-8 general punctuation marks (from U+2000 to U+206F) except newline characters (see above)
- separator = any of the following
- a blanc character
- a punctuation mark
- token = any of the following
- continuous sequence of non-separators appearing between two separators
- a single separator
- this definition of a segmentation underestimates the number of tokens in languages such as German ('Schulaufgabe') and Spanish ('hacerlo'); custom tokenizers are to be preferred for these languages