This page describes the generic text segmentation rules that should be used when preparing a new corpus to be annotated.

General remarks:

  • While preparing a corpus to be used in the shared task, each language team can use its own language-specific sentence and token definition.
  • In this case, two inputs are requested for the shared task from the corpus providers:
    • compulsory: a precise definition of a token and tokenization and/or sentence segmentation rules used in the corpus
    • recommended: a tokenizer and/or sentence segmenter conforming to these rules
  • This document describes a proposal of a generic language-independent segmenter. Tokens defined by this proposal are referred to as generic tokens.
  • See the annotation guidelines section that explains the distinction between tokens and words.

Generic rules for segmentation into sentences:

  • The segmentation is based on simple rules language-independent rules such as the occurrence of a dot ('.'), and explamation mark ('!'), a question mark '?', and a ellipsis ('...').
  • Exceptions - sentence splits are never introduced by:
    • punctuation included in URLs (www.parseme.eu) and email addresses (This email address is being protected from spambots. You need JavaScript enabled to view it.)
    • dots within acronyms (U.S.A.)
  • Generic sentence segmentation is heavily prone to errors. It should systematically be checked and modified by the langauge team before the MWE annotation starts.

Generic rules for segmentation into tokens:

  • blanc character = any of the following:
    • a space (U+0020)
    • a newline character: LF (U+000A), CR (U+000D), NEL (U+0085), LS (U+2028), PS (U+2029)
    • a tabulation (U+0009)
  • punctuation mark = any of the following
    • ASCII punctuation marks (from U+0021 to U+002F, from U+003A to U+0040, from U+005B to U+0060, from U+007B to U+007E)
    • UTF-8 general punctuation marks (from U+2000 to U+206F) except newline characters (see above)
  • separator = any of the following
    • a blanc character
    • a punctuation mark
  • token = any of the following
    • continuous sequence of non-separators appearing between two separators
    • a single separator

Known problems:

    • this definition of a segmentation underestimates the number of tokens in languages such as German ('Schulaufgabe') and Spanish ('hacerlo'); custom tokenizers are to be preferred for these languages