This page describes the text segmentation issues related to the pilot annotation phase 1 in the PARSEME shared task on on automatic detection of verbal MWEs.
General remarks:
- While preparing a corpus to be used in the shared task, each language team can use its own language-specific sentence and token definition.
- In this case, two inputs are requested for the shared task from the corpus providers:
- compulsory: a precise definition of a token and tokenization and/or sentence segmentation rules used in the corpus
- recommended: a tokenizer and/or sentence segmenter conforming to these rules
- This document describes a proposal of a generic language-independent segmenter. Tokens defined by this proposal are referred to as generic tokens.
- A token and a word are distinct notions.
- A (generic) token is a pragmatic, technical notion, defined according to largely non-linguistic clues.
- A word is a linguistically, notably semantically, motivated unit. It is, thus, language-dependent. Each language team needs to define the notion of a word for its own language. Ideally, it should be as close as possible to the (language-specific) definition of a token.
- A relation of a (generic) token to a (language-specific) word is not 1-to-1:
- Most often a (generic) token coincides with a word (e.g. with).
- Sometimes several (generic) tokens can build up one word, e.g. abbreviations like M., pp. (EN), possessives like Pandora's (EN), words with "accidental" separators like aujourd'hui (FR), inflected forms of foreign names like Chomsky'ego (PL). Then we speak of a multi-token word (MTW).
- Sometimes one (generic) token can contain several words, e.g. Schul|aufgabe (DE), della = de la (IT). In this case we speak of a multi-word token (MWT).
- Sometimes several (generic) tokens can non-trivially correspond to several words, e.g. do|n't = do not (EN)
- As a consequence (and since a MWE always contains at least two words) we have:
- A MWE can contain several tokens, whether each of them coincides with words as in to take a walk (EN) or not as in to open a Pandora's box
- One token - containing several words - can be a MWE, e.g. Befähigungszeugnis (DE) or not Schulaufgabe (DE)
- It remains to be seen to which extent divergencies between tokens and words concern verbal MWEs.
Generic rules for segmentation into sentences:
- It is based on simple rules language-independent rules such as the occurrence of a dot ('.'), and explamation mark ('!'), a question mark '?', and a ellipsis ('...').
- Exceptions - sentence splits are never introduced by:
- punctuation included in URLs (www.parseme.eu) and email addresses (This email address is being protected from spambots. You need JavaScript enabled to view it.)
- dots within acronyms (U.S.A.)
- Generic sentence segmentation is heavily prone to errors. It should systematically be checked and modified by the langauge team before the MWE annotation starts.
Generic rules for segmentation into tokens:
- blanc character = any of the following:
- a space (U+0020)
- a newline character: LF (U+000A), CR (U+000D), NEL (U+0085), LS (U+2028), PS (U+2029)
- a tabulation (U+0009)
- punctuation mark = any of the following
- ASCII punctuation marks (from U+0021 to U+002F, from U+003A to U+0040, from U+005B to U+0060, from U+007B to U+007E)
- UTF-8 general punctuation marks (from U+2000 to U+206F) except newline characters (see above)
- separator = any of the following
- a blanc character
- a punctuation mark
- token = any of the following
- continuous sequence of non-separators appearing between two separators
- a single separator
- multi-token word (MTW) = a sequence of tokens which - according to language-specific criteria - should be considered as one word, e.g. Chomsky'ego (PL)
- multi-word token (MWT) = a sequence of words whose surface represention consists of one token, e.g. della = de la (IT)
- multi-word expression (MWE) = a sequence of words (rather than tokens) which - according to language-specific criteria - should be considered as one expression (e.g. air brake, hierarchia Chomsky'ego); recall that in this shared task we only identify verbal MWEs
Known problems:
-
- this definition of a segmentation underestimates the number of tokens in languages such as German ('Schulaufgabe') and Spanish ('hacerlo'); custom tokenizers are to be preferred for these languages