Workshop on Data Provenance and Annotation in Computational Linguistics
January 22nd, 2018

co-located with the 16th International Workshop on Treebanks and Linguistic Theories (TLT16)
Charles University, Prague, Czech Republic
January 23-24, 2018

Invited speakers:

Adriane Boyd, Universität Tübingen
Peter Buneman, University of Edinburgh
Nicoletta Calzolari, Italian National Research Council
Sarah Cohen Boulakia, Université Paris Sud
Jan Hajič, Charles University, Prague

This workshop seeks to bring together researchers from the fields of provenance, data annotation, and data curation with researchers working within computational linguistics and dealing with the annotation of language data. Provenance is concerned with understanding how to model, record, and share metadata about the origin of data and the further sharing or processing that data has undergone. While provenance has been studied in various domains (e.g., for business applications or in the life sciences), many of the central issues are also of vital interest for computational linguistics.

For example, issues of „data cleaning“ and data curation both have serious repercussions for the reproducibility of analyses or experiments. In general, computational linguistic work with data tends to involve several pre-processing steps (stop-lists, data normalization, filtering out of information that is
considered to be not at-issue or error correction). However, these steps are seldom documented or described in detail. Data sets may also undergo several rounds of pre-processing, with information about the successive changes again not well documented. Data may also be automatically or semi-automatically generated. In computational linguistics this often takes the form of automatic or semi-automatic data annotation. This, as well as manual annotation, is prone to errors and inter-annotator disagreement, leading to rounds of adjucation or correction. This work with data is also generally not documented (in detail) so that annotation decisions may be hard to „undo“. Finally, once a data set is released, newer versions will inevitably also have to be released to deal with data expansion or correction. In this case, proper versioning and data curation is vital to ensure experimental and analytical reproducability.

While computational linguists deal with these issues on a daily basis, there is little awareness of established methodology and best practices coming from the field of data provenance. The aim of this workshop is to begin a dialog. On the one hand, we aim to create awareness of the needs and challenges posed by linguistic data in the data provenance community. On the other hand, we aim to import an understanding of the experiences and best practices established with respect to data provenance into the computational linguistics community.

The workshop is sponsored by the SFB-TRR 161 on „Quantitative Methods for Visual Computing“.