Call for Posters


Workshop on Data Provenance and Annotation in Computational Linguistics

Call for Posters

Co-located with the Treebanks and Linguistic Theory (TLT) conference 2018 in Prague
is a special Workshop on Data Provenance and Annotation in Computational Linguistics.

Invited Speakers:
Adriane Boyd, Universität Tübingen
Peter Buneman, University of Edinburgh
Nicoletta Calzolari, Italian National Research Council
Sarah Cohen Boulakia, Université Paris Sud

This is a call for posters to be presented at the workshop. The deadline for
submissions is December 22nd, 2017. Notification of acceptance will be by
December 31st, 2017.

The workshop seeks to bring together researchers from the fields of provenance,
data annotation, and data curation with researchers working within computational
linguistics and dealing with the annotation of language data. Provenance is
concerned with understanding how to model, record, and share metadata about the
origin of data and the further sharing or processing that data has
undergone. While provenance has been studied in various domains (e.g., for
business applications or in the life sciences), many of the central issues are
also of vital interest for computational linguistics.

For example, issues of „data cleaning“ and data curation both have serious
repercussions for the reproducibility of analyses or experiments. In general,
computational linguistic work with data tends to involve several pre-processing
steps (stop-lists, data normalization, filtering out of information that is
considered to be not at-issue or error correction). However, these steps are
seldom documented or described in detail. Data sets may also undergo several
rounds of pre-processing, with information about the successive changes again
not well documented. Data may also be automatically or semi-automatically
generated. In computational linguistics this often takes the form of automatic
or semi-automatic data annotation. This, as well as manual annotation, is prone
to errors and inter-annotator disagreement, leading to rounds of adjucation or
correction. This work with data is also generally not documented (in detail) so
that annotation decisions may be hard to „undo“. Finally, once a data set is
released, newer versions will inevitably also have to be released to deal with
data expansion or correction. In this case, proper versioning and data curation
is vital to ensure experimental and analytical reproducability.

While computational linguists deal with these issues on a daily basis, there is
little awareness of established methodology and best practices coming from the
field of data provenance. The aim of this workshop is to begin a dialog. On the
one hand, we aim to create awareness of the needs and challenges posed by
linguistic data in the data provenance community. On the other hand, we aim to
import an understanding of the experiences and best practices established with
respect to data provenance into the computational linguistics community.



Authors are invited to submit an abstract of no longer than two A4 pages in
length, including references and data. Abstracts must have 2.5 cm (1 inch)
margins on all sides and be set in Times New Roman with a font size no smaller
than 11pt. The submissions must not reveal the identity of the author(s) in
any way.

Abstracts must be submitted in PDF format through EasyChair by December 22nd,
2017, 11:59pm EST. To submit your abstract, please click on the following link:


Important Dates:

Deadline for submission: December 22nd, 2017
Notification of acceptance: December 31st, 2017

Workshop: January, 22nd, 2018


Organising committee:

Miriam Butt, University of Konstanz
Melanie Herschel, University of Stuttgart
Christin Schätzle, University of Konstanz