PARSEME shared task

Minutes from the Skype meeting

Friday, 4 December 2015, 2 p.m.

Carlos (ceramisch), Fabienne (fritzife03), Marie (mariecandito), Voula (vgiouli1), Veronika (akinorev1981), Behrang (behrang.qasemizadeh), Federico (kercos), Agata (agata.savary)
Unsure: Simon (simon.krek.jsi), Antoine

==================================================
Feedback from the annotation guidelines and format
Vincze Veronika <vinczev@inf.u-szeged.hu>
1. Distinguishing the categories of verbal MWEs:
- (Carlos) the distinction between subclasses may be difficult, risk of many borderline cases 
- (Carlos) "Other MWEs" can be mixed with idioms
- (Carlos) Verbal collocations are missing (e.g. adverbs that usually modify some verbs as "drastically drop"). Is it intended? 
- (Carlos) Problematic cases:
	+ to be afraid (that) - to annotate or not? should "that" be included?
	+ to make it
	+ to make do 
- (Agata) given the scope so far, with so many languages, collocations should probably be skipped
- (Agata) fuzzy borders between categories are probably inevitable; MWE identification is our 1st objective (it will be evaluated in the contest); categorizing is the 2nd objective (it will be annotated but not directly evaluated)
- (Carlos) Should nominalizations of verbal MWEs be annotated?
- (Agata) the guide mentions that nominalizations should be annotated. Maybe we should make it even more clear?

2. Form of the annotation guidelines
- (Carlos) the way of applying linguistic tests should be made more clear - should the candidate always be annotated when the answer to a test is yes/no?
- (Marie) the guidelines might be more detailed, with more examples. The currently developed French guidelines might be useful as an inspiration.
- (Marie) the notion of "lexically fixed expressions" (p. 1) is not totally clear
- (Marie) an additional test for LVCs might be useful: a modifier refers to the noun and the verb at the same time, e.g.
	John took a decision to leave. =
	It's John's decision to leave. =
	It's John who will leave. =
	*John took Paul's decision to leave.

3. Status of MWE components 
- (Agata) selected modifiers (forming collocations) are not to be annotated (see 1)
- (Marie) special case of prepositions still needs to be made more clear, notably the status of the components in the example "make the decision on" - is not quite clear wrt. the format description
- (Voula) unsure about including the non-fixed nominal and sentential complements or not 
- (Agata) non-fixed complements are not to be included in the annotation scope

4. Tokenization issues
- (Carlos) the role of hyphens is unclear: do they always introduce a new token?
- (Marie) the list of punctuation marks should be specified
- (Marie) punctuation marks (e.g; dashes or quotes) preceded/followed by a space can be discriminating for some MWEs, but this information is lost in the one-token-per-line format
- (Marie) tokenization of some special items, e.g. URLs, should be made clear
- (Veronika) we could get inspiration from the tokenization definition in Universal Dependencies
- (Marie) should the tokenization be universal or can it be language-specific?
- a generic tokenizer will be provided but language teams can use their own custom tokenizers
- (Agata) should the annotators correct the tokenization errors?
- (Marie) in phase 1 the annotators should signal the errors, not correct them; then we should evaluate the extent of errors and possibly adapt the annotation format
- (Fabienne) it is difficult to agree on common tokenization rules (cf. recent experiments in her team)

5. Multilingual issues:
- (Voula) more examples are needed (also in other languages), they could be provided by the LGLs
- (Behrang) how are we going to port the guidelines to different languages? See e.g. issues of combination vs. incorporation in Farsi
- (Agata) including (glossed and translated) examples from other languages will make the guidelines huge and hard to read
- the common part should be in English with English examples, the tests and examples should be numbered
- the language(-family)-specific sections should contain examples in these languages with reference to the tests and examples in the English section
- language teams should make sure that they include a linguist expert

6. Choice of corpus
- (Voula) the procedure for selecting the corpus to annotate is too vague, newspapers contain texts from different domains, e.g. texts on politics are very different from those on sports
- (Voula) maybe we should recommend a particular domain for the corpus
- (Fabienne, Marie) choosing a domain would be too restrictive, e.g. it would prevent some teams from reusing the already available corpus annotations
- let's recommend newspapers text from a non specific technical domain (no sports, weather forecasts, economics etc.)

7. Annotation format and environment
- (Voula) simple text format is OK, it is better than a new annotation tool
- (Federico, Veronika, Voula) OK for a spreadsheet
- (Marie) it's important to be able to work off-line
- (Federico) Google spreadsheets enables working on-line with Chrome
- we should not see the other annotations already available; each annotator should have her/his own spreadsheet version
- a meta-spreadsheet is needed with a control of language groups and files assigned to languages and annotators

8. Miscellaneous:
- we should decide if we recommend using a corpus with other annotation layers or not
- target applications:
	+ (Carlos) an application-driven point of view is interesting because it requires reasoning about semantics
	+ (Agata) a small sub-part of the final corpus will be parallel but not in the pilot annotation
	+ MT-oriented annotation would be too ambitious; the corpora will be largely non-aligned since we want to avoid translationese issues; and they are going to be too scarce anyway
- (Marie) evaluation modalities - two tracks are possible
	+ one in which only the available corpora and tools for pre-processing are used
	+ the other where the use of external resources is allowed
- roles
	+ (Behrang) roles should be defined more clearly: annotators do not necessarily choose the corpora, the final guidelines should be addressed only to annotators
	+ (Agata) roles like managers and annotators can be distinguished in the actual annotation after April but the pilot annotation will usually be done by the language team managers anyway
- organization within the language groups
	+ each leader organizers her/his work individually (notably as far as communication is concerned: emails, ticket system like Github etc.)
	+ (Voula) in the "other languages" section an extra hierarchy might be needed (only the leaders of language teams communicate directly with Voula)

9. Internal deadlines
- (now) inform the language teams about the corpus choice and the new schedule
- 15 December - tokenizer definition
- 15 December - enhanced guidelines sent to the LGLs
- 21 December - feedback from the LGLs
- 30 December - guidelines published; tokenizer and spreadsheets ready
- 4 January - start of the pilot annotation

==========
TODO:

Agata:
+ send the French guidelines to Veronika
+ update the schedule on the web page
+ inform the language teams about the corpus choice and the new schedule
+ enhance the definition of the tokenization (see UDs)
+ specify that tokenization errors should be signaled but not corrected 
+ delete the constraint of two different writing styles, add specification on avoiding technical sublanguage
+ doodle for the Skype talk after the 1st phase

Veronika+Agata:
+ add more examples to the guidelines, numerate the tests and the examples, add extra tests, reorganize the tests into a "decision tree"
- add a specific section on Hungarian, with references to the English section

Federico:
+ configure the Google spreadsheets with the specifications:
	+ be able to work off-line easily
	+ have a control of the format
	+ one spreadsheet per annotator
	+ a meta-spreadsheet for the management of annotators and files
Behrang:
+ test the Stanford tokenizer - see how specific it can be made; adapt it to the new tokenization specifications

LGLs:
+ feedback on the enhanced annotation guidelines