PARSEME shared task Minutes from the Skype meeting Friday, 4 December 2015, 2 p.m. Carlos (ceramisch), Fabienne (fritzife03), Marie (mariecandito), Voula (vgiouli1), Veronika (akinorev1981), Behrang (behrang.qasemizadeh), Federico (kercos), Agata (agata.savary) Unsure: Simon (simon.krek.jsi), Antoine ================================================== Feedback from the annotation guidelines and format Vincze Veronika 1. Distinguishing the categories of verbal MWEs: - (Carlos) the distinction between subclasses may be difficult, risk of many borderline cases - (Carlos) "Other MWEs" can be mixed with idioms - (Carlos) Verbal collocations are missing (e.g. adverbs that usually modify some verbs as "drastically drop"). Is it intended? - (Carlos) Problematic cases: + to be afraid (that) - to annotate or not? should "that" be included? + to make it + to make do - (Agata) given the scope so far, with so many languages, collocations should probably be skipped - (Agata) fuzzy borders between categories are probably inevitable; MWE identification is our 1st objective (it will be evaluated in the contest); categorizing is the 2nd objective (it will be annotated but not directly evaluated) - (Carlos) Should nominalizations of verbal MWEs be annotated? - (Agata) the guide mentions that nominalizations should be annotated. Maybe we should make it even more clear? 2. Form of the annotation guidelines - (Carlos) the way of applying linguistic tests should be made more clear - should the candidate always be annotated when the answer to a test is yes/no? - (Marie) the guidelines might be more detailed, with more examples. The currently developed French guidelines might be useful as an inspiration. - (Marie) the notion of "lexically fixed expressions" (p. 1) is not totally clear - (Marie) an additional test for LVCs might be useful: a modifier refers to the noun and the verb at the same time, e.g. John took a decision to leave. = It's John's decision to leave. = It's John who will leave. = *John took Paul's decision to leave. 3. Status of MWE components - (Agata) selected modifiers (forming collocations) are not to be annotated (see 1) - (Marie) special case of prepositions still needs to be made more clear, notably the status of the components in the example "make the decision on" - is not quite clear wrt. the format description - (Voula) unsure about including the non-fixed nominal and sentential complements or not - (Agata) non-fixed complements are not to be included in the annotation scope 4. Tokenization issues - (Carlos) the role of hyphens is unclear: do they always introduce a new token? - (Marie) the list of punctuation marks should be specified - (Marie) punctuation marks (e.g; dashes or quotes) preceded/followed by a space can be discriminating for some MWEs, but this information is lost in the one-token-per-line format - (Marie) tokenization of some special items, e.g. URLs, should be made clear - (Veronika) we could get inspiration from the tokenization definition in Universal Dependencies - (Marie) should the tokenization be universal or can it be language-specific? - a generic tokenizer will be provided but language teams can use their own custom tokenizers - (Agata) should the annotators correct the tokenization errors? - (Marie) in phase 1 the annotators should signal the errors, not correct them; then we should evaluate the extent of errors and possibly adapt the annotation format - (Fabienne) it is difficult to agree on common tokenization rules (cf. recent experiments in her team) 5. Multilingual issues: - (Voula) more examples are needed (also in other languages), they could be provided by the LGLs - (Behrang) how are we going to port the guidelines to different languages? See e.g. issues of combination vs. incorporation in Farsi - (Agata) including (glossed and translated) examples from other languages will make the guidelines huge and hard to read - the common part should be in English with English examples, the tests and examples should be numbered - the language(-family)-specific sections should contain examples in these languages with reference to the tests and examples in the English section - language teams should make sure that they include a linguist expert 6. Choice of corpus - (Voula) the procedure for selecting the corpus to annotate is too vague, newspapers contain texts from different domains, e.g. texts on politics are very different from those on sports - (Voula) maybe we should recommend a particular domain for the corpus - (Fabienne, Marie) choosing a domain would be too restrictive, e.g. it would prevent some teams from reusing the already available corpus annotations - let's recommend newspapers text from a non specific technical domain (no sports, weather forecasts, economics etc.) 7. Annotation format and environment - (Voula) simple text format is OK, it is better than a new annotation tool - (Federico, Veronika, Voula) OK for a spreadsheet - (Marie) it's important to be able to work off-line - (Federico) Google spreadsheets enables working on-line with Chrome - we should not see the other annotations already available; each annotator should have her/his own spreadsheet version - a meta-spreadsheet is needed with a control of language groups and files assigned to languages and annotators 8. Miscellaneous: - we should decide if we recommend using a corpus with other annotation layers or not - target applications: + (Carlos) an application-driven point of view is interesting because it requires reasoning about semantics + (Agata) a small sub-part of the final corpus will be parallel but not in the pilot annotation + MT-oriented annotation would be too ambitious; the corpora will be largely non-aligned since we want to avoid translationese issues; and they are going to be too scarce anyway - (Marie) evaluation modalities - two tracks are possible + one in which only the available corpora and tools for pre-processing are used + the other where the use of external resources is allowed - roles + (Behrang) roles should be defined more clearly: annotators do not necessarily choose the corpora, the final guidelines should be addressed only to annotators + (Agata) roles like managers and annotators can be distinguished in the actual annotation after April but the pilot annotation will usually be done by the language team managers anyway - organization within the language groups + each leader organizers her/his work individually (notably as far as communication is concerned: emails, ticket system like Github etc.) + (Voula) in the "other languages" section an extra hierarchy might be needed (only the leaders of language teams communicate directly with Voula) 9. Internal deadlines - (now) inform the language teams about the corpus choice and the new schedule - 15 December - tokenizer definition - 15 December - enhanced guidelines sent to the LGLs - 21 December - feedback from the LGLs - 30 December - guidelines published; tokenizer and spreadsheets ready - 4 January - start of the pilot annotation ========== TODO: Agata: + send the French guidelines to Veronika + update the schedule on the web page + inform the language teams about the corpus choice and the new schedule + enhance the definition of the tokenization (see UDs) + specify that tokenization errors should be signaled but not corrected + delete the constraint of two different writing styles, add specification on avoiding technical sublanguage + doodle for the Skype talk after the 1st phase Veronika+Agata: + add more examples to the guidelines, numerate the tests and the examples, add extra tests, reorganize the tests into a "decision tree" - add a specific section on Hungarian, with references to the English section Federico: + configure the Google spreadsheets with the specifications: + be able to work off-line easily + have a control of the format + one spreadsheet per annotator + a meta-spreadsheet for the management of annotators and files Behrang: + test the Stanford tokenizer - see how specific it can be made; adapt it to the new tokenization specifications LGLs: + feedback on the enhanced annotation guidelines