PARSEME shared task Minutes from the Skype meeting 11 February 2015 Antoine, Carlos (ceramisch), Fabienne (fritzife03), Marie (mariecandito), Voula (vgiouli1), Simon (simon.krek.jsi), Veronika (akinorev1981), Behrang (behrang.qasemizadeh), Federico (kercos), Agata (agata.savary) ============================================= MINUTES ============================================= ============================================= I. ADVANCES IN THE LANGUAGE GROUPS 1. VOULA: - 6 languages, all responded, delays in the Hebrew group - Farsi, Hungarian, Turkish, Greek, Maltese - no conclusions yet 2. SIMON - coordination started late - annotation done on Bulgarian (with extra subcategorization) - very well documented: http://dcl.bas.bg/en/parseme-shared-task-phase-1/ - Slovene done (in parallel to another project, other MWE types are annotated too), conversion to the common format not yet done - Polish done - feedback to send soon - no news from Croatian - a separate Skype on Slavic languages needed 3. MARIE & CARLOS - French, Romanian, Portuguese, Spanish - done; Italian - not received (Federico: one annotator completed the task) - feedback: * light from Romanian, * detailed from Spanish, * detailed from French - some comments added to the guidelines on Google Docs - see also detailed comments at: https://docs.google.com/document/d/1ae1l1mQLzB1wBkSaUWtMr0IrfcNHfPEWN_h3a1j6pg4/edit#heading=h.zfldmb8n5308 4. FABIENNE - hard start - German did by Fabienne (natie) and Agata (non native); German recruitment in progress (3 expressions of interest from Glorianna, Simone, Carla) - Swedish to come very soon (3 annotators recruited) - English - only one active person (Ismaïl - accepter to be the leader), more to recruit, corpus ready to annotate; - Yiddish - to come in several weeks (students hired), asked for the generic parser ============================================= II. FEEDBACK ON TOOLS - Fabienne: help needed for some annotators to be introduced to the spreadsheets; using Google spreadsheets locally works fine - Voula: a tokenizer problem with the Greek character set - solved by Behrang - Carlos: generic tokenizer used for Portuguese - Carlos: Spanish people report that Google speadsheets are slow (Agata: try working with Chrome/Chromium rather than Firefox) - Carlos: following the VMWE identifiers is a bit hard for longer sentences; in the final annotation tools we should avoid having to do it manually - Federico: speadsheet use becomes problematic for more than a few thousand lines; a more robust solution needed for the true annotation - Federico: unsure if everyone uses the same spreadsheets; they are not all referenced from the master spreadsheet - Behrang: Farsi uses a template copied to local Excel files - Behrang: inter-annotator agreement tools needed soon ================= III. TOKENIZATION ISSUES - Marie+Carlos: * the token vs. word distinction is confusing, and it is much less crucial than other difficulties; the documentation about these issues is scattered among the guidelines and web pages * maybe the decision of using vs. not using this distinction should be left to the language groups - Voula: * the issue of the multi-token words might be crucial for Turkish ================ IV. FEEDBACK ON THE GUIDELINES 1. VPCs - Fabienne: unsure so far why it is interesting to annotate them (except that the particle often occurs very far from the verb); non-compositionality hard to define for these constructions - Veronika: they seem relevant for Hungarian and German; they are very frequently semantically non-compositional - Voula: unclear if they should be annotated * Greek: they are very rare (2 occurrences) * Farsi: they do not exist - Simon: * annotators very often disagree on this category * only 1 borderline case in Bulgarian * Slovenian - very bad agreement on this category * VPCs seem irrelevant for Slavic languages - Carlos: * problem with the interpretation of guidelines: V+Preps were annotated by the Spanish team * VPCs seem irrelevant for Romance languages - conclusions: * VPCs should NOT include verbs+prepositions * VPCs should not be a universal category * an English native speaker should be recruited to discuss the VPC annotation in Germanic languages ====== 2. LVCs - Simon: * category relevant for Slavic languages * less frequent than in English or French * specificities to discuss in the group - Marie: * category relevant to Romance languages * some criteria should be re-phrased (e.g. 12) to be more universal across languages (e.g. replace "possessive" by "argument") * criterion 14 - V+N combinations with highly polysemous verbs are hard to judge (one of their regular senses can be concerned); maybe a special treatment is needed for these verbs * criterion 15 - valid only if the verb alone allows passivization - Voula: * category relevant to "other" languages * Greek - issues of coordinations (Agata: see the "letting us in and out" example at: http://typo.uni-konstanz.de/parseme/index.php/2-general/151-parseme-shared-task-pilot-annotation) * Farsi + the meaning of light verbs is not very clear; more language-specific guidelines are needed + specific problem: elliptical LVCs (verb is omitted, its omission is signaled by a preposition) * example: "give lecture" -> "-lecture" * gathering similar examples in many examples would be useful - Fabienne: * category relevant to Germanic languages, quite frequent occurrences * it is unclear if all the tests should work to classify a candidate as an LVC - Carlos: * imprecise guidelines: section 5, step 2: one type of idiosyncrasy is enough; step 3: it is category-specific if one, many or all tests have to apply - Agata: * issues with subject+verb ("the problem lies in sth") and verb+PP combinations - some tests cannot apply since they are oriented towards direct objects (Marie: subj+verb combinations should probably be moved to OTH; verb+PP are very rare in the LVC class) - conclusions: * VPCs seem a universal category * the guidelines should be made more universal and more precise * language-specific section should be developed for some languages ====== 3. ID - Carlos&Marie: * relevant to Romance languages * the status of idioms containing a verb but not functioning as a verb (e.g. "peut-être" 'maybe') is unclear (Agata: see the open problem in the guidelines) * some problems with identifying the fixed part of the expression, especially if the verb may or may not be mandatory (e.g. "être au grand complet" or "au grand complet"?) - Fabienne * relevant to Germanic languages * same problem of including the verb or not - Simon: * relevant to Slavic languages * Bulgarian - fine sub-classification of idioms - Voula: * relevant to "other" languages * this seems to be the easiest class to annotate - conclusions: * ID seem a universal category * idioms where the head verb is a copula need extra decisions/tests ====== 4. SENT - Voula: * not difficult to annotate - Marie: * difficult to distinguish from collocations (due to the compositional meaning) * the criteria should be more specific for them - Simon * maybe SENT should be merged with OTH? - conclusions: unclear ======================== 5. NOMINALIZATIONS - Voula: is is not quite clear what is meant by a nominalization in this task (Agata: gerunds and patriciples only) ======================== 6. REFLEXIVE VERBS: - Agata * formally they are idioms since the 'self' particle is, syntactically, the direct object * we might annotate those verb+reflexive particle combinations which are semantically non compositional (e.g. PL: "znajdować się" 'to find oneself' has nothing to do with finding)? - Simon: only very clear cases should be annotated; in Bulgarian - they are placed in the in OTH class - Marie&Carlos: in favor of annotating them with a specific category - Fabienne: should definitely be annotated if the verb itself is complex ("sich Sorgen machen"), unsure for simple verbs with reflexive particles - Voula: no reflexive verbs in Greek - Behrang: it's better not to add a new category, but to add reflexive verbs to OTH (filtering them out, if needed, will be relatively easy) - Carlos: reflexive verbs have specific tests so it is better to make a separate category for them - Agata: this is not a universal category but it applies to at least 2 whole language groups and to some isolated languages (e.g. German but not English); what should be it's place in the guidelines? * option 1: put them into OTH + advantages: keeping the number of categories lower; this is not a universal category so putting them here fits the guidelines + drawbacks: specific tests will have to appear in each relevant language; the individual descriptions will be redundant and inconsistent with each other * option 2: make them to a new universal category; the language specificity will be then not to have this category + advantages: the description will be more consistent and non-redundant for the languages concerned + drawbacks: the examples cannot be in English, contrary to all other categories in the universal guidelines * option 3: introduce an additional level between universal and language-specific - the choice to be done after phase 2 ====================== 7. CHOICE OF THE CORPUS FOR PHASE 2 - corpus selection technique * Voula: the Greek group pre-selected sentences containing MWEs, thus the density of MWEs is very high * Carlos: this technique introduces a bias since we only annotate those MWE that we know in advance * Agata: negative examples are also needed (to train tools) * Voula: the corpus selection should be more precisely described - genre * Agata: the Polish corpus of local news had very few VMWEs * Voula: in Greek, to the contrary, many VMWEs were found * Carlos: it would be more interesting not to restrict ourselves to one genre (e.g. VMWEs might be more frequent and of different nature e.g. in speech) * Voula: if the genre is not restricted, we don't have enough coverage of the phenomena to study - proposal for phase 2: * open the choice of text genres for phase 2 for those who wish, draw conclusions before the final annotation ======================== 8. PAPER: - long paper for ACL, if rejected, re-submission to the ACL workshop - edition under google docs - contents: focusing on annotating the data, rather than on the shared task (also for the reasons of anonymity) ======================== 9. DEADLINES: - [language group leaders] 18 February: feedback from language groups (conclusions + comments on Google Docs) - [Carlos+Marie] 23 February: guidelines on reflexive verbs with examples in Romance languages - [Agata+Veronika] 25 February: enhanced guidelines - [all] 25 February - 20 March: pilot annotation - phase 2 - [language leaders] 5 March: language-specific sections in the guidelines ========== TODO: Simon: - organize a Skype for Slavic languages - feedback on the guidelines by 18 February Fabienne: - discuss language-specific issues within the group (annotating compounds within VMWEs, annotating separable VPCs) - feedback on the guidelines by 18 February - find a native English speaker for the annotation of the English corpus * names of experts who worked with VPCs, proposed by Carlos: Nathan Schneider, Paul Cook, Diana McCarthy, Suzanne Stevenson, Colin Bannard, Tim Baldwin, Yuancheng Tu (native speaker?), Sabine Schulte im Walde (German but has worked quite a lot with VPCs) Fabienne+Veronika: - discuss VPCs in German and Hungarian; describe them in the language-specific part of the guidelines Carlos+Marie: - write a section on reflexive verbs (new category) with examples in Romance languages Language group leaders: - inform the language leaders on phase 2 * open choice of corpus genre, * changes to the guidelines, * slightly shifted timeline * necessity to develop language-specific section of the guidelines All language leaders: - produce a language-specific section of the guidelines (after 25 February) Agata+Veronika: - include an explanation of tokens vs. word in the guidelines; stress that this distinction can be abandoned by a particular language team - make clear that verbs + prepositions are not VPCs - exclude VPCs from the universal categories - add examples to the guidelines where all test are applied to one MWEs - add criterion to delimit collocations (when a verb selects a whole semantic class) - "have" + "go" - try to formulate specific rules - add a new category - reflexive verbs with specific tests - open types of texts for phase 2 (any type) - to prepare the final choice - make the corpus selection procedure more clear (no pre-selection of sentences with MWEs) - see detailed comments from Marie+Carlos - plan the development of an IA-agreement calculation tool - plan the development of final annotation tools - plan of the paper on Google Docs - prepare samples of comparable examples in many languages for the Annotathon