XML Viewer - w06-2920

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2920_metho.xml
Size: 54,433 bytes
Last Modified: 2025-10-06 14:10:51
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2920">
  <Title>CoNLL-X shared task on Multilingual Dependency Parsing</Title>
  <Section position="5" start_page="149" end_page="149" type="metho">
    <SectionTitle>
2 Previous research
</SectionTitle>
    <Paragraph position="0"> Tesni`ere (1959) introduced the idea of a dependency tree (a &amp;quot;stemma&amp;quot; in his terminology), in which words stand in direct head-dependent relations, for representing the syntactic structure of a sentence.</Paragraph>
    <Paragraph position="1"> Hays (1964) and Gaifman (1965) studied the formal properties of projective dependency grammars, i.e. those where dependency links are not allowed to cross. Mel'Vcuk (1988) describes a multistratal dependency grammar, i.e. one that distinguishes between several types of dependency relations (morphological, syntactic and semantic). Other theories related to dependency grammar are word grammar 3Some though had significantly less time: One participant registered as late as six days before the test data release (registration was a prerequisite to obtain most of the data sets) and still went on to submit parsed test data in time.</Paragraph>
    <Paragraph position="2"> (Hudson, 1984) and link grammar (Sleator and Temperley, 1993).</Paragraph>
    <Paragraph position="3"> Some relatively recent rule-based full dependency parsers are Kurohashi and Nagao (1994) for Japanese, Oflazer (1999) for Turkish, Tapanainen and J&amp;quot;arvinen (1997) for English and Elworthy (2000) for English and Japanese.</Paragraph>
    <Paragraph position="4"> While phrase structure parsers are usually evaluated with the GEIG/PARSEVAL measures of precision and recall over constituents (Black et al., 1991), Lin (1995) and others have argued for an alternative, dependency-based evaluation. That approach is based on a conversion from constituent structure to dependency structure by recursively defining a head for each constituent.</Paragraph>
    <Paragraph position="5"> The same idea was used by Magerman (1995), who developed the first &amp;quot;head table&amp;quot; for the Penn Treebank (Marcus et al., 1994), and Collins (1996), whose constituent parser is internally based on probabilities of bilexical dependencies, i.e. dependencies between two words. Collins (1997)'s parser and its reimplementation and extension by Bikel (2002) have by now been applied to a variety of languages: English (Collins, 1999), Czech (Collins et al., 1999), German (Dubey and Keller, 2003), Spanish (Cowan and Collins, 2005), French (Arun and Keller, 2005), Chinese (Bikel, 2002) and, according to Dan Bikel's web page, Arabic.</Paragraph>
    <Paragraph position="6"> Eisner (1996) introduced a data-driven dependency parser and compared several probability models on (English) Penn Treebank data. Kudo and Matsumoto (2000) describe a dependency parser for Japanese and Yamada and Matsumoto (2003) an extension for English. Nivre's parser has been tested for Swedish (Nivre et al., 2004), English (Nivre and Scholz, 2004), Czech (Nivre and Nilsson, 2005), Bulgarian (Marinov and Nivre, 2005) and Chinese Cheng et al. (2005), while McDonald's parser has been applied to English (McDonald et al., 2005a), Czech (McDonald et al., 2005b) and, very recently, Danish (McDonald and Pereira, 2006).</Paragraph>
  </Section>
  <Section position="6" start_page="149" end_page="151" type="metho">
    <SectionTitle>
3 Data format, task definition
</SectionTitle>
    <Paragraph position="0"> The training data derived from the original treebanks (see Section 4) and given to the shared task participants was in a simple column-based format that is  an extension of Joakim Nivre's Malt-TAB format4 for the shared task and was chosen for its processing simplicity. All the sentences are in one text file and they are separated by a blank line after each sentence. A sentence consists of one or more tokens. Each token is represented on one line, consisting of 10 fields. Fields are separated from each other by a TAB.5 The 10 fields are: 1) ID: Token counter, starting at 1 for each new sentence.</Paragraph>
    <Paragraph position="1"> 2) FORM: Word form or punctuation symbol.</Paragraph>
    <Paragraph position="2"> For the Arabic data only, FORM is a concatenation of the word in Arabic script and its transliteration in Latin script, separated by an underscore. This representation is meant to suit both those that do and those that do not read Arabic.</Paragraph>
    <Paragraph position="3"> 3) LEMMA: Lemma or stem (depending on the particular treebank) of word form, or an underscore if not available. Like for the FORM, the values for Arabic are concatenations of two scripts.</Paragraph>
    <Paragraph position="4">  4) CPOSTAG: Coarse-grained part-of-speech tag, where the tagset depends on the treebank.</Paragraph>
    <Paragraph position="5"> 5) POSTAG: Fine-grained part-of-speech tag,  where the tagset depends on the treebank. It is identical to the CPOSTAG value if no POSTAG is available from the original treebank.</Paragraph>
    <Paragraph position="6"> 6) FEATS: Unordered set of syntactic and/or morphological features (depending on the particular treebank), or an underscore if not available. Set members are separated by a vertical bar (|).</Paragraph>
    <Paragraph position="7"> 7) HEAD: Head of the current token, which is either a value of ID, or zero ('0') if the token links to the virtual root node of the sentence. Note that depending on the original treebank annotation, there may be multiple tokens with a HEAD value of zero. 8) DEPREL: Dependency relation to the HEAD.</Paragraph>
    <Paragraph position="8"> The set of dependency relations depends on the particular treebank. The dependency relation of a token with HEAD=0 may be meaningful or simply  shared task data, field values are also not supposed to contain any other whitespace (although unfortunately some spaces slipped through in the Spanish data).</Paragraph>
    <Paragraph position="9"> resulting from the PHEAD column is guaranteed to be projective (but is not available for all data sets), whereas the structure resulting from the HEAD column will be non-projective for some sentences of some languages (but is always available).</Paragraph>
    <Paragraph position="10"> 10) PDEPREL: Dependency relation to the PHEAD, or an underscore if not available.</Paragraph>
    <Paragraph position="11"> As should be obvious from the description above, our format assumes that each token has exactly one head. Some dependency grammars, and also some treebanks, allow tokens to have more than one head, although often there is a distinction between primary and optional secondary relations, e.g. in the Danish Dependency Treebank (Kromann, 2003), the Dutch Alpino Treebank (van der Beek et al., 2002b) and the German TIGER treebank (Brants et al., 2002).</Paragraph>
    <Paragraph position="12"> For this shared task we decided to ignore any additional relations. However the data format could easily be extended with additional optional columns in the future. Cycles do not occur in the shared task data but are scored as normal if predicted by parsers. The character encoding of all data files is Unicode (specifically UTF-8), which is the only encoding to cover all languages and therefore ideally suited for multilingual parsing.</Paragraph>
    <Paragraph position="13"> While the training data contained all 10 columns (although sometimes only with dummy values, i.e.</Paragraph>
    <Paragraph position="14"> underscores), the test data given to participants contained only the first 6. Participants' parsers then predicted the HEAD and DEPREL columns (any predicted PHEAD and PDEPREL columns were ignored). The predicted values were compared to the gold standard HEAD and DEPREL.6 The official evaluation metric is the labeled attachment score (LAS), i.e. the percentage of &amp;quot;scoring&amp;quot; tokens for which the system has predicted the correct HEAD and DEPREL. The evaluation script defines a non-scoring token as a token where all characters of the FORM value have the Unicode category property  languages and instructions on how to get the rest, the software used for the treebank conversions, much documentation, full results and other related information will be available from the permanent URL http://depparse.uvt.nl (also linked from the CoNLL web page).</Paragraph>
    <Paragraph position="15"> 7See man perlunicode for the technical details and the shared task website for our reasons for this decision. Note that an underscore and a percentage sign also have the Unicode &amp;quot;Punctuation&amp;quot; property.</Paragraph>
    <Paragraph position="16">  We tried to take a test set that was representative of the genres in a treebank and did not cut through text samples. We also tried to document how we selected this set.8 We aimed at having roughly the same size for the test sets of all languages: 5,000 scoring tokens. This is not an exact requirement as we do not want to cut sentences in half. The relatively small size of the test set means that even for the smallest treebanks the majority of tokens is available for training, and the equal size means that for the overall ranking of participants, we can simply compute the score on the concatenation of all test sets.</Paragraph>
  </Section>
  <Section position="7" start_page="151" end_page="153" type="metho">
    <SectionTitle>
4 Treebanks and their conversion
</SectionTitle>
    <Paragraph position="0"> In selecting the treebanks, practical considerations were the major factor. Treebanks had to be actually available, large enough, have a license that allowed free use for research or kind treebank providers who temporarily waived the fee for the shared task, and be suitable for conversion into the common format within the limited time. In addition, we aimed at a broad coverage of different language families.9 As a general rule, we did not manually correct errors in treebanks if we discovered some during the conversion, see also Buchholz and Green (2006), although we did report them to the treebank providers and several got corrected by them.</Paragraph>
    <Section position="1" start_page="151" end_page="153" type="sub_section">
      <SectionTitle>
4.1 Dependency treebanks
</SectionTitle>
      <Paragraph position="0"> We used the following six dependency treebanks:  bank, Otakar SmrVz for valuable help during the conversion and thanks again to Jan HajiVc, Christopher Cieri and Tony Castelletto. null 12Many thanks to the SDT people for granting the special license for CoNLL-X and to TomaVz Erjavec for converting the  son, 1976; Nilsson et al., 2005); Turkish: Metu-Sabanci treebank15 (Oflazer et al., 2003; Atalay et al., 2003).</Paragraph>
      <Paragraph position="1"> The conversion of these treebanks was the easiest task as the linguistic representation was already what we needed, so the information only had to be converted from SGML or XML to the shared task format. Also, the relevant information had to be distributed appropriately over the CPOSTAG, POSTAG and FEATS columns.</Paragraph>
      <Paragraph position="2"> For the Swedish data, no predefined distinction into coarse and fine-grained PoS was available, so the two columns contain identical values in our format. For the Czech data, we sampled both our training and test data from the official &amp;quot;training&amp;quot; partition because only that one contains gold standard PoS tags, which is also what is used in most other data sets. The Czech DEPREL values include the suffixes to mark coordination, apposition and parenthesis, while these have been ignored during the conversion of the much smaller Slovene data. For the Arabic data, sentences with missing annotation were filtered out during the conversion.</Paragraph>
      <Paragraph position="3"> The Turkish treebank posed a special problem because it analyzes each word as a sequence of one or more inflectional groups (IGs). Each IG consists of either a stem or a derivational suffix plus all the inflectional suffixes belonging to that stem/derivational suffix. The head of a whole word is not just another word but a specific IG of another word.16 One can easily map this representation to one in which the head of a word is a word but that treebank for us.</Paragraph>
      <Paragraph position="4"> 13Many thanks to Matthias Trautner Kromann and assistants for creating the DDT and releasing it under the GNU General Public License and to Joakim Nivre, Johan Hall and Jens Nilsson for the conversion of DDT to Malt-XML.</Paragraph>
      <Paragraph position="5"> 14Many thanks to Jens Nilsson, Johan Hall and Joakim Nivre for the conversion of the original Talbanken to Talbanken05 and for making it freely available for research purposes and to Joakim Nivre again for prompt and proper respons to all our questions.</Paragraph>
      <Paragraph position="6"> 15Many thanks to Bilge Say and Kemal Oflazer for granting the license for CoNLL-X and answering questions and to G&amp;quot;uls,en EryiVgit for making many corrections to the treebank and discussing some aspects of the conversion.</Paragraph>
      <Paragraph position="7"> 16This is a bit like saying that in &amp;quot;the usefulness of X for Y&amp;quot;, &amp;quot;for Y&amp;quot; links to &amp;quot;use-&amp;quot; and not to &amp;quot;usefulness&amp;quot;. Only that in Turkish, &amp;quot;use&amp;quot;, &amp;quot;full&amp;quot; and &amp;quot;ness&amp;quot; each could have their own inflectional suffixes attached to them.</Paragraph>
      <Paragraph position="8">  mapping would lose information and it is not clear whether the result is linguistically meaningful, practically useful, or even easier to parse because in the original representation, each IG has its own PoS and morphological features, so it is not clear how that information should be represented if all IGs of a word are conflated. We therefore chose to represent each IG as a separate token in our format. To make the result a connected dependency structure, we defined the HEAD of each non-word-final IG to be the following IG and the DEPREL to be &amp;quot;DERIV&amp;quot;. We assigned the stem of the word to the first IG's LEMMA column, with all non-first IGs having LEMMA ' ', and the actual word form to the last IG, with all non-last IGs having FORM ' '. As already mentioned in Section 3, the underscore has the punctuation character property, therefore non-last IGs (whose HEAD and DEPREL were introduced by us) are not scoring tokens. We also attached or reattached punctuation (see the README available at the shared task web-site for details.) 4.2 Phrase structure with functions for all constituents We used the following five treebanks of this type: German: TIGER treebank17 (Brants et al., 2002); Japanese: Japanese Verbmobil treebank18 (Kawata and Bartels, 2000); Portuguese: The Bosque part of the Floresta sint'a(c)tica19 (Afonso et al., 2002); Dutch: Alpino treebank20 (van der Beek et al., 2002b; van der Beek et al., 2002a); Chinese: Sinica 17Many thanks to the TIGER team for allowing us to use the treebank for the shared task and to Amit Dubey for converting the treebank.</Paragraph>
      <Paragraph position="9"> 18Many thanks to Yasuhiro Kawata, Julia Bartels and colleagues from T&amp;quot;ubingen University for the construction of the original Verbmobil treebank for Japanese and to Sandra K&amp;quot;ubler for providing the data and granting the special license for CoNLL-X.</Paragraph>
      <Paragraph position="10"> 19Many thanks to Diana Santos, Eckhard Bick and other Floresta sint(c)tica project members for creating the treebank and making it publicly available, for answering many questions about the treebank (Diana and Eckhard), for correcting problems and making new releases (Diana), and for sharing scripts and explaining the head rules implemented in them (Eckhard). Thanks also to Jason Baldridge for useful discussions and to Ben Wing for independently reporting problems which Diana then fixed.</Paragraph>
      <Paragraph position="11"> 20Many thanks to Gertjan van Noord and the other people at the University of Groningen for creating the Alpino Treebank and releasing it for free, to Gertjan van Noord for answering all our questions and for providing extra test material and to Antal van den Bosch for help with the memory-based tagger.</Paragraph>
      <Paragraph position="12"> treebank21 (Chen et al., 2003).</Paragraph>
      <Paragraph position="13"> Their conversion to dependency format required the definition of a head table. Fortunately, in contrast to the Penn Treebank for which the head table is based on POS22 we could use the grammatical functions annotated in these treebanks.</Paragraph>
      <Paragraph position="14"> Therefore, head rules are often of the form: the head child of a VP/clause is the child with the HD/predicator/hd/Head function. The DEPREL value for a token is the function of the biggest constituent of which this token is the lexical head. If the constituent comprising the complete sentence did not have a function, we gave its lexical head token the DEPREL &amp;quot;ROOT&amp;quot;.</Paragraph>
      <Paragraph position="15"> For the Chinese treebank, most functions are not grammatical functions (such as &amp;quot;subject&amp;quot;, &amp;quot;object&amp;quot;) but semantic roles (such as &amp;quot;agent&amp;quot;, &amp;quot;theme&amp;quot;). For the Portuguese treebank, the conversion was complicated by the fact that a detailed specification existed which tokens should be the head of which other tokens, e.g. the finite verb must be the head of the subject and the complementzier but the main verb must be the head of the complements and adjuncts.23 Given that the Floresta sint'a(c)tica does not use traditional VP constituents but rather verbal chunks (consisting mainly of verbs), a simple Magerman-Collins-style head table was not sufficient to derive the required dependency structure. Instead we used a head table that defined several types of heads (syntactic, semantic) and a link table that specified what linked to which type of head.24 Another problem existed with the Dutch treebank. Its original PoS tag set is very coarse and the PoS and the word stem information is not very reliable.25 We therefore decided to retag the tree-bank automatically using the Memory-Based Tagger (MBT) (Daelemans et al., 1996) which uses a very fine-grained tag set. However, this created a problem with multiwords. MBT does not have the concept of multiwords and therefore tags all of their 21Many thanks to Academia Sinica for granting the temporary license for CoNLL-X, to Keh-Jiann Chen for answering our questions and to Amit Dubey for converting the treebank. 22containing rules such as: the head child of a VP is the left-most &amp;quot;to&amp;quot;, or else the leftmost past tense verb, or else etc.  components individually. As Alpino does not provide an internal structure for multiwords, we had to treat multiwords as one token. However, we then lack a proper PoS for the multiword. After much discussion, we decided to assign each multi-word the CPOSTAG &amp;quot;MWU&amp;quot; (multiword unit) and a POSTAG which is the concatenation of the PoS of all the components as predicted by MBT (separated by an underscore). Likewise, the FEATS are a concatenation of the morphological features of all components. This approach resulted in many different POSTAG values for the training set and even in unseen values in the test set. It remains to be tested whether our approach resulted in data sets better suited for parsing than the original.</Paragraph>
      <Paragraph position="16"> 4.3 Phrase structure with some functions We used two treebanks of this type: Spanish: Cast3LB26 (Civit Torruella and Mart'i Anton'in, 2002; Navarro et al., 2003; Civit et al., 2003); Bulgarian: BulTreeBank27 (Simov et al., 2002; Simov and Osenova, 2003; Simov et al., 2004; Osenova and Simov, 2004; Simov et al., 2005).</Paragraph>
      <Paragraph position="17"> Converting a phrase structure treebank with only a few functions to a dependency format usually requires linguistic competence in the treebank's language in order to create the head table and missing function labels. We are grateful to Chanev et al. (2006) for converting the BulTreeBank to the shared task format and to Montserrat Civit for providing us with a head table and a function mapping</Paragraph>
    </Section>
    <Section position="2" start_page="153" end_page="153" type="sub_section">
      <SectionTitle>
for Cast3LB.28
4.4 Data set characteristics
</SectionTitle>
      <Paragraph position="0"> Table 1 shows details of all data sets. Following Nivre and Nilsson (2005) we use the following definition: &amp;quot;an arc (i, j) is projective iff all nodes occurring between i and j are dominated by i (where dominates is the transitive closure of the arc rela26Many thanks to Montserrat Civit and Toni Mart'i for allowing us to use Cast3LB for CoNLL-X and to Amit Dubey for converting the treebank.</Paragraph>
      <Paragraph position="1"> 27Many thanks to Kiril Simov and Petya Osenova for allowing us to use the BulTreeBank for CoNLL-X.</Paragraph>
      <Paragraph position="2"> 28Although unfortunately, due to a bug, the function list was not used and the Spanish data in the shared task ended up with many DEPREL values being simply ' '. By the time we discovered this, the test data release date was very close and we decided not to release new bug-fixed training material that late. tion)&amp;quot;.29</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="153" end_page="159" type="metho">
    <SectionTitle>
5 Approaches
</SectionTitle>
    <Paragraph position="0"> Table 2 tries to give an overview of the wide variety of parsing approaches used by participants. We refer to the individual papers for details. There are several dimensions along which to classify approaches.</Paragraph>
    <Section position="1" start_page="153" end_page="153" type="sub_section">
      <SectionTitle>
5.1 Top-down, bottom-up
</SectionTitle>
      <Paragraph position="0"> Phrase structure parsers are often classified in terms of the parsing order: top-down, bottom-up or various combinations. For dependency parsing, there seem to be two different interpretations of the term &amp;quot;bottom-up&amp;quot;. Nivre and Scholz (2004) uses this term with reference to Yamada and Matsumoto (2003), whose parser has to find all children of a token before it can attach that token to its head.</Paragraph>
      <Paragraph position="1"> We will refer to this as &amp;quot;bottom-up-trees&amp;quot;. Another use of &amp;quot;bottom-up&amp;quot; is due to Eisner (1996), who introduced the notion of a &amp;quot;span&amp;quot;. A span consists of a potential dependency arc r between two tokens i and j and all those dependency arcs that would be spanned by r, i.e. all arcs between tokens k and l with i [?] k,l [?] j. Parsing in this order means that the parser has to find all children and siblings on one side of a token before it can attach that token to a head on the same side.</Paragraph>
      <Paragraph position="2"> This approach assumes projective dependency structures. Eisner called this approach simply &amp;quot;bottomup&amp;quot;, while Nivre, whose parser implicitly also follows this order, called it &amp;quot;top-down/bottom-up&amp;quot; to distinguish it from the pure &amp;quot;bottom-up(-trees)&amp;quot; order of Yamada and Matsumoto (2003). To avoid confusion, we will refer to this order as &amp;quot;bottom-upspans&amp;quot;. null</Paragraph>
    </Section>
    <Section position="2" start_page="153" end_page="156" type="sub_section">
      <SectionTitle>
5.2 Unlabeled parsing versus labeling
</SectionTitle>
      <Paragraph position="0"> Given that the parser needs to predict the HEAD as well as the DEPREL value, different approaches are possible: predict the (probabilities of the) HEADs of all tokens first, or predict the (probabilities of the) DEPRELs of all tokens first, or predict the HEAD and DEPREL of one token before predicting these values for the next token. Within the first approach, each dependency can be labeled independently (Corston-Oliver and Aue, 2006) or a 29Thanks to Joakim Nivre for explaining this.</Paragraph>
      <Paragraph position="1">  Sino-Tibetan, Slavic, Germanic, Japonic (or language isolate), Romance, Ural-Altaic); number of genres, and genre if only one (news, dialogue, novel); type of annotation (d=dependency, c=constituents, dc=discontinuous constituents, +f=with functions, +t=with types). For the training data: number of tokens (times 1000); percentage of non-scoring tokens; number of parse tree units (usually sentences, times 1000); average number of (scoring and non-scoring) tokens per parse tree unit; whether a lemma or stem is available; how many different CPOSTAG values, POSTAG values, FEATS components and DEPREL values occur for scoring tokens; how many different values for DEPREL scoring tokens with HEAD=0 can have (if that number is 1, there is one designated label (e.g. &amp;quot;ROOT&amp;quot;) for tokens with HEAD=0); percentage of scoring tokens with HEAD=0, a head that precedes or a head that follows the token (this nicely shows which languages are predominantly head-initial or head-final); the average number of scoring tokens with HEAD=0 per parse tree unit; the percentage of (scoring and non-scoring) non-projective relations and of parse tree units with at least one non-projective relation. For the test data: number of scoring tokens; percentage of scoring tokens with a FORM or a LEMMA that does not occur in the training data.</Paragraph>
      <Paragraph position="2"> afinal punctuation was deliberately left out during the conversion (as it is explicitly excluded from the tree structure) bthe non-last IGs of a word are non-scoring, see Section 4.1 cin many cases the parse tree unit in PADT is not a sentence but a paragraph din many cases the unit in Sinica is not a sentence but a comma-separated clause or phrase ethe treebank consists of transcribed dialogues, in which some sentences are very short, e.g. just &amp;quot;Hai.&amp;quot; (&amp;quot;Yes.&amp;quot;) fonly part of the Arabic data has non-underscore values for the LEMMA column gno mapping from fine-grained to coarse-grained tags was available; same for Swedish h9 values are typos; POSTAGs also encode subcategorization information for verbs and some semantic information for conjunctions and nouns; some values also include parts in square brackets which in hindsight should maybe have gone to FEATS idue to treatment of multiwords jprobably due to some sentences consisting only of non-scoring tokens, i.e. punctuation kthese are all disfluencies, which are attached to the virtual root node lfrom co-indexed items in the original treebank; same for Bulgarian  algorithm ver. hor. search lab. non-proj learner pre post opt all pairs McD MST/Eisner b-s irr. opt/approx. 2nd + a MIRA [?] [?] [?] Cor MST/Eisner b-s irr. optimal 2nd [?] BPMb+ME [SVM] + c [?] [?] Shi MST/CLE irr. irr. optimal 1st +, CLE MIRA [?] [?] [?] Can own algorithm irr. irr. approx.(?) int. + d TiMBL [?] [?] + Rie ILP irr. irr. increment. int. + e MIRA [?] [?] + Bic CG-inspired mpf mpf backtrack(?) int. + f MLE(?) + g + h [?] stepwise Dre hagi/Eisner/rerank b-s irr. best 1st exh 2nd [?] MLE [?] [?] + j Liu own algorithm b-t mpf det./local int. [?] MLE [?] [?] [?] Car Eisner b-s irr. approx. int. [?] perceptron [?] [?] [?] stepwise: classifier-based  of the first author): algorithm (Y&amp;M: Yamada and Matsumoto (2003), ILP: Integer Linear Programming), vertical direction (irrelevant, mpf: most probable first, bottom-up-spans, bottom-up-trees), horizontal direction (irrelevant, mpf: most probable first, forward, backward), search (optimal, approximate, incremental, best-first exhaustive, deterministic), labeling (interleaved, separate and 1st step, separate and 2nd step), non-projective (ps-pr: through pseudo-projective approach), learner (ME: Maximum Entropy; learners in brackets were explored but not used in the official submission), preprocessing (projectivize, d2c: dependencies to constituents), postprocessing (deprojectivize, c2d: constituents to dependencies), learner parameter optimization per language anon-projectivity through approximate search, used for some languages b20 averaged perceptrons combined into a Bayes Point Machine cintroduced a single POS tag &amp;quot;aux&amp;quot; for all Swedish auxiliary and model verbs dby having no projectivity constraint eselective projectivity constraint for Japanese fseveral approaches to non-projectivity gusing some FEATS components to create some finer-grained POSTAG values hreattachment rules for some types of non-projectivity ihead automaton grammar jdetermined the maximally allowed distance for relations kthrough special parser actions lpseudo-projectivizing training data only mGreedy Prepend Algorithm nbut two separate learners used for unlabeled parsing versus labeling oboth foward and backward, then combined into a single tree with CLE pbut two separate SVMs used for unlabeled parsing versus labeling qforward parsing for Japanese and Turkish, backward for the rest rattaching remaining unattached tokens through exhaustive search (not for submitted runs)  sequence classifier can label all children of a token together (McDonald et al., 2006). Within the third approach, HEAD and DEPREL can be predicted simultaneously, or in two separate steps (potentially using two different learners).</Paragraph>
    </Section>
    <Section position="3" start_page="156" end_page="156" type="sub_section">
      <SectionTitle>
5.3 All pairs
</SectionTitle>
      <Paragraph position="0"> At the highest level of abstraction, there are two fundamental approaches, which we will call &amp;quot;all pairs&amp;quot; and &amp;quot;stepwise&amp;quot;. In an &amp;quot;all pairs&amp;quot; approach, every possible pair of two tokens in a sentence is considered and some score is assigned to the possibility of this pair having a (directed) dependency relation.</Paragraph>
      <Paragraph position="1"> Using that information as building blocks, the parser then searches for the best parse for the sentence.</Paragraph>
      <Paragraph position="2"> This approach is one of those described in Eisner (1996). The definition of &amp;quot;best&amp;quot; parse depends on the precise model used. That model can be one that defines the score of a complete dependency tree as the sum of the scores of all dependency arcs in it.</Paragraph>
      <Paragraph position="3"> The search for the best parse can then be formalized as the search for the maximum spanning tree (MST) (McDonald et al., 2005b). If the parse has to be projective, Eisner's bottom-up-span algorithm (Eisner, 1996) can be used for the search. For non-projective parses, McDonald et al. (2005b) propose using the Chu-Liu-Edmonds (CLE) algorithm (Chu and Liu, 1965; Edmonds, 1967) and McDonald and Pereira (2006) describe an approximate extension of Eisner's algorithm. There are also alternatives to MST which allow imposing additional constraints on the dependency structure, e.g. that at most one dependent of a token can have a certain label, such as &amp;quot;subject&amp;quot;, see Riedel et al. (2006) and Bick (2006). By contrast, Canisius et al. (2006) do not even enforce the tree constraint, i.e. they allow cycles. In a variant of the &amp;quot;all pairs&amp;quot; approach, only those pairs of tokens are considered that are not too distant (Canisius et al., 2006).</Paragraph>
    </Section>
    <Section position="4" start_page="156" end_page="156" type="sub_section">
      <SectionTitle>
5.4 Stepwise
</SectionTitle>
      <Paragraph position="0"> In a stepwise approach, not all pairs are considered.</Paragraph>
      <Paragraph position="1"> Instead, the dependency tree is built stepwise and the decision about what step to take next (e.g. which dependency to insert) can be based on information about, in theory all, previous steps and their results (in the context of generative probabilistic parsing, Black et al. (1993) call this the history). Stepwise approaches can use an explicit probability model over next steps, e.g. a generative one (Eisner, 1996; Dreyer et al., 2006), or train a machine learner to predict those. The approach can be deterministic (at each point, one step is chosen) or employ various types of search. In addition, parsing can be done in a bottom-up-constituent or a bottom-up-spans fashion (or in another way, although this was not done in this shared task). Finally, parsing can start at the first or the last token of a sentence. When talking about languages that are written from left to right, this distinction is normally referred to as left-to-right versus right-to-left. However, for multilingual parsing which includes languages that are written from right to left (Arabic) or sometimes top to bottom (Chinese, Japanese) this terminology is confusing because it is not always clear whether a left-to-right parser for Arabic would really start with the left-most (i.e. last) token of a sentence or, like for other languages, with the first (i.e. rightmost). In general, starting with the first token (&amp;quot;forward&amp;quot;) makes more sense from a psycholinguistic point of view but starting with the last (&amp;quot;backward&amp;quot;) might be beneficial for some languages (possibly related to them being head-initial versus head-final languages). The parsing order directly determines what information will be available from the history when the next decision needs to be made. Stepwise parsers tend to interleave the prediction of HEAD and DEPREL.</Paragraph>
    </Section>
    <Section position="5" start_page="156" end_page="157" type="sub_section">
      <SectionTitle>
5.5 Non-projectivity
</SectionTitle>
      <Paragraph position="0"> All data sets except the Chinese one contain some non-projective dependency arcs, although their proportion varies from 0.1% to 5.4%. Participants took the following approaches to non-projectivity: * Ignore, i.e. predict only projective parses. Depending on the way the parser is trained, it might be necessary to at least projectivize the training data (Chang et al., 2006).</Paragraph>
      <Paragraph position="1"> * Always allow non-projective arcs, by not imposing any projectivity constraint (Shimizu, 2006; Canisius et al., 2006).</Paragraph>
      <Paragraph position="2"> * Allow during parsing under certain conditions, e.g. for tokens with certain properties (Riedel et al., 2006; Bick, 2006) or if no alternative projective arc has a score above the threshold  (Bick, 2006) or if the classifier chooses a special action (Attardi, 2006) or the parser predicts a trace (Schiehlen and Spranger, 2006).</Paragraph>
      <Paragraph position="3"> * Introduce through post-processing, e.g.</Paragraph>
      <Paragraph position="4"> through reattachment rules (Bick, 2006) or if the change increases overall parse tree probability (McDonald et al., 2006).</Paragraph>
      <Paragraph position="5"> * The pseudo-projective approach (Nivre and Nilsson, 2005): Transform non-projective training trees to projective ones but encode the information necessary to make the inverse transformation in the DEPREL, so that this inverse transformation can also be carried out on the test trees (Nivre et al., 2006).</Paragraph>
    </Section>
    <Section position="6" start_page="157" end_page="158" type="sub_section">
      <SectionTitle>
5.6 Data columns used
</SectionTitle>
      <Paragraph position="0"> Table 3 shows which column values have been used by participants. Nobody used the PHEAD/PDEPREL column in any way. It is likely that those who did not use any of the other columns did so mainly for practical reasons, such as the limited time and/or the difficulty to integrate it into an existing parser.</Paragraph>
      <Paragraph position="1"> 5.6.1 FORM versus LEMMA Lemma or stem information has often been ignored in previous dependency parsers. In the shared task data, it was available in just over half the data sets. Both LEMMA and FORM encode lexical information. There is therefore a certain redundancy. Participants have used these two columns in different ways: * Use only one (see Table 3).</Paragraph>
      <Paragraph position="2"> * Use both, in different features. Typically, a feature selection routine and/or the learner itself (through weights) will decide about the importance of the resulting features.</Paragraph>
      <Paragraph position="3"> * Use a variant of the FORM as a substitute for a missing LEMMA. Bick (2006) used the lowercased FORM if the LEMMA is not available, Corston-Oliver and Aue (2006) a prefix and Attardi (2006) a stem derived by a rule-based system for Danish, German and Swedish.</Paragraph>
      <Paragraph position="4"> form lem. cpos pos feats  pating groups. '[?]': a column value was not used at all. '+': used in at least some features. '(+)': Variant of FORM used only if LEMMA is missing, or only parts of FEATS used. '++': used more extensively than another column containing related information (where FORM and LEMMA are related, as are CPOSTAG and POSTAG), e.g. also in combination features or features for context tokens in addition to features for the focus token(s). &amp;quot;rer.&amp;quot;: used in the reranker only. For the last column: atomic, comp. = components, cr.pr. = cross-product.</Paragraph>
      <Paragraph position="5"> aalso prefix and suffix for labeler binstead of form for Arabic and Spanish cinstead of POSTAG for Dutch and Turkish dfor labeler; unlab. parsing: only some for global features ealso prefix falso 1st character of POSTAG gonly as backoff hreranker: also suffix; if no lemma, use prefix of FORM iLEMMA, POSTAG, FEATS only for back-off smoothing  All data sets except German and Swedish had different values for CPOSTAG and POSTAG, although the granularity varied widely. Again, there are different approaches to dealing with the redundancy: * Use only one for all languages.</Paragraph>
      <Paragraph position="6">  * Use both, in different features. Typically, a feature selection routine and/or the learner itself (through weights) will decide about the importance of the resulting features.</Paragraph>
      <Paragraph position="7"> * Use one or the other for each language.</Paragraph>
      <Paragraph position="8">  By design, a FEATS column value has internal structure. Splitting it at the '|'30 results in a set of components. The following approaches have been used: * Ignore the FEATS.</Paragraph>
      <Paragraph position="9"> * Treat the complete FEATS value as atomic, i.e. do not split it into components.</Paragraph>
      <Paragraph position="10"> * Use only some components, e.g. Bick (2006) uses only case, mood and pronoun subclass and Attardi (2006) uses only gender, number, per-son and case.</Paragraph>
      <Paragraph position="11"> * Use one binary feature for each component.</Paragraph>
      <Paragraph position="12"> This is likely to be useful if grammatical function is indicated by case.</Paragraph>
      <Paragraph position="13"> * Use one binary feature for each cross-product of the FEATS components of i and the FEATS components of j. This is likely to be useful for agreement phenomena.</Paragraph>
      <Paragraph position="14"> * Use one binary feature for each FEATS component of i that also exists for j. This is a more explicit way to model agreement.</Paragraph>
    </Section>
    <Section position="7" start_page="158" end_page="159" type="sub_section">
      <SectionTitle>
5.7 Types of features
</SectionTitle>
      <Paragraph position="0"> When deciding whether there should be a dependency relation between tokens i and j, all parsers use at least information about these two tokens. In addition, the following sources of information can be used (see Table 4): token context (tc): a limited number (determined by the window size) of tokens directly preceding or following i or j; children: information about the already found children of i and j; siblings: in a set-up where the decision is not &amp;quot;is there a relation between i and j&amp;quot; but &amp;quot;is i the head of j&amp;quot; or in a separate labeling step, the siblings of i are the already found children of j; structural context 30or for Dutch, also at the ' ' tc ch si sc di in gl co ac la op  groups. See the text for the meaning of the column abbreviations. For separate HEAD and DEPREL assignment: p: only for unlabeled parsing, l: only for labeling, r: only for reranking.</Paragraph>
      <Paragraph position="1"> aFORM versus LEMMA bnumber of tokens governed by child cPOSTAG versus CPOSTAG dfor arity constraint efor arity constraint ffor &amp;quot;full&amp;quot; head constraint gfor uniqueness constraint hfor barrier constraint iof constraints jPOS window size (sc) other than children/siblings: neighboring subtrees/spans, or ancestors of i and j; distance from i to j; information derived from all the tokens in between i and j (e.g. whether there is an intervening verb or how many intervening commas there are); global features (e.g. does the sentence contain a finite verb); explicit feature combinations (depending on the learner, these might not be necessary, e.g. a polynomial kernel routinely combines features); for classifier-based parsers: the previous actions, i.e. classifications; whether information about labels is used as input for other decisions. Finally, the precise set of features can be optimized per language.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="159" end_page="159" type="metho">
    <SectionTitle>
6 Results
</SectionTitle>
    <Paragraph position="0"> Table 5 shows the official results for submitted parser outputs.31 The two participant groups with the highest total score are McDonald et al. (2006) and Nivre et al. (2006). As both groups had much prior experience in multilingual dependency parsing (see Section 2), it is not too surprising that they both achieved good results. It is surprising, however, how similar their total scores are, given that their approaches are quite different (see Table 2).</Paragraph>
    <Paragraph position="1"> The results show that experiments on just one or two languages certainly give an indication of the usefulness of a parsing approach but should not be taken as proof that one algorithm is better for &amp;quot;parsing&amp;quot; (in general) than another that performs slightly worse.</Paragraph>
    <Paragraph position="2"> The Bulgarian scores suggest that rankings would not have been very different had it been the 13th obligatory languages.</Paragraph>
    <Paragraph position="3"> Table 6 shows that the same holds had we used another evaluation metric. Note that a negative number in both the third and fifth column indicates that errors on HEAD and DEPREL occur together on the same token more often than for other parsers. Finally, we checked that, had we also scored on punctuation tokens, total scores as well as rankings would only have shown very minor differences.</Paragraph>
  </Section>
  <Section position="10" start_page="159" end_page="161" type="metho">
    <SectionTitle>
7 Result analysis
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="159" end_page="161" type="sub_section">
      <SectionTitle>
7.1 Across data sets
</SectionTitle>
      <Paragraph position="0"> The average LAS over all data sets varies between 56.0 for Turkish and 85.9 for Japanese. Top scores vary between 65.7 for Turkish and 91.7 for Japanese.</Paragraph>
      <Paragraph position="1"> In general, there is a high correlation between the best scores and the average scores. This means that data sets are inherently easy or difficult, no matter what the parsing approach. The &amp;quot;easiest&amp;quot; one is clearly the Japanese data set. However, it would be wrong to conclude from this that Japanese in general is easy to parse. It is more likely that the effect stems from the characteristics of the data. The Japanese Verbmobil treebank contains dialogue within a restricted domain (making business appointments). As 31Unfortunately, urgent other obligations prevented two participants (John O'Neil and Kenji Sagae) from submitting a paper about their shared task work. Their results are indicated by a smaller font. Sagae used a best-first probabilistic version of  the ranking for each participant changes (or not: '=') if the unlabeled attachment scores, as shown in the fourth column, are used. The fifth column shows how the ranking changes (in comparison to LAS) if the label accuracies, as shown in the sixth column, are used.</Paragraph>
      <Paragraph position="2"> aIn Bick's method, preference is given to the assignment of dependency labels.</Paragraph>
      <Paragraph position="3"> bSchiehlen derived the constituent labels for his PCFG approach from the DEPREL values.</Paragraph>
      <Paragraph position="4"> cDue to the bug (see footnote with Table 5).</Paragraph>
      <Paragraph position="5"> can be seen in Table 1, there are very few new FORM values in the test data, which is an indication of many dialogues in the treebank being similar. In addition, parsing units are short on average. Finally, the set of DEPREL values is very small and consequently the ratio between (C)POSTAG and DEPREL values is extremely favorable. It would be interesting to apply the shared task parsers to the Kyoto University Corpus (Kurohashi and Nagao, 1997), which is the standard treebank for Japanese and has also been used by Kudo and Matsumoto  tions (SD) from the average per participant are calculated over the 12 obligatory languages (i.e. excluding Bulgarian). Note that due to the equal sizes of the test sets for all languages, the total scores, i.e. the LAS over the concatenation of the 12 obligatory test sets, are identical (up to the first decimal digit) to the average LAS over the 12 test sets. Averages and standard deviations per data set are calculated ignoring zero scores (i.e. results not submitted). The highest score for each column and those not significantly worse (p &lt; 0.05) are shown in bold face. Significance was computed using the official scoring script eval.pl and Dan Bikel's Randomized Parsing Evaluation Comparator, which implements stratified shuffling. aAttardi's submitted results contained an unfortunate bug which caused the DEPREL values of all tokens with HEAD=0 to be an underscore (which is scored as incorrect). Using the simple heuristic of assigning the DEPREL value that most frequently occured with HEAD=0 in training would have resulted in a total LAS of 67.5. (2000), or to the domain-restricted Japanese dialogues of the ATR corpus (Lepage et al., 1998).32 Other relatively &amp;quot;easy&amp;quot; data sets are Portuguese (2nd highest average score but, interestingly, the third-longest parsing units), Bulgarian (3rd), German (4th) and Chinese (5th). Chinese also has the second highest top score33 and Chinese parsing units 32Unfortunately, both these treebanks need to be bought, so they could not be used for the shared task. Note also that Japanese dependency parsers often operate on &amp;quot;bunsetsus&amp;quot; instead of words. Bunsetsus are related to chunks and consist of a content word and following particles (if any).</Paragraph>
      <Paragraph position="6"> 33Although this seems to be somewhat of a mystery compared to the ranking according to the average scores. Riedel et are the shortest. and Chinese parsing units are the shortest. We note that all &amp;quot;easier&amp;quot; data sets offer large to middle-sized training sets.</Paragraph>
      <Paragraph position="7"> The most difficult data set is clearly the Turkish one. It is rather small, and in contrast to Arabic and Slovene, which are equally small or smaller, it covers 8 genres, which results in a high percentage of new FORM and LEMMA values in the test set.</Paragraph>
      <Paragraph position="8"> It is also possible that parsers get confused by the high proportion (one third!) of non-scoring tokens al. (2006)'s top score is more than 3% absolute above the second highest score and they offer no clear explanation for their success.</Paragraph>
      <Paragraph position="9">  and the many tokens with ' ' as either the FORM or LEMMA. There is a clear need for further research to check whether other representations result in better performance.</Paragraph>
      <Paragraph position="10"> The second-most difficult data set is Arabic. It is quite small and has by far the longest parsing units.</Paragraph>
      <Paragraph position="11"> The third-most difficult data set is Slovene. It has the smallest training set. However, its average as well as top score far exceed those for Arabic and Turkish, which are larger. Interestingly, although the treebank text comes from a single source (a translation of Orwell's novel &amp;quot;1984&amp;quot;), there is quite a high proportion of new FORM and LEMMA values in the test set. The fourth-most difficult data set is Czech in terms of the average score and Dutch in terms of the top score. The diffence in ranking for Czech is probably due to the fact that it has by far the largest training set and ironically, several participants could not train on all data within the limited time, or else had to partition the data and train one model for each partition. Likely problems with the Dutch data set are: noisy (C)POSTAG and LEMMA, (C)POSTAG for multiwords, and the highest proportion of nonprojectivity. null Factors that have been discussed so far are: the size of the training data, the proportion of new FORM and LEMMA values in the test set, the ratio of (C)POSTAG to DEPREL values, the average length of the parsing unit the proportion of non-projective arcs/parsing units. It would be interesting to derive a formula based on those factors that fits the shared task data and see how well it predicts results on new data sets. One factor that seems to be irrelevant is the head-final versus head-initial distinction, as both the &amp;quot;easiest&amp;quot; and the most difficult data sets are for head-final languages. There is also no clear proof that some language families are easier (with current parsing methods) than others. It would be interesting to test parsers on the Hebrew treebank (Sima'an et al., 2001), to compare performance to Arabic, the other Semitic language in the shared task, or on the Hungarian Szeged Corpus (Csendes et al., 2004), for another agglutinative language.</Paragraph>
    </Section>
    <Section position="2" start_page="161" end_page="161" type="sub_section">
      <SectionTitle>
7.2 Across participants
</SectionTitle>
      <Paragraph position="0"> For most parsers, their ranking for a specific language differs at most a few places from their over-all ranking. There are some outliers though. For example, Johansson and Nugues (2006) and Yuret (2006) are seven ranks higher for Turkish than overall, while Riedel et al. (2006) are five ranks lower.</Paragraph>
      <Paragraph position="1"> Canisius et al. (2006) are six and Schiehlen and Spranger (2006) even eight ranks higher for Dutch than overall, while Riedel et al. (2006) are six ranks lower for Czech and Johansson and Nugues (2006) also six for Chinese. Some of the higher rankings could be related to native speaker competence and resulting better parameter tuning but other outliers remain a mystery. Even though McDonald et al.</Paragraph>
      <Paragraph position="2"> (2006) and Nivre et al. (2006) obtained very similar overall scores, a more detailed look at their performance shows clear differences. Taken over all 12 obligatory languages, both obtain a recall of more than 89% on root tokens (i.e. those with HEAD=0) but Nivre's precision on them is much lower than McDonald's (80.91 versus 91.07). This is likely to be an effect of the different parsing approaches.</Paragraph>
    </Section>
    <Section position="3" start_page="161" end_page="161" type="sub_section">
      <SectionTitle>
7.3 Across part-of-speech tags
</SectionTitle>
      <Paragraph position="0"> When breaking down by part-of-speech the results of all participants on all data sets, one can observe some patterns of &amp;quot;easy&amp;quot; and &amp;quot;difficult&amp;quot; parts-ofspeech, at least in so far as tag sets are comparable across treebanks. The one PoS that everybody got 100% correct are the German infinitival markers (tag PTKZU; like &amp;quot;to&amp;quot; in English). Accuracy on the Swedish equivalent (IM) is not far off at 98%.</Paragraph>
      <Paragraph position="1"> Other easy PoS are articles, with accuracies in the nineties for German, Dutch, Swedish, Portuguese and Spanish. As several participants have remarked in their papers, prepositions are much more difficult, with typical accuracies in the fifties or sixties. Similarly, conjunctions typically score low, with accuracies even in the forties for Arabic and Dutch.</Paragraph>
    </Section>
  </Section>
  <Section position="11" start_page="161" end_page="162" type="metho">
    <SectionTitle>
8 Future research
</SectionTitle>
    <Paragraph position="0"> There are many directions for interesting research building on the work done in this shared task. One is the question which factors make data sets &amp;quot;easy&amp;quot; or difficult. Another is finding out how much of parsing performance depends on annotations such as the lemma and morphological features, which are not yet routinely part of treebanking efforts. In this respect, it would be interesting to repeat ex- null periments with the recently released new version of the TIGER treebank which now contains this information. One line of research that does not require additional annotation effort is defining or improving the mapping from coarse-grained to fine-grained PoS tags.34 Another is harvesting and using large-scale distributional data from the internet. We also hope that by combining parsers we can achieve even better performance, which in turn would facilitate the semi-automatic enlargement of existing tree-banks and possibly the detection of remaining errors. This would create a positive feedback loop.</Paragraph>
    <Paragraph position="1"> Finally one must not forget that almost all of the LEMMA, (C)POSTAG and FEATS values and even part of the FORM column (the multiword tokens used in many data sets and basically all tokenization for Chinese and Japanese, where words are normally not delimited by spaces) have been manually created or corrected and that the general parsing task has to integrate automatic tokenization, morphological analysis and tagging. We hope that the resources created and lessons learned during this shared task will be valuable for many years to come but also that they will be extended and improved by others in the future, and that the shared task website will grow into an informational hub on multilingual dependency parsing.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML