XML Viewer - p03-1013

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1013_metho.xml
Size: 33,006 bytes
Last Modified: 2025-10-06 14:08:12
<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1013">
  <Title>Probabilistic Parsing for German using Sister-Head Dependencies</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Parsing German
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Syntactic Properties
</SectionTitle>
      <Paragraph position="0"> German exhibits a number of syntactic properties that distinguish it from English, the language that has been the focus of most research in parsing.</Paragraph>
      <Paragraph position="1"> Prominent among these properties is the semi-free  various languages (dependency precision for Czech) wordorder, i.e., German wordorder is fixed in some respects, but variable in others. Verb order is largely fixed: in subordinate clauses such as (1a), both the finite verb hat 'has' and the non-finite verb komponiert 'composed' are in sentence final position.  hat.</Paragraph>
      <Paragraph position="2"> has 'Because he has composed music yesterday.' b. Hat er gestern Musik komponiert? c. Er hat gestern Musik komponiert.</Paragraph>
      <Paragraph position="3"> In yes/no questions such as (1b), the finite verb is sentence initial, while the non-finite verb is sentence final. In declarative main clauses (see (1c)), on the other hand, the finite verb is in second position (i.e., preceded by exactly one constituent), while the non-finite verb is final.</Paragraph>
      <Paragraph position="4"> While verb order is fixed in German, the order of complements and adjuncts is variable, and influenced by a variety of syntactic and non-syntactic factors, including pronominalization, information structure, definiteness, and animacy (e.g., Uszkoreit 1987). The first position in a declarative sentence, for example, can be occupied by various constituents, including the subject (er 'he' in (1c)), the object (Musik 'music' in (2a)), an adjunct (gestern  'yesterday' in (2b)), or the non-finite verb (komponiert 'composed' in (2c)).</Paragraph>
      <Paragraph position="5"> (2) a. Musik hat er gestern komponiert.</Paragraph>
      <Paragraph position="6"> b. Gestern hat er Musik komponiert .</Paragraph>
      <Paragraph position="7"> c. Komponiert hat er gestern Musik.</Paragraph>
      <Paragraph position="8">  The semi-free wordorder in German means that a context-free grammar model has to contain more rules than for a fixed wordorder language. For transitive verbs, for instance, we need the rules S ! VNPNP,S! NP V NP, and S ! NP NP V to account for verb initial, verb second, and verb final order (assuming a flat S, see Section 2.2).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Negra Annotation Scheme
</SectionTitle>
      <Paragraph position="0"> The Negra corpus consists of around 350,000 words of German newspaper text (20,602 sentences). The annotation scheme (Skut et al., 1997) is modeled to a certain extent on that of the Penn Treebank (Marcus et al., 1993), with crucial differences. Most importantly, Negra follows the dependency grammar tradition in assuming flat syntactic representations: (a) There is no S ! NP VP rule. Rather, the subject, the verb, and its objects are all sisters of each other, dominated by an S node. This is a way of accounting for the semi-free wordorder of German (see Section 2.1): the first NP within an S need not be the subject.</Paragraph>
      <Paragraph position="1"> (b) There is no SBAR ! Comp S rule. Main clauses, subordinate clauses, and relative clauses all share the category S in Negra; complementizers and relative pronouns are simply sisters of the verb.</Paragraph>
      <Paragraph position="2"> (c) There is no PP ! P NP rule, i.e., the preposition and the noun it selects (and determiners and adjectives, if present) are sisters, dominated by a PP node. An argument for this representation is that prepositions behave like case markers in German; a preposition and a determiner can merge into a single word (e.g., in dem 'in the' becomes im).</Paragraph>
      <Paragraph position="3"> Another idiosyncrasy of Negra is that it assumes special coordinate categories. A coordinated sentence has the category CS, a coordinate NP has the category CNP, etc. While this does not make the annotation more flat, it substantially increases the number of non-terminal labels. Negra also contains grammatical function labels that augment phrasal and lexical categories. Example are MO (modifier), HD (head), SB (subject), and OC (clausal object).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="1" type="metho">
    <SectionTitle>
3 Probabilistic Parsing Models
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Probabilistic Context-Free Grammars
</SectionTitle>
      <Paragraph position="0"> Lexicalization has been shown to improve parsing performance for the Penn Treebank (e.g., Carroll and Rooth 1998; Charniak 1997, 2000; Collins 1997). The aim of the present paper is to test if this finding carries over to German and to the Negra corpus. We therefore use an unlexicalized model as our baseline against which to test the lexicalized models.</Paragraph>
      <Paragraph position="1"> More specifically, we used a standard probabilistic context-free grammar (PCFG; see Charniak 1993). Each context-free rule RHS ! LHS is annotated with an expansion probability P(RHSjLHS).</Paragraph>
      <Paragraph position="2"> The probabilities for all rules with the same lefthand side have to sum to one, and the probability of a parse tree T is defined as the product of the probabilities of all rules applied in generating T .</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
3.2 Carroll and Rooth's Head-Lexicalized
Model
</SectionTitle>
      <Paragraph position="0"> The head-lexicalized PCFG model of Carroll and Rooth (1998) is a minimal departure from the standard unlexicalized PCFG model, which makes it ideal for a direct comparison.</Paragraph>
      <Paragraph position="1">  A grammar rule LHS ! RHS can be written as</Paragraph>
      <Paragraph position="3"> are daughters. Let l(C) be the lexical head  Charniak (1997) proposes essentially the same model; we will nevertheless use the label 'Carroll and Rooth model' as we are using their implementation (see Section 4.1). of the constituent C. The rule probability is then defined as (see also Beil et al. 2002):</Paragraph>
      <Paragraph position="5"> (l(C)jC;P;l(P)) is the probability that the (non-head) category C has the lexical head l(C) given that its mother is P with lexical head l(P).</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.3 Collins's Head-Lexicalized Model
</SectionTitle>
      <Paragraph position="0"> In contrast to Carroll and Rooth's (1998) approach, the model proposed by Collins (1997) does not compute rule probabilities directly. Rather, they are generated using a Markov process that makes certain independence assumptions. A grammar rule LHS ! RHS can be written as P ! L</Paragraph>
      <Paragraph position="2"> where P is the mother and H is the head daughter.</Paragraph>
      <Paragraph position="3"> Let l(C) be the head word of C and t(C) the tag of the head word of C. Then the probability of a rule is defined as:</Paragraph>
      <Paragraph position="5"> are the probabilities of generating the nonterminals to the left and right of the head, respectively; d(i) is a distance measure. (L</Paragraph>
      <Paragraph position="7"> are stop categories.) At this point, the model is still unlexicalized. To add lexical sensitivity, the P</Paragraph>
      <Paragraph position="9"> probability functions also take into account head words and their POS tags:</Paragraph>
      <Paragraph position="11"/>
    </Section>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4 Experiment 1
</SectionTitle>
    <Paragraph position="0"> This experiment was designed to compare the performance of the three models introduced in the last section. Our main hypothesis was that the lexicalized models will outperform the unlexicalized baseline model. Another prediction was that adding Negra-specific information to the models will increase parsing performance. We therefore tested a model variant that included grammatical function labels, i.e., the set of categories was augmented by the function tags specified in Negra (see Section 2.2).</Paragraph>
    <Paragraph position="1"> Adding grammatical functions is a way of dealing with the wordorder facts of German (see Section 2.1) in the face of Negra's very flat annotation scheme. For instance, subject and object NPs have different wordorder preferences (subjects tend to be preverbal, while objects tend to be postverbal), a fact that is captured if subjects have the label NP-SB, while objects are labeled NP-OA (accusative object), NP-DA (dative object), etc. Also the fact that verb order differs between subordinate and main clauses is captured by the function labels: the former are labeled S, while the latter are labeled S-OC (object clause), S-RC (relative clause), etc.</Paragraph>
    <Paragraph position="2"> Another idiosyncrasy of the Negra annotation is that conjoined categories have separate labels (S and CS, NP and CNP, etc.), and that PPs do not contain an NP node. We tested a variant of the Carroll and Rooth (1998) model that takes this into account.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.1 Method
</SectionTitle>
      <Paragraph position="0"> used the treebank format of Negra. This format, which is included in the Negra distribution, was derived from the native format by replacing crossing branches with traces. We split the corpus into three subsets. The first 18,602 sentences constituted the training set. Of the remaining 2,000 sentences, the first 1,000 served as the test set, and the last 1000 as the development set. To increase parsing efficiency, we removed all sentences with more than 40 words.</Paragraph>
      <Paragraph position="1"> This resulted in a test set of 968 sentences and a development set of 975 sentences. Early versions of the models were tested on the development set, and the test set remained unseen until all parameters were fixed. The final results reported this paper were obtained on the test set, unless stated otherwise.</Paragraph>
      <Paragraph position="2"> Grammar Induction For the unlexicalized PCFG model (henceforth baseline model), we used the probabilistic left-corner parser Lopar (Schmid, 2000). When run in unlexicalized mode, Lopar implements the model described in Section 3.1. A grammar and a lexicon for Lopar were read off the Negra training set, after removing all grammatical function labels. As Lopar cannot handle traces, these were also removed from the training data.</Paragraph>
      <Paragraph position="3"> The head-lexicalized model of Carroll and Rooth (1998) (henceforth C&amp;R model) was again realized using Lopar, which in lexicalized mode implements the model in Section 3.2. Lexicalization requires that each rule in a grammar has one of the categories on its righthand side annotated as the head. For the categories S, VP, AP, and AVP, the head is marked in Negra. For the other categories, we used rules to heuristically determine the head, as is standard practice for the Penn Treebank.</Paragraph>
      <Paragraph position="4"> The lexicalized model proposed by Collins (1997) (henceforth Collins model) was re-implemented by one of the authors. For training, empty categories were removed from the training data, as the model cannot handle them. The same head finding strategy was applied as for the C&amp;R model.</Paragraph>
      <Paragraph position="5"> In this experiment, only head-head statistics were used (see (5)). The original Collins model uses sister-head statistics for non-recursive NPs. This will be discussed in detail in Section 5.</Paragraph>
      <Paragraph position="6"> Training and Testing For all three models, the model parameters were estimated using maximum likelihood estimation. Both Lopar and the Collins model use various backoff distributions to smooth the estimates. The reader is referred to Schmid (2000) and Collins (1997) for details. For the C&amp;R model, we used a cutoff of one for rule frequencies</Paragraph>
      <Paragraph position="8"> and lexical choice frequencies P choice (the cutoff value was optimized on the development set). We also tested variants of the baseline model and the C&amp;R model that include grammatical function information, as we hypothesized that this information might help the model to handle wordorder variation more adequately, as explained above. Finally, we tested variant of the C&amp;R model that uses Lopar's parameter pooling feature. This feature makes it possible to collapse the lexical choice distribution P choice for either the daughter or the mother categories of a rule (see Section 3.2). We pooled the estimates for pairs of conjoined and nonconjoined daughter categories (S and CS, NP and CNP, etc.): these categories should be treated as the same daughters; e.g., there should be no difference between S !NP V and S!CNP V. We also pooled the estimates for the mother categories NPs and PPs. This is a way of dealing with the fact that there is no separate NP node within PPs in Negra.</Paragraph>
      <Paragraph position="9"> Lopar and the Collins model differ in their handling of unknown words. In Lopar, a POS tag distribution for unknown words has to be specified, which is then used to tag unknown words in the test data. The Collins model treats any word seen fewer than five times in the training data as unseen and uses an external POS tagger to tag unknown words. In order to make the models comparable, we used a uniform approach to unknown words. All models were run on POS-tagged input; this input was created by tagging the test set with a separate POS tagger, for both known and unknown words. We used TnT (Brants, 2000), trained on the Negra training set. The tagging accuracy was 97.12% on the development set.</Paragraph>
      <Paragraph position="10"> In order to obtain an upper bound for the performance of the parsing models, we also ran the parsers on the test set with the correct tags (as specified in Negra), again for both known and unknown words.</Paragraph>
      <Paragraph position="11"> We will refer to this mode as 'perfect tagging'.</Paragraph>
      <Paragraph position="12"> All models were evaluated using standard PAR-SEVAL measures. We report labeled recall (LR) labeled precision (LP), average crossing brackets (CBs), zero crossing brackets (0CB), and two or less crossing brackets ( 2CB). We also give the coverage (Cov), i.e., the percentage of sentences that the parser was able to parse.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4.2 Results
</SectionTitle>
    <Paragraph position="0"> The results for all three models and their variants are given in Table 2, for both TnT tags and perfect tags. The baseline model achieves 70.56% LR and 66.69% LP with TnT tags. Adding grammatical functions reduces both figures slightly, and coverage drops by about 15%. The C&amp;R model performs worse than the baseline, at 68.04% LR and 60.07% LP (for TnT tags). Adding grammatical function again reduces performance slightly. Parameter pooling increases both LR and LP by about 1%. The Collins models also performs worse than the baseline, at 67.91% LR and 66.07% LP.</Paragraph>
    <Paragraph position="1"> Performance using perfect tags (an upper bound of model performance) is 2-3% higher for the base-line and for the C&amp;R model. The Collins model gains only about 1%. Perfect tagging results in a performance increase of over 10% for the models with grammatical functions. This is not surprising, as the perfect tags (but not the TnT tags) include grammatical function labels. However, we also observe a dramatic reduction in coverage (to about 65%).</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.3 Discussion
</SectionTitle>
      <Paragraph position="0"> We added grammatical functions to both the base-line model and the C&amp;R model, as we predicted that this would allow the model to better capture the wordorder facts of German. However, this prediction was not borne out: performance with grammatical functions (on TnT tags) was slightly worse than without, and coverage dropped substantially. A possible reason for this is sparse data: a grammar augmented with grammatical functions contains many additional categories, which means that many more parameters have to be estimated using the same training set. On the other hand, a performance increase occurs if the tagger also provides grammatical function labels (simulated in the perfect tags condition). However, this comes at the price of an unacceptable reduction in coverage.</Paragraph>
      <Paragraph position="1"> When training the C&amp;R model, we included a variant that makes use of Lopar's parameter pooling feature. We pooled the estimates for conjoined daughter categories, and for NP and PP mother categories. This is a way of taking the idiosyncrasies of the Negra annotation into account, and resulted in a small improvement in performance.</Paragraph>
      <Paragraph position="2"> The most surprising finding is that the best performance was achieved by the unlexicalized PCFG  functions; pool: parameter pooling for NPs/PPs and conjoined categories)  baseline model. Both lexicalized models (C&amp;R and Collins) performed worse than the baseline. This results is at odds with what has been found for English, where lexicalization is standardly reported to increase performance by about 10%. The poor performance of the lexicalized models could be due to a lack of sufficient training data: our Negra training set contains approximately 18,000 sentences, and is therefore significantly smaller than the Penn Tree-bank training set (about 40,000 sentences). Negra sentences are also shorter: they contain, on average, 15 words compared to 22 in the Penn Treebank.</Paragraph>
      <Paragraph position="3"> We computed learning curves for the unmodified variants (without grammatical functions or parameter pooling) of all three models (on the development set). The result (see Figure 1) shows that there is no evidence for an effect of sparse data. For both the baseline and the C&amp;R model, a fairly high f-score is achieved with only 10% of the training data. A slow increase occurs as more training data is added.</Paragraph>
      <Paragraph position="4"> The performance of the Collins model is even less affected by training set size. This is probably due to the fact that it does not use rule probabilities directly, but generates rules using a Markov chain.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="1" end_page="3" type="metho">
    <SectionTitle>
5 Experiment 2
</SectionTitle>
    <Paragraph position="0"> As we saw in the last section, lack of training data is not a plausible explanation for the sub-baseline performance of the lexicalized models. In this experiment, we therefore investigate an alternative hypothesis, viz., that the lexicalized models do not cope  matical categories in the Penn Treebank and Negra well with the fact that Negra rules are so flat (see Section 2.2). We will focus on the Collins model, as it outperformed the C&amp;R model in Experiment 1.</Paragraph>
    <Paragraph position="1"> An error analysis revealed that many of the errors of the Collins model in Experiment 1 are chunking errors. For example, the PP neben den Mitteln des Theaters should be analyzed as (6a). But instead the parser produces two constituents as in (6b)):  'apart from the means of the theater'.</Paragraph>
    <Paragraph position="2"> b. [PP neben den Mitteln] [NP des Theaters] The reason for this problem is that neben is the head of the constituent in (6), and the Collins model uses a crude distance measure together with head-head dependencies to decide if additional constituents should be added to the PP. The distance measure is inadequate for finding PPs with high precision.</Paragraph>
    <Paragraph position="3"> The chunking problem is more widespread than PPs. The error analysis shows that other constituents, including Ss and VPs, also have the wrong boundary. This problem is compounded by the fact that the rules in Negra are substantially flatter than the rules in the Penn Treebank, for which the Collins model was developed. Table 3 compares the average number of daughters in both corpora.</Paragraph>
    <Paragraph position="4"> The flatness of PPs is easy to reduce. As detailed in Section 2.2, PPs lack an intermediate NP projection, which can be inserted straightforwardly using the following rule: (7) [PPP...]! [PPP[NP...]] In the present experiment, we investigated if parsing performance improves if we test and train on a version of Negra on which the transformation in (7) has been applied.</Paragraph>
    <Paragraph position="5"> In a second series of experiments, we investigated a more general way of dealing with the flatness of</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
C&amp;R Collins Charniak Current
</SectionTitle>
      <Paragraph position="0"> Head sister category X X X Head sister head word X X X Head sister head tag X X Prev. sister category X X X Prev. sister head word X Prev. sister head tag X Table 4: Linguistic features in the current model compared to the models of Carroll and Rooth (1998), Collins (1997), and Charniak (2000) Negra, based on Collins's (1997) model for non-recursive NPs in the Penn Treebank (which are also flat). For non-recursive NPs, Collins (1997) does not use the probability function in (5), but instead substitutes P r (and, by analogy, P</Paragraph>
      <Paragraph position="2"> ). In the literature, the version of P r in (5) is said to capture head-head relationships. We will refer to the alternative model in (8) as capturing sister-head relationships.</Paragraph>
      <Paragraph position="3"> Using sister-head relationships is a way of counteracting the flatness of the grammar productions; it implicitly adds binary branching to the grammar. Our proposal is to extend the use of sister-head relationship from non-recursive NPs (as proposed by Collins) to all categories.</Paragraph>
      <Paragraph position="4"> Table 4 shows the linguistic features of the resulting model compared to the models of Carroll and Rooth (1998), Collins (1997), and Charniak (2000). The C&amp;R model effectively includes category information about all previous sisters, as it uses context-free rules. The Collins (1997) model does not use context-free rules, but generates the next category using zeroth order Markov chains (see Section 3.3), hence no information about the previous sisters is included. Charniak's (2000) model extends this to higher order Markov chains (first to third order), and therefore includes category information about previous sisters.The current model differs from all these proposals: it does not use any information about the head sister, but instead includes the category, head word, and head tag of the previous sister, effectively treating it as the head.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.1 Method
</SectionTitle>
      <Paragraph position="0"> We first trained the original Collins model on a modified versions of the training test from Experiment 1 in which the PPs were split by applying rule (7).</Paragraph>
      <Paragraph position="1"> In a second series of experiments, we tested a range of models that use sister-head dependencies instead of head-head dependencies for different categories. We first added sister-head dependencies for NPs (following Collins's (1997) original proposal) and then for PPs, which are flat in Negra, and thus similar in structure to NPs (see Section 2.2). Then we tested a model in which sister-head relationships are applied to all categories.</Paragraph>
      <Paragraph position="2"> In a third series of experiments, we trained models that use sister-head relationships everywhere except for one category. This makes it possible to determine which sister-head dependencies are crucial for improving performance of the model.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.2 Results
</SectionTitle>
      <Paragraph position="0"> The results of the PP experiment are listed in Table 5. Again, we give results obtained using TnT tags and using perfect tags. The row 'Split PP' contains the performance figures obtained by including split PPs in both the training and in the testing set. This leads to a substantial increase in LR (6-7%) and LP (around 8%) for both tagging schemes. Note, however, that these figures are not directly comparable to the performance of the unmodified Collins model: it is possible that the additional brackets artificially inflate LR and LP. Presumably, the brackets for split PPs are easy to detect, as they are always adjacent to a preposition. An honest evaluation should therefore train on the modified training set (with split PPs), but collapse the split categories for testing, i.e., test on the unmodified test set. The results for this evaluation are listed in rows 'Collapsed PP'. Now there is no increase in performance compared to the unmodified Collins model; rather, a slight drop in LR and LP is observed.</Paragraph>
      <Paragraph position="1"> Table 5 also displays the results of our experiments with the sister-head model. For TnT tags, we observe that using sister-head dependencies for NPs leads to a small decrease in performance compared to the unmodified Collins model, resulting in 67.84% LR and 65.96% LP. Sister-head dependencies for PPs, however, increase performance substantially to 70.27% LR and 68.45% LP. The highest improvement is observed if head-sister dependencies are used for all categories; this results in 71.32% LR and 70.93% LP, which corresponds to an improvement of 3% in LP and 5% in LR compared to the unmodified Collins model. Performance with perfect tags is around 2-4% higher than with TnT tags. For perfect tags, sister-head dependencies lead to an improvement for NPs, PPs, and all categories.</Paragraph>
      <Paragraph position="2"> The third series of experiments was designed to determine which categories are crucial for achieving this performance gain. This was done by training models that use sister-head dependencies for all categories but one. Table 6 shows the change in LR and LP that was found for each individual category (again for TnT tags and perfect tags). The highest drop in performance (around 3%) is observed when the PP category is reverted to head-head dependencies. For S and for the coordinated categories (CS,  CNP, etc.), a drop in performance of around 1% each is observed. A slight drop is observed also for VP (around 0.5%). Only minimal fluctuations in performance are observed when the other categories are removed (AP, AVP, and NP): there is a small effect (around 0.5%) if TnT tags are used, and almost no effect for perfect tags.</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="3" type="sub_section">
      <SectionTitle>
5.3 Discussion
</SectionTitle>
      <Paragraph position="0"> We showed that splitting PPs to make Negra less flat does not improve parsing performance if testing is carried out on the collapsed categories. However, we observed that LR and LP are artificially inflated if split PPs are used for testing. This finding goes some way towards explaining why the parsing performance reported for the Penn Treebank is substantially higher than the results for Negra: the Penn Treebank contains split PPs, which means that there are lot of brackets that are easy to get right. The resulting performance figures are not directly comparable to figures obtained on Negra, or other corpora with flat PPs.</Paragraph>
      <Paragraph position="1">  We also obtained a positive result: we demonstrated that a sister-head model outperforms the unlexicalized baseline model (unlike the C&amp;R model and the Collins model in Experiment 1). LR was about 1% higher and LP about 4% higher than the baseline if lexical sister-head dependencies are used for all categories. This holds both for TnT tags and for perfect tags (compare Tables 2 and 5). We also found that using lexical sister-head dependencies for all categories leads to a larger improvement than using them only for NPs or PPs (see Table 5). This result was confirmed by a second series of experiments, where we reverted individual categories back to head-head dependencies, which triggered a decrease in performance for all categories, with the exception of NP, AP, and AVP (see Table 6).</Paragraph>
      <Paragraph position="2"> On the whole, the results of Experiment 2 are at odds with what is known about parsing for English.</Paragraph>
      <Paragraph position="3"> The progression in the probabilistic parsing literature has been to start with lexical head-head dependencies (Collins, 1997) and then add non-lexical sis- null This result generalizes to Ss, which are also flat in Negra (see Section 2.2). We conducted an experiment in which we added an SBAR above the S. No increase in performance was obtained if the evaluation was carried using collapsed Ss.  been found useful in a limited way: in the original Collins model, they are used for non-recursive NPs.</Paragraph>
      <Paragraph position="4"> Our results show, however, that for parsing German, lexical sister-head information is more important than lexical head-head information. Only a model that replaced lexical head-head with lexical sister-head dependencies was able to outperform a baseline model that uses no lexicalization.</Paragraph>
      <Paragraph position="5">  Based on the error analysis for Experiment 1, we claim that the reason for the success of the sister-head model is the fact that the rules in Negra are so flat; using a sister-head model is a way of binarizing the rules.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="3" end_page="3" type="metho">
    <SectionTitle>
6 Comparison with Previous Work
</SectionTitle>
    <Paragraph position="0"> There are currently no probabilistic, treebank-trained parsers available for German (to our knowledge). A number of chunking models have been proposed, however. Skut and Brants (1998) used Negra to train a maximum entropy-based chunker, and report LR and LP of 84.4% for NP and PP chunking. Using cascaded Markov models, Brants (2000) reports an improved performance on the same task (LR 84.4%, LP 88.3%). Becker and Frank (2002) train an unlexicalized PCFG on Negra to perform a different chunking task, viz., the identification of topological fields (sentence-based chunks). They report an LR and LP of 93%.</Paragraph>
    <Paragraph position="1"> The head-lexicalized model of Carroll and Rooth (1998) has been applied to German by Beil et al.</Paragraph>
    <Paragraph position="2">  It is unclear what effect bi-lexical statistics have on the sister-head model; while Gildea (2001) shows bi-lexical statistics are sparse for some grammars, Hockenmaier and Steedman (2002) found they play a greater role in binarized grammars. (1999, 2002). However, this approach differs in the number of ways from the results reported here: (a) a hand-written grammar (instead of a treebank grammar) is used; (b) training is carried out on unannotated data; (c) the grammar and the training set cover only subordinate and relative clauses, not unrestricted text. Beil et al. (2002) report an evaluation using an NP chunking task, achieving 92% LR and LP. They also report the results of a task-based evaluation (extraction of sucategorization frames).</Paragraph>
    <Paragraph position="3"> There is some research on treebank-based parsing of languages other than English. The work by Collins et al. (1999) and Bikel and Chiang (2000) has demonstrated the applicability of the Collins (1997) model for Czech and Chinese. The performance reported by these authors is substantially lower than the one reported for English, which might be due to the fact that less training data is available for Czech and Chinese (see Table 1). This hypothesis cannot be tested, as the authors do not present learning curves for their models. However, the learning curve for Negra (see Figure 1) indicates that the performance of the Collins (1997) model is stable, even for small training sets. Collins et al.</Paragraph>
    <Paragraph position="4"> (1999) and Bikel and Chiang (2000) do not compare their models with an unlexicalized baseline; hence it is unclear if lexicalization really improves parsing performance for these languages. As Experiment 1 showed, this cannot be taken for granted.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML