File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2929_metho.xml

Size: 15,634 bytes

Last Modified: 2025-10-06 14:10:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2929">
  <Title>Vine Parsing and Minimum Risk Reranking for Speed and Precision[?]</Title>
  <Section position="5" start_page="0" end_page="202" type="metho">
    <SectionTitle>
3 Unlabeled Parsing
</SectionTitle>
    <Paragraph position="0"> The first component of our system is an unlabeled parser that, given a sentence, finds the U best unlabeled trees under a probabilistic model using a bottom-up dynamic programming algorithm.2 The model is a probabilistic head automaton grammar (Alshawi, 1996) that assumes conditional indepen- null as described in Eisner et al. (2004). All of our dynamic programming algorithms are implemented concisely in the Dyna language.</Paragraph>
    <Paragraph position="1">  in training data. We show oracle, 1-best, and reranked performance on the test set at different stages of the system. Boldface marks oracle performance that, given perfect downstream modules, would supercede the best system. Italics mark the few cases where the reranker increased error rate. Columns 8-10 show labeled accuracy; column 10 gives the final shared task evaluation scores. dence between the left yield and the right yield of a given head, given the head (Eisner, 1997).3 The best known parsing algorithm for such a model is O(n3) (Eisner and Satta, 1999). The U-best list is generated using Algorithm 3 of Huang and Chiang (2005).</Paragraph>
    <Section position="1" start_page="201" end_page="201" type="sub_section">
      <SectionTitle>
3.1 Vine parsing (dependency length bounds)
</SectionTitle>
      <Paragraph position="0"> Following Eisner and N. Smith (2005), we also impose a bound on the string distance between every 3To empirically test this assumption across languages, we measured the mutual information between different features of yleft(j) and yright(j), given xj. (Mutual information is a statistic that equals zero iff conditional independence holds.) A detailed discussion, while interesting, is omitted for space, but we highlight some of our findings. First, unsurprisingly, the splithead assumption appears to be less valid for languages with freer word order (Czech, Slovene, German) and more valid for more fixed-order languages (Chinese, Turkish, Arabic) or corpora (Japanese). The children of verbs and conjunctions are the most frequent violators. The mutual information between the sequence of dependency labels on the left and on the right, given the head's (coarse) tag, only once exceeded 1 bit (Slovene).</Paragraph>
      <Paragraph position="1"> child and its parent, with the exception of nodes attaching to $. Bounds of this kind are intended to improve precision of non-$ attachments, perhaps sacrificing recall. Fixing bound Blscript, no left dependency may exist between child xi and parent xj such that j[?]i &gt; Blscript (similarly for right dependencies and Br).</Paragraph>
      <Paragraph position="2"> As a result, edge-factored parsing runtime is reduced from O(n3) to O(n(B2lscript +B2r)). For each language, we choose Blscript (Br) to be the minimum value that will allow recovery of 90% of the left (right) dependencies in the training corpus (Tab. 1, cols. 1, 2, and 4). In order to match the training data to the parsing model, we re-attach disallowed long dependencies to $ during training.</Paragraph>
    </Section>
    <Section position="2" start_page="201" end_page="202" type="sub_section">
      <SectionTitle>
3.2 Estimation
</SectionTitle>
      <Paragraph position="0"> The probability model predicts, for each parent word xj, {xi}i[?]yleft(j) and {xi}i[?]yright(j). An advantage of head automaton grammars is that, for a given parent node xj, the children on the same side, yleft(j),  for example, can depend on each other (cf. McDonald et al., 2005). Child nodes in our model are generated outward, conditional on the parent and the most recent same-side sibling (MRSSS). This increases our parser's theoretical runtime to O(n(B3lscript + B3r)), which we found was quite manageable.</Paragraph>
      <Paragraph position="1"> Let pary : {1,2,...,n} - {0,1,...,n} map each node to its parent in y. Let predy : {1,2,...,n} {[?],1,2,...,n} map each node to the MRSSS in y if it exists and [?] otherwise. Let [?]i = |i [?] j |if j is i's parent. Our (probability-deficient) model defines</Paragraph>
      <Paragraph position="3"> Due to the familiar sparse data problem, a maximum likelihood estimate for the ps in Eq. 1 performs very badly (2-23% unlabeled accuracy). Good statistical parsers smooth those distributions by making conditional independence assumptions among variables, including backoff and factorization. Arguably the choice of assumptions made (or interpolated among) is central to the success of many existing parsers.</Paragraph>
      <Paragraph position="4"> Noting that (a) there are exponentially many such options, and (b) the best-performing independence assumptions will almost certainly vary by language, we use a mixture among 8 such models. The same mixture is used for all languages. The models were not chosen with particular care,4 and the mixture is not trained--the coefficients are fixed at uniform, with a unigram coarse-tag model for backoff. In principle, this mixture should be trained (e.g., to maximize likelihood or minimize error on a development dataset).</Paragraph>
      <Paragraph position="5"> The performance of our unlabeled model's top choice and the top-20 oracle are shown in Tab. 1, cols. 5-6. In 5 languages (boldface), perfect labeling and reranking at this stage would have resulted in performance superior to the language's best labeled 4Our infrastructure provides a concise, interpreted language for expressing the models to be mixed, so large-scale combination and comparison are possible.</Paragraph>
      <Paragraph position="6"> system, although the oracle is never on par with the best unlabeled performance.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="202" end_page="203" type="metho">
    <SectionTitle>
4 Labeling
</SectionTitle>
    <Paragraph position="0"> The second component of our system is a labeling model that independently selects a label from D for each parent/child pair in a tree. Given the U best unlabeled trees for a sentence, the labeler produces the L best labeled trees for each unlabeled one.</Paragraph>
    <Paragraph position="1"> The computation involves an O(|D|n) dynamic programming algorithm, the output of which is passed to Huang and Chiang's (2005) algorithm to generate the L-best list.</Paragraph>
    <Paragraph position="2"> We separate the labeler from the parser for two reasons: speed and candidate diversity. In principle the vine parser could jointly predict dependency labels along with structures, but parsing run-time would increase by at least a factor of |D|. The two stage process also forces diversity in the candidate list (20 structures with 50 labelings each); the 1,000-best list of jointly-decoded parses often contained many (bad) relabelings of the same tree.</Paragraph>
    <Paragraph position="3"> In retrospect, assuming independence among dependency labels damages performance substantially for some languages (Turkish, Czech, Swedish, Danish, Slovene, and Arabic); note the often large drop in oracle performance between Tab. 1, cols. 5 and 8. This assumption is necessary in our framework, because the O(|D|M+1n) runtime of decoding with an Mth-order Markov model of labels5 is in general prohibitive--in some cases |D |&gt; 80. Pruning and search heuristics might ameliorate runtime.</Paragraph>
    <Paragraph position="4"> If xi is a child of xj in direction D, and xpred is the MRSSS (possibly [?]), where [?]i = |i[?]j|, we estimate p(lscript,xi,xj,xpred,[?]i  |D) by a mixture (untrained, as in the parser) of four backed-off, factored estimates.</Paragraph>
    <Paragraph position="5"> After parsing and labeling, we have for each sentence a list of U x L candidates. Both the oracle performance of the best candidate in the (20 x 50)best list and the performance of the top candidate are shown in Tab. 1, cols. 8-9. It should be clear from the drop in both oracle and 1-best accuracy that our labeling model is a major source of error.</Paragraph>
    <Paragraph position="6"> 5We tested first-order Markov models that conditioned on parent or MRSSS dependency labels.</Paragraph>
  </Section>
  <Section position="7" start_page="203" end_page="203" type="metho">
    <SectionTitle>
5 Reranking
</SectionTitle>
    <Paragraph position="0"> We train a log-linear model combining many feature scores (see below), including the log-probabilities from the parser and labeler. Training minimizes the expected error under the model; we use deterministic annealing to smooth the error surface and avoid local minima (Rose, 1998; D. Smith and Eisner, 2006).</Paragraph>
    <Paragraph position="1"> We reserved 200 sentences in each language for training the reranker, plus 200 for choosing among rerankers trained on different feature sets and different (U xL)-best lists.6 Features Our reranking features predict tags, labels, lemmata, suffixes and other information given all or some of the following non-local conditioning context: bigrams and trigrams of tags or dependency labels; parent and grandparent dependency labels; subcategorization frames (in terms of tags or dependency labels); the occurrence of certain tags between head and child; surface features like the lemma7 and the 3-character suffix. In some cases the children of a node are considered all together, and in other cases left and right are separated.</Paragraph>
    <Paragraph position="2"> The highest-ranked features during training, for all languages, are the parser and labeler probabilities, followed by p([?]i  |tparent), p(direction | tparent), p(label  |labelpred,labelsucc,subcat), and p(coarse(t)  |D,coarse(tparent),Betw), where Betw is TRUE iff an instance of the coarse tag type with the highest mutual information between its left and right children (usually verb) is between the child and its head.</Paragraph>
    <Paragraph position="3"> Feature and Model Selection For training speed and to avoid overfitting, only a subset of the above features are used in reranking. Subsets of different sizes (10, 20, and 40, plus &amp;quot;all&amp;quot;) are identified for each language using two na&amp;quot;ive feature-selection heuristics based on independent performance of features. The feature subset with the highest accuracy on the 200 heldout sentences is selected.</Paragraph>
    <Paragraph position="4"> 6In training our system, we made a serious mistake in training the reranker on only 200 sentences. As a result, our pretesting estimates of performance (on data reserved for model selection) were very bad. The reranker, depending on condition, had only 2-20 times as many examples as it had parameters to estimate, with overfitting as the result.</Paragraph>
    <Paragraph position="5"> 7The first 4 characters of a word are used where the lemma is not available.</Paragraph>
    <Paragraph position="6"> Performance Accuracy of the top parses after reranking is shown in Tab. 1, cols. 10-11. Reranking almost always gave some improvement over 1-best parsing.8 Because of the vine assumption and the preprocessing step that re-attaches all distant children to $, our parser learns to over-attach to $, treating $-attachment as a default/agnostic choice. For many applications a local, incomplete parse may be sufficiently useful, so we also measured non-$ unlabeled precision and recall (Tab. 1, cols. 12-13); our parser has &gt; 80% precision on 8 of the languages.</Paragraph>
    <Paragraph position="7"> We also applied reranking (with unlabeled features) to the 20-best unlabeled parse lists (col. 7).</Paragraph>
  </Section>
  <Section position="8" start_page="203" end_page="204" type="metho">
    <SectionTitle>
6 Error Analysis: German
</SectionTitle>
    <Paragraph position="0"> The plurality of errors (38%) in German were erroneous $ attachments. For ROOT dependency labels, we have a high recall (92.7%), but low precision (72.4%), due most likely to the dependency length bounds. Among the most frequent tags, our system has most trouble finding the correct heads of prepositions (APPR), adverbs (ADV), finite auxiliary verbs (VAFIN), and conjunctions (KON), and finding the correct dependency labels for prepositions, nouns, and finite auxiliary verbs.</Paragraph>
    <Paragraph position="1"> The German conjunction und is the single word with the most frequent head attachment errors. In many of these cases, our system does not learn the subtle difference between enumerations that are headed by A in A und B, with two children und and B on the right, and those headed by B, with und and A as children on its left.</Paragraph>
    <Paragraph position="2"> Unlike in some languages, our labeled oracle accuracy is nearly as good as our unlabeled oracle accuracy (Tab. 1, cols. 8, 5). Among the ten most frequent dependency labels, our system has the most difficulty with accusative objects (OA), genitive attributes (AG), and postnominal modifiers (MNR).</Paragraph>
    <Paragraph position="3"> Accusative objects are often mistagged as subject (SB), noun kernel modifiers (NK), or AG. About 32% of the postnominal modifier relations (ein Platz in der Geschichte, 'a place in history') are labeled as modifiers (in die Stadt fliegen, 'fly into the city'). Genitive attributes are often tagged as NK since both are frequently realized as nouns.</Paragraph>
  </Section>
  <Section position="9" start_page="204" end_page="204" type="metho">
    <SectionTitle>
7 Error Analysis: Arabic
</SectionTitle>
    <Paragraph position="0"> As with German, the greatest portion of Arabic errors (40%) involved attachments to $. Prepositions are consistently attached too low and accounted for 26% of errors. For example, if a form in construct (idafa) governed both a following noun phrase and a prepositional phrase, the preposition usually attaches to the lower noun phrase. Similarly, prepositions usually attach to nearby noun phrases when they should attach to verbs farther to the left.</Paragraph>
    <Paragraph position="1"> We see a more serious casualty of the dependency length bounds with conjunctions. In ground truth test data, 23 conjunctions are attached to $ and 141 to non-$ to using the COORD relation, whereas 100 conjunctions are attached to $ and 67 to non-$ using the AUXY relation. Our system overgeneralizes and attaches 84% of COORD and 71% of AUXY relations to $. Overall, conjunctions account for 15% of our errors. The AUXY relation is defined as &amp;quot;auxiliary (in compound expressions of various kinds)&amp;quot;; in the data, it seems to be often used for waw-consecutive or paratactic chaining of narrative clauses. If the conjunction wa ('and') begins a sentence, then that conjunction is tagged in ground truth as attaching to $; if the conjunction appears in the middle of the sentence, it may or may not be attached to $.</Paragraph>
    <Paragraph position="2"> Noun attachments exhibit a more subtle problem.</Paragraph>
    <Paragraph position="3"> The direction of system attachments is biased more strongly to the left than is the case for the true data. In canonical order, Arabic nouns do generally attach on the right: subjects and objects follow the verb; in construct, the governed noun follows its governor.</Paragraph>
    <Paragraph position="4"> When the data deviate from this canonical order-when, e.g, a subject precedes its verb--the system prefers to find some other attachment point to the left. Similarly, a noun to the left of a conjunction often erroneously attaches to its left. Such ATR relations account for 35% of noun-attachment errors.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML