XML Viewer - w00-1201

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1201_intro.xml
Size: 12,069 bytes
Last Modified: 2025-10-06 14:00:57
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1201">
  <Title>Two Statistical Parsing Models Applied to the Chinese Treebank</Title>
  <Section position="3" start_page="0" end_page="2" type="intro">
    <SectionTitle>
2 Models and Modifications
</SectionTitle>
    <Paragraph position="0"> We will briefly describe the two parsing models employed (for a full description of the BBN model, see (Miller et al., 1998) and also (Bikel, 2000); for a full description Of the TAG model, see (Chiang, 2000)).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Model 2 of (Collins, 1997)
</SectionTitle>
      <Paragraph position="0"> Both parsing models discussed in this paper inherit a great deal from this model, so we briefly describe its &amp;quot;progenitive&amp;quot; features here, describing only how each of the two models of this paper differ in the subsequent two sections. null The lexicalized PCFG that sits behind Model 2 of (Collins, 1997) has rules of the form</Paragraph>
      <Paragraph position="2"> where P, Li, R/ and H are all lexicalized nonterminals, and P inherits its lexical head from its distinguished head child, H. In this generative model, first P is generated, then its head-child H, then each of the left- and right-modifying nonterminals are generated from the head outward. The modifying non-terminals Li and R/are generated conditioning on P and H, as well as a distance metric (based on what material intervenes between the currently-generated modifying non-terminal and H) and an incremental subcat frame feature (a multiset containing the complements of H that have yet to be generated on the side of H in which the currently-generated nonterminal falls). Note that if the modifying nonterminals were generated completely independently, the model would be very impoverished, but in actuality, by including the distance and subcat frame features, the model captures a crucial bit of linguistic reality, viz., that words often have well-defined sets of complements and adjuncts, dispersed with some well-defined distribution in the right hand sides of a (context-free) rewriting system.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
2.2 BBN Model
2.2.1 Overview
</SectionTitle>
      <Paragraph position="0"> The BBN model is also of the lexicalized PCFG variety. In the BB.N model, as with Model 2 of (Collins, 1997), modifying non-terminals are generated conditioning both on the parent P and its head child H. Unlike Model 2 of (Collins, 1997), they are also generated conditioning on the previously generated modifying nonterminal, L/-1 or Pq-1, and there is no subcat frame or distance feature. While the BBN model does not perform at the level of Model 2 of (Collins, 1997) on Wall Street Journal text, it is also less language-dependent, eschewing the distance metric (which relied on specific features of the English Treebank) in favor of the &amp;quot;bigrams on nonterminals&amp;quot; model.</Paragraph>
      <Paragraph position="1">  This section briefly describes the top-level parameters used in the BBN parsing model.</Paragraph>
      <Paragraph position="2"> We use p to denote the unlexicalized nonterminal corresponding to P in (1), and similarly for li, ri and h. We now present the top-level generation probabilities, along with examples from Figure 1. For brevity, we omit the smoothing details of BBN's model (see (Miller et al., 1998) for a complete description); we note that all smoothing weights are computed via the technique described in (Bikel et al., 1997).</Paragraph>
      <Paragraph position="3"> The probability of generating p as the root label is predicted conditioning on only +TOP+, which is the hidden root of all parse trees:</Paragraph>
      <Paragraph position="5"> The probability of generating a head node h with a parent p is P(h I P), e.g., P(VP \] S). (3) The probability of generating a left-modifier</Paragraph>
      <Paragraph position="7"> when generating the NP for NP(Apple-NNP), and the probability of generating a right mod-</Paragraph>
      <Paragraph position="9"> when generating the NP for NP(Microsoft-NNP). 1 The probabilities for generating lexical elements (part-of-speech tags and words) are as follows. The part of speech tag of the head of the entire sentence, th, is computed conditioning only on the top-most symbol p:2</Paragraph>
      <Paragraph position="11"> Part of speech tags of modifier constituents, tli and tri, are predicted conditioning on the modifier constituent li or r/, the tag of the head constituent, th, and the word of the head constituent, Wh P(tl, \[li, th, Wh) and P(tr~ \[ ri, th, Wh). (7) The head word of the entire sentence, Wh, is predicted conditioning only on the top-most symbol p and th:</Paragraph>
      <Paragraph position="13"> Head words of modifier constituents, w h and Wry, are predicted conditioning on all the context used for predicting parts of speech in (7), as well as the parts of speech themsleves P(wt, \[ tl,, li, th, Wh) and P(wri \[ try, ri, th, Wh). (9) The original English model also included a word feature to heIp reduce part-of-speech ambiguity for unknown words, but this component of the model was removed for Chinese, as it was language-dependent.</Paragraph>
      <Paragraph position="14"> The probability of an entire parse tree is the product of the probabilities of generating all of the elements of that parse tree, 1The hidden nonterminal +BEGIN+ is used to provide a convenient mechanism for determining the initial probability of the underlying Markov process generating the modifying nonterminals; the hidden nonterminal +END+ is used to provide consistency to the underlying Markov process, i.e., so that the probabilities of all possible nonterminal sequences sum to 1. 2This is the one place where we altered the original model, as the lexical components of the head of the entire sentence were all being estimated incorrectly, causing an inconsistency in the model. We corrected the estimation of th and Wh in our implementation.</Paragraph>
      <Paragraph position="15"> where an element is either a constituent label, a part of speech tag or a word. We obtain maximum-likelihood estimates of the parameters of this model using frequencies gathered from the training data.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.3 TAG Model
</SectionTitle>
      <Paragraph position="0"> The model of (Chiang, 2000) is based on stochastic TAG (Resnik, 1992; Schabes, 1992). In this model a parse tree is built up not out of lexicalized phrase-structure rules but by tree fragments (called elementary trees) which are texicalized in the sense that each fragment contains exactly one lexical item (its anchor).</Paragraph>
      <Paragraph position="1"> In the variant of TAG we use, there are three kinds of elementary tree: initial, (predicative) auxiliary, and modifier, and three composition operations: substitution, adjunction, and sister-adjunction. Figure 2 illustrates all three of these operations, c~i is an initial tree which substitutes at the leftmost node labeled NP$;/~ is an auxiliary tree which adjoins at the node labeled VP. See (Joshi and Schabes, 1997) for a more detailed explanation. null Sister-adjunction is not a standard TAG operation, but borrowed from D-Tree Grammar (Rainbow et al., 1995). In Figure 2 the modifier tree V is sister adjoined between the nodes labeled VB and NP$. Multiple modifier trees can adjoin at the same place, in the spirit of (Schabes and Shieber, 1994).</Paragraph>
      <Paragraph position="2"> In stochastic TAG, the probability of generating an elementary tree depends on the elementary tree itself and the elementary tree it attaches to. The parameters are as follows:</Paragraph>
      <Paragraph position="4"> where c~ ranges over initial trees,/~ over auxiliary trees, 3' over modifier trees, and T/over nodes. Pi(c~) is the probability of beginning a derivation with c~; Ps(o~ I 77) is the probability of substituting o~ at 7; Pa(/~ I r/) is the probability of adjoining ~ at 7/; finally, Pa(NONE I 7) is the probability of nothing adjoining at ~/.</Paragraph>
      <Paragraph position="5"> Our variant adds another set of parameters:</Paragraph>
      <Paragraph position="7"> This is the probability of sister-adjoining 7 between the ith and i + lth children of ~ (allowing for two imaginary children beyond the leftmost and rightmost children). Since multiple modifier trees can adjoin at the same location, Psa(7) is also conditioned on a flag f which indicates whether '7 is the first modifier tree (i.e., the one closest to the head) to adjoin at that location.</Paragraph>
      <Paragraph position="8"> For our model we break down these probabilities further: first the elementary tree is generated without its anchor, and then its anchor is generated. See (Chiang, 2000) for more details.</Paragraph>
      <Paragraph position="9"> During training each example is broken into elementary trees using head rules and argument/adjunct rules similar to those of (Collins, 1997). The rules are interpreted as follows: a head is kept in the same elementary tree in its parent, an argument is broken off into a separate initial tree, leaving a substitution node, and an adjunct is broken off into a separate modifier tree. A different rule is used for extracting auxiliary trees; see (Chiang, 2000) for details. Xia (1999) describes a similar process, and in fact our rules for the Xinhua corpus are based on hers.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.4 Modifications
</SectionTitle>
      <Paragraph position="0"> The primary language-dependent component that had to be changed in both models was the head table, used to determine heads when training. We modified the head rules described in (Xia, 1999) for the Xinhua corpus and substituted these new rules into both models.</Paragraph>
      <Paragraph position="1"> The (Chiang, 2000) model had the following additional modifications.</Paragraph>
      <Paragraph position="2"> * The new corpus had to be prepared for use with the trainer and parser. Aside from technicalities, this involved retraining the part-of-speech tagger described in (Ratnaparkhi, 1997), which was used for tagging unknown words. We also lowered the unknown word threshold from 4 to 2 because the Xinhua corpus was smaller than the WSJ corpus.</Paragraph>
      <Paragraph position="3"> * In addition to the change to the head-finding rules, we also changed the rules for classifying modifiers as arguments or adjuncts. In both cases the new rules were adapted from (Xia, 1999).</Paragraph>
      <Paragraph position="4"> * For the tests done in this paper, a beam width of 10 -4 was used.</Paragraph>
      <Paragraph position="5"> The BBN model had the following additional modifications: * As with the (Chiang, 2000) model, we similarly lowered the unknown word threshold of the BBN model from its default 5 to 2.</Paragraph>
      <Paragraph position="6"> * The language-dependent word-feature was eliminated, causing parts of speech for unknown words to be predicted solely on the head relations in the model.</Paragraph>
      <Paragraph position="7"> * The default beam size in the probabilistic CKY parsing algorithm was widened. The default beam pruned away chart entries whose scores were not within a factor of e -5 of the top-ranked subtree; this  precision, CB = avg. crossing brackets, 0CB = zero crossing brackets, &lt;2CB = &lt;2 crossing brackets. All results are percentages, except for those in the CB column, tUsed larger beam settings and lower unknown word threshold than the defaults. *3 of the 400 sentences were not parsed due to timeouts and/or pruning problems. :~3 of the 348 sentences did not get parsed due to pruning problems, and 2 other sentences had length mismatches (scoring program errors). tight limit was changed to e -9. Also, the default decoder pruned away all but the top 25-ranked chart entries in each cell; this limit was expanded to 50.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML