File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1011_intro.xml

Size: 11,011 bytes

Last Modified: 2025-10-06 14:00:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1011">
  <Title>Parsing with the Shortest Derivation</Title>
  <Section position="3" start_page="0" end_page="70" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> A well-known property of stochastic grammars is their prope,lsity to assign highe, probabilities to shorter derivations o1' a sentence (cf. Chitrao &amp; Grishman 1990; Magerman &amp; Marcus 1991; Briscoe &amp; Carroll 1993; Charniak 1996). This propensity is due to the probability o1' a derivation being computed as tile product of the rule probabilities, and thus shorter derivations involving fewer rules tend to have higher probabilities, ahnost regardless of the training data.</Paragraph>
    <Paragraph position="1"> While this bias may seem interesting in the light of the principle of cognitive economy, shorter derivations generate smaller parse h'ees (consisting of fewer nodes) whiclt are not warranted by tile correct parses of sentences. Most systems therefore redress this bias, for instance by normalizillg the derivation probability (see Caraballo &amp; Charniak 1998).</Paragraph>
    <Paragraph position="2"> However, for stochastic grammars lhat use elementary trees instead o1' context-l'ree rules, the propensity to assign higher probabilities to shorter derivations does not necessarily lead to a bias in favor of smaller parse trees, because elementary trees may differ in size and lexicalization. For Stochastic Tree-Substitution Grammars (STSG) used by Data-Oriented Parsing (DOP) models, it has been observed lhat the shortest derivation of a sentence consists of the largest subtrees seen in a treebank thai generate that sentence (of. Bed 1992, 98). We may therefore wonder whether for STSG lhe bias in favor of shorter derivations is perhaps beneficial rather than llarmful.</Paragraph>
    <Paragraph position="3"> To investigate this question we created a new STSG-DOP model which uses this bias as a feature.</Paragraph>
    <Paragraph position="4"> This non-probabilistic DOP model parses each sentence by returning its shortest derivation (consisting of tile fewest subtrees seen in ttle corpus). Only if there is more than one shortest derivation the model backs off to a frequency ordering o1' the corpus-subtrees and chooses the shortest deriwttion with most highest ranked subtrees. We compared this non-probabilistic DOP model against tile probabilistic DOP model (which estimales the most probable parse for each sentence) on three different domains: tbe Penn ATIS treebank (Marcus et al. 1993), the Dutch OVIS treebank (Bonnema el al. 1997) and tile Penn Wall Street Journal (WSJ) treebank (Marcus el al.</Paragraph>
    <Paragraph position="5"> 1993). Stwprisingly, the non-probabilistic DOP model oult~erforms the probabilistic I)OP model on both lhe ATIS and OVIS treebanks, while it obtains competitive resuhs on the WSJ trcebank. We conjectu,c thai any stochastic granlnlar which uses units of flexible size can be turned into an accurate non-probabilistic version.</Paragraph>
    <Paragraph position="6"> Tile rest of this paper is organized as follows: we first explain botll the probabilistic and non-prolmbilistic DOP model. Next, we go into tile computational aspccls of these models, and finally we compare lhe performance of the models on the three treebanks.</Paragraph>
    <Paragraph position="7">  2. Probabilistic vs. Non-Probabilistic</Paragraph>
    <Section position="1" start_page="0" end_page="69" type="sub_section">
      <SectionTitle>
Data-Oriented Parsing
</SectionTitle>
      <Paragraph position="0"> Both probabilistic and non-probabilistic DOP are based on the DOP model in Bod (1992) which extracts a Stochastic Tree-Substitution Grammar fi'om a treebank (&amp;quot;STSG-DOP&amp;quot;). I STSG-DOP uses subtrees J Note that the l)OP-approach of extracting grammars f,om corpora has been applied to a wide variety of other grammatical fimncworks, including Tree-lnsertio,~ Grmnmar  from parse trees in a corpus as elementary trees, and leftmost-substitution to combine subtrees into new trees. As at\] example, consider a very simple corpus consisting of only two trees (we leave out some subcategorizations to keel) the example simt}le):  A new sentence such as She saw the dress with the telescope can be parsed by combining subtrees from this corpus by means of lel'tmost-substitution (indicated as deg):  S o X o = ,S /x &amp; /x P i&amp;quot; v,. ..... IA i P /x..</Paragraph>
      <Paragraph position="1"> A with the telescope she VP PP s e VI' A A 7, SaW SaW I\]IC dl'CSs whh the telescope  lhe dress with the telescope Note that other derivations, involving different subtrees, may yield the same parse tree; for instance:</Paragraph>
      <Paragraph position="3"> saw the dB:ss with the telescope Figure 3. l)ifferent derivation yielding the same parse true lbr She saw tile &amp;'ess with tile telescope Note also that, given this example corpus, the sentence we considered is ambiguous; by combining (Hoogweg 2000), Tree-Adjoining Grammar (Neumalm 1998), Lexical-Functional Grammar (Bed &amp; Kaplan 1998; Way 1999; Bed 2000a), Head-driven Phrase Structure Grammar (Neumann &amp; Flickinger 1999), and Montague Grammar (van den Berg et al. 1994; Bed 1998). For the relation between DOP and Memory-Based Learning, see Daelemans (1999).</Paragraph>
      <Paragraph position="4"> other subtrees, a dilTerent parse may be derived, which is analogous to the first rather than the second</Paragraph>
      <Paragraph position="6"> lbr Site saw the dress with the telescope The probabilistic and non-probabilistie DOP models differ in the way they define the best parse tree of a sentence. We now discuss these models separately.</Paragraph>
    </Section>
    <Section position="2" start_page="69" end_page="70" type="sub_section">
      <SectionTitle>
2.1 The probabilistic DOP model
</SectionTitle>
      <Paragraph position="0"> The probabilistic DOP model introduced in Bed (1992, 93) computes the most probable parse tree of a sentence from the normalized subtree l'requencies in the corpus. The probability of a subtree t is estimated as the nunlber of occurrences of t seen in the corpus, divided by the total number of occurrences of corpus-subtrees that have the same root label as t. Let It I rett, rn the number of occurrences of t in the corpus and let r(t) return the root label of t then: l'(t)=ltl/Zt':r(r)=,.(t)lt' \[.2 Tim probability of a derivation is computed as the product of the probabilities of the subtrees involved in it. The probability of a parse tree is computed as the sum of the probabilities ol' all distinct derivations that produce that tree. The parse tree with the highest 2 It should be stressed that there may be several other ways to estimate subtree probabilities in I)OP. For example, Bonnema et al. (1999) estimate the probability era subtree as the probability that it has been involved in the derivation of a corpus tree. It is not yet known whether this alternative probability model outperforms the model in Bed (1993).</Paragraph>
      <Paragraph position="1"> Johnson (1998) pointed out that the subtree estimator in Bed (1993) yields a statistically inconsistent model. This means that as the training corpus increases the corresponding sequences of probability distributions do not converge to the true distribution that generated the training data. Experiments with a consistent maximum likelihood estimator (based on the inside-outside algorithln in Lari and Yot, ng 1990), leads however to a significant decrease in parse accuracy on the ATIS and OVIS corpora. This indicates that statistically consistency does not necessarily lead to better performance.</Paragraph>
      <Paragraph position="2">  probalfility is defined as the best parse tree of a Selltence.</Paragraph>
      <Paragraph position="3"> The fn'obabilistie DOP model thus considers counts of subtrees of a wide range o1' sizes in computing the probability of a tree: everything from counts C/51' single-level rules to cotmts of entire trees.</Paragraph>
    </Section>
    <Section position="3" start_page="70" end_page="70" type="sub_section">
      <SectionTitle>
2.2 The noil-probabilistic I)OP model
</SectionTitle>
      <Paragraph position="0"> Tim non-prolmlfilistic I)OP model uses a rather different definition of the best parse tree. instead of conqmting the most probable patse of a sentence, it computes the parse tree which can be generated by the fewest eorpus-stibtrees, i.e., by the shortest deriwltion independent of the subtree prolmbilities.</Paragraph>
      <Paragraph position="1"> Since stlblrees are allowed to be of arbitrary size, tile shortest derivation typically corresponds 1(5 the pa.rse tree which consists of largest possible corpussubtrees, thus maximizing syntaclic context. For cxmnple, given the corpus in Figure 1, the best parse tree for She saw the dress with the telescope is given in Figure 3, since that parse tree can be generated by a derivation of only two corpus-sublrees, while tile parse tree in Figure 4 needs at least three corpus-sublrees to be generated. (Interestingly, the parse lree with the sho,'test derivation in Figure 3 is also tile most probable parse tree according to pl'obalfilistic 1)O1 ) for this corpus, but this need not always be so.</Paragraph>
      <Paragraph position="2"> As mentioned, the probabilistic 1)O1' lnodel has ah'eady a bias l(5 assign higher probabilities to parse trees that can be generaled 153, shorter deriwtlions. The non-pvobabilistic I)OP model makes this bias absolute.) The shortest deriwttion may not t)e unique: it may happen that different parses of a sentence are generated by tile same mininml nmnl)er of corpussubtrees. In lhat ease the model backs off to a l'requency ordering (51' the subtrees. That is, all subtrees of each root label arc assigned a rank according to their frequency ill the co,pus: the most frequent sublree (or subtrees) (51&amp;quot; each root label get rank 1, the second most frequent subtree gels rank 2, etc. Next, the rank of each (shortest) derivation is computed its the sum of the ranks (51&amp;quot; tile subtrecs involved. The deriwttion with the smallest sum, or highest rank, is taken as the best derivation producing the best parse tree.</Paragraph>
      <Paragraph position="3"> The way we compute the rank of a de,'ivalion by surmrdng up lhe ranks of its subtrees may seem rather ad hoc. However, it is possible to provide an information-theoretical ,notivation for this model.</Paragraph>
      <Paragraph position="4"> According to Zipl&amp;quot;s law, rank is roughly prolxsrtional to tile negative logaritlun of frequency (Zipf 1935). In Slmnnon's Information Theory (Shannon 194~,), tile negative logaritlun (of base 2) of the probability of an event is belter known as the information (51' that event. Thus, tile rank of a subtree is roughly proportional to its information. It follows that minimizing the sum of the sublrce ranks in a derivation corresponds to minimizing the (self-)information of a derivation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML