File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/e93-1040_metho.xml

Size: 13,347 bytes

Last Modified: 2025-10-06 14:13:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="E93-1040">
  <Title>Parsing the Wall Street Journal with the Inside-Outside Algorithm</Title>
  <Section position="3" start_page="341" end_page="341" type="metho">
    <SectionTitle>
2 Training Corpus
</SectionTitle>
    <Paragraph position="0"> The experiments use texts from the Wall Street Journ~d Corpus ,and its partially bracketed version provided by the Penn Treebank (Brill et al., 1990). Out of 38 600 bracketed sentences (914 000 words), we extracted 34500 sentences (817 000 words) as possible source of training material ,and 4100 sentences (97 000 words) as source for testing. We experimented with several subsets (350, 1095, 8000 ,and 34500 sentences) of the available training materi~d.</Paragraph>
    <Paragraph position="1"> For practiced purposes, the part of the tree bank used for training is preprocessed before being used. First, fiat portions of parse trees found in the tree b,'mk are turned into a right linear binary br~mching structure. This enables us to take full adv~mtage of the fact that the extended inside-outside ~dgorithm (as described in Pereira and Schabes, 1992) behaves in linear time when the text is fully bracketed. Then, the syntactic labels are ignored. This allows the reestimation algorithm to distribute its own set of labels based on their actual distribution. We later suggest a method for recovering these labels.</Paragraph>
    <Paragraph position="2"> The following is ,an ex~unple of a partially parsed sentence found in the Penn Treeb~mk:</Paragraph>
    <Paragraph position="4"> The above parse corresponds to the fully bracketed unlabeled parse</Paragraph>
    <Paragraph position="6"> found in the tr,'fining corpus. The experiments reported in this paper use only the p,'trt-of-speech sequences of this corpus ,and the resulting fully bracketed parses. For the above example, the following bracketing is used in the training material:</Paragraph>
  </Section>
  <Section position="4" start_page="341" end_page="344" type="metho">
    <SectionTitle>
(DT (NN (IN (DT (JJ NNS)))) (VBZ (VBN VBN)))
3 Inferring Bracketings
</SectionTitle>
    <Paragraph position="0"> For the set of experiments described in this section, the initial gr,'unmar consists of,all 4095 possible Chore- null sky Normal Form rules over 15 nonterminals (X i, 1 &lt; i &lt; 15) and 48 termin,'d symbols (t,,, 1 &lt; m &lt; 48) for part-of-speech tags (the same set as the one used in the Penn Treebank):</Paragraph>
    <Paragraph position="2"> The parameters of the initial stochastic context-free grammar are set randomly while maintaining the proper conditions for stochastic context-free grammars. 1 Using the algorithm described in Pereira and Schabes (1992), the current rule probabilities and the parsed training set C are used to estimate the expected frequencies of each rule. Once these frequencies are computed over each bracketed sentence c in the training set, new rule probabilities ,are assigned in a way that increases the estimated probability of the bracketed training set. This process is iterated until the increase in the estimated probability of the bracketed training text becomes negligible, or equivalently, until the decrease in cross entropy</Paragraph>
    <Paragraph position="4"> becomes negligible. In the above formula, the probability P(c) of the partially bracketed sentence c is computed as the sum of the probabilities of all derivations compatible with the bracketing of the sentence. This notion of compatible bracketing is defined in details in Pereim and Schabes (1992). Informally speaking, a derivation is compatible with the bracketing of the input given in the tree bank, if no bracket imposed by the derivation crosses a bracket in the input.</Paragraph>
    <Paragraph position="5">  As refining material, we selected randomly out of the available training material 1042 sentences of length shorter than 15 words. For evaluation purposes, we also 1. The sum of the probabilities of the rules with same left hand side must be one.</Paragraph>
    <Paragraph position="6"> nmdomly selected 84 sentences of length shorter than 15 words among the test sentences.</Paragraph>
    <Paragraph position="7"> Figure 1 shows the cross entropy of the training after each iteration. It also shows for each iteration the cross entropies f/of 84 sentences randomly selected ,among the test sentences of length shorter than 15 words. The cross entropy decreases ,as more iterations ,are performed and no over training is observed..</Paragraph>
    <Paragraph position="8">  test sentences shorter than 15 words.</Paragraph>
    <Paragraph position="9"> To evaluate the quality of the analyses yielded by the inferred grammars obtained ,after each iteration, we used a Viterbi-style parser to find the most likely analyses of sentences in several test samples, and compared them with the Treebank partial bmcketings of the sentences of those samples. For each sample, we counted the percent- null age of brackets of the most likely ~malysis that are not &amp;quot;crossing&amp;quot; the partiid bracketing of the same sentences found in the Treebank. This percentage is called the bracketing accuracy (see Pereira and Schabes, 1992 tor the precise definition of this measure). We also computed the percentage of sentences in each smnple in which no crossing bracket wits found. This percentage is called the sentence accuracy.</Paragraph>
    <Paragraph position="10"> Figure 2 shows the bracketing and sentence accuracy for the s,'une 84 test sentences.</Paragraph>
    <Paragraph position="11"> Table 1 shows the bracketing and sentence accuracy for test sentences within various length ranges. High bracketing accuracy is obtained even on relatively long sentences. However, as expected, the sentence accuracy decreases rapidly as the sentences get longer.</Paragraph>
    <Paragraph position="12">  different lengths (using 1042 sentences of lengths shorter than 15 words as training material).</Paragraph>
    <Paragraph position="13"> Table 2 compares our results with the bracketing accuracy of analyses obtained by a systematic right linear branching structure for all words except for the final punctuation mark (which we att~tched high). 2 We also evaluated the stochastic context-free gr, unmar obtained by collecting each level of the trees found in the training tree bimk (see Table 2).</Paragraph>
    <Paragraph position="14">  grammar, of right linear structures and of the Treebank grammar.</Paragraph>
    <Paragraph position="15"> Right linear structures perform surprisingly well. Our results improve by 20 percentage points upon this base line performance. These results suggest that the distribution of sentence structure in naturally occurring text is simpler than one may have thought, especially since only part-of-speech tags were used. This may suggest 2. We thank Eric Brill and David Yarowsky for suggesting these experiments.</Paragraph>
    <Paragraph position="16"> the existence of clusters of trees in the training material. However, using the number of crossing brackets ils a distance between trees, we have been unable to reveal the existence of clusters.</Paragraph>
    <Paragraph position="17"> The grammar obtained by collecting rules from the tree bank performs very poorly. One can conclude that the labels used in the tree bank do not have ,'my statistical property. The task of inferring a stochastic grammar from a tree bank is not trivial and therefore requires statistical training.</Paragraph>
    <Paragraph position="18"> In the appendix we give examples of the most likely analyses output by the inferred grammar on severld test sentences In Table 3, different subsets of the available trltining sentences of lengths up to 15 words long and the grammars were evaluated on the same set of test sentences of lengths shorter than 15 words. The size of the training set does not seem to ,affect the performimce of the parser.  bracketing and sentence accuracy.</Paragraph>
    <Paragraph position="19"> However if one includes all available sentences (34700 sentences), for the stone test set, the bracketing accuracy drops to 84% ,and the sentence accuracy to 40%.</Paragraph>
    <Paragraph position="20"> We have also experimented with the following initial grmnmar which defines a large number of rules</Paragraph>
    <Paragraph position="22"> In this grammar, each non-terminal symbol is uniquely ,associated with a terminal symbol. We observed over-Ix,fining with this grmnmar ,and better statistic~d convergence was obtained, however the performance of the parser did not improve.</Paragraph>
  </Section>
  <Section position="5" start_page="344" end_page="344" type="metho">
    <SectionTitle>
4 Reducing the Grammar Size and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="344" end_page="344" type="sub_section">
      <SectionTitle>
Smoothing Issues
</SectionTitle>
      <Paragraph position="0"> As grammars are being inferred at each iteration, the training algorithm was designed to guarantee that no parameter was set below some small threshold. This constraint is important for smoothing. It implies that no rule ever disappears at a reestimation step.</Paragraph>
      <Paragraph position="1"> However, once the final grammar is found, for practical purposes, one can reduce the number of parameters being used. For example, the size of the grammar can be reduced by eliminating the rules whose probabilities are below some threshold or by keeping for each non-terminal only the top rules rewriting it.</Paragraph>
      <Paragraph position="2"> However, one runs into the risk of not being able to parse sentences given as input. We used the following smoothing heuristics.</Paragraph>
      <Paragraph position="3"> Lexieal rule smoothing. In the case no rule in the gnunmar introduces a terminal symbol found in the input string, we assigned a lexical rule (X i ~ tin) with very low * probability for all non-terminal symbols. This case will not happen if the training is representative of the lexical items.</Paragraph>
      <Paragraph position="4"> Syntactic rule smoothing. When the sentence is not recognized from the starting symbol, we considered ,all possible non-terminal symbols as starting symbols ,and considered as starting symbol the one that yields the most likely ,'malysis. Although this procedure may not guarantee that ,all sentences will be recognized, we found it is very useful in practice.</Paragraph>
      <Paragraph position="5"> When none of the above procedures enable parsing of the sentence, we used the entire set of parameters of the inferred gr,~mar (this was never the case on the test sentences we considered).</Paragraph>
      <Paragraph position="6"> For example, the grammar whose performance is depicted in Table 2 defines 4095 parameters. However, the same performance is achieved on these test sets by using only 450 rules (the top 20 binary branching rules X i ~ XjXk for each non-terminal symbol ,and the top 10 lexical rules X i ~ I m for each non-terminal symbol),</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="344" end_page="345" type="metho">
    <SectionTitle>
5. Implementation
</SectionTitle>
    <Paragraph position="0"> Pereira and Schabes (1992) note that the training ,algorithm behaves in linear time (with respect to the sentence length) when the training material consists of fully bracketed sentences. By taking advantage of this fact, the experiments using a small number of initial rules and a small subset of the available training materials do not require a lot of computation time and can be performed on a single workstation. However, the experiments using larger initial grammars or using more material require more computation.</Paragraph>
    <Paragraph position="1"> The training algorithm can be parallelized by dividing the training corpus into fixed size blocks of sentences ,and by having multiple workstations processing each one of them independently. When ,all blocks have been computed, the counts are merged and the parameters are reestimated. For this purpose, we used PVM (Beguelin et al., 1991) as a mechanism for message passing across workstations.</Paragraph>
    <Paragraph position="2"> . Stochastic Model of Labeling for Binary Branching Trees The stochastic grmnmars inferred by the training procedures produce unlabeled parse trees. We are currently evaluating the following stochastic model for labeling a binary branching tree. In this approach, we make the simplifying assumption that the label of a node only depends on the labels of its children. Under this assumption, the probability of labeling a tree is the product of the probability of labeling each level in the tree. For example, the probability of the following labeling:  These probabilities can be estimated in a simple manher given a tree bank. For example, the probability of labeling a level as NP ~ DTNN is estimated as the number of occurrences (in the tree bank) ofNP ~ DTNN divided by the number of occurrences ofX =~ DTNN where X ranges over every label.</Paragraph>
    <Paragraph position="3"> Then the probability of a labeling can be computed bottom-up from leaves to root. Using dyn,'unic programruing on increasingly large subtrees, the labeling with the highest probability can be computed.</Paragraph>
    <Paragraph position="4">  We are currently evzduating the effectiveness of this vnethod.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML