XML Viewer - p98-1091

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1091_metho.xml
Size: 13,552 bytes
Last Modified: 2025-10-06 14:14:56
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1091">
  <Title>An Empirical Evaluation of Probabilistic Lexicalized Tree Insertion Grammars *</Title>
  <Section position="4" start_page="558" end_page="562" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> In the following experiments we show that PLTIGs of varying sizes and configurations can be induced by processing a large training corpus, and that the trained PLTIGs can provide parses on unseen test data of comparable quality to the parses produced by PCFGs. Moreover, we show that PLTIGs have significantly lower entropy values than PCFGs, suggesting that they make better language models. We describe the induction process of the PLTIGs in Section 3.1. Two corpora of very different nature are used for training and testing. The first set of experiments uses the Air Travel Information System (ATIS) corpus. Section 3.2 presents the complete results of this set of experiments. To determine if PLTIGs can scale up well, we have also begun another study that uses a larger and more complex corpus, the Wall Street Journal TreeBank corpus. The initial results are discussed in Section 3.3. To reduce the effect of the data sparsity problem, we back off from lexical words to using the part of speech tags as the anchoring lexical items in all the experiments. Moreover, we use the deleted-interpolation smoothing technique for the N-gram models and PLTIGs. PCFGs do not require smoothing in these experiments.</Paragraph>
    <Section position="1" start_page="558" end_page="560" type="sub_section">
      <SectionTitle>
3.1 Grammar Induction
</SectionTitle>
      <Paragraph position="0"> The technique used to induce a grammar is a subtractive process. Starting from a universal grammar (i.e., one that can generate any string made up of the alphabet set), the parameters Example sentence: The cat chases the mouse Corresponding derivation tree: tinit .~dJ.</Paragraph>
      <Paragraph position="1"> tthe .~dj.</Paragraph>
      <Paragraph position="2"> teat ~dj.</Paragraph>
      <Paragraph position="3"> tchase s ~dj.</Paragraph>
      <Paragraph position="4"> ttht ,,,1~t. adj.</Paragraph>
      <Paragraph position="5">  tree is right adjoined to the tree anchored with the neighboring word in the sentence, the only structure is right branching.</Paragraph>
      <Paragraph position="6"> are iteratively refined until the grammar generates, hopefully, all and only the sentences in the target language, for which the training data provides an adequate sampling. In the case of a PCFG, the initial grammar production rule set contains all possible rules in Chomsky Normal Form constructed by the nonterminal and terminal symbols. The initial parameters associated with each rule are randomly generated subject to an admissibility constraint. As long as all the rules have a non-zero probability, any string has a non-zero chance of being generated. To train the grammar, we follow the Inside-Outside re-estimation algorithm described by Lari and Young (1990). The Inside-Outside re-estimation algorithm can also be extended to train PLTIGs. The equations calculating the inside and outside probabilities for PLTIGs can be found in Hwa (1998).</Paragraph>
      <Paragraph position="7"> As with PCFGs, the initial grammar must be able to generate any string. A simple PLTIG that fits the requirement is one that simulates a bigram model. It is represented by a tree set that contains a right auxiliary tree for each lexical item as depicted in Figure 1. Each tree has one adjunction site into which other right auxiliary trees can adjoin. The tree set has only one initial tree, which is anchored by an empty lexical item. The initial tree represents the start of the sentence. Any string can be constructed by right adjoining the words together in order. Training the parameters of this grammar yields the same result as a bigram model: the parameters reflect close correlations between words</Paragraph>
      <Paragraph position="9"> that are frequently seen together, but the model cannot provide any high-level linguistic structure. (See example in Figure 2.) Example sentence: The cat chases the mouse Corresponding derivation tree: tinit .~dj.</Paragraph>
      <Paragraph position="10">  possible, the sentences can be parsed in a more linguistically plausible way To generate non-linear structures, we need to allow adjunction in both left and right directions. The expanded LTIG tree set includes a left auxiliary tree representation as well as right for each lexical item. Moreover, we must modify the topology of the auxiliary trees so that adjunction in both directions can occur. We insert an intermediary node between the root and the lexical word. At this internal node, at most one adjunction of each direction may take place. The introduction of this node is necessary because the definition of the formalism disallows right adjunction into the root node of a left auxiliary tree and vice versa. For the sake of uniformity, we shall disallow adjunction into the root nodes of the auxiliary trees from now on. Figure 3 shows an LTIG that allows at most one left and one right adjunction for each elementary tree. This enhanced LTIG can produce hierarchical structures that the bigram model could not (See Figure 4.) It is, however, still too limiting to allow only one adjunction from each direction. Many  words often require more than one modifier. For example, a transitive verb such as &amp;quot;give&amp;quot; takes at least two adjunctions: a direct object noun phrase, an indirect object noun phrase, and possibly other adverbial modifiers. To create more adjunct/on sites for each word, we introduce yet more intermediary nodes between the root and the lexical word. Our empirical studies show that each lexicalized auxiliary tree requires at least 3 adjunction sites to parse all the sentences in the corpora. Figure 5(a) and (b) show two examples of auxiliary trees with 3 adjunction sites. The number of parameters in a PLTIG is dependent on the number of adjunction sites just as the size of a PCFG is dependent on the number of nonterminals. For a language with V vocabulary items, the number of parameters for the type of PLTIGs used in this paper is 2(V+I)+2V(K)(V+I), where K is the number of adjunction sites per tree. The first term of the equation is the number of parameters contributed by the initial tree, which always has two adjunction sites in our experiments. The second term is the contribution from the auxiliary trees. There are 2V auxiliary trees, each tree has K adjunction sites; and V + 1 parameters describe the distribution of adjunction at each site. The number of parameters of a PCFG with M nonterminals is M 3 + MV. For the experiments, we try to choose values of K and M for the PLTIGs and PCFGs such that</Paragraph>
      <Paragraph position="12"/>
    </Section>
    <Section position="2" start_page="560" end_page="561" type="sub_section">
      <SectionTitle>
3.2 ATIS
</SectionTitle>
      <Paragraph position="0"> To reproduce the results of PCFGs reported by Pereira and Schabes, we use the ATIS corpus for our first experiment. This corpus contains 577 sentences with 32 part-of-speech tags. To ensure statistical significance, we generate ten random train-test splits on the corpus. Each set randomly partitions the corpus into three sections according to the following distribution: 80% training, 10% held-out, and 10% testing.</Paragraph>
      <Paragraph position="1"> This gives us, on average, 406 training sentences, 83 testing sentences, and 88 sentences for held-out testing. The results reported here are the averages of ten runs.</Paragraph>
      <Paragraph position="2"> We have trained three types of PLTIGs, varying the number of left and right adjunction sites. The L2R1 version has two left adjunction sites and one right adjunction site; L1R2 has one  left adjunction site and two right adjunction sites; L2R2 has two of each. The prototypical auxiliary trees for these three grammars are shown in Figure 5. At the end of every training iteration, the updated grammars are used to parse sentences in the held-out test sets D, and the new language modeling scores (by measuring the cross-entropy estimates f/(D, L2R1), f/(D, L1R2), and//(D, L2R2)) are calculated.</Paragraph>
      <Paragraph position="3"> The rate of improvement of the language modeling scores determines convergence. The PLTIGs are compared with two PCFGs: one with 15-nonterminals, as Pereira and Schabes have done, and one with 20-nonterminals, which has comparable number of parameters to L2R2, the larger PLTIG.</Paragraph>
      <Paragraph position="4"> In Figure 6 we plot the average iterative improvements of the training process for each grammar. All training processes of the PLTIGs converge much faster (both in numbers of iterations and in real time) than those of the PCFGs, even when the PCFG has fewer parameters to estimate, as shown in Table 1. From Figure 6, we see that both PCFGs take many more iterations to converge and that the cross-entropy value they converge on is much higher than the PLTIGs.</Paragraph>
      <Paragraph position="5"> During the testing phase, the trained grammars are used to produce bracketed constituents on unmarked sentences from the testing sets T. We use the crossing bracket metric to evaluate the parsing quality of each grammar. We also measure the cross-entropy estimates \[-I(T, L2R1), f-I(T, L1R2),H(T, L2R2), f-I(T, PCFG:5), and fI(T, PCFG2o) to determine the quality of the language model. For a baseline comparison, we consider bigram and trigram models with simple right branching bracketing heuristics. Our findings are summarized in Table 1.</Paragraph>
      <Paragraph position="6"> The three types of PLTIGs generate roughly the same number of bracketed constituent errors as that of the trained PCFGs, but they achieve a much lower entropy score. While the average entropy value of the trigram model is the lowest, there is no statistical significance between it and any of the three PLTIGs. The relative statistical significance between the various types of models is presented in Table 2. In any case, the slight language modeling advantage of the tri-gram model is offset by its inability to handle parsing.</Paragraph>
      <Paragraph position="7"> Our ATIS results agree with the findings of Pereira and Schabes that concluded that the performances of the PCFGs do not seem to depend heavily on the number of parameters once a certain threshold is crossed. Even though PCFG2o has about as many number of parameters as the larger PLTIG (L2R2), its language modeling score is still significantly worse than that of any of the PLTIGs.</Paragraph>
      <Paragraph position="8">  grammars. If &amp;quot;better&amp;quot; appears at cell (i,j), then the model in row i has an entropy value lower than that of the model in column j in a statistically significant way. The symbol &amp;quot;-&amp;quot; denotes that the difference of scores between the models bears no statistical significance.</Paragraph>
    </Section>
    <Section position="3" start_page="561" end_page="562" type="sub_section">
      <SectionTitle>
3.3 WSJ
</SectionTitle>
      <Paragraph position="0"> Because the sentences in ATIS are short with simple and similar structures, the difference in performance between the formalisms may not be as apparent. For the second experiment, we use the Wall Street Journal (WSJ) corpus, whose sentences are longer and have more varied and complex structures. We use sections 02 to 09 of the WSJ corpus for training, section 00 for held-out data D, and section 23 for test T. We consider sentences of length 40 or less. There are 13242 training sentences, 1780 sentences for the held-out data, and 2245 sentences in the test. The vocabulary set consists of the 48 part-of-speech tags. We compare three variants of PCFGs (15 nonterminals, 20 nonterminals, and 23 nonterminals) with three variants of PLTIGs (L1R2, L2R1, L2R2). A PCFG with 23 nonterminals is included because its size approximates that of the two smaller PLTIGs. We did not generate random train-test splits for the WSJ corpus because it is large enough to provide adequate sampling. Table 3 presents our findings. From Table 3, we see several similarities to the results from the ATIS corpus. All three variants of the PLTIG formalism have converged at a faster rate and have far better language modeling scores than any of the PCFGs. Differing from the previous experiment, the PLTIGs produce slightly better crossing bracket rates than the PCFGs on the more complex WSJ corpus. At least 20 nonterminals are needed for a PCFG to perform in league with the PLTIGs. Although the PCFGs have fewer parameters, the rate seems to be indifferent to the size of the grammars after a threshold has been reached. While upping the number of nonterminal symbols from 15 to 20 led to a 22.4% gain, the improvement from PCFG2o to PCFG23 is only 0.5%. Similarly for PLTIGs, L2R2 performs worse than L2R1 even though it has more parameters. The baseline comparison for this experiment results in more extreme outcomes. The right branching heuristic receives a  crossing bracket rate of 49.44%, worse than even that of PCFG15. However, the N-gram models have better cross-entropy measurements than PCFGs and PLTIGs; bigram has a score of 3.39 bits per word, and trigram has a score of 3.20 bits per word. Because the lexical relationship modeled by the PLTIGs presented in this paper is limited to those between two words, their scores are close to that of the bigram model.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML