File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/e95-1015_metho.xml

Size: 17,766 bytes

Last Modified: 2025-10-06 14:14:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="E95-1015">
  <Title>VP VB NP NP PP PP IN NP VP VBG NN</Title>
  <Section position="4" start_page="106" end_page="108" type="metho">
    <SectionTitle>
3 Note that Monte Carlo estimation of the most probable
</SectionTitle>
    <Paragraph position="0"> parse is more reliable than the estimation of the most probable parse by generating the n most probable derivations by Viterbi, since it might be that the most probable parse is exclusively generated by many low probable derivations. The Monte Carlo method is guaranteed to converge to the most probable parse.</Paragraph>
    <Paragraph position="1">  Sampling a random 0C/riva~ion from a derivation forest Given a derivation forest, of a sentence of n words, consisting of labeled entries (i,j) that span the words between the i-th and the j-th position of the sentence. Every entry is labeled with linked elementary trees, together with their probabilities, that constitute subderivations of the underlying subsentence.</Paragraph>
    <Paragraph position="2"> Sampling a derivation from the chart consists of choosing at every labeled entry (bottom-up, breadthfu'st) a random subderivation of each root-node:</Paragraph>
    <Paragraph position="4"> select 4 a random subderivation of root X eliminate the other subderivations We now have an algorithm that selects a random derivation from a derivation forest. Converting this derivation into a parse tree gives a first estimation for the most probable parse. Since one random sample is not a reliable estimate, we sample a large number of random derivations and see which parse is generated most frequently. This is exemplified by the following algorithm. (Note that we might also estimate the most probable derivation by random sampling, namely by counting which derivation is sampled most often; however, the most probable derivation can be more effectively generated by Viterbi.) Eslimating the most probable parse (MPP) Given a derivation forest for an input sentence: repeat until the MPP converges sample a random derivation from the forest store the parse generated by the random derivation MPP := the most frequently occurring parse There is an important question as to how long the convergence of the most probable parse may take. Is there a tractable upper bound on the number of derivations that have to be sampled from the forest before stability in the top of the parse distribution occurs? The answer is yes: the worst case time complexity of achieving a maximum estimation error e by means of random sampling is O(e-2), independently of the probability distribution. This is a classical result from sampling theory (cf. Hammersley and Handscomb, 1964), and follows directly from Chebyshev's inequality. In practice, it means that the 4 Let { (e 1, Pl), (e2, P2) ..... (en, Pn) } be a probability distribution of events el, e2, ..., en; an event e i is said to be randomly selected iff its probability of being selected is equal to Pi. In order to allow for &amp;quot;direct sampling&amp;quot;, one must convert the probability distribution into a corresponding sample space for which holds that the frequency of occurrence 3\] of each event e i is a positive integer equal to Npi, where N is the size of the sample space.</Paragraph>
    <Paragraph position="5"> error e is inversely proportional to the square-root of the number of random samples N and therefore, to reduce e by a factor of k, the number of samples N needs to be increased k2-fold. In practical experiments (see SS4), we will limit the number of samples to a pre-determined, sufficiently large bound N.</Paragraph>
    <Paragraph position="6"> What is the theoretical worst case time complexity of parsing and disambiguation together? That is, given an STSG and an input sentence, what is the maximal time cost of finding the most probable parse of a sentence? If we use a CKY-parser, the creation of a derivation forest for a sentence of n words takes O(n 3) time. Taking also into account the size G of an STSG (defined as the sum of the lengths of the yields of all its elementary trees), the time complexity of creating a derivation forest is proportional to Gn 3. The time complexity of disambiguation is both proportional to the cost of sampling a derivation, i.e. Gn 3, and to the cost of the convergence by means of iteration, which is e -2. Tiffs means that the time complexity of disambiguation is given by O(Gn3e-2). The total time complexity of parsing and disambiguation is equal to O(Gn 3) + O(Gn3e -2) = O(Gn3e'2). Thus, there exists a tractable procedure that estimates the most probable parse of an input sentence.</Paragraph>
    <Paragraph position="7"> Notice that although the Monte Carlo disambiguation algorithm estimates the most probable parse of a sentence in polynomial time, it is not in the class of polynomial time decidable algorithms. The Monte Carlo algorithm cannot decide in polynomial time what is the most probable parse; it can only make the error-probability of the estimated most probable parse arbitrarily small. As such, the Monte Carlo algorithm is a probabilistic algorithm belonging to the class of Bounded error Probabilistic Polynomial time (BPP) algorithms.</Paragraph>
    <Paragraph position="8"> We hypothesize that Monte Carlo disambiguation is also relevant for other stochastic grammars. It turns out that all stochastic extensions of CFGs that are stochastically richer than SCFG need exponential time algorithms for finding a most probable parse tree (cf. Briscoe &amp; Carroll, 1992; Black et al., 1993; Magerman &amp; Weir, 1992; Schabes &amp; Waters, 1993). To our knowledge, it has never been studied whether there exist BPP-algorithms for these models. Alhough it is beyond the scope of our research, we conjecture that there exists a Monte Carlo disambiguation algorithm for at least Stochastic Tree-Adjoining Grammar (Schabes, 1992).</Paragraph>
    <Section position="1" start_page="107" end_page="108" type="sub_section">
      <SectionTitle>
3.2.3 Psychological relevance of Monte
Carlo disambiguation
</SectionTitle>
      <Paragraph position="0"> As has been noted, an important difference between the Viterbi algorithm and the Monte Carlo algorithm is, that with the latter we never have 100% confidence. In our opinion, this should not be seen as a disadvantage. In fact, absolute confidence about the most probable parse does not have any significance, as the probability assigned to a p~se is already an estimation of its actual probability. One may ask as to whether Monte Carlo is appropriate for modeling  human sentence perception. The following lists some properties of Monte Carlo disambiguation that may be of psychological interest: 1. As mentioned above, Monte Carlo never provides 100% confidence about the best analysis. This corresponds to the psychological observation that people never have absolute confidence about their interpretation of an ambiguous sentence.</Paragraph>
      <Paragraph position="1"> 2. Although conceptually Monte Carlo uses the total space of possible analyses, it tends to sample only the most likely ones. Very unlikely analyses may only be sampled after considerable time, but it is not guaranteed that all analyses are found in finite time. This matches with experiments on human sentence perception where very implausible analyses are only perceived with great difficulty and after considerable time.</Paragraph>
      <Paragraph position="2"> 3. Monte Carlo does not necessarily give the same results for different sequences of samples, especially if different analyses in the top of the distribution are almost equally likely. In the case there is more than one most probable analysis, Monte Carlo does not converge to one analysis but keeps alternating, however large the number of samples is made. In experiments with human sentence perception, it has often been shown that different analyses can be perceived for one sentence. And in case these analyses are equally plausible, people perceive so-called fluctuation effects. This fluctuation phenomenon is also well-known in the perception of ambiguous visual patterns.</Paragraph>
      <Paragraph position="3"> 4. Monte Carlo can be made parallel in a very straightforward way: N samples can be computed by N processing units, where equal outputs are reinforced. The more processing units are employed, the better the estimation. However, since the number of processing units is finite, there is never absolute confidence. This has some similarity with the Parallel Distributed Processing paradigm for haman (language) processing (Rumelhart &amp; McClelland, 1986).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="108" end_page="109" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> In this section, we report on experiments with an implementation of DOP that parses and disambiguates part-of-speech strings. In (Bod, 1995) it is shown how DOP is extended to parse word strings that possibly contain unknown words.</Paragraph>
    <Section position="1" start_page="108" end_page="108" type="sub_section">
      <SectionTitle>
4.1 The test environment
</SectionTitle>
      <Paragraph position="0"> For our experiments, we used a manually corrected version of the Air Travel Information System (ATIS) spoken language corpus (Hemphill et al., 1990) annotated in the Pennsylvania Treebank (Marcus et al., 1993). We employed the &amp;quot;blind testing&amp;quot; method, dividing the corpus into a 90% training set and a 10% test set by randomly selecting sentences. The 675 trees from the training set were converted into their subtrees together with their relative frequencies, yielding roughly 4&amp;quot;105 different subtrees. The 75 part-of-speech sequences from the test set served as input strings that were parsed and disambiguated using the subtrees from the training set. As motivated in (Bed, 1993b), we use the notion of parse accuracy as our accuracy metric, defined as the percentage of the test strings for which the most probable parse is identical to the parse in the test set.</Paragraph>
    </Section>
    <Section position="2" start_page="108" end_page="108" type="sub_section">
      <SectionTitle>
4.2 Accuracy as a function of subtree-depth
</SectionTitle>
      <Paragraph position="0"> It is one of the most essential features of DOP, that arbitrarily large subtrees are taken into consideration to estimate the probability of a parse. In order to test the usefulness of this feature, we performed different experiments constraining the depth of the subtrees.</Paragraph>
      <Paragraph position="1"> The following table shows the results of seven experiments for different maximum depths of the training set subtrees. The accuracy refers to the parse accuracy at 400 randomly sampled parses, and is rounded off to the nearest integer. The CPU time refers to the average CPU time per string employed by a Spark II.</Paragraph>
    </Section>
    <Section position="3" start_page="108" end_page="108" type="sub_section">
      <SectionTitle>
1.6 h
1.9 h
2.2 h
3.5 h
</SectionTitle>
      <Paragraph position="0"> The table shows a dramatic increase in parse accuracy when enlarging the maximum depth of the subtrees from 1 to 2. (Remember that for depth one, DOP is equivalent to a stochastic context-free grammar.) The accuracy keeps increasing, at a slower rate, when the depth is enlarged further. The highest accuracy is obtained by using all subtrees from the training set: 72 out of the 75 sentences from the test set are parsed correctly. Thus, the accuracy increases if larger subtrees are used, though the CPU time increases considerably as well.</Paragraph>
    </Section>
    <Section position="4" start_page="108" end_page="109" type="sub_section">
      <SectionTitle>
4.3 Does the most probable derivation
</SectionTitle>
      <Paragraph position="0"> generate the most probable parse? Another important feature of DOP is that the probability of a resulting parse tree is computed as the sum of the probabilities of all its derivations.</Paragraph>
      <Paragraph position="1"> Although the most probable parse of a sentence is not necessarily generated by the most probable derivation of that sentence, there is a question as to how often these two coincide. In order to study this, we also calculated the derivation accuracy, defined as the percentage of the test strings for which the parse generated by the most probable derviation is identical to the parse in the test set. The following table shows the derivation accuracy against the parse accuracy for the 75 test set strings from the ATIS corpus, using different maximum depths for the corpus subtrees.</Paragraph>
      <Paragraph position="2">  The table shows that the derivation accuracy is equal to the parse accuracy if the depth of the subtrees is constrained to 1. This is not surprising, as for depth 1, DOP is equivalent with SCFG where every parse is generated by exactly one derivation. What is remarkable, is, that the derivation accuracy decreases if the depth of the subtrees is enlarged to 2. If the depth is enlarged further, the derivation accuracy increases again. The highest derivation accuracy is obtained by using all subtrees from the corpus (65%), but remains far behind the highest parse accuracy (96%). From this table we conclude that if we.are interested in the most probable analysis of a string we must not look at the probability of the process of achieving that analysis but at the probability of the result of that process.</Paragraph>
    </Section>
    <Section position="5" start_page="109" end_page="109" type="sub_section">
      <SectionTitle>
4.4 The significance of once-occurring
</SectionTitle>
      <Paragraph position="0"> subtrees There is an important question as to whether we can reduce the &amp;quot;grammar constant&amp;quot; of DOP by eliminating very infrequent subtrees, without affecting the parse accuracy. In order to study this question, we start with a test result. Consider the test set sentence &amp;quot;Arrange the flight code of the flight from Denver to Dallas Worth in descending order&amp;quot;, which has the following parse in the test set:  The corresponding p-o-s sequence of this sentence is the test set string &amp;quot;vB DT NN NN IN DT NN IN NP TO NP NP IN VBG NN&amp;quot;. At subtree-depth &lt; 2, the following most probable parse was estimated for this string (where for reasons of readability the words are added to the p-o-s tags):  In this parse, we see that the prepositional phrase &amp;quot;in descending order&amp;quot; is incorrectly attached to the NP &amp;quot;the flight&amp;quot; instead of to the verb &amp;quot;arrange&amp;quot;. This wrong attachment may be explained by the high relative frequencies of the following subtrees of depth 2 (that appear in structures of sentences like &amp;quot;Show me the transportation from SFO to downtown San Francisco in August&amp;quot;, where the PP &amp;quot;in August&amp;quot; is attached to the NP &amp;quot;the transportation&amp;quot;, and not to the verb &amp;quot;show&amp;quot;):</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="109" end_page="110" type="metho">
    <SectionTitle>
NP NP NP NP
PP PP PP
PP IN PP
NP PP IN
NP
</SectionTitle>
    <Paragraph position="0"> Only if the maximum depth was enlarged to 4, subtrees like the following were available, which led to the estimation of the correct tree.</Paragraph>
    <Paragraph position="1">  It is interesting to note that this subtree occurs only once in the training set. Nevertheless, it induces the correct parsing of the test string. This seems to contradict the fact that probabilities based on sparse data are not reliable. Since many large subtrees are once-occumng events (hapaxes), there seems to be a preference in DOP for an occurrence-based approach if enough context is provided: large subtrees, even if they occur once, tend to contribute to the generation of the correct parse, since they provide much contextual information. Although these subtrees have low probabilities, they tend to induce the correct parse because fewer subtrees are needed to construct a parse. Additional experiments seemed to confirm this hypothesis. Throwing away all hapaxes yielded an accuracy of 92%, which is a decrease of 4%.</Paragraph>
    <Paragraph position="2"> Distinguishing between small and large hapaxes, showed that the accuracy was not affected by eliminating the hapaxes of depth 1 (however, as an advantage, the convergence seemed to get slightly faster). Eliminating hapaxes larger than depth 1, decreased the accuracy. The following table shows the parse accuracy after eliminating once-occurring subtrees of different maximum depths.</Paragraph>
    <Paragraph position="3">  We have shown that in DOP and STSG the Viterbi algorithm cannot be used for computing a most probable tree of a string. We developed a modification of Viterbi which allows by means of an iterative Monte Carlo search to estimate the most probable tree of a string in polynomial time. Experiments on ATIS showed that only in 68% of the cases, the most probable derivation of a string generates the most probable tree of that string, and that the parse accuracy is dramatically higher than the derivation accuracy. We conjectured that the Monte Carlo algorithm can also be applied to other stochastic grammars for computing the most probable tree of a string. The question as to whether the most probable tree of a string can also be deterministically derived in polynomial time is still unsolved.</Paragraph>
  </Section>
  <Section position="7" start_page="110" end_page="110" type="metho">
    <SectionTitle>
Acknowledgments
</SectionTitle>
    <Paragraph position="0"> The author is indebted to Remko Scha for valuable comments on an earlier version of this paper, and to</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML