File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/e95-1015_intro.xml

Size: 9,736 bytes

Last Modified: 2025-10-06 14:05:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="E95-1015">
  <Title>VP VB NP NP PP PP IN NP VP VBG NN</Title>
  <Section position="3" start_page="105" end_page="106" type="intro">
    <SectionTitle>
3 For the input string abcd, the following derivation
</SectionTitle>
    <Paragraph position="0"> forest is then obtained: Computing a most probable parse tree in STSG In order to deal with the problem of computing the most probable parse tree of a sentence, we will distinguish between parsing and disambiguation. By parsing we mean the creation of a parse forest for an input sentence. By disambiguation we mean the selection of the most probable parse 2 from the forest. The creation of a parse forest is an intermediate step for computing the most probable parse.</Paragraph>
    <Section position="1" start_page="105" end_page="105" type="sub_section">
      <SectionTitle>
3.1 Parsing
</SectionTitle>
      <Paragraph position="0"> From the way STSG combines elementary trees by means of substitution, it follows that an input sentence can be parsed by the same algorithms as (S)CFGs. Every elementary tree t is used as a context-free rewrite rule root(t) --~ yield(t). Given a chart parsing algorithm, an input sentence of length n can be parsed in n 3 time.</Paragraph>
      <Paragraph position="1"> In order to obtain a chart-like forest for a sentence parsed in STSG, we need to label the well-formed substrings in the chart not only with the syntactic categories of that substring but with the full elementary trees t that correspond to the use of the derived rules root(t) ---~yield(t). Note that in a chart-like forest generated by an STSG, different derivations that generate a same tree do not collapse. We will therefore talk about a derivation forest generated by an STSG (cf. Sima'an et al., 1994).</Paragraph>
      <Paragraph position="2"> The following formal example illustrates what a derivation forest of a string may look like. In the example, we leave out the probabilities, which are needed only in the disambiguation process. The visual representation comes from (Kay, 1980): every entry (i,j) in the chart is indicated by an edge and spans the words between the i-th and the j-th position of a sentence. Every edge is labeled with the elementary trees that denote the underlying phrase. The example-STSG consists of the following elementary trees:  2 Although theoretically there can be more than one most probable parse for a sentence, in practice a system that employs a non-trivial treebank tends to generate exactly one most probable parse for a given input sentence.</Paragraph>
      <Paragraph position="4"> Note that different derivations in the forest generate the same tree. By exhaustively unpacking the forest, four different derivations generating two different trees are obtained. We may ask whether we can pack the forest by collapsing spurious derivations.</Paragraph>
      <Paragraph position="5"> Unfortunately, no efficient procedure is known that accomplishes this (remember that there can be exponentially many derivations for one tree).</Paragraph>
    </Section>
    <Section position="2" start_page="105" end_page="106" type="sub_section">
      <SectionTitle>
3.2 Disambiguation
</SectionTitle>
      <Paragraph position="0"> Cubic time parsing does not guarantee cubic time disambiguation, as a sentence may have exponentially many parses and any such parse may have exponentially many derivations. Therefore, in order to find the most probable parse of a sentence, it is not efficient to compare the probabilities of the parses by exhaustively unpacking the chart. Even for determining the probability of one parse, it is not efficient to add the probabilities of all derivations of that parse.</Paragraph>
      <Paragraph position="1"> 3.2.1 Viterbi optimization is not feasible for finding the most probable parse There exists a heuristic optimization algorithm, known as Viterbi optimization, which selects on the basis of an SCFG the most probable derivation of a sentence in cubic time (Viterbi, 1967; Fujisaki et al., 1989; Jelinek et al., 1990). In STSG, however, the most probable derivation does not necessarily generate the most probable parse, as the probability of a parse is defined as the sum of the probabilities of all its derivations. Thus, there is an important question as to whether we can adapt the Viterbi algorithm for finding the most probable parse.</Paragraph>
      <Paragraph position="2"> To understand the difficulty of the problem, we look in more detail at the Viterbi algorithm. The basic idea of the Viterbi algorithm is the early pruning of low probability subderivations in a bottom-up fashion. Two different subderivations of the same part of the sentence and whose resulting  subparses have the same root can both be developed (if at all) to derivations of the whole sentence in the same ways. Therefore, if one of these two subderivations has a lower probability, then it can be eliminated. This is illustrated by a formal example in figure 7. Suppose that during bottom-up parsing of the string abcd the following two subderivations dl and d2 have been generated for the substring abc.</Paragraph>
      <Paragraph position="3"> (Actually represented are their resulting subparses.)  If the probability of dl is higher than the probability of d2, we can eliminate d2 if we are only interested in finding the most probable derivation of abcd. But if we are interested in finding the most probable parse of abcd (generated by STSG), we are not allowed to eliminate d2. This can be seen by the following.</Paragraph>
      <Paragraph position="4"> Suppose that we have the additional elementary tree given in figure 8.</Paragraph>
      <Paragraph position="5">  This elementary tree may be developed to the same tree that can be developed by d2, but not to the tree that can be developed by dl. And since the probability of a parse tree is equal to the sum of the probabilities of all its derivations, it is still possible that d 2 contributes to the generation of the most probable parse. Therefore we are not allowed to eliminate d2. This counter-example does not prove that there is no heuristic optimization that allows polynomial time selection of the most probable parse. But it makes clear that a &amp;quot;select-best&amp;quot; search, as accomplished by Viterbi, is not adequate for finding the most probable parse in STSG. So far, it is unknown whether the problem of finding the most probable parse in a deterministic way is inherently exponential or not (cf. Sima'an et al., 1994). One should of course ask how often in practice the most probable derivation produces the most probable parse, but this can only be answered by means of experiments on real life corpora. Experiments on the ATIS corpus (see session 4) show that only in 68% of the cases the most probable derivation of a sentence generates the most probable parse of that sentence.</Paragraph>
      <Paragraph position="6"> Moreover, the parse accuracy obtained by the most probable parse is dramatically higher than the parse accuracy obtained by the parse generated by the most probable derivation.</Paragraph>
      <Paragraph position="7"> 3.2.2 Estimating the most probable parse by Monte Carlo search We will leave it as an open question whether the most probable parse can be deterministically derived in polynomial time. Here we will ask whether there exists a polynomial time approximation procedure that estimates the most probable parse with an estimation error that can be made arbitrarily small. We have seen that a &amp;quot;select-best&amp;quot; search, as accomplished by Viterbi, can be used for finding the most probable derivation but not for finding the most probable parse. If we apply instead of a select-best search, a &amp;quot;select-random&amp;quot; search, we can generate a random derivation. By iteratively generating a large number of random derivations we can estimate the most probable parse as the parse which results most often from these random derivations (since the probability of a parse is the probability that any of its derivations occurs). The most probable parse can be estimated as accurately as desired by making the number of random samples as large as desired.</Paragraph>
      <Paragraph position="8"> According to the Law of Large Numbers, the most often generated parse converges to the most probable parse. Methods that estimate the probability of an event by taking random samples are known as Monte Carlo methods (Hammersley &amp; Handscomb, 1964). 3 The selection of a random derivation is accomplished in a bottom-up fashion analogous to Viterbi. Instead of selecting the most probable subderivation at each node-sharing in the chart, a random subderivation is selected (i.e. sampled) at each node-sharing (that is, a subderivation that has n times as large a probability as another subderivation should also have n times as large a chance to be chosen as this other subderivation). Once sampled at the S-node, the random derivation of the whole sentence can be retrieved by tracing back the choices made at each node-sharing. Of course, we may postpone sampling until the S-node, such that we sample directly from the distribution of all S-derivations. But this would take exponential time, since there may be exponentially many derivations for the whole sentence. By sampling bottom-up at every node where ambiguity appears, the maximum number of different subderivations at each node-sharing is bounded to a constant (the total number of rules of that node), and therefore the time complexity of generating a random derivation of an input sentence is equal to the time complexity of finding the most probable derivation, O(n3). This is exemplified by the following algorithm.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML