XML Viewer - w98-1304

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1304_metho.xml
Size: 14,788 bytes
Last Modified: 2025-10-06 14:15:15
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1304">
  <Title>I</Title>
  <Section position="5" start_page="38" end_page="75" type="metho">
    <SectionTitle>
4 Transforming Traversal Strings into Trees
</SectionTitle>
    <Paragraph position="0"> We use a heuristic algorithm for reconstructing a tree from traversal strings. This includes '~artial&amp;quot; traversal strings, but we simply refer to them as traversal strings since we will always be working with partial traversal strings anyway. This is a brief, informal description of the algorithm. A complete technical description is given in (Hogenhout, 1998).</Paragraph>
    <Paragraph position="1"> The algorithm is based on the heuristic that best matches should go first. The best match is decided by checking neighboring strings (or, later, subtrees) for equal nonterminals starting at the top of either side. The pair with the most matching nonterminals is merged as displayed in figure 4. This process is repeated until one tree remains, or when there are no more matching neighbors.</Paragraph>
    <Paragraph position="2"> For example, the choice</Paragraph>
    <Paragraph position="4"> would initially be decided in favor of the and rain, and.after they are merged completely, the top three nonterminals would be merged to those of i~ There is an easy way of testing this algorithm. One can take trees from a treebank, convert the trees to traversal strings, then use the algorithm to reconstruct the trees. Figure 5 shows the labeled accuracy and recall of these reconstructed trees when compared to the original treebank trees, for various maximum traversal string lengths.</Paragraph>
    <Paragraph position="5"> The accuracy is calculated as number of identical brackets with identical nonterminal accuracy -- number of brackets in system parse (1) and the recall as number of identical brackets with identical nonterminal recall = number of brackets in treebank parse (2) which we will refer tO as &amp;quot;labeled accuracy&amp;quot; and &amp;quot;labeled recall&amp;quot; as opposed to the &amp;quot;unlabeled&amp;quot; versions of these measures that ignore nonterminals.  Even for long traversal strings the original tree is not reconstructed completely. This happens, for example, when two identical nonterminals are siblings, as in the sentence &amp;quot;She ga~e \[NP the man\] \[NP a book\].&amp;quot; It is of course possible to solve such problems with a post-processor that tries to recognize such situations and correct them whenever they arise. But as can be seen it only involves a small percentage (about 2%) of all brackets and for this reason it is not very significant at this stage.</Paragraph>
    <Paragraph position="7"> The graph shows that if we are capable of predicting up to 5 or more vertices, the algorithm will be able to do very well. If we can only predict up to 4 vertices we still have a high upper bound, but it is slightly lower~ Predicting up to 3 or less vertices however will not produce useful results.</Paragraph>
    <Paragraph position="8"> It must however be stressed that this .is only an upper bound and does not reflect the performance of a useful system in any way. The upper bound only helps to pin down the border line of 4-5 vertices, and what really counts in practice is how the algorithm will do when the traversal strings that are predicted contain errors--as they undoubtedly will.</Paragraph>
  </Section>
  <Section position="6" start_page="75" end_page="75" type="metho">
    <SectionTitle>
5 Guessing Traversal Strings
</SectionTitle>
    <Paragraph position="0"> We will now look at the question of how to predict traversal strings. As will become clear when inspecting the equation this bears similarity to part-of-speech tagging. But there is one factor that makes a big difference: we do not test on the correct traversal string, but on the result of the tree that is reconstructed at the end. In many cases the traversal string that is guessed is not correct, but similar to the correct traversal string, and a Similar traversal string will render much better results at tree reconstruction than a completely different one.</Paragraph>
    <Paragraph position="1"> As usual our approach is maximizing the likelihood of the training dal~a. We will use a Hidden Markov Model which has traversal string-tag combinations as states and which produces words as output. We do not re~stimate probabilities using the Bantu-Welch algorithm (Bantu, 1972) but we use smoothed Maximum Likelihood estimates from treebank data.</Paragraph>
    <Paragraph position="2"> Let us say we have a string of words wl...wn, and we are interested in guessing tags 7&amp;quot; --- tl ...tn and traversal strings S = sl...s,~. We also use s0 ----- to -- too = s~+l ~- t~+l = dummy as a short-hand to signal the beginning and end of the sentence.</Paragraph>
    <Paragraph position="3">  We take the probability of a sentence to be</Paragraph>
    <Paragraph position="5"> corresponding to the transition and output probabilities of a hidden markov model.</Paragraph>
    <Paragraph position="6"> In practice the probabilities p(wi\[si, ti) and p(s~+l, ti+l\]s,, t,) can not be estimated directly using Maximum Likelihood because of sparse data. For this reason we smooth the estimates with our version of lower-order models as follows: ~(w~lsl, ti) = A,it~p(wils,, ti) + (1 - ;~,m)p(w, lt0 (5) where the interpolation factor sit, is adjusted for different values of si and t, as suggested in (Bahl, Jelinek, and Mercer, 1983). We also divide the si-ti pa~r values over different buckets so that all pairs in the same bucket have the same ,~ parameter. It should be noted that we have a special word which stands for &amp;quot;nnlcnown word,&amp;quot; to take care of words that were not seen in the training data.</Paragraph>
    <Paragraph position="7"> We do something similar for p(si+l, ti+llsi, tO, namely ~(s~+lt,+lls. tO = 6~,pCs~+~t~+iIs. td + 6~mpCsi+lt~+lltd + 6~mpCs,+lt,+l) (fi) where of course 6~,ti + ~it, + 6~ti = 1. The interpolation factors axe bucketed in the same w~y.</Paragraph>
    <Paragraph position="8"> Using the obtained model we choose T and ,q by maximizing the probability of the sentence that we wish to analyse:</Paragraph>
    <Paragraph position="10"> argmax ITp(w~ls~, tdp(si+2, ti+l Is. to (8) C/r,s) ~o which can be resolved using the Viterbi-algorithm (Viterbi, 1967).</Paragraph>
  </Section>
  <Section position="7" start_page="75" end_page="75" type="metho">
    <SectionTitle>
6 Selection of Part of Speech Tags
</SectionTitle>
    <Paragraph position="0"> The process outlined above still has one problem that will be central in the rest of the discussion. The number of traversal strings is easily a few thousand, and the number of part of speech tag-traversal string pairs is even larger. Clearly, the computational complexity of the algorithm is in calculating (8). But, given a word and the history up to that word, most tags and traversal strings can be ruled out immediately. We will therefore only consider a fraction of the possible part of speech tags and traversal strings. This section will discuss how we select part of speech tags.</Paragraph>
    <Paragraph position="1"> The equation we use for selecting a tag is similar to the standard tagging HMM based model.</Paragraph>
    <Paragraph position="2"> We pretend for the time being that we are dealing with another stochastic process, namely one that only generates tags. We assume that</Paragraph>
    <Paragraph position="4"> but we do not really use this model, we only' use the idea behind it to approximate the probability of a tag. We find the most likely tags after seeing word i using the following</Paragraph>
    <Paragraph position="6"> where s is a traversal string, the symbol Bi-I indicates the set of tag-traversal string pairs that is being considered for word wi-l, and ~ indicates the &amp;quot;forward probability&amp;quot; according to the HMM. As usual 64u, u~-l) -- 1 ifu = u~_~ and 0 otherwise. We will discuss later how the set B~-I is chosen, but this of course depends on the tags selected for the word wi-1.</Paragraph>
    <Paragraph position="7"> We distinguish between ~, (tagging model) and ~(~,~) (traversal string model).</Paragraph>
    <Paragraph position="8"> We take two significant assumptions at this point. First, we do not really use the HMM indicated in 410), but in equation (12) we restrict ourselves to the forward probability. The second assumption we take is (13), i.e., we estimate the probability of the previous tag by the tag-traversal string pairs that were selected for the previous word. Using this method we do not need to implement the markov model for tags, we only need the tables for p(t, lti_l) and p(wdti ). As we already need the last one for the traversal string model, we only need the (small) table p(ti\[t~-l) especially for tagging.</Paragraph>
    <Paragraph position="9"> We must emphasize that the tagging described here is only a first estimate. We consider the most ~lcely one, two or three tags according to this model and discard the rest. Once they are selected, these probabilities are discarded and we return to the regular model. The next section will describe how the tags are selected in the next phase.</Paragraph>
  </Section>
  <Section position="8" start_page="75" end_page="75" type="metho">
    <SectionTitle>
7 Selection of Traversal Strings : First Phase
</SectionTitle>
    <Paragraph position="0"> The next problem is how to select a few traversal strings given a word and a few tags, one of which is likely to be correct. The model we use for this pre~selection is actually more simple; as we ignore the selected traversal strings for previous words. 1 From the corpus we directly estimate in Maximum Likelihood fashion P4w,, si, ti) (14) and select the most likely travexsal strings si from this table. If there are too few samples for a particular word wi, the list is completed with the more general distribution</Paragraph>
    <Paragraph position="2"> again maximizing over si. We will have to consider that we do not have a single tag but several options, but we will first pretend that we do have one single tag.</Paragraph>
    <Paragraph position="3"> Figure 6 shows the results of this first phase, in case the maximum length of traversal strings is set to 5. If the best 50 candidates are selected according to 414), supplemented with selection according to (15) if necessary, we have the correct candidate between them about 80% of the time. That means that for 20% of the words, we can only hope that a similar traver-~al string will be available for them. If we use the best 300 candidates, we will miss the correct candidate for about one word per sentence. We must however emphas!ze two points: 1. The question is not only if we can select the correct candidate. It is crucial that, when a wrong candidate is chosen, this is at least similar to the correct candidate.</Paragraph>
    <Paragraph position="4"> 2. Figure 6 indicates the percentage for traversal strings cut of at length 5. If traversal stings of a different maximum length are used, this will change (the higher the maximum length, the lower the percentage of hits).</Paragraph>
    <Paragraph position="5">  Now we return to the tagging problem; after all we do not have the right tag available to us. We solve this, heuristically, as follows. Let a be the most likely tag, b the second most likely and c the third.</Paragraph>
    <Paragraph position="6">  - If p(a)/p(b) &gt; 50, select 300 candidates for tag a and ignore other tags.</Paragraph>
    <Paragraph position="7"> - If 50 &gt; p(a)/p(b) &gt; 4 we select 300 candidates for tag a and more 100 candidates for tag b.</Paragraph>
    <Paragraph position="8"> - If 4 &gt;_ p(a)/p(b) we select 300 candidates for tag a, 200 candidates for tag b and 100  candidates for tag c.</Paragraph>
    <Paragraph position="9"> This scheme gives more candidates for more ambiguous words, but as about 80% of all words fall in the first category and only 9% in the last category, this is not so bad. This list will contain the correct traversal string about 95% of the time.</Paragraph>
  </Section>
  <Section position="9" start_page="75" end_page="75" type="metho">
    <SectionTitle>
8 Selection of Traversal Strings : Second Phase
</SectionTitle>
    <Paragraph position="0"> The previous section explained how initial candidates can be selected quickly from all possible sets. After these initial candidates were selected, the transition and output probabilities are calculated. Let again B~- 1 be the set of candidates considered for word w~_ 1. Then we need to calculate (regrouping the product as compared to (8)) the quantity</Paragraph>
    <Paragraph position="2"> where we set</Paragraph>
    <Paragraph position="4"> and Bo = {(dummy, dummy)}. The sum in (16) reflects almost all of the time that the calculation process takes up. But equation (16) gives a much more accurate estimate of likelihood than the rather primitive word-based selection (14), so once this sum is calculated we have a much better idea of the likelihood of candidates. For this reason we use two criteria: Note that using a technique similar to that for part of speech tags is not an option as this is exactly what we are trying to avoid doing for all possible traversai strings.</Paragraph>
    <Paragraph position="5">  - In the first phase we use equation (14) and select the best p candidates. (As explained, depending on tagging confidence we vary the number of candidates, so # should be thought of as an average.) - In the second phase we use equation (16) and select the best 7 candidates.</Paragraph>
    <Paragraph position="6">  It will be clear that we can choose 7 ((/z. We have illustrated this in figure 7, which displays the percentage of correct candidates for various values for 7, again using a maximum traversal string length.of 5. Note that the computational complexity of the Viterbi algorithm will be O(p~n) where n is sentence length.</Paragraph>
    <Paragraph position="7">  ff ....</Paragraph>
    <Paragraph position="8"> ~ate available after phase 2 -' correct candidate C/ho6en by Viterbl algodthm r i i i i 5 10 15 20 25 30 number of candidates phase 2  line) and also the percentage of cases in which the correct candidate is chosen by the Viterbi algorithm. A remarkable fact arises from this figure: the percentage of traversal strings that are chosen correctly stabilizes at about 7 = 4. From that point the percentage is about 50% and while increasing 7 increases the chance that the correct one will be available, choosing it becomes more diiBcult and these two effects cancel each other out. Nevertheless the result continues to improve for higher 7, as better alternatives become available. We will put 7 to 15 as a higher number contributes little more to the final scores.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML