XML Viewer - a94-1009

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/a94-1009_metho.xml
Size: 12,227 bytes
Last Modified: 2025-10-06 14:13:37
<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1009">
  <Title>Does Baum-Welch Re-estimation :Help Taggers?</Title>
  <Section position="4" start_page="54" end_page="55" type="metho">
    <SectionTitle>
3 The effect of the initial conditions
</SectionTitle>
    <Paragraph position="0"> The first experiment concerned the effect of the initial conditions on the accuracy using Baum-Welch re-estimation. A model was trained from a hand-tagged corpus in the manner described above, and then degraded in various ways to simulate the effect of poorer training, as follows: Lexicon DO Un-degraded lexical probabilities, calculated from f(i, w)/f(i).</Paragraph>
    <Paragraph position="1"> D1 Lexical probabilities are correctly ordered, so that the most frequent tag has the highest lexical probability and so on, but the absolute values are otherwise unreliable.</Paragraph>
    <Paragraph position="2"> D2 Lexical probabilities are proportional to the overall tag frequencies, and are hence independent of the actual occurrence of the word in the training corpus.</Paragraph>
    <Paragraph position="3"> D3 All lexical probabilities have the same value, so that the lexicon contains no information other than the possible tags for each word.</Paragraph>
    <Paragraph position="4"> Transitions TO Un-degraded transition probabilities, calculated from f(i, j)/f(i).</Paragraph>
    <Paragraph position="5"> T1 All transition probabilities have the same value.</Paragraph>
    <Paragraph position="6"> We could expect to achieve D1 from, say, a printed dictionary listing parts of speech in order of frequency. Perfect training is represented by case D0+T0. The Xerox experiments (Cutting et al., 1992) correspond to something between D1 and D2, and between TO and T1, in that there is some initial biasing of the probabilities.</Paragraph>
    <Paragraph position="7"> For the test, four corpora were constructed from the LOB corpus: LOB-B from part B, LOB-L from part L, LOB-B-G from parts B to G inclusive and LOB-B-J from parts B to J inclusive. Corpus LOB-B-J was used to train the model, and LOB-B. LOB- null L and LOB-B-G were passed through thirty iterations of the BW algorithm as untagged data. In each case, the best accuracy (on ambiguous words, as usual) from the FB algorithm was noted. As an additional test, we tried assigning the most probable tag from the DO lexicon, completely ignoring tag-tag transitions. The results are summarised in table 1, for various corpora, where F denotes the &amp;quot;most frequent tag&amp;quot; test. As an example of how these figures relate to overall accuracies, LOB-B contains 32.35% ambiguous tokens with respect to the lexicon from LOB-B-J, and the overall accuracy in the D0+T0 case is hence 98.69%. The general pattern of the results is similar across the three test corpora, with the only difference of interest being that case D3+T0 does better for LOB-L than tbr the other two cases, and in particular does better than cases D0+T1 and DI+T1. A possible explanation is that in this case the test data does not overlap with the training data, and hence the good quality lexicons (DO and D1) have less of an influence. It is also interesting that D3+T1 does better than D2+T1. The reasons for this are unclear, and the results are not always the same with other corpora, which suggests that they are not statistically significant.</Paragraph>
    <Paragraph position="8"> Several follow-up experiments were used to confirm the results: using corpora from the Penn treebank, using equivalence classes to ensure that all lexical entries have a total relative frequency of at least 0.01, and using larger corpora. The specific accuracies were different in the various tests, but the overall patterns remained much the same, suggesting that they are not an artifact of the tagset or of details of the text.</Paragraph>
    <Paragraph position="9"> The observations we can make about these results are as follows. Firstly, two of the tests, D2+T1 and D3+T1, give very poor performance. Their accuracy is not even as good as that achieved by picking the most frequent tag (although this of course implies a lexicon of DO or D1 quality). It follows that ifBaum-Welch re-estimation is to be an effective technique, the initial data must have either biasing in the transitions (the TO cases) or in the lexical probabilities (cases D0+T1 and DI+T1), but it is not necessary to have both (D2/D3+T0 and D0/DI+T1).</Paragraph>
    <Paragraph position="10"> Secondly, training from a hand-tagged corpus (case D0+T0) always does best, even when the test data is from a different source to the training data, as it is for LOB-L. So perhaps it is worth investing effort in hand-tagging training corpora after all, rather than just building a lexicon and letting re-estimation sort out the probabilities. But how can we ensure that re-estimation will produce a good quality model? We look further at this issue in the next section.</Paragraph>
    <Paragraph position="11">  accuracy as the iteration progresses. A second experiment was conducted to decide when it is appropriate to use Baum-Welch re-estimation at all. There seem to be three patterns of behaviour: Classical A general trend of rising accuracy on each iteration, with any falls in accuracy being local. It indicates that the model is converging towards an optimum which is better than its starting point.</Paragraph>
    <Paragraph position="12"> Initial maximum Highest accuracy on the first iteration, and falling thereafter. In this case the initial model is of better quality than BW can achieve. That is, while BW will converge on an optimum, the notion of optimality is with respect to the HMM rather than to the linguistic judgements about correct tagging.</Paragraph>
    <Paragraph position="13"> Early maximum Rising accuracy for a small number of iterations (2-4), and then falling as in initial maximum.</Paragraph>
    <Paragraph position="14"> An example of each of the three behaviours is shown in figure 1. The values of the accuracies and the test conditions are unimportant here; all we want to show is the general patterns. The second experiment had the aim of trying to discover which pattern applies under which circumstances, in order to help decide how to train the model. Clearly, if the expected pattern is initial maximum, we should not use BW at all, if early maximum, we should halt the process after a few iterations, and if classical, we should halt the process in a &amp;quot;standard&amp;quot; way, such as comparing the perplexity of successive models.</Paragraph>
    <Paragraph position="15"> The tests were conducted in a similar manner to those of the first experiment, by building a lexicon and transitions from a hand tagged training corpus, and then applying them to a test corpus with varying degrees of degradation. Firstly, four different degrees of degradation were used: no degradation at all, D2 degradation of the lexicon, T1 degradation of the transitions, and the two together. Secondly, we selected test corpora with varying degrees of similarity to the training corpus: the same text, text from a similar domain, and text which is significantly different. Two tests were conducted with each combination of the degradation and similarity, using different corpora (from the Penn treebank) ranging in size from approximately 50000 words to 500000 words. The re-estimation wa.s allowed to run for ten iterations.</Paragraph>
    <Paragraph position="16"> The results appear ill table 2, showing the best accuracy achieved (on ambiguous words).</Paragraph>
    <Paragraph position="17"> the iteration at which it occurred, and the pattern of re-estimation (I = initial maximum, E = early maximum, C = classical). The patterns are summarised in table 3, each entry in the table showing the patterns for the two tests under the given conditions. Although there is some variations in the readings, for example ill the &amp;quot;similar/D0+T0&amp;quot; case, we can draw some general conclusions about the patterns obtained from different sorts of data.</Paragraph>
    <Paragraph position="18"> When the lexicon is degraded (D2), the pattern is always classical. With a good lexicon but either degraded transitions or a test corpus differing from the training corpus, the pattern tends to be early maximum. When the test corpus is very similar to the model, then the pattern is initial maximum. Furthermore, examining the accuracies in table 2, in the cases of initial maximum and early maximum, the accuracy tends to be significantly higher than with classical behaviour. It seems likely that what is going on is that the model is converging to towards something of similar &amp;quot;quality&amp;quot; in each case, but when the pattern is classical, the convergence starts from a lower quality model and improves, and in the other cases, it starts from a higher quality one and deteriorates. In the case of early maximum, the few iterations where the accuracy is improving correspond to the creation of entries for unknown words and th~ ~, fine tuning of ones for known ones, and these changes outweigh those produced by the re-estimation.</Paragraph>
  </Section>
  <Section position="5" start_page="55" end_page="57" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> From the obserw~tions in the previous section, we propose the following guidelines for how to train a  HMM for use in tagging: * If a hand-tagged training corpus is available, use it . If the test and training corpora are nearidentical, do not use BW re-estimation; otherwise use for a small number of iterations. * If no such training corpus is available, but a lexicon with at least relative frequency data is available, use BW re-estimation for a small number of iterations.</Paragraph>
    <Paragraph position="1"> * If neither training corpus nor lexicon are available, use BW re-estimation with standard convergence tests such as perplexity. Without a lexicon, some initial biasing of the transitions is needed if good results are to be obtained.</Paragraph>
    <Paragraph position="2"> Similar results are presented by Merialdo (1994), who describes experiments to compare the effect of training from a hand-tagged corpora and using the Baum-Welch algorithm with various initial conditions. As in the experiments above, BW re-estimation gave a decrease in accuracy when the starting point was derived from a significant amount of hand-tagged text. In addition, although Merialdo does not highlight the point, BW re-estimation starting from less than 5000 words of hand-tagged text shows early maximum behaviour. Merialdo's conclusion is that taggers should be trained using as much hand-tagged text as possible to begin with, and only then applying BW re-estimation with untagged text. The step forward taken in the work here is to show that there are three patterns of re-estimation behaviour, with differing guidelines for how to use BW effectively, and that to obtain a good starting point when a hand-tagged corpus is  not available or is too small, either the lexicon or the transitions must be biased.</Paragraph>
    <Paragraph position="3"> While these may be useful heuristics from a practical point of view, the next step forward is to look for an automatic way of predicting the accuracy of the tagging process given a corpus and a model.</Paragraph>
    <Paragraph position="4"> Some preliminary experiments with using measures such as perplexity and the average probability of hypotheses show that, while they do give an indication of convergence during re-estimation, neither shows a strong correlation with the accuracy. Perhaps what is needed is a &amp;quot;similarity measure&amp;quot; between two models M and M ~, such that if a corpus were tagged with model M, M ~ is the model obtained by training from the output corpus from the tagger as if it were a hand-tagged corpus. However, preliminary experiments using such measures as the Kullback-Liebler distance between the initial and new models have again showed that it does not give good predictions of accuracy. In the end it may turn out there is simply no way of making the prediction without a source of intbrmation extrinsic to both model and corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML