XML Viewer - p97-1032

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1032_metho.xml
Size: 8,114 bytes
Last Modified: 2025-10-06 14:14:40
<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1032">
  <Title>Comparing a Linguistic and a Stochastic Tagger</Title>
  <Section position="4" start_page="247" end_page="248" type="metho">
    <SectionTitle>
8 The Statistical Tagger
</SectionTitle>
    <Paragraph position="0"> The statistical tagger used in the experiments is a classical trigram-based HMM decoder of the kind described in e.g. (Church 1988), (DeRose 1988) and numerous other articles. Following conventional notation, e.g. (Rabiner 1989, pp. 272-274) and (Krenn and Samuelsson 1996, pp. 42-46), the tagger recursively calculates the ~, 3, 7 and 6 variables for each word string position t = 1 ..... T and each possible state 4 si : i = 1,...,n:</Paragraph>
    <Paragraph position="2"> where St = si is the event of the tth word being emitted from state si and Wt = wk, is the event of the tth word being the particular word w~, that was actually observed in the word string.</Paragraph>
    <Paragraph position="3"> Note that for t = 1 ..... T-1 ; i,j- l ..... n</Paragraph>
    <Paragraph position="5"> where pij = P(St+I = sj I St = si) are the transition probabilities, encoding the tag N-gram probabilities, and</Paragraph>
    <Paragraph position="7"> tagger is encoded as a first-order HMM, where each state corresponds to a sequence of ,V-I tags, i.e., for a trigram tagger, each state corresponds to a tag pair.</Paragraph>
    <Paragraph position="8">  are the lexical probabilities. Here X, is the random variable of assigning a tag to the tth word and xj is the last tag of the tag sequence encoded as state sj. Note that si # sj need not imply zi # zj.</Paragraph>
    <Paragraph position="9"> More precisely, the tagger employs the converse lexical probabilities</Paragraph>
    <Paragraph position="11"> The rationale behind this is to facilitate estimating the model parameters from sparse data. In more detail, it is easy to estimate P(tag I word) for a previously unseen word by backing off to statistics derived from words that end with the same sequence of letters (or based on other surface cues), whereas directly estimating P(word I tag) is more difficult.</Paragraph>
    <Paragraph position="12"> This is particularly useful for languages with a rich inflectional and derivational morphology, but also for English: for example, the suffix &amp;quot;-tion&amp;quot; is a strong indicator that the word in question is a noun; the suffix &amp;quot;-able&amp;quot; that it is an adjective.</Paragraph>
    <Paragraph position="13"> More technically, the lexicon is organised as a reverse-suffix tree, and smoothing the probability estimates is accomplished by blending the distribution at the current node of the tree with that of higher-level nodes, corresponding to (shorter) suffixes of the current word (suffix). The scheme also incorporates probability distributions for the set of capitalized words, the set of all-caps words and the set of infrequent words, all of which are used to improve the estimates for unknown words. Employing a small amount of back-off smoothing also for the known words is useful to reduce lexical tag omissions. Empirically, looking two branching points up the tree for known words, and all the way up to the root for unknown words, proved optimal. The method for blending the distributions applies equally well to smoothing the transition probabilities pij, i.e., the tag N-gram probabilities, and both the scheme and its application to these two tasks are described in detail in (Samuelsson 1996), where it was also shown to compare favourably to (deleted) interpolation, see (Jelinek and Mercer 1980), even when the back-off weights of the latter were optimal.</Paragraph>
    <Paragraph position="14"> The 6 variables enable finding the most probable state sequence under the HMM, from which the most likely assignment of tags to words can be directly established. This is the normal modus operandi of an HMM decoder. Using the 7 variables, we can calculate the probability of being in state si at string position t, and thus having emitted wk, from this state, conditional on the entire word string. By summing over all states that would assign the same tag to this word, the individual probability of each tag being assigned to any particular input word, conditional on the entire word string, can be calculated:</Paragraph>
    <Paragraph position="16"> This allows retaining multiple tags for each word by simply discarding only low-probability tags; those whose probabilities are below some threshold value.</Paragraph>
    <Paragraph position="17"> Of course, the most probable tag is never discarded, even if its probability happens to be less than the threshold value. By varying the threshold, we can perform a recall-precision, or error-rate-ambiguity, tradeoff. A similar strategy is adopted in (de Marcken 1990).</Paragraph>
  </Section>
  <Section position="5" start_page="248" end_page="249" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> The statistical tagger was trained on 357,000 words from the Brown corpus (Francis and Ku~era 1982), reannotated using the EngCG annotation scheme (see above). In a first set of experiments, a 35,000 word subset of this corpus was set aside and used to evaluate the tagger's performance when trained on successively larger portions of the remaining 322,000 words. The learning curve, showing the error rate alter full disambiguation as a function of the amount of training data used, see Figure 1, has levelled off at 322,000 words, indicating that little is to be gained from further training. We also note that the absolute value of the error rate is 3.51% -- a typical state-of-the-art figure. Here, previously unseen words contribute 1.08% to the total error rate, while the contribution from lexical tag omissions is 0.08% 95% confidence intervals for the error rates would range from + 0.30% for 30,000 words to + 0.20~c at 322.000 words.</Paragraph>
    <Paragraph position="1"> The tagger was then trained on the entire set of 357,000 words and confronted with the separate 55,000-word benchmark corpus, and run both in full  Table h Error-rate-ambiguity tradeoff for both taggets on the benchmark corpus. Parenthesized numbers are interpolated.</Paragraph>
    <Paragraph position="2"> and partial disambiguation mode. Table 1 shows the error rate as a function of remaining ambiguity (tags/word) both for the statistical tagger, and for the EngCG-2 tagger. The error rate for full disanabiguation using the 6 variables is 4.72% and using the 7 variables is 4.68%, both -4-0.18% with confidence degree 95%. Note that the optimal tag sequence obtained using the 7 variables need not equal the optimal tag sequence obtained using the 6 variables. In fact, the former sequence may be assigned zero probability by the HMM, namely if one of its state transitions has zero probability.</Paragraph>
    <Paragraph position="3"> Previously unseen words account for 2.01%, and lexical tag omissions for 0.15% of the total error rate. These two error sources are together exactly 1.00% higher on the benchmark corpus than on the Brown corpus, and account for almost the entire difference in error rate. They stem from using less complete lexical information sources, and are most likely the effect of a larger vocabulary overlap between the test and training portions of the Brown corpus than between the Brown and benchmark corpora.</Paragraph>
    <Paragraph position="4"> The ratio between the error rates of the two taggets with the same amount of remaining ambiguity ranges from 8.6 at 1.026 tags/word to 28,0 at 1.070 tags/word. The error rate of the statistical tagger can be further decreased, at the price of increased remaining ambiguity, see Figure 2. In the limit of retaining all possible tags, the residual error rate is entirely due to lexical tag omissions, i.e., it is 0.15%, with in average 14.24 tags per word. The reason that this figure is so high is that the unknown words, which comprise 10% of the corpus, are assigned all possible tags as they are backed off all the way to the root of the reverse-suffix tree.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML