File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/j97-3003_metho.xml

Size: 41,266 bytes

Last Modified: 2025-10-06 14:14:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="J97-3003">
  <Title>Automatic Rule Induction for Unknown-Word Guessing</Title>
  <Section position="3" start_page="406" end_page="406" type="metho">
    <SectionTitle>
2. Guessing-Rule Schemata
</SectionTitle>
    <Paragraph position="0"> There are two kinds of word-guessing rules employed by our cascading guesser: morphological rules and nonmorphological ending-guessing rules. Morphological word-guessing rules describe how one word can be guessed given that another word is known. Unlike morphological guessing rules, nonmorphological rules do not require the base form of an unknown word to be listed in the lexicon. Such rules guess the pos-class for a word on the basis of its ending or leading segments alone. This is especially important when dealing with uninflected words and domain-specific sub-languages where many highly specialized words can be encountered. In English, as in many other languages, morphological word formation is realized by affixation: prefixation and suffixation. Thus, in general, each kind of guessing rule can be further subcategorized depending on whether it is applied to the beginning or tail of an un-</Paragraph>
  </Section>
  <Section position="4" start_page="406" end_page="416" type="metho">
    <SectionTitle>
3 The induction technique can be considered to be semi-unsupervised since it uses the annotation stated
</SectionTitle>
    <Paragraph position="0"> in the lexicon. At the same time it does not require additional annotation since that annotation already exists regardless of the rule induction task.</Paragraph>
    <Paragraph position="1">  Computational Linguistics Volume 23, Number 3 known word. To mirror this classification, we will introduce a general schemata for guessing rules and a guessing rule will be seen as a particular instantiation of this schemata.</Paragraph>
    <Paragraph position="2"> Definition A guessing-rule schemata is a structure G =x:{b.e} \[-S +M ?/-class --*R-class\] where * x indicates whether the rule is applied to the beginning or end of a word and has two possible values, b-beginning and e-end; * S is the affix to be segmented; it is deleted (-) from the beginning or end of an unknown word according to the value of x; * M is the mutative segment (possibly empty), which should be added (+) to the result string after the segmentation; * /-class is the required Pos-class (set of one or more pos-tags) of the stem; the result string after the -S and +M operations should be checked (?) in the lexicon for having this particular Pos-class; if/-class is set to be &amp;quot;void&amp;quot; no checking is required; * R-class is the POs-class to assign (--,) to the unknown word if all the above operations (-S +M ?I) have been successful.</Paragraph>
    <Paragraph position="3"> For example, the rule e\[-ied +y ?(VB VBP) --*(JJ VBD VBN)\] says that if there is an unknown word which ends with ied, we should strip this ending from it and append the string y to the remaining part. If we then find this word in the lexicon as (VB VBP) (base verb or verb of present tense non-3d form), we conclude that the unknown word is of the category (JJ VBD VBN) (adjective, past verb, or participle). Thus, for instance, if the word specified was unknown to the lexicon, this rule first would try to segment the required ending ied (specified - ied = specif), then add to the result the mutative segment y (specif + y = specify), and, if the word specify was found in the lexicon as (VB VBP), the unknown word specified would be classified as (JJ VBD VBN).</Paragraph>
    <Paragraph position="4"> Since the mutative segment can be an empty string, regular morphological formations can be captured as well. For instance, the rule b\[-un +&amp;quot;&amp;quot; ?(VBD VBN) --*(JJ)\] says that if segmenting the prefix un from an unknown word results in a word that is found in the lexicon as a past verb and participle (VBD VBN), we conclude that the unknown word is an adjective 0J). This rule will, for instance, correctly classify the word unscrewed if the word screwed is listed in the lexicon as (VBD VBN). When setting the S segment to an empty string and the M segment to a non-empty string, the schemata allows for cases when a secondary form is listed in the lexicon and the base form is not. For instance, the rule e\[-&amp;quot;&amp;quot; +ed ?(VBD VBN) --*(VB VBP)\] says that if adding the segment ed to the end of an unknown word results in a word</Paragraph>
    <Section position="1" start_page="408" end_page="410" type="sub_section">
      <SectionTitle>
Andrei Mikheev Unknown-Word Guessing
</SectionTitle>
      <Paragraph position="0"> that is found in the lexicon as a past verb and participle (VBD VBN), then the unknown word is a base or non-3d present verb (VB VBP).</Paragraph>
      <Paragraph position="1"> The general schemata can also capture ending-guessing rules if the/-class is set to be &amp;quot;void.&amp;quot; This indicates that no stem lookup is required. Naturally, the mutative segment of such rules is always set to an empty string. For example, an ending-guessing rule e\[-ing +&amp;quot;&amp;quot; ?-- --*(JJ NN VBG)\] says that if a word ends with ing it can be an adjective, a noun, or a gerund. Unlike a morphological rule, this rule does not check whether the substring preceding the ing-ending is listed in the lexicon with a particular POs-class.</Paragraph>
      <Paragraph position="2"> The proposed guessing-rule schemata is in fact quite similar to the set of generic transformations for unknown-word guessing developed by Brill (1995). There are, however, three major differences: * Brill's transformations do not check whether the stem belongs to a particular POS-class while the schemata proposed here does (?/-class) and therefore imposes more rigorous constraints; * Brill's transformations do not account for irregular morphological cases like try-tries whereas our schemata does (+M segment); * Brill's guessing rules produce a single most likely tag for an unknown word, whereas our guesser is intended to imitate the lexicon and produce all possible tags.</Paragraph>
      <Paragraph position="3"> Brill's system has two transformations that our schemata do not capture: when a particular character appears in a word and when a word appears in a particular context. The latter transformation is, in fact, due to the peculiarities of Brill's tagging algorithm and, in other approaches, is captured at the disambiguation phase of the tagger itself. The former feature is indirectly captured in our approach. It has been noticed (as in \[Weischedel et al., 1993\], for example) that capitalized and hyphenated words have a different distribution from other words. Our morphological rules account for this difference by checking the stem of the word. The ending-guessing rules, on the other hand, do not use information about stems. Thus if the ending s predicts that a word can be a plural noun or a 3d form of a verb, the information that this word was capitalized can narrow the considered set of POS-tags to plural proper noun. We therefore decided to collect ending-guessing rules separately for capitalized words, hyphenated words, and all other words. In our experiments, we restricted ourselves to the production of six different guessing-rule sets, which seemed most appropriate for English: * Suffix deg - suffix morphological rules with no mutative endings (0). Such rules account for the regular suffixation as, for instance, book + ed = booked;  * Suffix I - suffix morphological rules with a mutative ending in the last letter. Such rules account for many cases of the irregular suffixation as, for instance, try - y + ied = tried; * Prefix - prefix morphological rules with no mutative segments (0). Such rules account for the regular prefixation as, for instance, Un q- screw ~ unscrew;  Computational Linguistics Volume 23, Number 3 * Ending- - ending-guessing rules for hyphenated words; * Ending c - ending-guessing rules for capitalized words; * Ending* - ending-guessing rules for all other (nonhyphenated and noncapitalized) words.</Paragraph>
      <Paragraph position="4"> 3. Guessing-Rule Induction  As already mentioned, we see features that our guessing-rule schemata is intended to capture as general language regularities rather than properties of rare or corpus-specific words only. This significantly simplifies training data requirements: we can induce guessing rules from a general-purpose lexicon. 4 First, we no longer depend on the size or even existence of an annotated training corpus. Second, we do not require any annotation to be done for the training; instead, we reuse the information stated in the lexicon, which we can automatically map to a particular tag set that a tagger is trained to. We also use the actual frequencies of word usage, collected from a raw corpus. This allows for the discrimination between rules that are no longer productive (but have left their imprint on the basic lexicon) and rules that are productive in real-life texts. For guessing rules to capture general language regularities, the lexicon should be as general as possible (i.e., should list all possible pos-tags for a word) and large. The corresponding corpus should also be large enough to obtain reliable estimates of word-frequency distribution for at least 10,000-15,000 words. Since a word can take on several different POS-tags, in the lexicon it can be represented as a \[string/Pos-class\] record, where the POs-class is a set of one or more POS-tags. For instance, the entry for the word book, which can be a noun (NN) or a verb (VB) would look like \[book (NN VB)\]. Thus the nth entry of the lexicon (Wn) can be represented as \[W C\]n where W is the surface lexical form and C is its pos-class. Different lexicon entries can share the same POs-class but they cannot share the same surface lexical form. In our experiments, we used a lexicon derived from CRLEX (Burnage 1990), a large multilingual database that includes extensive lexicons of English, Dutch, and German. We constructed an English lexicon of 72,136 word forms with morphological features, which we then mapped into the Penn Treebank tag set (Marcus, Marcinkiewicz, and Santorini 1993). The most frequent open-class tags of this tag set are shown in Table 1. Word-frequency distribution was estimated from the Brown Corpus, which reflects multidomain language use.</Paragraph>
      <Paragraph position="5"> As usual, we separated the test sample from the training sample. Here we followed the suggestion that the unknown words actually are quite similar to words that occur only once (hapax words) in the corpus (Dermatas and Kokkinakis 1995; Baayen and Sproat 1995). We put all the hapax words from the Brown Corpus that were found in the CnLEx-derived lexicon into the test collection (test lexicon) and all other words from the CELEx-derived lexicon into the training lexicon. In the test lexicon, we also included the hapax words not found in the CELEx-derived lexicon, assigning them the POS-tags they had in the Brown Corpus. Then we filtered out words shorter than four characters, nonwords such as numbers or alpha-numerals, which usually are handled at the tokenization phase, and all closed-class words, s which we assume will always be present in the lexicon. Thus after all these transformations we obtained a lexicon of 59,268 entries for training and the test lexicon of 17,868 entries.</Paragraph>
      <Paragraph position="6">  Our guessing-rule induction technique uses the training and test data prepared as described above and can be seen as a sampling for the best performing rule set from a collection of automatically produced rule sets. Here is a brief outline of its major phases: Rule Extraction Phase (Section 3.1) - sets of word-guessing rules, (e.g., Prefix, Suffix deg, Suffix 1, Ending, etc.) are extracted from the lexicon and cleaned of redundant and infrequently used rules; Rule Scoring Phase (Section 3.2) - each rule from the extracted rule sets is ranked according to its accuracy, and rules that scored above a certain threshold are included in the working rule sets; Rule Merging Phase (Section 3.3) - rules that have not scored high enough are merged together into more general rules, then rescored, and, depending on their score, added to the working rule sets; Direct Evaluation Phase (Sections 3.4) - working rule sets produced with different thresholds are evaluated to obtain the best-performing ones.</Paragraph>
    </Section>
    <Section position="2" start_page="410" end_page="411" type="sub_section">
      <SectionTitle>
3.1 Rule Extraction Phase
</SectionTitle>
      <Paragraph position="0"> For the extraction of the initial sets of prefix and suffix morphological guessing rules (Prefix, Suffix deg, and Suffix1), we define the operator Vn where the index n specifies the length of the mutative ending of the main word. Thus when the index n is set to 0 the result of the application of the V0 operator will be a morphological rule with no mutative segment. The V1 operator will extract the rules with the alterations in the last letter of the main word. When the ~ operator is applied to a pair of entries from the lexicon (\[W C\]i and \[W C\]j), first, it segments the last (or first) n characters of the shorter word (Wj) and stores this in the M element of the rule. Then it tries to segment an affix by subtracting the shorter word (Wj) without the mutative ending from the longer word (Wi). If the subtraction results in an non-empty string and the mutative segment is not duplicated in the affix, the system creates a morphological rule with the POs-class of the shorter word (Cj) as the/-class, the POS-class of the longer word (Ci) as the R-class and the segmented affix itself in the S field. For example: \[booked (JJ VBD VBN)\] V0 \[book (NN VB)\] --+ e\[-ed +&amp;quot;&amp;quot; ?(NN VB) ---+(JJ VBD VBN)\] \[advisable (JJ)\] V1 \[advise (NN VB)\] ---+ e\[-able +&amp;quot;e&amp;quot; ?(NN VB) ---~(JJ) \] The V operator is applied to all possible pairs of lexical entries sequentially, and, if a rule produced by such an application has already been extracted from another pair, its frequency count (f) is incremented. Thus, prefix and suffix morphological rules together with their frequencies are produced. Next, we cut out the most infrequent rules, which might bias further learning. To do that we eliminate all the rules with frequency f less than a certain threshold 8, which usually is set quite low: 2-4. Such filtering reduces the rule sets more than tenfold.</Paragraph>
      <Paragraph position="1"> To collect the ending-guessing rules, we set the upper limit on the ending length equal to five characters and thus collect from the lexicon all possible word-endings of length 1, 2, 3, 4, and 5, together with the POS-classes of the words in which these endings appeared. We also set the minimum length of the remaining substring to three characters. We define the unary operator A, which produces a set of ending-guessing  Computational Linguistics Volume 23, Number 3 rules from a word in the lexicon (\[W C\]i). For instance, from a lexicon entry Idifferent (JJ)\] the operator A will produce five ending-guessing rules: A \[different 0J)\] = { e\[--t + .... ?-- ~ (J J)\] e\[--nt + .... ?-- --+ (JJ)\] e\[-ent + .... ?- ~ (J J)\] e\[-rent + .... ?-- --* (J3)\] e\[-erent + .... ?- --+ 0J)\] The G operator is applied to each entry in the lexicon, and if a rule it produces has already been extracted from another entry in the lexicon, its frequency count (f) is incremented. Then the infrequent rules with f &lt; 0 are eliminated from the ending-guessing rule set.</Paragraph>
      <Paragraph position="2"> After applying the/k and V operations to the training lexicon, we obtained rule collections of 40,000-50,000 entries. Filtering out the rules with frequency counts of 1 reduced the collections to 5,000-7,000 entries.</Paragraph>
    </Section>
    <Section position="3" start_page="411" end_page="413" type="sub_section">
      <SectionTitle>
3.2 Rule Scoring Phase
</SectionTitle>
      <Paragraph position="0"> Of course, not all acquired rules are equally good at predicting word classes: some rules are more accurate in their guesses and some rules are more frequent in their application. For every rule acquired, we need to estimate whether it is an effective rule worth retaining in the working rule set. To do so, we perform a statistical experiment as follows: we take each rule from the extracted rule sets, one by one, take each word-type from the training lexicon and guess its POs-class using the rule, if the rule is applicable to the word. For example, if a guessing rule strips off a particular suffix and a current word from the lexicon does not have this suffix, we classify that word and the rule as incompatible and the rule as not applicable to that word. If a rule is applicable to a word, we compare the result of the guess with the information listed in the lexicon. If the guessed class is the same as the class stated in the lexicon, we count it as a hit or success, otherwise it is a failure. Then, since we are interested in the application of the rules to word-tokens in the corpus, we multiply the result of the guess by the corpus frequency of the word. If we keep the sample space for each rule separate from the others, we have a binomial experiment. The value of a guessing rule closely correlates with its estimated proportion of success (/5), which is the proportion of all positive outcomes (x) of the rule application to the total number of the trials (n), which are, in fact, the number of all the word tokens that are compatible to the rule in the corpus: x: number of successful guesses = n: number of the compatible to the rule word-tokens The 15 estimate is a good indicator of the rule accuracy but it frequently suffers from large estimation error due to insufficient training data. For example, if a rule was found to apply just once and the total number of observations was also one, its estimate p has the maximal value (1) but clearly this is not a very reliable estimate. We tackle this problem by calculating the lower confidence limit 71&amp;quot; L for the rule estimate, which can be seen as the minimal expected value of/~ for the rule if we were to draw a large number of samples. Thus with a certain confidence c~ we can assume that if we used more training data, the rule estimate/~ would be not worse than the 7rL. The rule estimate then will be taken at its lowest possible value which is the ~L limit itself. First we adjust the rule estimate so that we have no zeros in positive (/~) or negative (1 - \]5) outcome probabilities, by adding some floor values to the numerator and denominator:</Paragraph>
      <Paragraph position="2"> d/ where t(l_c0/2 is a coefficient of the t-distribution. It has two parameters: c~, the level of confidence and dr, the number of degrees of freedom, which is one less than the sample size (dr n 1). e/ = - t(l_~)/2 can be looked up in the tables for the t-distribution listed df df in every textbook on statistics. We adopted 90% confidence for which t(1_o.9o)/2=to.o5 takes values depending on the sample size as in Figure 1.</Paragraph>
      <Paragraph position="3"> Using ~-L instead of \]~ for rule scoring favors higher estimates (/3) obtained over larger samples (n). Even if one rule has a high estimate value but that estimate was obtained over a small sample, another rule with a lower estimate value but obtained over a large sample might be valued higher by ~rL. This rule-scoring function resembles the one used by Tzoukermann, Radev, and Gale (1995) for scoring Pos-disambiguation rules for the French tagger. The main difference between the two functions is that there the t value was implicitly assumed to be 1, which corresponds to a confidence level of 68% on a very large sample.</Paragraph>
      <Paragraph position="4"> Another important consideration for rating a word-guessing rule is that the longer the affix or ending (S) of this rule, the more confident we are that it is not a coincidental one, even on small samples. For example, if the estimate for the word-ending o was obtained over a sample of five words and the estimate for the word-ending fulness was also obtained over a sample of five words, the latter is more representative, even though the sample size is the same. Thus we need to adjust the estimation error in accordance with the length of the affix or ending. A good way to do this is to decrease it proportionally to a value that increases along with the increase of the length. A suitable solution is to use the logarithm of the affix length: ^ .(o,-,I /pt(1 - ^* scorei -= Pt - to.os * V n. Pi )/(1 + log(ISil)) When the length of S (the affix or ending) is 1, the estimation error is not changed since log(l) is 0. For the rules with an affix or ending length of 2 the estimation error is reduced by 1 + log(2) = 1.3, for the length 3 this will be 1 + log(3) = 1.48, etc.</Paragraph>
      <Paragraph position="5"> The longer the length, the smaller the sample that will be considered representative enough for a confident rule estimation.</Paragraph>
      <Paragraph position="6"> Setting the threshold (0s) at a certain level we include in the working rule sets only those rules whose scores are higher than the threshold. The method for finding the optimal threshold is based on empirical evaluations of the rule sets and is described in Section 3.4. Usually, the threshold is set in the range of 65-80 points and the rule sets are reduced down to a few hundred entries. For example, when we set  the threshold (0s) to 75 points, the obtained ending-guessing rule collection (Ending*) comprised 1,876 rules, the suffix rule collection without mutation (Suffix deg) comprised 591 rules, the suffix rule collection with mutation (Suffix 1) comprised 912 entries and the prefix rule collection (Prefix) comprised 235 rules. Table 2 shows the highest-rated rules from the induced Prefix and Suffix deg rule sets. In general, it looks as though the induced morphological guessing rules largely consist of the standard rules of English morphology and also include a small proportion of rules that do not belong to the known morphology of English. For instance, the suffix rule e\[ -et +&amp;quot;&amp;quot; ?(NN) --,(NN)\] does not stand for any well-known morphological rule, but its prediction is as good as those of the standard morphological rules. The same situation can be seen with the prefix rule b\[ -st +&amp;quot;&amp;quot; ?(NNS) --+(NNS)I, which is quite predictive but at the same time is not a standard English morphological rule. The ending-guessing rules, naturally, include some proper English suffixes but mostly they are simply highly predictive ending segments of words.</Paragraph>
    </Section>
    <Section position="4" start_page="413" end_page="414" type="sub_section">
      <SectionTitle>
3.3 Rule Merging Phase
</SectionTitle>
      <Paragraph position="0"> Rules which have scored lower than the threshold are merged together into more general rules. These new rules, if they score above the threshold, can also be included in the working rule sets. We merge together two rules if they scored below the threshold and have the same affix (S), mutative segment (M), and initial class (i).6 We define the rule-merging operator (r): Ai @ Aj = At: \[Si, Mi, Ii, Ri U Rj\] if Si = Sj &amp; Mi = Mj &amp; Ii = Ij This operator merges two rules with the same affix (S), mutative segment (M) and the initial class (I) into one rule, with the resulting class being the union of the two merged resulting classes. For example,  Lexicon entry and guesser's categorization for \[developed (JJ VBD VBN)\]. The score of the resulting rule will be higher than the scores of the individual rules since the number of positive observations increases and the number of the trials remains the same. After a successful application of the * operator, the resulting general rule is substituted for the two merged ones. To perform such rule merging over a rule set the rules that have not been included into the working rule set are first sorted by their score and the rules with the best scores are merged first. After each successful merging, the resulting rule is rescored. This is done recursively until the score of the resulting rule does not exceed the threshold, at which point it is added to the working rule sets. This process is applied until no merges can be done to the rules that scored poorly. In our experiment we noticed that the merging added 30-40% new rules to the working rule sets, and therefore the final number of rules for the induced sets were: Prefix - 348, Suffix deg - 975, Suffix 1- 1,263 and Ending* - 2,196.</Paragraph>
    </Section>
    <Section position="5" start_page="414" end_page="416" type="sub_section">
      <SectionTitle>
3.4 Direct Evaluation Phase
</SectionTitle>
      <Paragraph position="0"> There are two important questions that arise at the rule acquisition stage: how to choose the scoring threshold Os and what the performance of the rule sets produced with different thresholds is. The task of assigning a set of POS-tags to a word is actually quite similar to the task of document categorization where a document is assigned a set of descriptors that represent its contents. There are a number of standard parameters (Lewis 1991) used for measuring performance on this kind of task. For example, suppose that a word can take on one or more POS-tags from the set of open-class POS-tags: qJ NN NNS RB VB VBD VBG VBN VBZ). To see how well the guesser performs, we can compare the results of the guessing with the Pos-tags known to be true for the Word (i.e., listed in the lexicon). Let us take, for instance, a lexicon entry \[developed (JJ VBD VBN)\]. Suppose that the guesser categorized it as \[developed (JJ NN RB VBD VBZ)\]. We can represent this situation as in Figure 2.</Paragraph>
      <Paragraph position="1"> The performance of the guesser can be measured in: * recall - the percentage of POS-tags correctly assigned by the guesser, i.e., two (jJ VBD) out of three (JJ VBD VBN) or 66%. 100% recall would mean that the guesser had assigned all the correct pos-tags but not necessarily only the correct ones. So, for example, if the guesser had assigned all possible POS-tags to the word its recall would have been 100%.</Paragraph>
      <Paragraph position="2"> * precision - the percentage of POS-tags the guesser assigned correctly (JJ VBD) over the total number of POS-tags it assigned to the word (Jl NN RB VBD VBZ), i.e., 2/5 or 40%. 100% precision would mean that the guesser did not assign incorrect POS-tags, although not necessarily all the correct ones were assigned. So, if the guesser had assigned only (JJ) its precision would have been 100%.</Paragraph>
      <Paragraph position="3"> * coverage - the proportion of words guesser was able to classify, but not necessarily correctly. So, for example, if we had evaluated a guesser with  something to 80 of them, its coverage would have been 80%.</Paragraph>
      <Paragraph position="4"> The interpretation of these percentages is by no means straightforward, as there is no straightforward way of combining these different measures into a single one. For example, these measures assume that all combinations of POS-tags will be equally hard to disambiguate for the tagger, which is not necessarily the case. Obviously, the most important measure is recall since we want all possible categories for a word to be guessed. Precision seems to be slightly less important since the disambiguator should be able to handle additional noise but obviously not in large amounts. Coverage is a very important measure for a rule set, since a rule set that can guess very accurately but only for a tiny proportion of words is of questionable value. Thus, we will try to maximize recall first, then coverage, and, finally, precision. We will measure the aggregate by averaging over measures per word (micro-average), i.e., for every single word from the test collection the precision and recall of the guesses are calculated, and then we average over these values.</Paragraph>
      <Paragraph position="5"> To find the optimal threshold (0s) for the production of a guessing rule set, we generated a number of similar rule sets using different thresholds and evaluated them against the training lexicon and the test lexicon of unseen 17,868 hapax words. Every word from the two lexicons was guessed by a rule set and the results were compared with the information the word had in the lexicon. For every application of a rule set to a word, we computed the precision and recall, and then using the total number of guessed words we computed the coverage. We noticed certain regularities in the behavior of the metrics in response to the change of the threshold: recall improves as the threshold increases while coverage drops proportionally. This is not surprising: the higher the threshold, the fewer the inaccurate rules included in the rule set, but at the same time the fewer the words that can be handled. An interesting behavior is shown by precision: first, it grows proportionally along with the increase of the threshold, but then, at high thresholds, it decreases. This means that among very confident rules with very high scores, there are many quite general ones. The best thresholds were obtained in the range of 70-80 points.</Paragraph>
      <Paragraph position="6"> Table 3 displays the metrics for the best-scored (by aggregate of the three metrics on the training and the test samples) rule sets. As the baseline standard, we took the ending-guessing rule set supplied with the Xerox tagger (Cutting et al. 1992). When we compared the Xerox ending guesser with the induced ending-guessing rule set (Ending*), we saw that its precision was about 6% poorer and, most importantly, it</Paragraph>
    </Section>
    <Section position="6" start_page="416" end_page="416" type="sub_section">
      <SectionTitle>
Andrei Mikheev Unknown-Word Guessing
</SectionTitle>
      <Paragraph position="0"> could handle 6% fewer unknown words. Finally, we measured the performance of the cascading application of the induced rule sets when the morphological guessing rules were applied before the ending-guessing rules (Prefix+Suffixdeg+Suffix 1 +Ending -c*). We detected that the cascading application of the morphological rule sets together with the ending-guessing rules increases the overall precision of the guessing by about 8%.</Paragraph>
      <Paragraph position="1"> This made the improvement over the baseline Xerox guesser 13% in precision and 7% in coverage on the test sample.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="416" end_page="419" type="metho">
    <SectionTitle>
4. Unknown-Word Tagging
</SectionTitle>
    <Paragraph position="0"> The direct evaluation phase gave us a basis for setting the threshold to produce the best-performing rule sets. The task of unknown-word guessing is, however, a subtask of the overall part-of-speech tagging process. Our main interest is in how the advantage of one rule set over another will affect the tagging performance. Therefore, we performed an evaluation of the impact of the word guessers on tagging accuracy. In this evaluation we used the cascading guesser with two different taggers: a c++ implemented bigram HMM tagger akin to one described in Kupiec (1992) and the rule-based tagger of Brill (1995). Because of the similarities in the algorithms with the LISP implemented Xerox tagger, we could directly use the Xerox guessing rule set with the HMM tagger. Brill's tagger came pretrained on the Brown Corpus and had a corresponding guessing component. This gave us a search-space of four basic combinations: the HMM tagger equipped with the Xerox guesser, the Brill tagger with its original guesser, the HMM tagger with our cascading (Prefix+Suffixdeg+Suffixl+Ending-C*) guesser and the Brill tagger with the cascading guesser. We also tried hybrid tagging using the output of the HMM tagger as the input to Brill's final state tagger, but it gave poorer results than either of the taggers and we decided not to consider this tagging option.</Paragraph>
    <Section position="1" start_page="416" end_page="418" type="sub_section">
      <SectionTitle>
4.1 Setting up the Experiment
</SectionTitle>
      <Paragraph position="0"> We evaluated the taggers with the guessing components on all fifteen subcorpora of the Brown Corpus, one after another. The HMM tagger was trained on the Brown Corpus in such a way that the subcorpus used for the evaluation was not seen at the training phase. All the hapax words and capitalized words with frequency less than 20 were not seen at the training of the cascading guesser. These words were not used in the training of the tagger either. This means that neither the HMM tagger nor the cascading guesser had been trained on the texts and words used for evaluation. We do not know whether the same holds for the Brill tagger and the Brill and Xerox guessers since we took them pretrained. For words that the guessing components failed to guess, we applied the standard method of classifying them as common nouns (NN) if they were not capitalized inside a sentence and proper nouns (NNP) otherwise. When we used the cascading guesser with the Brill tagger we interfaced them on the level of the lexicon: we guessed the unknown words before the tagging and added them to the lexicon listing the most likely tags first as required. 7 Here we want to clarify that we evaluated the overall results of the Brill tagger rather than just its unknown-word tagging component. Another point to mention is that, since we included the guessed words in the lexicon, the Brill tagger could use for the transformations all relevant Pos-tags for unknown words. This is quite different from the output of the original Brill's guesser, which provides only one Pos-tag for an unknown word.</Paragraph>
      <Paragraph position="1"> In our tagging experiments, we measured the error rate of tagging on unknown 7 We estimated the most likely tags from the training data.</Paragraph>
      <Paragraph position="2">  Computational Linguistics Volume 23, Number 3 words using different guessers. Since, arguably, the guessing of proper nouns is easier than is the guessing of other categories, we also measured the error rate for the subcategory of capitalized unknown words separately. The error rate for a category of words was calculated as follows: Error x = Wrongly_Tagged_Words_from_Set_X Total_Words_in_Set_X Thus, for instance, the error rate of tagging the unknown words is the proportion of the mistagged unknown words to all unknown words. To see the distribution of the workload between different guessing rule sets we also measured the coverage of a guessing rule set:</Paragraph>
      <Paragraph position="4"> Total _Unknown _Words We collected the error and coverage measures for each of the fifteen subcorpora 8 of the Brown Corpus separately, and, using the bootstrap replicate technique (Efron and Tibshirani 1993), we calculated the mean and the standard error for each combination of the taggers with the guessing components. For the fifteen accuracy means {al, d2 .... , a15} obtained upon tagging the fifteen subcorpora of the Brown Corpus, we generated a large number of bootstrap replicates of the form {bl, b2,..., b15} where each mean was randomly chosen with replacements such as, for instance, {bl = a11, b2 = a4, b3 = , b4 = an .... , b14 = a~9, b15 = a4}.</Paragraph>
      <Paragraph position="5"> Using these replicates, we calculated the mean and standard error of the whole bootstrap distribution as follows:</Paragraph>
      <Paragraph position="7"> distribution; This way of calculating the estimated standard error for the mean does not assume the normal distribution and hence provides more accurate results. We noticed a certain inconsistency in the markup of proper nouns (NNP) in the Brown Corpus supplied with the Penn Treebank. Quite often obvious proper nouns as, for instance, Summerdale, Russia, or Rochester were marked as common nouns (NN) and sometimes lower-cased common nouns such as business or church were marked as proper nouns. Thus we decided not to count as an error the mismatch of the NN/NNP tags. Using the HMM tagger with the lexicon containing all the words from  the Brown Corpus, we obtained the error rate (mean) 0* (.)=4.003093 with the standard error deB=0.155599. This agrees with the results on the closed dictionary (i.e., without unknown words) obtained by other researchers for this class of the model on the same corpus (Kupiec 1992; DeRose 1988). The Brill tagger showed some better results: error rate (mean) 0* (.)=3.327366 with the standard error deB=O. 123903. Although our primary goal was not to compare the taggers themselves but rather their performance with the guessing components, we attribute the difference in their performance to the fact that Brill's tagger uses the information about the most likely tag for a word whereas the HMM tagger did not have this information and instead used the priors for a set of POS-tags (ambiguity class). When we removed from the lexicon all the hapax words and, following the recommendation of Church (1988), all the capitalized words with frequency less than 20, we obtained some 51,522 unknown word-tokens (25,359 wordtypes) out of more than a million word-tokens in the Brown Corpus. We tagged the fifteen subcorpora of the Brown Corpus by the four combinations of the taggers and the guessers using the lexicon of 22,260 word-types.</Paragraph>
    </Section>
    <Section position="2" start_page="418" end_page="419" type="sub_section">
      <SectionTitle>
4.2 Results of the Experiment
</SectionTitle>
      <Paragraph position="0"> Table 4 displays the tagging results on the unknown words obtained by the four different combinations of taggers and guessers. It shows the overall error rate on unknown words and also displays the distribution of the error rate and the coverage between unknown proper nouns and the other unknown words. Indeed the error rate on the proper nouns was much smaller than on the rest of the unknown words, which means that they are much easier to guess. We can also see a difference in the distribution (coverage) of the unknown words using different taggers. This can be accounted for by the fact that the unguessed capitalized words were taken by default to be proper nouns and that the Brill tagger and the HMM tagger had slightly different strategies to apply to the first word of a sentence. The cascading guesser outperformed the other two guessers in general and most importantly in the non-proper noun category, where it had an advantage of 6.5% over Brill's guesser and about 8.7% over Xerox's guesser.</Paragraph>
      <Paragraph position="1"> In our experiments the category of unknown proper nouns had a larger share (6364%) than we expect in real life because all the capitalized words with frequency less than 20 were taken out of the lexicon. The cascading guesser also helped to improve the accuracy on unknown proper nouns by about 1% in comparison to Brill's guesser and about 3% in comparison to Xerox's guesser. The cascading guesser outperformed the other two guessers on every subcorpus of the Brown Corpus. Table 5 shows the distribution of the workload and the tagging accuracy among the different rule sets of the cascading guesser. The default assignment of the NN tag to unguessed words  performed very poorly, having the error rate of 44%. When we compared this distribution to that of the Xerox guesser we saw that the accuracy of the Xerox guesser itself was only about 6.5% lower than that of the cascading guesser 9 and the fact that it could handle 6% fewer unknown words than the cascading guesser resulted in the increase of incorrect assignments by the default strategy.</Paragraph>
      <Paragraph position="2"> There were three types of mistaggings on unknown words detected in our experiments. Mistagging of the first type occurred when a guesser provided a broader POS-class for an unknown word than a lexicon would, and the tagger had difficulties with its disambiguation. This was especially the case with the words that were guessed as noun/adjective (NN JJ) but, in fact, act only as one of them (as do, for example, many hyphenated words). Another highly ambiguous group is the ing words, which, in general, can act as nouns, adjectives, and gerunds and only direct lexicalization can restrict the search-space, as in the case of the word seeing, which cannot act as an adjective. The second type of mistagging was caused by incorrect assignments by the guesser. Usually this was the case with irregular words such as cattle or data, which were wrongly guessed to be singular nouns (NN) but in fact were plural nouns (NN8). We also did not include the &amp;quot;foreign word&amp;quot; category (FW) in the set of tags to guess, but this did not do too much harm because these words were very infrequent in the texts. And the third type of mistagging occurred when the word-POS guesser assigned the correct Pos-class to a word but the tagger still disambiguated this class incorrectly. This was the most frequent type of error, which accounted for more than 60% of the mistaggings on unknown words.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML