File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0101_metho.xml

Size: 23,443 bytes

Last Modified: 2025-10-06 14:14:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0101">
  <Title>Using word class for Part-of-speech disambiguation</Title>
  <Section position="5" start_page="2" end_page="3" type="metho">
    <SectionTitle>
3 Problems with lexical probabilities
</SectionTitle>
    <Paragraph position="0"> There are several ways lexical probabilities could be estimated for a given language, each of them presenting problems: 1. From raw text: a human tagger could manually disambiguate texts. There are some problems though due to the fact that there are always words that are overseen (therefore improperly tagged) or there is disagreement between humans (on at least 5% of the words),  and cross-checking by another human is required. In our system, we manually tagged about 76,000 words 1 in this way.</Paragraph>
    <Paragraph position="1"> . Bootstrapping from already tagged text: this technique generally consists of using a small tagged corpus to train a system and having the system tag another subset of the corpus that gets disambiguated later. (Derouault and Merialdo, 1986) have used these techniques but the necessary human effort is still considerable.</Paragraph>
    <Paragraph position="2"> 3. From the baseform of the word: one could estimate the frequency of the analyzed stem in the process of morphological analysis.</Paragraph>
    <Paragraph position="3"> . From the inflectional morpheme: similarly, one could estimate the probabifity of the inflectional morpheme given its stem. This approach is often used for smoothing probabilities, but, considering the high ambiguity of some French suffixes, such as &amp;quot;e&amp;quot;, &amp;quot;es&amp;quot;, etc, it is doubtful that basing the estimates on the suffixes alone would give good results. . From unseen pairs of \[words,tags\]: for a given word, such as &amp;quot;marine&amp;quot; that can have 8 possible tags, if only the instances \[marine, adj-fem-sing\], \[marine, noun-fem-sing\] are found in the training corpus, one could assume that the remaining unseen instances have a much lower probabifity. This could create problems in making incorrect assumptions on words. Out of all the possibifities outfined above, none seems feasible and robust enough. Therefore, we decided to pay more attention to a different paradigm which captures more information about the word at a morphological and syntactic level.</Paragraph>
  </Section>
  <Section position="6" start_page="3" end_page="7" type="metho">
    <SectionTitle>
4 The genotype solution
</SectionTitle>
    <Paragraph position="0"> In an attempt to capture the multiple word ambiguities on the one hand and the recurrence of these observations on the other, we came up with a new concept, called genotype. In biology, the genotype refers to the content of genes or the pattern of genes in the cell. As used in our context, the genotype is the set of part of speech tags associated with a word. Each word has a genotype (or series of tags based on morphological features) assigned during morphological analysis, and words, according to their patterns, share the same genotype. The genotype depends on the tagset, but not on any particular tagging method. For example, the word &amp;quot;marine&amp;quot; with the eight morphological analyses fisted in Table 1, has the genotype \[JFS NFS NMS vlsPI V1SPS V2SPM V3SPI V3SPS\] 2, each tag corresponding to an analysis, i.e. the list of potential tags for &amp;quot;marine&amp;quot; as shown in Table 1. For each genotype, we compute the frequency with which each of the tags occurs and we select this decision. This paradigm has the advantage of capturing the morphological variation of words combined with the frequency with which they occur. A genotype decision is the most frequent tag associated with a genotype in the training corpus. As explained in Section 4.2, out of a trMning corpus of 76,000 tokens, we extracted a total of 429 unigram genotypes, 6650 bigram genotypes, and 23,802 trigram genotypes with their respective decisions.</Paragraph>
    <Paragraph position="1">  = verb, 1st person, singular, present, indicative; vlsPS = verb, 1st person, singular, present, subjunctive; V2SPM = verb, 2nd person, singular, present, imperative; v3sPI = verb, 3rd person, singular, present, indicative; v3sPs = verb, 3rd person, singular, present, subjunctive.</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.1 Power of genotypes
</SectionTitle>
      <Paragraph position="0"> The genotype concept allows generalizations to be made across words according to tag patterns, thereby gathering estimates not on words but on tag occurrences. We discovered that in a training corpus of 76,000 tokens, lexical frequencies are not as reliable as genotype frequencies. In order to illustrate this, Table 2 and Table 3 show convincing results using this approach. Table 2 presents the set of words corresponding to the genotype \[NFP V2S\], and their resolution with respect to lexicM frequencies and genotype frequencies. The table shows 12 words from the test corpus which, from a morphological point of view, can be either verb-2nd-person-singular (v2s) or noun-feminineplural (NFP); the first column contains always the same tag NFP, because of the genotype decision; we learned from the training corpus that at each time a word could be tagged NFP or V2S, it is 100% of the times NFP, 0% V2S, therefore the noun form is always picked over the verb form. Out of the 12 words listed in the Table 2, 4 words (marked unseen in the table) could not be estimated using lexical frequencies alone since they do not appear in the training corpus. However, since all of them belong to the same genotype, the 4 unseen occurrences are properly tagged.</Paragraph>
      <Paragraph position="1">  In Table 3, we demonstrate that the genotype decision for the \[NMS vls v2s v3s\] genotype always favors the noun-masculine-singular form (NMS) over the verb forms (vls for verb-lst-personsingular, v2s for verb-2nd-person-singular, v3s for verb-3rd-person-singular). Out of the 12 words listed in Table 3, 5 do not occur in the training corpus and 4 of them can be properly tagged using the genotype estimates. The word &amp;quot;suicide&amp;quot;, however, which should be tagged as a verb, was improperly tagged as a noun. Note that we are only considering unigrams of genotypes, which tend to overgeneralize. However, as shown in Section 4.3, the additional estimates of bigrams and trigrams will use the context to select a more appropriate tag.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
4.2 Distribution of genotypes
</SectionTitle>
      <Paragraph position="0"> Among all parts of speech, there is a clear division between closed-class parts of speech, which include prepositions and conjunctions, and open-class ones, which includes verbs, nouns, and adjectives. Similarly, we suggest that genotypes be classified in categories: * Closed-class genotypes contain at least one closed-class part-of-speech, e.g., &amp;quot;des&amp;quot;, which belongs to the \[P R\] (preposition, article) genotype.</Paragraph>
      <Paragraph position="1">  similarly to the closed-class genotype, with respect to the small number of words - often homograph - in that genotype. For instance, the word &amp;quot;ills&amp;quot; (son \[singular and plural\], threads) with the low frequent genotype \[NM NMP\] or the word &amp;quot;avions&amp;quot; (planes, (we) had) which belong to the genotype \[NFP V1P\].</Paragraph>
      <Paragraph position="2"> * Open-class genotypes contain all other genotypes, such as \[NFS vls v2s v3s\]. This class, unlike the other two, is productive.</Paragraph>
      <Paragraph position="3"> There are several facts which demonstrate the power of genotypes for disambiguation. First, the number of genotypes on which the estimates are made is much smaller than the number of words on which to compute estimates. Our results show that in the training corpus of 76,000 tokens, there are 10,696 words, and 429 genotypes. Estimating probabilities on 429 genotypes rather than 10,696 words is an enormous gain. Since the distributions in both cases have a very long tail, there are many more words than genotypes for which we cannot obtain reliable statistics. As an example, we extracted the most frequent open-class genotypes from the training corpus (each of them occurring more than 100 times) shown in Table 4. It is striking to notice that these 22 genotypes represent almost 10~ of the corpus. The table shows the genotype in the first column, the number of occurrences in the second one, the part-of-speech distribution in the third one, the best genotype decision and the percent of this selection in the last column. We can see that words belonging to the same genotype are likely to be tagged with the same tag; for example, the genotype \[NFS Vis V2S V3S\] is tagged as NFS. That allows us to make predictions for words missing from the training corpus.</Paragraph>
    </Section>
    <Section position="3" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
4.3 Contextual probabilities via bigram and trigram genotypes
</SectionTitle>
      <Paragraph position="0"> Using genotypes at the unigram level tends to result in overgeneralization, due to the fact that the genotype sets are too coarse. In order to increase the accuracy of part-of-speech disambiguation, we need to give priority to trigrams over bigrams, and to bigrams over unigrams.</Paragraph>
      <Paragraph position="1"> In a way similar to decision trees, Table 5 shows how the use of context allows for better disambiguation of genotype. We have considered a typical ambiguous genotype \[.IMP NMP\] which occurs 607 times in the training corpus, almost evenly distributed between the two alternative</Paragraph>
      <Paragraph position="3"/>
      <Paragraph position="5"/>
      <Paragraph position="7"> tags, JMP and NMP. As a result, if only unigram training data is used, the best candidate for that genotype would be JMP, occurring 316 out of 607 times. However, choosing JMP only gives us 52.06% accuracy. Table 5 clearly demonstrates that the contextual information around the genotype will bring this percentage up significantly. As an example, let us consider the 5th fine of Table 5, where the number 17 is marked with a square. In this case, we know that the \[JMP NMP\] genotype has a right context consisting of the genotype \[p r\] (4th column, 5th fine). In this case, it is no longer true that JMP is the best candidate. Instead, NMP Occurs 71 out of 91 times and becomes the best candidate. Overall, for all possible left and right contexts of \[JMP NMP\], the guess based on both the genotype and the single left or right contexts will be correct 433 times out of 536 (or 80.78%). In a similar fashion, the three possible trigram layouts (Left, Middle, and Right) are shown in fines 18-27. They show that the performance based on trigrams is 95.90%. This particular example provides strong evidence of the usefulness of contextual disambiguation with genotypes.</Paragraph>
      <Paragraph position="8"> The fact that this genotype, very ambiguous as a unigram (52.06%), can be disambiguated as a noun or adjective according to context at the trigram stage with 95.90% accuracy demonstrates the strength of our approach.</Paragraph>
    </Section>
    <Section position="4" start_page="5" end_page="7" type="sub_section">
      <SectionTitle>
4.4 Smoothing probabilities with genotypes
</SectionTitle>
      <Paragraph position="0"> In the context of a small training corpus, the problem of sparse data is more serious than with a larger tagged corpus. Genotypes play an important role for smoothing probabilities. By paying attention to tags only and thus ignoring the words themselves, this approach handles new words that have not been seen in the training corpus. Table 6 shows how the training corpus provides coverage for n-gram genotypes that appear in the test corpus. It is interesting to notice that only  nmp, imp 71 71 71 nmp, p, rims 21 21 21 117 imp, jmp, x 3 8 11 nmp, jmp, x 8 p, nmp, p 23 23 23 r, nmp, p 19 19 21 r, jmp, p 2 p, nmp, jmp 27 29 29 r, nmp, jmp 2 z, p, nmp 16 17 17 z, r, nmp 1</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="7" end_page="11" type="metho">
    <SectionTitle>
5 Comparison with other approaches
</SectionTitle>
    <Paragraph position="0"> In some sense, this approach is similar to the notion of &amp;quot;ambiguity classes&amp;quot; explained in (Kupiec, 1992) and (Cutting et al., 1992) where words that belong to the same part-of-speech figure together.</Paragraph>
    <Paragraph position="1"> In this approach, they use the notion of word equivalence or ambiguity classes to describe words belonging to the same part-of-speech categories. In our work, the entire algorithm bases estimations on genotype only, filtering down the ambiguities and resolving them with statistics. Moreover, the estimation is achieved on a sequence of n-gram genotypes. Also, the refinement that is contMned in our system reflects the real morphological ambiguities, due to the rich nature of the morphological output and the choice of tags. There are three main differences between their work and ours. First, in their work, the most common words are estimated individually and the less common ones are  put together in their respective ambiguity classes; in our work, every word is equally treated by its respective genotype. Second, in their work, ambiguity classes can be marked with a preferred tag in order to help disambiguation whereas in our work, there is no special annotation since words get disambiguated through the sequential application of the modules. Third, and perhaps the most important, in our system, the linguistic and statistical estimations are entirely done on the genotypes only, regardless of the words. Words are not estimated individually given their class categories; rather, genotypes are estimated separately from the words or in the context of other genotypes (bi- and tri-gram probabilities). (Brill, 1995) presents a rule-based part-of-speech tagger for unsupervised training corpus. Some of the rules of his system and the fact that he uses a minimal training corpus suggests some similarities with our system, but the main aim of the work is to investigate methods to combine supervised and unsupervised training in order to come up with a highly performing tagger. (Chanod and Tapanainen, 1995) compare two tagging frameworks for tagging French, one that is statistical, built upon the Xerox tagger (Cutting et al., 1992), and another based on linguistic constraints only. The contraints can be 100% accurate or describe the tendency of a particular tagging choice. The contraint-based tagger is proven to have better performance than the statistical one, since rule writing is more handlable or more controllable than adjusting the parameters of the statistical tagger. It is difficult to compare any kind of performance since their tagset is very small, i.e. 37 tags, including a number of word-specific tags (which reduces further the number of &amp;quot;real&amp;quot; tags), and does not account for several morphological features, such as gender, number for pronouns, etc. Moreover, categories that can be very ambiguous, such as coordinating conjunctions, subordinating conjunctions, relative and interrogative pronouns which tend to be collapsed; consequently, the disambiguation is simplified and results cannot be compared.</Paragraph>
    <Paragraph position="2"> 6 Implementation and performance of the part-of-speech tagger We have developed a part-of-speech tagger using only a finite-state machine framework. The input string is represented as a finite-state generator, and the tagging is obtained through composition with a pipeline of finite-state transducers (FST's). Besides the modules for pre-processing and tokenization, the tagger includes a morphological FST and a statistical FST, which incorporates linguistic and statistical knowledge. We have used a toolkit developed at AT&amp;T Bell Laboratories (Pereira et al., 1994) which manipulates weighted and unweighted finite-state machines (acceptors or transducers). Using these tools, we have created a set of programs which generate finite-state transducers from descriptions of linguistic rules (in the form of negative constraints) and for encoding distribution information obtained through statistical learning. Statistical decisions on genotypes are represented by weights - the lower cost, the higher the chance of a particular tag to be picked. With this representation, we are able to prefer one n-gram decision over another based on the cost.</Paragraph>
    <Paragraph position="3"> The morphological FST is generated automatically from a large dictionary of French of about 90,000 entries and on-line corpora, such as Le Monde Newspapers (ECI, 1989 and 1990). It takes the text as input and produces an FST that encodes each possible tagging of the input text as one distinct path from the start state to the final state. The statistical FST is created from 1-gram, 2-gram, and 3-gram genotype data obtained empirically from the training corpus. It encodes all 1, 2, 3-grams of genotypes extracted from the training corpus with a cost determined as a function of the frequency of the genotype decision in the training corpus. Table 7 shows how costs are computed for a specific bigram and how these costs are used to make a tagging decision. The  bigram in the example, \[p r\] \[jmp nmp\], occurs 306 times in the training corpus. All possible taggings, i.e. \[p\] limp\], \[p\] \[nmp\], \[r\] \[jmp\], and \[r\] \[nmp\] appear in the training corpus. The sub-FST that corresponds to this bigram of genotypes will have \[p r\] \[jmp nmp\] on its input and all 4 possible taggings on its output. Each tagging sequence has a different costs. Let f be the total count of the genotype bigram. Let ft be the number of cases that the tagging is t, for all possible taggings t (in this example there are 4 possible taggings). The cost of the transition for tagging t is the negative logarithm of ft divided by f: -log(ft/f). The selected transition is the one with the lowest cost; the example in Table 7 illustrates the computation of costs with \[p\] \[nmp\], the selected tagging in bold.</Paragraph>
    <Paragraph position="4"> genotype bigram tagging frequency \[p r\] \[imp nmp\] p, jmp p, nmp  In a similar way, the statistical FST contains paths for unigrams and trigrams. In order to prefer trigrams over bigrams, and bigrams over unigrams, we have added a biased cost to some transitions. The empirically determined values of the biased cost are as follows: trigram biased cost &lt; bigram biased cost &lt; unigram biased cost.</Paragraph>
    <Paragraph position="5"> If a certain bigram or trigram does not appear in the training corpus, the FST will still have a corresponding path, but at a higher cost. Since negative constraints (such as &amp;quot;article&amp;quot; followed by &amp;quot;verb&amp;quot;) reflect n-grams that are impossible linguistically and therefore have an expected frequency of appearance equal to 0, we assign them a very high cost (note that in order to keep the graph connected, we cannot assign a cost of ~x~). To make the use of biased cost clear, Table 8 shows the unigrams \[p r\] and \[jmp nmp\] that compose the bigram described in Table 7 and the corresponding transition costs.</Paragraph>
    <Paragraph position="6"> genotype unigram tagging frequency \[p r\] p 6645/6883  Figure 2 presents the FST that corresponds to Table 7 and Table 8. The top part shows how the genotype bigram \[p r\] \[jmp nmp\] can be tagged as a sequence of two unigrams; the bottom part uses one bigram to tag it. The notation on all arcs in the FST is the following: input string : output string / cost e.g., \[p hi.&amp;quot; p / 1.04 The input is a genotype n-gram, the output represents a possible tag n-gram with the corresponding cost. The FST shown in Figure 2 is part of a much larger FST containing 2.8 million ares.</Paragraph>
    <Paragraph position="7"> The cheapest path for tagging the sequence of two genotypes \[p r\] \[jmp nmp\] can go either  other case (unigrams), the cheapest path or the lowest cost includes the two transitions \[p\] and limp\] for a total cost of 1.04 + 1.65 = 2.69. In this case, not only do bigrams have precedence over unigrams, but the choice of the tagging sequence \[p\], \[nmp\] is also better than the sequence \[p\] \[jmp\], as it takes into account the context information. Similarly, if a trigram contained a bigram as a sub-FST, typically the cost of going through the trigram would be smaller than the cost of going through a bigram and a unigram. In the case where two consecutive genotype unigrams do not compose a bigram seen in the training corpus, there is no context information that can be applied and only the information of the tagging of the individual unigrams is used.</Paragraph>
    <Paragraph position="8"> The tagger is based on a tagset of 72 parts of speech. As said earlier, the training corpus was manually tagged and contained 76,000 words. The test corpus, also manually tagged, contained 1,500 words. Taking into account the large number of parts of speech, the tagger disambiguates correctly about 95% of unrestricted text. We are in the process of improving the tagger performance in refining rules and biased costs.</Paragraph>
    <Paragraph position="9"> '7 Steps for building an optimal training corpus This section explains the motivations of our claims for developing taggers for a language. The following steps are based on our experience and, we believe, will extend to a wide range of language types.</Paragraph>
    <Paragraph position="10">  1. Study morpho-syntactic ambiguity and word frequencies: Part-of-speech ambiguities must be observed as a function of the word frequencies as shown in Section 2.</Paragraph>
    <Paragraph position="11"> 2. Analyze morphology and morphological features in order to evaluate the ambiguity  of the language. As shown in Section 2, some suffixes may disambiguate a certain number of words, whereas others may be truly ambiguous and overlap over several categories of words. 3. Determine concise tagset based on trade-off between tagset size and computational complexity. This requires system tuning and is often dependent on the application. The more tags, the harder the estimation of probabilities, and the sparser the data. Having a concise set of tags is therefore a priority.</Paragraph>
    <Paragraph position="12">  4. Obtain maximum genotype coverage: genotypes must first be separated into closed, semi-closed, and open class. Then, the first two classes must be exhaustively covered since their number is relatively small. Last, open-class genotypes should be examined by order of frequency; since their number is finite, they can also be exhaustively covered.</Paragraph>
    <Paragraph position="13"> 5. Capture contextual probabilities: genotypes must be considered in context. As described in Section 4.3, bigram and trigram genotypes give accurate estimates of the morpho-syntactic variations of the language.</Paragraph>
    <Paragraph position="14"> We believe that concentrating efforts on these issues will allow part-of-speech tagger developers to optimize time and effort in order to develop adequate basic training material.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML