XML Viewer - p98-1063

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1063_metho.xml
Size: 15,400 bytes
Last Modified: 2025-10-06 14:14:56
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1063">
  <Title>Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages</Title>
  <Section position="2" start_page="380" end_page="380" type="metho">
    <SectionTitle>
2 Lemma disambiguation
</SectionTitle>
    <Paragraph position="0"> The lemma disambiguation has been added to the previously developed analyser for two main reasons: * the average number of interpretations in unknown words is significantly higher than in standard words.</Paragraph>
    <Paragraph position="1"> * there could be more than one lemma per tag. Since the disambiguation module won't deal with this kind of ambiguity, it has to be solved to lemmatise the text.</Paragraph>
    <Paragraph position="2"> We use different methods for the disambiguation of linguistic variants and unknown words. In the case of linguistic variants we try to select the lemma that is &amp;quot;nearest&amp;quot; to the standard one according to the number of non-standard morphemes and rules. We choose the interpretation that has less non-standard uses.</Paragraph>
    <Paragraph position="3">  In the case of unknown words, the procedure uses the following criteria: * for each category and subcategory pair, leave at least one interpretation.</Paragraph>
    <Paragraph position="4"> * assign a weight to each lemma according to the final trigram and the category and subcategory pair.</Paragraph>
    <Paragraph position="5"> * select the lemma according to its length and weight -best combination of high weight and short lemma.</Paragraph>
    <Paragraph position="6"> These procedures have been tested with a small corpus and the produced error-rate is 0.2%. This is insignificant considering that the average number of interpretations of unknown words decreases by 7, as shown in table 1.</Paragraph>
  </Section>
  <Section position="3" start_page="380" end_page="380" type="metho">
    <SectionTitle>
3 Designing the tagset
</SectionTitle>
    <Paragraph position="0"> The choice of a tagset is a critical aspect when designing a tagger. Before defining the tagset 2 This module is very useful since Basque is still in normalisation process.</Paragraph>
    <Paragraph position="1"> we have had to take some aspects into account: there was not any exhaustive tagset for automatic use, and the output of the morphological analyser is too rich and does not offer a directly applicable tagset.</Paragraph>
    <Paragraph position="2"> While designing the general tagset, we tried to meet the following requirements: * it had to take into account all the problems concerning ellipsis, derivation and composition (Aduriz et al., 1995).</Paragraph>
    <Paragraph position="3"> * in addition, it had to be general, far from ad hoc tagsets.</Paragraph>
    <Paragraph position="4"> * it had to be coherent with the information provided by the morphological analyser.</Paragraph>
    <Paragraph position="5"> Bearing all these considerations in mind, the tagset has been structured in four levels: * in the first level, general categories are included (noun, verb, etc.). There are 20 tags. * in the second level each category tag is further refined by subcategory tags. There are 48 tags.</Paragraph>
    <Paragraph position="6"> * the third level includes other interesting information, as declension case, verb tense, etc. There are 318 tags in the training corpus, but using a larger corpus we found 185 new tags.</Paragraph>
    <Paragraph position="7"> * the output of the morphological analysis constitutes the last level of tagging. There are 2,943 different interpretations in this training corpus, but we have found more than 9,000 in a larger c  The morphological ambiguity will differ depending on the level of tagging used in each case, as shown in table 2.</Paragraph>
  </Section>
  <Section position="4" start_page="380" end_page="382" type="metho">
    <SectionTitle>
4 Morphological Disambiguation
</SectionTitle>
    <Paragraph position="0"> There are two kinds of methods for morphological disambiguation: on one hand, statistical methods need little effort and obtain very good results (Church, 1988; Cutting etal., 1992), at least when applied to English, but when we try to apply them to Basque we encounter additional problems; on the other hand, some rule-based systems (Brill, 1992; Voutilainen et aL, 1992) are at least as good as statistical systems and are better adapted to free-order languages and agglutinative languages. So, we  have selected one of each group: Constraint Grammar formalism (Karlsson et aL, 1995) and the HMM based TATOO tagger (Armstrong et aL, 1995), which has been designed to be applied it to the output of a morphological analyser and the tagset can be switched easily without changing the input text.</Paragraph>
    <Paragraph position="1">  We have used the second and third levels tagsets for the experiments and a small corpus -28,300 words- divided in a training corpus of 27,000 words and a text of 1,300 words for testing.</Paragraph>
    <Paragraph position="2">  The initial ambiguity of the training corpus is relatively high, as shown infig. 1, and the average number of tags per token is also higher than in other languages -see fig. 2. The number of ambiguity classes is also high -290 and 1138 respectively- and some of the classes in the test corpus aren't in the training corpus, specially in the 3rd level tagset. This means that the training corpus doesn't cover all the phenomena of the language, so we would need a larger corpus to assure that it is general and representative of the language.</Paragraph>
    <Paragraph position="3"> We tried both supervised and unsupervised 4 3 These measures are taken after the process denoted in each column: M-' morphological analysis; M* morphological analysis with enriched lexicon;  training using the 2nd level tagset and only supervised training using the third level tagset. The results are shown infig. 3(S). Accuracy is below 90% and 75% respectively. Using unknown words to enrich the lexicon, the results are improved -seefig. 3(S*)-, but are still far from the accuracy of other systems.</Paragraph>
    <Paragraph position="4"> We have also written some biases -to be exact 11- to correct the most evident errors in the 2nd level. We didn't write more biases for the following reasons: * They can use just the previous tag to change the probabilities, and in some cases we need a wider context to the left and/or to the right. * They can't use the lemma or the word.</Paragraph>
    <Paragraph position="5"> * From the beginning of this research, our intention was to combine this method with Constraint Grammar.</Paragraph>
    <Paragraph position="6"> Using these biases, the error rate decreases by 5% in supervised training and by 7% in unsupervised one-fig. 3(S+B).</Paragraph>
    <Paragraph position="7"> We also used biases 5 with the enriched lexicon and the accuracy increases by less than 2% in both experiments -fig. 3(S+B*). This is not a great improvement when trying to decrease an error rate greater than 10%, but the enrichment of the lexicon may be a good way to improve the system.</Paragraph>
    <Paragraph position="8"> The logical conclusions of these experiments are: * the statistical approach might not be a good approach for agglutinative and free-order languages -as pointed out by Oflazer and KuruOz (1994).</Paragraph>
    <Paragraph position="9"> * writing good disambiguation rules may really improve the accuracy of the disambiguation task.</Paragraph>
    <Paragraph position="10"> As we mentioned above, it is difficult to define accurate rules using stochastic models, so we use the Constraint Grammar for Basque 6 (Aduriz et al., 1997) for this purpose.</Paragraph>
    <Paragraph position="11"> The morphological disambiguator uses around 800 constraint rules that discard illegitimate analyses on the basis of local or global context methods to compare the results, the latter performed better using a larger corpus.</Paragraph>
    <Paragraph position="12"> These biases were written taking into account the errors made in the first experiment.</Paragraph>
    <Paragraph position="13"> The rules were designed having syntactic analysis as the main goal.</Paragraph>
    <Paragraph position="14"> conditions. The application of CG formalism 7 is quite satisfactory, obtaining a recall of 99,8% but there are still 2.16 readings per token. The ambiguity rate after applying CG of Basque drop from 41% to 12% using 2nd level tagset and 64% to 22% using 3rd level tagset</Paragraph>
  </Section>
  <Section position="5" start_page="382" end_page="383" type="metho">
    <SectionTitle>
5 Combining methods
</SectionTitle>
    <Paragraph position="0"> There have been some approaches to the combination of statistical and linguistic methods applied to POS disambiguation (Leech et al., 1994; Tapanainen and Voutilainen, 1994; Oflazer and Tiar, 1997) to improve the accuracy of the systems.</Paragraph>
    <Paragraph position="1"> Oflazer and &amp;quot;FOr (1997) use simple statistical information and constraint rules. They include a constraint application paradigm to make the disambiguation independent of the rule sequence. null The approach of Tapanainen and Voutilainen (1994) disambiguates the text using XT and ENGCG independently; then the ambiguities remaining in ENGCG are solved using the resuits of XT.</Paragraph>
    <Paragraph position="2"> We propose a similar combination, applying both disambiguation methods one after the other, but training the stochastic tagger on the output of the CG disambiguator.</Paragraph>
    <Paragraph position="3"> Since in the output of CG of Basque the avera7 These results were obtained using the CG-2 parser, which allows grouping the rules in different ordered subgrammars depending on their accuracy. This morphological disam-biguator uses only the first two subgrammars.</Paragraph>
    <Paragraph position="4"> s S '--* stochastic; * --* with enriched lexicon; B --, with biases; CG --, Constraint Grammar. ge number of possible tags is still high -1.131.14 for 2nd level tagset and 1.29-1.3 for 3rd level tagset- and the stochastic tagger produces relatively high error rate -around 15% in 2nd level and almost 30% in 3rd level-, we first apply constraint rules and then train the stochastic tagger on the output of the rule-based disambiguator. Fig. I(CG) shows the ambiguity left by Basque CG in terms of the tagsets. Although the ambiguity rate is significantly lower than in previous experiments, the remaining ambiguities are hard to solve even using all the lingu|stic information available.</Paragraph>
    <Paragraph position="5"> We have also experimented with the enriched lexicon and the results are very encouraging, as shown in fig. 3(CG+S*). Considering that the number of ambiguity classes is still high -around 240 in the 2nd level and more than 1000 in the 3rd level-, we think that the results are very good.</Paragraph>
    <Paragraph position="6"> For the 2nd level tagging, the error rate after combining both methods is less than 3.5%, half of it comes from MORFEUS and Basque CG and the rest is made by the stochastic disambiguation. This is due to the fact that generally the types of ambiguity remaining after CG is applied are hard to solve.</Paragraph>
    <Paragraph position="7"> Examining the errors, we find that half of them are made in unknown words trying to distinguish between proper names of persons and places. We use two different tags because it is interesting for some applications and the tagset was defined based on morphological features.</Paragraph>
    <Paragraph position="8"> This kind of ambiguity is very hard to solve and in some applications this distinction is not important. So in this case the accuracy of the tagger would be 98%.</Paragraph>
    <Paragraph position="9"> The accuracy in the third level tagset is around 91% using the combined method, which is not too bad bearing in mind the number of tags -310-, the precision of the input-1.29 tags/token- and that the training corpus does not cover all the phenomena of the language 9.</Paragraph>
    <Paragraph position="10"> We want to point out that the experiments with the 3rd level tagset show even clearer that the combined method performs much better than the stochastic. Moreover, we think that CG disambiguation is even convenient at this level because of the initial ambiguity -63%.</Paragraph>
    <Paragraph position="11"> 9 In a corpus of around 900,000 words we found 185 new tags and more than 1700 new classes.</Paragraph>
  </Section>
  <Section position="6" start_page="383" end_page="383" type="metho">
    <SectionTitle>
Conclusion
</SectionTitle>
    <Paragraph position="0"> We have presented the results of applying different disambiguation methods to an agglutinative and highly inflected language with a relatively free order in sentences.</Paragraph>
    <Paragraph position="1"> On one hand, this latter characteristic of Basque makes it difficult to learn appropriate probabilities, particularly first order stochastic models. We solve this problem in part with CG for Basque, which uses a larger context and can tackle the free word-order problem.</Paragraph>
    <Paragraph position="2"> However, it is a very hard work to write a full grammar and disambiguate texts completely using CG formalism, so we have complemented this method with a stochastic disambiguation process and the results are quite encouraging.</Paragraph>
    <Paragraph position="3"> Comparing the results of Tapanainen and Voutilainen (1994) with ours, we see that they achieve 98.5% recall combining 1.02-1.04 readings from ENGCG and 96% accuracy in XT, while we begin with 1.13-1.14 readings, the quality of our stochastic tagger is less than 90% and our result is better than 96%.</Paragraph>
    <Paragraph position="4"> Unlike Tapanainen and Voutilainen (1994), we think that training on the output of the CG the statistical disambiguation works quite better 10, at least using such a small training corpus. In the future we will compile a larger corpus and to decrease the number of readings left by CG.</Paragraph>
    <Paragraph position="5"> On the other hand, we think that the information given by the second level tag is not sufficient to decide which of the choices is the correct one, but the training corpus is quite small. However, translating the results of the 3rd level to the 2nd one we obtain around 97% of accuracy. So, we think that improving the 3rd level tagging would improve the 2nd level tagging too. We also want to experiment unsupervised learning in the 3rd level tagging with a large training corpus.</Paragraph>
    <Paragraph position="6"> Along with this, the future research will focus on the following processes: * morphosyntactic treatment for the elaboration of morphological information (nominalisation, ellipsis, etc.).</Paragraph>
    <Paragraph position="7"> * treatment of multiword lexical units (MWLU). We are planning to integrate this module to process unambiguous MWLU, to decreases the ambiguity rate and to make the input of the disambiguation more precise.</Paragraph>
  </Section>
  <Section position="7" start_page="383" end_page="383" type="metho">
    <SectionTitle>
10 With their method accuracy is 2% lower.
Acknowledgement
</SectionTitle>
    <Paragraph position="0"> We are in debt with the research-team of the General Linguistics Department of the University of Helsinki for giving us permission to use CG Parser. We also want to thank Gilbert Robert for tuning TATOO.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML