File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0406_intro.xml
Size: 5,750 bytes
Last Modified: 2025-10-06 14:01:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0406"> <Title>Unsupervised learning of word sense disambiguation rules by estimating an optimum iteration number in the EM algorithm</Title> <Section position="3" start_page="0" end_page="3" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In this paper, we improve an unsupervised learning method using the Expectation-Maximization (EM) algorithm proposed by (Nigam et al., 2000) for text classification problems in order to apply it to word sense disambiguation (WSD) problems. The original method works well, but often causes worse classification for WSD. To avoid this, we propose two methods to estimate the optimum iteration number in the EM algorithm.</Paragraph> <Paragraph position="1"> Many problems in natural language processing can be converted into classification problems, and be solved by an inductive learning method. This strategy has been very successful, but it has a serious problem in that an inductive learning method requires labeled data, which is expensive because it must be made manually. To overcome this problem, unsupervised learning methods using huge unlabeled data to boost the performance of rules learned by small labeled data have been proposed recently(Blum and Mitchell, 1998)(Yarowsky, 1995)(Park et al., 2000)(Li and Li, 2002). Among these methods, the method using the EM algorithm proposed by the paper(Nigam et al., 2000), which is referred to as the EM method in this paper, is the state of the art. However, the target of the EM method is text classification. It is hoped that this method can be applied to WSD, because WSD is the most important problem in natural language processing. null The EM method works well in text classification, but often causes worse classification in WSD. The EM method is expected to improve the accuracy of learned rules step by step in proportion to the iteration number in the EM algorithm. However, this rarely happens in practice, and in many cases, the accuracy falls after a certain iteration number in the EM algorithm. In the worst case, the accuracy of the rule learned through only labeled data is degraded by using unlabeled data. To overcome this problem, we estimate an optimum iteration number in the EM algorithm, and in actual learning, we stop the iteration of the EM algorithm at the estimated number. If the estimated number is 0, it means that the EM method is not used. To estimate the optimum iteration number, we propose two methods: one uses cross validation and the other uses two heuristics besides cross validation. In this paper, we refer to the former method as CV-EM and the latter method as CV-EM2.</Paragraph> <Paragraph position="2"> In experiments, we solved 50 noun WSD problems in the Japanese Dictionary Task in SENSEVAL2(Kurohashi and Shirai, 2001). The original EM method failed to boost the precision (76.78%) of the rule learned through only labeled data. On the other hand, CV-EM and CV-EM2 boosted the precision to 77.88% and 78.56%. The score of CV-EM2 is a match for the best public score of this task. Furthermore, these methods were confirmed to be effective also for verb WSD problems.</Paragraph> <Paragraph position="3"> We can solve the classification problem by estimating the probability P(c|x). Actually, the class c</Paragraph> <Paragraph position="5"> As a result, we get</Paragraph> <Paragraph position="7"> In the above equation, P(c) is estimated easily; the question is how to estimate P(x|c). Naive Bayes models assume the following:</Paragraph> <Paragraph position="9"> |c) is easy, so we can estimate P(x|c)(Mitchell, 1997). In order to use Naive Bayes effectively, we must select features that satisfy the equation 1 as much as possible. In text classification tasks, the appearance of each word corresponds to each feature.</Paragraph> <Paragraph position="10"> In this paper, we use following six attributes (e1toe6) for WSD. Suppose that the target word is w .</Paragraph> <Paragraph position="11"> kako/saikou/wo/kiroku/suru/ta/.</Paragraph> <Paragraph position="12"> Because the word to the left of the word 'kiroku' is 'wo', we get'e1=wo'. In the same way, we get'e2=suru'. Content words to the left of the word 'kiroku' are the word 'kako' and the word 'saikou'. We select two words from them in the order of proximity to the target word. Thus, we get 'e3=kako' and 'e3=saikou'. In the same way, we get 'e4=suru' and 'e4=.'. Note A sentence is segmented into words, and each word is transformed to its original form by morphological analysis. 'kiroku' has at least two meanings: 'memo' and 'record'. that the comma and the period are defined as a kind of content words in this paper. Next we look up the thesaurus ID of the word 'saikou', and find 3.1920_4 .</Paragraph> <Paragraph position="13"> In our thesaurus, as shown in Figure 1, a higher number corresponds to a higher level meaning.</Paragraph> <Paragraph position="14"> In this paper, we use a four-digit number and a five-digit number of a thesaurus ID. As a result, for 'e3=saikou', we get 'e5=3192' and 'e5=31920'. In the same way, for 'e3=kako',we get'e5=1164'and'e5=11642'. Following this procedure, we should look up the thesaurus ID of the word 'suru'. However, we do not look up the thesaurus ID for a word that consists of hiragana characters, because such words are too ambiguous, that is, they have too many thesaurus IDs. When a word has multiple thesaurus IDs, we create a feature for each ID.</Paragraph> <Paragraph position="15"> As a result, we get following ten features from the above example sentence: e1=wo, e2=suru, e3=saikou, e3=kako, e4=suru, e4=., e5=3192, e5=31920, e5=1164, e5=11642.</Paragraph> </Section> class="xml-element"></Paper>