File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/w96-0113_intro.xml
Size: 3,421 bytes
Last Modified: 2025-10-06 14:06:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0113"> <Title>A Re-estimation Method for Stochastic Language Modeling from Ambiguous Observations</Title> <Section position="3" start_page="0" end_page="155" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Stochastic language models are useful for many language processing applications such as speech recognition, natural language processing and so on. However, in order to build an accurate stochastic language model, large amounts of tagged text are needed and a tagged corpus may not always match a target application because of, for example, differences between the tag systems. If the language model can be estimated from untagged corpora and the dictionary of a target application, then the above two problems would be resolved because large amounts of untagged corpora could be easily used and untagged corpora are neutral toward any applications. null Kupiec (1992) has proposed an estimation method for the N-gram language model using the Baum-Welch reestimation algorithm (Rabiner et al., 1994) from an untagged corpus and Cutting et al. (1992) have applied this method to an English tagging system. Takeuchi and Matsumoto (1995) also have developed an extended method for unsegmented languages (e.g., Japanese) and applied it to their Japanese tagger.</Paragraph> <Paragraph position="1"> However, Merialdo (1994) and Elworthy (1994) have criticized methods of estimation from an untagged corpus based on the maximum likelihood principle. They pointed out limitation of such methods revealed by their experiments and said that the optimization of likelihood didn't necessarily improve tagging accuracy. In other words, the training data extracted from an untagged corpus using only a dictionary are, by nature, too noisy to build a reliable model. I would like to know whether or not the noise problem occurs in other language models such as the HMM. Zhou and Nakagawa (1994) have shown, in the experiments of word prediction from the previous word sequence, that the HMM is more powerful than the bigram model and is nearly equivalent to the trigram model, though the number of parameters of the HMM is less than that in the N-gram model. In general, models with fewer parameters are more robust. Here, I investigate a method that can estimate HMM parameters from an untagged corpus and also a general technique for supressing noise in untagged training data. The goals of this paper are as follows.</Paragraph> <Paragraph position="2"> * Extension of Baum-Welch algorithm: I formulate an algorithm that can be applied to untagged, unsegmented language corpora and estimate not only the N-gram model, but the HMM. Also, a scaling procedure is defined in the algorithm.</Paragraph> <Paragraph position="3"> * Credit factor: In order to overcome the noise of untagged corpora, I introduce credit factors that are assigned to training data. The estimation algorithm can approximately maximize the modified likelihood that is weighted by the credit factors.</Paragraph> <Paragraph position="4"> The problem of stochastic tagging is formulated in the next section(2) and the extended reestimation method in section 3. A way of determining the credit factor based on a rule-based tagger is described in section 4. Experiments which evaluate the proposed method are reported in section 5.</Paragraph> </Section> class="xml-element"></Paper>