XML Viewer - w96-0113

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0113_metho.xml
Size: 19,558 bytes
Last Modified: 2025-10-06 14:14:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0113">
  <Title>A Re-estimation Method for Stochastic Language Modeling from Ambiguous Observations</Title>
  <Section position="4" start_page="155" end_page="156" type="metho">
    <SectionTitle>
2 Stochastic Tagging Formulation
</SectionTitle>
    <Paragraph position="0"> In general, the stochastic tagging problem can be formulated as a search problem in the stochastic space of sequences of tags and words. In this formulation, the tagger searches for the best sequence that maximizes the probability (Nagata, 1994): (1~ r, T) = arg maxp(W, TIS ) = arg maxp(W, T) (1) W,T W,T where W is a word sequence (wl,w2, ...,Wn), T is a tag sequence (tl,t2,...,tn) and S is an input sentence. Since Japanese sentences have no delimiters (e.g., spaces) between words, a morphological analyzer (tagger) must decide word segmentation in addition to part-of-speech assignment. The number of segmentation ambiguities of Japanese sentences is large and these ambiguities complicate the work of a Japanese tagger.</Paragraph>
    <Paragraph position="1"> Although all possible p(W, T)s on combinations of W and T cannot be estimated, there are some particularly useful approximations such as the N-gram model and the HMM. The following formulae are straightforward formulations whose observed variables are pairs of words and tags:</Paragraph>
    <Paragraph position="3"> Formula 2 is the N-gram model and formula 3 is the HMM. When N of formula 2 is two, the model is called the bigram, when N is three, it is the trigrarm Symbol x of formula 3 denotes a possible path of states of the HMM and x(i) denotes a state of the HMM that is visited at the i-th transition in the path x. ax(i),x(i+l ) is the transition probability from x(i) to x(i + i). In particular, ax(0),x(1 ) represents the initial state probability (Trx(1)) of x(1). b~(i)(w , t) is an output probability of a pair of word w and tag t on the state x(i). A state of the HMM represents an abstract class of a part of the input symbol sequence. That is, we can regard the HMM as a mixed model of unigram, bigram, trigram, and so on.</Paragraph>
    <Paragraph position="4">  We can also decrease the number of model parameters by separating the tag model from formulae 2 and 3. In the models, the N-gram and the HMM are used to model tag sequence and p(wlt ) is used for another part of the model.</Paragraph>
    <Paragraph position="6"> The PAIR-HMM, TAG-bigram model, and TAG-HMM based on formulae 3, 4 (where N = 2) and 5, respectively, will be investigated in section 5. In the next section, I describe an extension to the forward-backward algorithm for determining HMM parameters fi'om ambiguous observations.</Paragraph>
  </Section>
  <Section position="5" start_page="156" end_page="160" type="metho">
    <SectionTitle>
3 Re-estimation Method from Ambiguous Observations
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="156" end_page="156" type="sub_section">
      <SectionTitle>
3.1 Ambiguous Observation Structure
</SectionTitle>
      <Paragraph position="0"> Here, we define an ambiguous observation as a lattice structure with a credit factor for each branch. In unsegmented languages that have no delimiter between words, such as Japanese, candidates for alignment of tag and word have different segmentation. That is, they must be represented by a lattice. We can create a lattice structure from untagged Japanese sentences and a Japanese dictionary.</Paragraph>
      <Paragraph position="1"> The following is the definition of the lattice of candidates representing ambiguous word and tag sequences called the morpheme network. All morphemes on the morpheme network are numbered.</Paragraph>
      <Paragraph position="2"> w, or word(s): The spelling of the s-th morpheme.</Paragraph>
      <Paragraph position="3"> ts or tag(s): The tag of the s-th morpheme.</Paragraph>
      <Paragraph position="4"> suc(s): The set of morpheme numbers that the s-th morpheme connects to.</Paragraph>
      <Paragraph position="5"> pre(s): The set of morpheme numbers that connect to the s-th morpheme.</Paragraph>
      <Paragraph position="6"> credit(r, s): The credit factor of the connection between the r-th and the s-th morphemes. For example, a morpheme network can be derieved from the input sentence &amp;quot;~ L ~3 v~&amp;quot; which means &amp;quot;not to sail&amp;quot; (Fig. 1). The real and dotted lines in Figure 1 represent the correct and incorrect paths of morphemes, respectively. Of course, any algorithm for estimation from untagged corpora cannot determine whether the connections are correct or not. The connections of dotted lines constitute noise for the estimation algorithm. The numbers on the lines show the credit factor of each connection that is assigned by the method described in section 4. The numbers at the right of colons are morpheme numbers. In Figure 1, word(3) is ' ~ b ', tag(3) is 'verb', pre(3) is the set {1}, sue(3) is the set {6, 7} and the credit factor credit(l, 3) is 0.8.</Paragraph>
    </Section>
    <Section position="2" start_page="156" end_page="158" type="sub_section">
      <SectionTitle>
3.2 Re-estimation Algorithm
</SectionTitle>
      <Paragraph position="0"> Given a morpheme network, we can formulate the reestimation algorithm for the HMM parameters. The original forward-backward algorithm calculates the probability of the partial observation sequence given the state of the HMM at the time (position of word in the input sentence). The original algorithm does this by a time synchronous procedure operating on unambiguous observation sequence. The extended algorithm calculates the probability of the</Paragraph>
      <Paragraph position="2"> partial ambiguous sequence given the state of the HMM at the node (morpheme candidate) in the morpheme network by a node synchronous procedure. The algorithm formulation is as follows: initial:</Paragraph>
      <Paragraph position="4"> where on(l) is the set of numbers of the left most morphemes in the morpheme network and on(B) is the set of numbers of the right most morphemes. The '#' in credit(#, u) means the beginning-of-text indicator.</Paragraph>
      <Paragraph position="5"> The trellis, that is often used to explain the originM forward-backward algorithm, is extended into a network trellis. Figure 2 is an example of the network trellis that is generated from the morpheme network example given above (Fig. 1). In this example, c~7(1) means a forward probability of the 7th morpheme at the 1st state of the HMM.</Paragraph>
      <Paragraph position="6"> Using the extended forward-backward probabilities we can formulate the reestimation algorithm from ambiguous observations:</Paragraph>
      <Paragraph position="8"> noun: 1 verb:3 adjective:6</Paragraph>
      <Paragraph position="10"/>
      <Paragraph position="12"> where k represents the k-th input sentence and Pk is sum of the probabilities of possible sequences in the k-th morpheme network weighted by the credit factors.</Paragraph>
    </Section>
    <Section position="3" start_page="158" end_page="160" type="sub_section">
      <SectionTitle>
3.3 Scaling
</SectionTitle>
      <Paragraph position="0"> In the calculation of forward-backward probabilities, under-flow sometimes occurs if the dictionaxy for making the morpheme network is large and/or the length of the input sentence is long, because the forward-backward algorithm multiplies many small transition and output probabilities together. This problem is native to speech modeling, but in general, the modeling of text is free from this problem. However, since Japanese sentences tend to be relatively long and the recent Japanese dictionary for research is large, under-flow is sometimes a problem.</Paragraph>
      <Paragraph position="1"> For example, the EDIt Japanese corpus (EDR, 1994) includes sentences that consist of more than fifty words at a frequency of one percent. In fact, we experienced the underflow problem in preliminary experiments with the EDR corpus.</Paragraph>
      <Paragraph position="2"> Application of the scaling technique of the original backward-forward algorithm (Rabiner et al., 1994) to our reestimation method would solve the under-flow problem. The original technique is based on synchronous calculation with positions of words in the input sentence in left-to-right fashion. However, since word boundaries in the morpheme network may or may not cross on the input character sequence, we cannot directly apply this method to the extended algorithm.</Paragraph>
      <Paragraph position="3"> Let us introduce synchronous points on a~ input characters sequence to facilitate synchronization of the calculation of forward-backward probabilities. All possible paths of a morpheme</Paragraph>
      <Paragraph position="5"> network have one morpheme on each synchronous point. The synchronous points are defined as positions of the head character of all morphemes in a morpheme network and are numbered from left to right. The synchronous point number of the left most word is defined as 1. A morpheme is associated with the synchronous points which are located in the flow of characters of the morpheme.</Paragraph>
      <Paragraph position="6"> The symbols and on(q) function are defined as follows: B: The maximum number of synchronous points in a morpheme network.</Paragraph>
      <Paragraph position="7"> on(q): The set of morpheme numbers that are associated with synchronous point q. L,: The left most synchronous point that is associated with the s-th morpheme.</Paragraph>
      <Paragraph position="8"> R,: The right most synchronous point that is associated with the s-th morpheme.</Paragraph>
      <Paragraph position="9"> Figure 3 is an example of the syncronous points for the morpheme network example given above (Fig. 1). The values of the symbols and function defined above are as follows in this example; B = 5, on(2) = {2, 3}, L5 = 3, R5 = 4 and so on.</Paragraph>
      <Paragraph position="10"> The scaled forward probabilities are defined with the above definitions. The notation ~st(i) is used to denote the unscaled forward probabilities of the s-th morpheme on the syncronous point l, &amp;sl(i) to denote the scaled forward probabilities, and &amp;,l(i) to denote the local version of c~ before scaling, cl is the scaling factor of synchronous point I.</Paragraph>
      <Paragraph position="11"> initial:</Paragraph>
      <Paragraph position="13"/>
      <Paragraph position="15"> The scaled forward probabilities can be calculated synchronizing with the synchronous points from left to right. The scaled backward probabilities are defined in the same way using the scaling factors obtained in the calculation of the forward probabilities.</Paragraph>
      <Paragraph position="16"> The scaled forward-backward probabilities have the following property:</Paragraph>
      <Paragraph position="18"> where &amp;8 = &amp;~R~ and fls = fl~ns. Using this property, the reestimation formulae can be replaced with the scaled versions. The replaced formulae are free of the under-flow problem and their use also obviates the need to calculate the weighted sum of path probabilities of the k-th ambiguous observation, Pk.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="160" end_page="161" type="metho">
    <SectionTitle>
4 Credit Factor
</SectionTitle>
    <Paragraph position="0"> In the estimation of a Japanese language model from an untagged corpus, the segmentation ambiguity of Japanese sentences severely degrades the model reliability. I will show that model estimations excluding the credit factors cannot overcome the noise problem in section 5. Credit factors play a very important role by supressing noise in the training data. However, a way of calculating the optimal value of credit is not yet available, so a preliminary method described in this section was used for the experiments.</Paragraph>
    <Paragraph position="1"> The 'costs' of candidates outputted by a rule-based tagger were used as the source of information related to the credit. Juman (Matsumoto et al., 1994) was used in our experiments to generate the morpheme network. Juman is a rule-based Japanese tagging system which uses hand-coding cost values that represent the implausibility of morpheme connections, and word- and tag-occurences. Given a cost-width, Juman outputs the candidates of morpheme sequences pruned by this cost-width. A larger cost-width would result in a larger number of output candidates.</Paragraph>
    <Paragraph position="2"> We evaluated the precision of a set of morpheme candidates that have a certain cost.</Paragraph>
    <Paragraph position="3"> The precision value was used as the credit factor of each branch in the morpheme network to be outputted by Juman (Table 1). In the experiments described in the next section, we approximated the results from this example (see Table 1) by the formula 1/(a* cost + b), where a was 0.5 and b 1.19.</Paragraph>
  </Section>
  <Section position="7" start_page="161" end_page="162" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="161" end_page="161" type="sub_section">
      <SectionTitle>
5.1 Implementation
</SectionTitle>
      <Paragraph position="0"> The experimental system for model estimation was implemented using the extended reestimation method. A morpheme network of each input sentence was generated with Juman (Matsumoto et al., 1994) and the credit factor was attached to each branch as described above. The system can estimate three kinds of models; the PAIRoHMM (formula 3) with output symbols as pairs of words and tags, the TAG-bigram model (formula 4, where N = 2) and TAG-HMM (formula 5) with output symbols as tags and p(w\]t). The scaling technique was used with all estimations.</Paragraph>
      <Paragraph position="1"> The numbers of parameters of the TAG-bigram model, the TAG-HMM and the PAIR-HMM are approximated by the equations NT 2 + ND, NS 2 + NS * NT + ND, and NS 2 + NS * ND, respectively, where NT is the number of tags, NS is the number of states of the HMM, and ND is the number of entries in the dictionary. In all experiments, NT, NS and ND were fixed at 104, 10, and 130,000, respectively. The numbers of parameters of the TAG-bigram model, TAG-HMM, PAIR-HMM were 10816 + ND, 1140 + ND, and 100+ IOND, respectively. Note that the number of parameters of the tag model of the TAG-HMM is one tenth that of the TAG-bigram model.</Paragraph>
      <Paragraph position="2"> For the model evaluation, a stochastic tagger was implemented. Given a morpheme network generated by Juman with a cost-width, the implemented tagger selects the most probable path in the network using each stochastic model. The best path was calculated by the Viterbialgorithm on the paths of the morpheme network.</Paragraph>
    </Section>
    <Section position="2" start_page="161" end_page="161" type="sub_section">
      <SectionTitle>
5.2 Data and Evaluation
</SectionTitle>
      <Paragraph position="0"> I used 26108 Japanese untagged sentences as training data and 100 hand-tagged sentences as test data, both from the Nikkei newspaper 1994 corpus (Nihon Keizai Shimbun, Inc., 1995).</Paragraph>
      <Paragraph position="1"> The test sentences include about 2500 Japanese morphemes. The tags were defined as the combination of part-of-speech, conjugation, and class of conjugation. The number of kinds of tags was 104.</Paragraph>
      <Paragraph position="2"> In the precision evaluation, the correct morpheme was defined as that matching the segmentation, tag, and spelling of the base form of the hand-tagged morpheme. The precision was defined as the proportion of correct morphemes relative to the total number of morphemes in the sequence which the tagger outputted as the best alignment of tags and words.</Paragraph>
    </Section>
    <Section position="3" start_page="161" end_page="162" type="sub_section">
      <SectionTitle>
5.3 Results
</SectionTitle>
      <Paragraph position="0"> Three kinds of models were estimated using the untagged training data with the initial parameters set to the equivalent probabilities. Each model was estimated both with and without use of the credit factor. The reestimation algorithm was iterated for five to twenty times.</Paragraph>
      <Paragraph position="1"> The precision of the most plausible segmentation and tag assignment was outputted by the tagger based on each stochastic model estimated either without (Figs. 4 and 5) or with (Fig. 6) the credit factor assignment function described in the previous section. Two versions of the morpheme network for the estimations were used; one limited by a cost-width of 500 (Fig. 4) and the other by a cost-width of 70 (Figs. 5 and 6). The cost-width of 500 required almost all of the morphemes to be used for the estimation. In other words, a morpheme network of cost-width 500 was equivalent to that extracted from the input sentence with a dictionary only. Although one experiment (Fig. 5) didn't use the credit factor assignment function, it is regarded as using a special function of the credit factor that returns 0 or 1, that  is a step function, with a cost threshold of 70. However, this function doesn't differentiate among morphemes whose costs are 0 and 70.</Paragraph>
      <Paragraph position="2"> The cost-widths (see horizontal axes in Figs. 4, 5 and 6) were provided to Juman to generate the morpheme network used in the stochastic tagger for model evaluation. The tagger chose the best morpheme sequence from the network by each stochastic model. A larger cost-width would result in a larger network, lower precision, and higher recall (Table 2). Note that the precision of any model will never exceed the recall of Juman (see Table 2). If a model is correctly estimated, then a larger cost-width will improve precision. Therefore, we can estimate model accuracy from the precision at cost-width 500 or 1000.</Paragraph>
      <Paragraph position="3"> When estimated without the credit factor (Fig. 4), neither the HMM nor the TAG-bigram model was robust against noisy training data. It was also observed in the experiments that the accuracy of tagging was degraded by excessive iterations of reestimation. I conclude that it is hard to estimate the Japanese model from only an untagged corpus and a dictionary.</Paragraph>
      <Paragraph position="4"> Precision was improved by the step credit factor function whose threshold is 70 (Fig. 5).</Paragraph>
      <Paragraph position="5"> The precision of the HMMs are better than the precision of the TAG-bigram model, despite the number of parameters of the TAG-HMM being smaller than that for the TAG-bigram model. The HMM is very capable of modeling language, if the training data is reliable.</Paragraph>
      <Paragraph position="6"> Including the variable credit factor in these models is an effective way to improve precision (Fig. 6). In particular, the results of the TAG-bigram model were dramatically improved by using the variable credit factor. Although incorporating the credit factor into the HMM improved the results, they remained at a level similar to that of the TAG-bigram model with the credit factor. Although it is not clear exactly why the HMM did not improved more, there are at least three possible explanations: (1) theoretical limitation of estimation using a~ untagged corpus, (2) using an untagged corpus, estimation of the HMM is harder than estimation of the bigram model, therefore more corpora are needed to train the HMM or (3) the credit factor in this experiment matched to the bigram model but not to the HMM.</Paragraph>
      <Paragraph position="7"> Investigation of these possibilities in the future is needed.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML