File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/j95-3002_metho.xml

Size: 37,613 bytes

Last Modified: 2025-10-06 14:13:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="J95-3002">
  <Title>Robust Learning, Smoothing, and Parameter Tying on Syntactic Ambiguity Resolution</Title>
  <Section position="5" start_page="327" end_page="328" type="metho">
    <SectionTitle>
3. Baseline Model
</SectionTitle>
    <Paragraph position="0"> To establish a benchmark for examining the power of the proposed algorithms, we begin with a baseline system, in which the parameters are estimated by using the MLE method. Later, we will show how to improve the baseline model with the proposed enhancement mechanisms.</Paragraph>
    <Section position="1" start_page="327" end_page="328" type="sub_section">
      <SectionTitle>
3.1 Experimental Setup
</SectionTitle>
      <Paragraph position="0"> First of all, 10,000 parsed sentences generated by BehaviorTran (Chen et al. 1991), a commercialized English-to-Chinese machine translation system designed by Behavior  Tung-Hui Chiang et al. Robust Learning, Smoothing, and Parameter Tying Design Corporation (BDC), were collected. The domain for this corpus is computer manuals and documents. The correct parts of speech and parse trees for the collected sentences were verified by linguistic experts. The corpus was then randomly partitioned into a training set of 9,000 sentences and a test set of the remaining 1,000 sentences to eliminate possible systematic biases. The average number of words per sentence for the training set and the test set were 13.9 and 13.8, respectively. In the training set, there were 1,030 unambiguous sentences, while 122 sentences were un-ambiguous in the test set. On the average, there were 34.2 alternative parse trees per sentence for the training set, and 31.2 for the test set. If we excluded those unambiguous sentences, there were 38.49 and 35.38 alternative syntactic structures per sentence for the training set and the test set, respectively.</Paragraph>
      <Paragraph position="1"> 3.1.1 Lexicon and Phrase Structure Rules. In the current system, there are 10,418 distinct lexicon entries, extracted from the 10,000-sentence corpus. The grammar is composed of 1,088 phrase structure rules that are expressed in terms of 35 terminal symbols (parts of speech) and 95 nonterminal symbols.</Paragraph>
      <Paragraph position="2"> 3.1.2 Language Models. Usually, a more complex model requires more parameters; hence it frequently introduces more estimation error, although it may lead to less modeling error. To investigate the effects of model complexity and estimation error on the disambiguation task, the following models, which account for various lexical and syntactic contextual information, were evaluated:</Paragraph>
      <Paragraph position="4"> scores and the L1 mode of operation in computing syntactic scores. The number of parameters required is (10,418 x 35) + (35 x 35) + (96, 699 x 95) -- 9, 492, 260. 3 Lex(L2)+Syn(L1): this model uses a trigram model in computing lexical scores and the L1 mode of operation in computing syntactic scores. The number of parameters required is (10,418 x 35) + (35 x 35 x 35) + (96,699 x 95) = 9,533,910.</Paragraph>
      <Paragraph position="5"> Lex(L1)+Syn(L2): this model uses a bigram model in computing lexical scores and the L2 mode of operation in computing syntactic scores. The number of parameters required is (10, 418 x 35) + (35 x 35) + (96, 699 x 95 x 95) = 873, 014, 330.</Paragraph>
      <Paragraph position="6"> Lex(L2)+Syn(L2): this model uses a trigram model in computing lexical scores and the L2 mode of operation in computing syntactic scores. The number of parameters required is (10,418 x 35) + (35 x 35 x 35) + (96, 699 x 95 x 95) = 873, 055, 980.</Paragraph>
      <Paragraph position="7"> 2 L1 means to consult one left-hand side part of speech, and L2 means to consult two left-hand side parts of speech.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="328" end_page="329" type="metho">
    <SectionTitle>
3 The number of parameters for Lex(L1) and Lex(L2) modes is (Nw x Nt) + N 2 and (Nw x Nt) + N 3,
</SectionTitle>
    <Paragraph position="0"> respectively, where Nw(= 10,418) stands for the number of words in the lexicon, and Nt(= 35) denotes the number of distinct terminal symbols (parts of speech). The number of parameters for Syn(L1) and Syn(L2) modes is Np x Nnt and Np x N2t, respectively, where Nnt(---- 95) denotes the number of nonterminal symbols, and Np (= 96, 699) is the number of patterns corresponding to all possible reduce actions. Each pattern is represented as a pair of \[current symbols, reduced symbol\]. For instance, \[{B,C},{A}\] is the pattern corresponding to the reduce action A +- BC in Figure 2.</Paragraph>
    <Paragraph position="1">  measures: accuracy rate and selection power. The measure of accuracy rate of parse tree selection has been widely used in the literature. However, this measure is unable to identify which model is better if the average number of alternative syntactic structures in various tasks is different. For example, a language model with 91% accuracy rate for a task with an average of 1.1 alternative syntactic structures per sentence, which corresponds to the performance of random selection, is by no means better than the language model that attains 70% accuracy rate when there are an average of 100 alternative syntactic structures per sentence. Therefore, a measure, namely Selection Power (SP), is proposed in this paper to give additional information for evaluation.</Paragraph>
    <Paragraph position="2"> SP is defined as the average selection factor (SF) of the disambiguation mechanism on the task of interest. The selection factor for an input sentence is defined as the least proportion of all possible alternative structures that includes the selected syntactic structure. 4 A smaller SP value would, in principle, imply better disambiguation power.</Paragraph>
    <Paragraph position="3"> Formally, SP is expressed as</Paragraph>
    <Paragraph position="5"> ri where sf(i) = G is the selection factor for the ith sentence; ni is the total number of alternative syntactic structures for the ith sentence; ri is the rank of the most preferred candidate. The selection power for a disambiguation mechanism basically serves as an indicator of the selection ability that includes the most preferred candidate within a particular (N-best) region. A mechanism with a smaller SP value is more likely to include the most preferred candidate for some given N-best hypotheses.</Paragraph>
    <Paragraph position="6"> In general, the measures of accuracy rate and the selection power are highly correlated. But it is more informative to report performance with both accuracy rate and selection power. Selection power supplements accuracy rate when two language models to be compared are tested on different tasks.</Paragraph>
    <Section position="1" start_page="329" end_page="329" type="sub_section">
      <SectionTitle>
3.2 Summary of Baseline Results
</SectionTitle>
      <Paragraph position="0"> The performances of the various models in terms of accuracy rate and selection power are shown in Table 1; the values in the parentheses correspond to performance excluding unambiguous sentences. Table 1 shows that better performance (both in terms of accuracy rate and selection power) can be attained when more contextual information is consulted (or when more parameters are used). The improvement in resolutionof syntactic ambiguity by using more lexical contextual information, however, is not statistically significant s when the consulted contextual information in the syntactic models is fixed. For instance, the test set performance for the Lex(L1)+Syn(L2) model is 52.8%, while the performance for the Lex(L2)+Syn(L2) model is only 53.1%. With this small performance difference, we cannot reject the hypothesis that the performance of the Lex(L1)+Syn(L2) model is the same as that of the Lex(L1)+Syn(L2) model. On the other hand, if the consulted lexical contexts are fixed, the performance of the syntactic disambiguation process is improved significantly by using more syntactic contextual 4 The term &amp;quot;most preferred candidate&amp;quot; means the syntactic structure most preferred by people even when there is more than one arguably correct syntactic structure. However, throughout this paper, both the expressions &amp;quot;most preferred syntactic structure&amp;quot; and &amp;quot;correct syntactic structure&amp;quot; refer to the syntactic structure most preferred by our linguistic experts.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="329" end_page="330" type="metho">
    <SectionTitle>
5 The conclusions drawn throughout this paper are all examined based on the testing hypothesis
</SectionTitle>
    <Paragraph position="0"> procedure for a significance level a = 0.01 (Gillick and Cox 1989).</Paragraph>
    <Paragraph position="1">  (b) Test set performance  information. For example, a 53.1% accuracy rate is attained for the Lex(L2)+Syn(L2) model, while the accuracy rate is 49.7% for the Lex(L2)+Syn(L1) model. This result indicates that the context-free assumption adopted by most stochastic parsers might not hold.</Paragraph>
  </Section>
  <Section position="8" start_page="330" end_page="343" type="metho">
    <SectionTitle>
4. Discrimination- and Robustness-Oriented Learning
</SectionTitle>
    <Paragraph position="0"> Although MLE possesses many nice properties (Kendall and Stuart 1979), the criterion of maximizing likelihood value is not equivalent to that of minimizing the error rate in a training set. The maximum likelihood approach achieves disambiguation indirectly and implicitly through the estimation procedure. However, correct disambiguation only depends on the ranks, rather than the likelihood values, of the candidates. In other words, correct recognition will still be obtained if the score of the correct candidate is the highest, even though the likelihood values of the various candidates are estimated poorly. Motivated by this concern, a discrimination-oriented learning procedure is proposed in this paper to adjust the parameters iterafively such that the correct ranking orders can be achieved.</Paragraph>
    <Paragraph position="1"> A general adaptive learning algorithm for minimizing the error rate in the training set was proposed by Amari (1967) using the probability descent (PD) method.</Paragraph>
    <Paragraph position="2"> The extension of PD, namely the generalized probability descent method (GPD), was also developed by Katagiri, Lee, and Juang (1991). However, minimizing the error rate in the training set cannot guarantee that the error rate in the test set is also minimized. Discrimination-based learning procedures, in general, tend to overtune the training set performance unless the number of available data is several times larger  Computational Linguistics Volume 21, Number 3 than the number of parameters (based on our experience). Overtuning the training set performance usually causes performance on the test set to deteriorate. Hence, the robustness issue, which concerns the possible statistical variations between the training set and the test set, must be taken into consideration when we adopt an adaptive learning procedure. In this section, we start with a learning algorithm derived from the probabilistic descent procedure (Katagiri, Lee, and Juang 1991). The robust learning algorithm explored by Su and Lee (1991, 1994) is then introduced to enhance the robustness of the system.</Paragraph>
    <Section position="1" start_page="331" end_page="334" type="sub_section">
      <SectionTitle>
4.1 Discrimination-Oriented Learning
</SectionTitle>
      <Paragraph position="0"> To link the syntactic disambiguation process with the learning procedure, a discrim- * W n ination function, namely gj,k(1), for the syntactic tree Synj,k, corresponding to the lexical sequence LeXk and the input sentence (or word sequence) w~, is defined as gj,k(W~) -- logP(Synj,k, LeXk I w~) (14) Since log(.) is a monotonic increasing function, we can rewrite the criterion for syntactic disambiguation in Equation 1 as the following equation: (d, 1() = argmax{gj,k (w~) } (15) j,k According to Equation 2, Equation 8, and Equation 12, the discrimination function can be further derived as follows:</Paragraph>
      <Paragraph position="2"> _k,i-1 . ,,~11/2. Asy,~(j,i) = \[- log P(Lj,i I -j,1 jj * ~j,k = where Alex(k,i) = \[--logP(Ck,i l C-k,1 ,wlJ~ , 1J'i-1~\]1/2 \[&amp;syn (J, 1), ~Zex (k, 1),..., &amp;quot;~s~,n (,J, n), Alex (k, n)\] is regarded as a parameter vector composed of the lexical and syntactic score components, and II,~j,kll is defined as the Euclidean norm of the vector 'I~j,k. However, in such a formulation, the lexical scores as well as the syntactic scores are assumed to contribute equally to the disambiguation process. This assumption is inappropriate because different linguistic information may contribute differently to various disambiguation tasks. Moreover, the preference scores related to various types of linguistic information may have different dynamic ranges. Therefore, different scores should be assigned different weights to account for both the contribution in discrimination and the dynamic ranges. The discrimination function is thus modified into the following form:</Paragraph>
      <Paragraph position="4"> Tung-Hui Chiang et al. Robust Learning, Smoothing, and Parameter Tying where Wtex and Wsyn stand for the lexical and syntactic weights, respectively; they are set to 1.0 initially, q~j,k corresponds to a transformation of the original vector q~j,k and is represented as the following equation:</Paragraph>
      <Paragraph position="6"> The whole parameter set, denoted by A, thus includes the lexical weight, Wtex, the syntactic weight, Wsyn, the lexical parameters Alex = {Atex(i,j)}vi,j and the syntactic parameters Asyn = { Asyn(i,j) }vi,j; i.e., A = {Wlex, Wsyn } U Alex U Asyn (19) The decision rule for the classifier to select the desired output, according to Eq. (17), is represented as follows:</Paragraph>
      <Paragraph position="8"> Let the correct syntactic structure associated with the input sentence be Syn,~,~.</Paragraph>
      <Paragraph position="9"> Then the misclassification distance, denoted by d~, k, for selecting the syntactic structure Synj, k as the final output is defined by the following equation:</Paragraph>
      <Paragraph position="11"> Such a definition makes the distance be the difference of the lengths (or norms) of the score vectors in the parameter space. Furthermore, d j, k is differentiable with respect to the parameters. Note that according to the definition in Equation 21, an error will occur if ddz &gt; 0, i.e., 114.,~ll &gt; 114'j,~\[I.</Paragraph>
      <Paragraph position="12"> Next, similar to the probabilistic-descent approach (Amari 1967), a loss function Ij,~(A) is defined as a nondecreasing and differentiable function of the misclassification distance; i.e., lj,k(A) = l(dj,k(w~; A)). To approximate the zero-one loss function defined for the minimum-error-rate classification, the loss function, as in Amari (1967), is defined as</Paragraph>
      <Paragraph position="14"> where do is a small positive constant. It has been proved by Amari (1967) that the average loss function will decrease if the adjustments in the learning process satisfy the following equation:</Paragraph>
      <Paragraph position="16"> Computational Linguistics Volume 21, Number 3 where e(t) is a positive function, which usually decreases with time, to control the convergence speed of the learning process; U is a positive-definite matrix, which is assumed to be an identity matrix in the current implementation, and 27 is the gradient operator. Hence, it follows from Equation 23 that the ith syntactic parameter compox(t+l) (a, i), corresponding to the correct candidate, Syn~,~ in the (t + 1)-th iteration, nent ,,syn would be adjusted according to the following equation:</Paragraph>
      <Paragraph position="18"> Meanwhile, the syntactic parameter component corresponding to the top incorrect candidate would be adjusted according to the following formulae:</Paragraph>
      <Paragraph position="20"> The learning rules for adjusting the lexical parameters can be represented in a similar manner: . For the lexical parameters corresponding to the correct candidates:</Paragraph>
      <Paragraph position="22"> For the lexical parameters corresponding to the top candidate: (t+l) ^ =/~lex(k,O_A/ktex(k,O, ifll~,~,fl\[ I &gt; \[14d,kl\[ ' &amp;quot;~lex (k, i) (t) ^ . (t) ^ *</Paragraph>
      <Paragraph position="24"> Tung-Hui Chiang et al. Robust Learning, Smoothing, and Parameter Tying In addition, the syntactic and lexical weights are adjusted as follows:</Paragraph>
      <Paragraph position="26"> As the parameters are adjusted according to the learning rules described above, the score of the correct candidate will increase and the score of the incorrect candidate will decrease from iteration to iteration until the correct candidate is selected.</Paragraph>
      <Paragraph position="27"> The ratio of the syntactic weight to the lexical weight, i.e., Wsyn/Wlex, finally turns out to be 1.3 for the Lex(L2)+Syn(L2) model after the discriminative learning procedure is applied. This ratio varies with the adopted language models, but is always larger than 1.0. This result matches our expectation, because the syntactic score should provide more discrimination power than the lexical score in the syntactic disambiguation task.</Paragraph>
      <Paragraph position="28"> The experimental results of using the discriminative learning procedure with 20 iterations are shown in Table 2. For comparison, the corresponding results before learning, i.e., the baseline results, are repeated in the upper row of each table entry. For the Lex(L2)+Syn(L2) model, the accuracy rate for parse tree disambiguation in the trairting set is improved from 79.04% to 92.77%, which corresponds to a 65.5% error reduction rate. However, only a 7.03% error reduction rate is observed in the test set, from 53.10% to 56.40%. Similar tendencies are also observed for the other models.</Paragraph>
      <Paragraph position="29"> Since the discriminative learning procedure only aims at minimizing the error rate in the training set, the training set performance can usually be tuned very closely to 100% when a large number of parameters are available. However, the performance improvement for the test set is far less than that for the training set, since the statistical variations between the training set and the test set are not taken into consideration in the learning procedure. For investigating robustness issues in more detail, a robust learning procedure and the associated analyses are provided in the following section.</Paragraph>
    </Section>
    <Section position="2" start_page="334" end_page="338" type="sub_section">
      <SectionTitle>
4.2 Robust Learning
</SectionTitle>
      <Paragraph position="0"> As discussed in the previous section, the discriminative learning approach aims at minimizing the training set errors. The error rate measured in the training set is, in general, over-optimistic (Efron and Gong 1983), because the training set performance can be tuned to approach 100% by using a large number of parameters. The parameters obtained in such a way frequently fail to attain an optimal performance when used in  a real application. This over-tuning phenomenon happens mainly because of the lack of sufficient sampling data and the possible statistical variations between the training set and the test set.</Paragraph>
      <Paragraph position="1"> To achieve better performance for a real application, one must deal with statistical variation problems. Most adaptive learning procedures stop adjusting the parameters once the input training token has been classified correctly. For such learning procedures, the distance between the correct candidate and other competitive ones may be too small to cover the possible statistical variations between the training corpus and the real application. To remedy this problem, Su and Lee (1991, 1994) suggested that the distance margin between the correct candidate and the top competitor should be enlarged, even though the input token is correctly recognized, until the margin exceeds a given threshold. A large distance margin would provide a tolerance region in the neighborhood of the decision boundary to allow possible data scattering in the real applications (Su and Lee 1994). A promising result has been observed by applying  Tung-Hui Chiang et al. Robust Learning, Smoothing, and Parameter Tying this robust learning procedure to recognize the alphabet of E-set in English (Su and Lee 1991, 1994).</Paragraph>
      <Paragraph position="2"> To enhance robustness, the learning rules from Equation 24 to Equation 30 are modified as follows. Following the notations in the previous section, the correct syntactic structure is denoted by Syn~,fl, and the syntactic structure of the strongest competitor is denoted by Synj, k, whose score may either rank first or second.</Paragraph>
      <Paragraph position="3"> . For the syntactic and lexical parameters corresponding to the correct candidate: .(t+l) / a~y. ta, i) .(t), i) (t) = A~y. (a, + A~y. (a, i),</Paragraph>
      <Paragraph position="5"> The learning rules of the syntactic and lexical weights are modified as follows:</Paragraph>
      <Paragraph position="7"> The margin 6 in the above equations can be assigned either absolutely or relatively, as suggested in Su and Lee (1991, 1994). Currently, the relative mode with a 30% passing rate (i.e., 30% of the training tokens pass through the margin) is used in our implementation.</Paragraph>
      <Paragraph position="8"> The simulation results, compared with the results obtained by using the discriminative learning procedure, are shown in Table 3. Table 3(a) shows that performances with robust learning in the training set are a little worse than those with discrimination learning for the L1 syntactic language models. Nevertheless, they are a little better for the L2 syntactic language model. All these differences, however, are not statistically significant. On the contrary, the results with robust learning for the test set, as shown in Table 3(b), are much better in all cases. The robust learning procedure achieves more than 8% improvement compared with the discriminative learning procedure for all language models. It is evident that the robust learning procedure is superior to the discriminative learning procedure in the test set.</Paragraph>
      <Paragraph position="9">  (b) Test set performance ~DL and RL denote &amp;quot;Discriminative Learning&amp;quot; and &amp;quot;Robust Learning,&amp;quot; respectively 5. Parameter Smoothing for Sparse Data  The above-mentioned robust learning algorithm starts with the initial parameters estimated by using MLE method. MLE, however, frequently suffers from the large estimation error caused by the lack of sufficient training data in many statistical approaches. For example, MLE gives a zero probability to events that were never observed in the training set. Therefore, MLE fails to provide a reliable result if only a small number of sampling data are available. To overcome this problem, Good (1953) proposed using Turing's formula as an improved estimate over the well-known MLE. In addition, Katz (1987) proposed a different smoothing technique, called the Back-Off procedure, for smoothing unreliably estimated n-gram parameters with their correlated (n-1)-gram parameters. To investigate the effects of parameter smoothing on robust learning, both these techniques are used to smooth the estimated parameters, and then the robust learning procedure is applied based on those smoothed parameters. These two smooth- null Tung-Hui Chiang et al. Robust Learning, Smoothing, and Parameter Tying ing techniques are first summarized in the following section. The investigation for the smoothing/robust learning hybrid approach is presented next.</Paragraph>
    </Section>
    <Section position="3" start_page="338" end_page="341" type="sub_section">
      <SectionTitle>
5.1 The Smoothing Procedures
</SectionTitle>
      <Paragraph position="0"> nr be the number of events that occur exactly r times. Then the following equation holds:</Paragraph>
      <Paragraph position="2"> The maximum likelihood estimate PML for the probability of an event e occurring r times is defined as follows: r</Paragraph>
      <Paragraph position="4"> The total probability estimate, using Turing's formula, for all the events that actually occurred in the sample space is equal to</Paragraph>
      <Paragraph position="6"> where C(e) stands for the frequency count of the event e in the sample. This, in turn, leads to the following equation:</Paragraph>
      <Paragraph position="8"> According to Turing's formula, the probability mass nl/N is then equally distributed over the events that never occur in the sample.</Paragraph>
      <Paragraph position="9">  rameters for an m-gram model, i.e., the conditional probability of a word given the (m-l) preceding words. This procedure is summarized as follows:</Paragraph>
      <Paragraph position="11"> Computational Linguistics Volume 21, Number 3 is a normalized factor such that</Paragraph>
      <Paragraph position="13"> Compared with Turing's formula, the probability for an m-gram that does not occur in the sample is &amp;quot;backed off&amp;quot; to refer to its corresponding (m-1)-gram probability.</Paragraph>
      <Paragraph position="14"> Table 4 gives the experimental results for using the maximum likelihood (ML), Turing (TU) and back-off (BF) estimation procedures. The results show that smoothing the unreliable parameters degrades the training set performance; however, it improves the performance for the test set. Among the estimators, the maximum likelihood estimator provides the best results for the training set, but it is the worst on the test set. Both Turing's and the back-off procedures perform better than the maximum likelihood procedure. This means that smoothing unreliable parameters is absolutely essential if only limited training data are available.</Paragraph>
      <Paragraph position="15"> Compared with Turing's procedure, the back-off procedure is 1 ~ 2% worse in all cases. After examining the estimated parameters by using these two smoothing procedures, we found that some syntactic parameters for null events were assigned very large values by the Back-Off procedure, while they were assigned small probabilities by Turing's formula. A typical example is shown as follows. The reduce action &amp;quot;n quan --* NLM*&amp;quot; given the left contexts \[P*, N2\] never occurred in the training set. But, the probability of P( n quan --~ NLM* \[ \[n quan\] reduced; L2=P*, LI=N2) is finally replaced by the probability of P( n quan ~ NLM* I In quan\] reduced) in the Back-Off estimation procedure. Since the probability P( n quan --+ NLM* I \[n quan\] reduced) has a large value (= 0.25), the probability P( n quan ~ NLM* I \[n quan\] are reduced; L2=P*, LI=N2) is accordingly large also. From the estimation point of view, the parameters for null events may be assigned better estimated values by using the Back-Off method; however, these parameters do not necessarily guarantee that the discrimination power will be better improved. Take the sentence &amp;quot;A stack ofpinfeed paper three inches high may be placed underneath it&amp;quot; as an example. The decomposed phrase levels and the corresponding syntactic scores for the correct and the top candidate are shown in Table 5 (a) and (b), respectively. We find that the main factor affecting the tree selection is the sixth phrase level, which corresponds to the reduce action % quan --* NLM*&amp;quot; with the left two contextual symbols P* and N2 for the top candidate. As described above, the probability P( n quan --* NLM* I \[n quan\] reduced; L2=P*, LI=N2) is assigned a large value in the Back-Off estimation procedure. However, to correctly select the right syntactic structure in this example, P( quan ~ QUAN I \[quan\] reduced; L2=P*, LI=N2) should be greater than P( n quan ~ NLM* I \[n quan\] reduced; L2=P*, LI=N2). This requirement may not be met by any estimation procedure, since the above two probabilities are estimated from two different outcome spaces (one conditioned on \[quan\], and the other conditioned on \[n, qua n\]). Therefore, even though the Back-Off procedure may give better estimates for the parameters, it cannot guarantee that the recognition result can be improved. The comparison between Turing's procedure and the Back-Off procedure thus varies in different cases. In fact, the Back-Off estimation did show better results in our previous research (Lin, Chiang, and Su 1994). Nevertheless, we will show in the next section that the selection of a smoothing method is not crucial after the robust learning procedure has been applied.</Paragraph>
      <Paragraph position="16"> Furthermore, comparing the results in Table 3 and Table 4, we find that the performance with the robust learning procedure is much better than that with the smoothing techniques. Although both the adaptive learning procedures and the smoothing techniques show improvement, the robust learning procedure, which emphasizes dis- null crimination capability rather than merely improving estimation process, achieves a better result. Since the philosophies of performance improvement for these two algorithms are different (one from the estimation point of view and the other from the discrimination point of view), it is interesting to combine these two algorithms and investigate the effect of the robust learning procedure on the smoothed parameters.</Paragraph>
      <Paragraph position="17"> Detailed discussion on this hybrid approach is addressed in the following section.</Paragraph>
      <Paragraph position="18">  Computational Linguistics Volume 21, Number 3 Table 5 The decomposed phrase levels associated with the sentence &amp;quot;A stack ofpinfeed paper three inches high may be placed underneath it,&amp;quot; and the corresponding scores with the Back-Off estimation method for (a) the correct candidate and (b) the top candidate. The shaded rows indicate the different patterns between the two parse trees.</Paragraph>
      <Paragraph position="19"> word current symbols ~ reduced symbol score</Paragraph>
    </Section>
    <Section position="4" start_page="341" end_page="343" type="sub_section">
      <SectionTitle>
5.2 Robust Learning on the Smoothed Parameters
</SectionTitle>
      <Paragraph position="0"> The hybrid approach first uses a smoothing technique to estimate the initial parameters. Afterwards, the robust learning procedure is applied based on the smoothed parameters. The advantages of this approach are two-fold. First, the power of the scoring function is enhanced since the smoothing techniques can reduce the estimation errors, especially for unseen events. Second, the parameters estimated from the smoothing techniques give the robust learning procedure a better initial point and are more likely to reach a better solution when many local optima exist in the parameter space. In other words, the smoothing techniques indirectly prevent the learning process from being trapped in a poor local optimum, although reducing the estimation  errors by using these methods does not directly improve the discrimination capability. Experimental results using this hybrid approach are shown in Table 6, where the results using the (ML+RL) mode are also listed for reference.</Paragraph>
      <Paragraph position="1"> Significant improvement, compared with the (ML+RL) mode, has been observed  Computational Linguistics Volume 21, Number 3 by using the smoothed parameters at the initial step before the robust learning procedure is applied. With this hybrid approach, better results are obtained using a more complex language model, such as Lex(L2)+Syn(L2). However, there is no significant performance difference achieved by using the (TU+RL) and the (BF+RL) approaches for all language models, even though Turing's smoothing formula was shown to behave better than the Back-Off procedure before applying the robust learning procedure. This is not surprising because starting the robust learning procedure with different initial points would still lead to the same local optimum if the starting region, where the initial points are located, has only one local optimum. By using Turing's formula/Robust Learning hybrid approach for the Lex(L2)+Syn(L2) model, the accuracy rate for parse tree selection is improved to 69.2%, which corresponds to a 34.3% error reduction compared with the baseline of 53.1% accuracy. The superiority in terms of both discrimination and robustness for the hybrid approach is thus clearly demonstrated.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="343" end_page="345" type="metho">
    <SectionTitle>
6. Parameter Tying
</SectionTitle>
    <Paragraph position="0"> The investigation described in Section 5 has shown that smoothing is essential before the robust learning procedure is applied. Nevertheless, although we get better initial estimates by smoothing parameters corresponding to rare events, these parameters still cannot be trained well in the robust learning procedure, because such parameters are seldom or never touched by the training process. Unfortunately, this problem occurs frequently in statistical language modeling. This happens because, in general, to reduce modeling errors, a model accounting for more contextual information is desired. However, a model incorporating more contextual information would have a larger number of null event parameters, which will not be touched in the learning procedure.</Paragraph>
    <Paragraph position="1"> To overcome this problem, a novel approach is proposed in this paper to train the null event parameters by tying them to their highly correlated parameters, and then adjusting them through the robust learning procedure. Basically, the reasons for using this approach are two-fold. First, the number of parameters can be reduced by using the tying scheme. Secondly, this tying scheme gives parameters for rare events more chance to be touched in the learning procedure and thus they can be trained more reliably. The details are addressed below.</Paragraph>
    <Section position="1" start_page="343" end_page="344" type="sub_section">
      <SectionTitle>
6.1 Tying Procedure
</SectionTitle>
      <Paragraph position="0"> The tying procedure includes the following two steps: . Initial Estimation: For an m-gram model, the conditional probability</Paragraph>
      <Paragraph position="2"> where V denotes the vocabulary and C(.) stands for the frequency count of an event in the training set. If ~y~v C( xl .... ,Xm-l,y) &gt; Q~, where Qa is a present threshold, it is assumed that the estimated value of P(xm \]x~ n-l) is reliable and no action is required in this situation. On the other hand, if Y~v C(Xl .... , Xm-1, y) &lt; Qa, the estimated value of P(xm I m-1 x I ) is x 1 ) is regarded as unreliable. In this case, P(xm \] m-1 substituted by the smoothed value of the (m - 1)-gram probability  P(Xm \] X2n-1). Currently, Qe is set to ten times the size of the possible outcomes of Xm, i.e., Qa = \[10 x (the number of possible tags)\] for the part-of-speech transition parameters.</Paragraph>
      <Paragraph position="3"> Tying Procedure: Consider the m-gram events {Xl,..., Xm-1, Yi}, Myi C V, which have the same (m-1)-gram history {Xl,..., Xm--1}. Each of the probabilities P(Yi \] Xl ..... Xm-1), Vyi C V is first assigned a smoothed value in the above step. To give these parameters more chance to be trained during the robust learning process, we tie together the parameters whose corresponding events appear less than Qn times in the training set. That is, the parameters P(Yk \] Xl, x2 .... , Xm-1), Yk C V, are tied if the associated events satisfy the following conditions: C(x1,...,xm-l,yi) &lt; Qa, and C(Xl,...,Xm-l,yk) &lt; Qn, yk E V, (47) yiff_ V where Qn is currently set to 2.</Paragraph>
      <Paragraph position="4"> The numbers of parameters before and after tying for each language model are tabulated in Table 7. This table shows that the number of parameters is greatly reduced after the tying process, especially for the L2 syntactic models.</Paragraph>
    </Section>
    <Section position="2" start_page="344" end_page="345" type="sub_section">
      <SectionTitle>
6.2 Robust Learning on the Tied Parameters
</SectionTitle>
      <Paragraph position="0"> After the parameters are estimated and tied through the tying procedure, the robust learning algorithm is applied on the tied parameters. The experimental results are shown in Table 8. The results with the TU+RL hybrid approach are also listed for reference. The performance with the Tying/Robust Learning hybrid approach, as shown in Table 8, deteriorates somewhat in the training set because the tying procedure decreases the modeling resolution. However, the test set performance with this hybrid approach is slightly (but not significantly) better than the Turing's formula/Robust  Learning approach. In addition, it reduces the large number of parameters, and thus greatly eases the memory constraints for implementing the system.</Paragraph>
      <Paragraph position="1"> A summary illustrating the performance improvement by using the proposed enhancement mechanisms for the Lex(L2)+Syn(L2) model is shown in Table 9. The proposed tying approach, after being combined with the robust learning procedure, significantly reduces the error rate compared with the baseline (36.67% error reduction is achieved, from 53.1% to 70.3%). Moreover, the number of parameters is reduced to less than 1/2000 of the original parameter space.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML