File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/95/j95-3002_abstr.xml

Size: 7,431 bytes

Last Modified: 2025-10-06 13:48:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="J95-3002">
  <Title>Robust Learning, Smoothing, and Parameter Tying on Syntactic Ambiguity Resolution</Title>
  <Section position="2" start_page="0" end_page="322" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Resolution of syntactic ambiguity has been a focus in the field of natural language processing for a long time. Both rule-based and statistics-based approaches have been proposed to attack this problem in the past. For rule-based approaches, knowledge is induced by linguistic experts and is encoded in terms of rules. Since a huge amount of fine-grained knowledge is usually required to solve ambiguity problems, it is quite difficult for a rule-based approach to acquire such kinds of knowledge. In addition, the maintenance of consistency among the inductive rules is by no means easy. Therefore, a rule-based approach, in general, fails to attain satisfactory performance for large-scale applications.</Paragraph>
    <Paragraph position="1"> In contrast, a statistical approach provides an objective measuring function to evaluate all possible alternative structures in terms of a set of parameters. Generally, the * National Tsing Hua University, Department of Electrical Engineering, Hsinchu, Taiwan 300, R.O.C. t Email: kysu@bdc.com.tw. (~ 1995 Association for Computational Linguistics Computational Linguistics Volume 21, Number 3 statistics of parameters are estimated from a training corpus by using well-developed statistical theorems. The linguistic uncertainty problems can thus be resolved on a solid mathematical basis. Moreover, the knowledge acquired by a statistical method is always consistent because all the data in the corpus are jointly considered during the acquisition process. Hence, compared with a rule-based method, the time required for knowledge acquisition and the cost needed to maintain consistency among the acquired knowledge sources are significantly reduced by adopting a statistical approach.</Paragraph>
    <Paragraph position="2"> Among the statistical approaches, Su and Chang (1988) and Suet al. (1991) proposed a unified scoring function for resolving syntactic ambiguity. With that scoring function, various knowledge sources can be unified in a uniform formulation. Previous work has demonstrated that this scoring function is able to provide high discrimination power for a variety of applications (Su, Chiang, and Lin 1992; Chen et al. 1991; Su and Chang 1990). In this paper, we start with a baseline system based on this scoring function, and then proceed with different proposed enhancement methods. A test set of 1,000 sentences, extracted from technical manuals, is used for evaluation. A performance of 53.1% accuracy rate for parse tree selection is obtained for the base-line system, when the parameters are estimated by using the maximum likelihood estimation (MLE) method.</Paragraph>
    <Paragraph position="3"> Note that it is the ranking of competitors, instead of the likelihood value, that directly affects the performance of a disambiguation task. Maximizing the likelihood values on the training corpus, therefore, does not necessarily lead to the minimum error rate. In addition, the statistical variations between the training corpus and real tasks are usually not taken into consideration in the estimation procedure. Thus, minimizing the error rate on the training corpus does not imply minimizing the error rate in the task we are really concerned with.</Paragraph>
    <Paragraph position="4"> To deal with the problems described above, a variety of discrimination-based learning algorithms have been adopted extensively in the field of speech recognition (Bahl et al. 1988; Katagiri et al. 1991; Su and Lee 1991, 1994). Among those approaches, the robustness issue was discussed in detail by Su and Lee (1991, 1994) in particular, and encouraging results were observed. In this paper, a discrimination oriented adaptive learning algorithm is first derived based on the scoring function mentioned above and probabilistic gradient descent theory (Amari 1967; Katagiri, Lee, and Juang 1991).</Paragraph>
    <Paragraph position="5"> The parameters of the scoring function are then learned from the training corpus using the discriminative learning algorithm. The accuracy rate for parse tree selection is improved to 56.4% when the discriminative learning algorithm is applied.</Paragraph>
    <Paragraph position="6"> In addition to the discriminative learning algorithm described above, a robust learning procedure is further applied in order to consider the possible statistical variations between the training corpus and the real task. The robust learning process continues adjusting the parameters even though the input training token has been correctly recognized, until the score difference between the correct candidate and the top competitor exceeds a preset threshold. The reason for this is to provide a tolerance zone with a large margin for better preserving the correct ranking orders for data in real tasks. An accuracy rate of 64.3% for parse tree selection is attained after this robust learning algorithm is used.</Paragraph>
    <Paragraph position="7"> The above-mentioned robust learning procedure starts with the parameters obtained by the maximum likelihood estimation method. However, the MLE is notoriously unreliable when there is insufficient training data. The MLE for the probability of a null event is zero, which is generally inappropriate for most applications. To avoid the sparse training data problem, the parameters are first estimated by various parameter smoothing methods (Good 1953; Katz 1987). An accuracy rate for parse tree selection is improved to 69.8% by applying the robust learning procedure to the  Tung-Hui Chiang et al. Robust Learning, Smoothing, and Parameter Tying smoothed parameters. This result demonstrates that a better initial estimate of the parameters gives the robust learning procedure a chance to obtain better results when many local maximal points exist.</Paragraph>
    <Paragraph position="8"> Finally, a parameter tying scheme is proposed to reduce the number of parameters.</Paragraph>
    <Paragraph position="9"> In this approach, some less reliably estimated but highly correlated parameters are tied together, and then trained through the robust learning procedure. The probabilities of the events that never appear in the training corpus can thus be trained more reliably.</Paragraph>
    <Paragraph position="10"> This hybrid (tying + robust learning) approach reduces the number of parameters by a factor of 2,000 (from 8.7 x 108 to 4.2 x 105) and achieves 70.3% accuracy rate for parse tree selection.</Paragraph>
    <Paragraph position="11"> This paper is organized as follows. A unified scoring function used for integrating knowledge from lexical and syntactic levels is introduced in Section 2. The results of using the unified scoring function are summarized in Section 3. In Section 4, the discrimination- and robustness-oriented learning algorithm is derived. The effects of the parameter smoothing techniques on the robust learning procedure are investigated in Section 5. Next, the parameter tying scheme used to enhance parameter training and reduce the number of parameters is described in Section 6. Finally, we discuss our conclusions and describe the direction of future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML