XML Viewer - c94-1023

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1023_metho.xml
Size: 16,538 bytes
Last Modified: 2025-10-06 14:13:36
<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-1023">
  <Title>AUTOMATIC MODEI~ REFINEMENT with an application to tagging</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
AUTOMATIC MODEI~ REFINEMENT
</SectionTitle>
    <Paragraph position="0"> with an application to tagging</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="148" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> Statistical NLP models usually only consider coarse information and very restricted context to make the estimation of parameters feasible. To reduce the modeling error introduced by a simplified probabilistic model, the Classitication and Regression Tree (CART) method was adopted in this paper to select more discriminative features for automatic model refinement. Because the features are adopted dependently during splitting the classification tree in CART, the number of training data in each terminal node is small, which makes the labeling process of terminal nodes not robust. This over-tuning phenomenon cannot be completely removed by cross-validation process (i.e., pruning process). A probabilistic classification model based on the selected discriminative features is thtls proposed to use the training data more efficiently. In tagging the Brown Corpus, our probabilistic classification model reduces the error rate of the top 10 error dominant words from 5.71% to 4.35%, which shows 23.82% improvement over the unrefined model.</Paragraph>
    <Paragraph position="1"> l. INTRODUCTION To automatically acquire knowledge from corpora, statistical methods are widely used recently (Church, 1989; Chiang, Lin &amp; Su, 1992; Su, Chang &amp; Lin, 1992). The perfonnance of a probabilistic model is affected by the estimation error due to insufficient training data and the modeling error due to lacking complete knowledge of the problem to be conquered. In the literature, several smoothing methods (Good, 1953; Katz, 1987) have been used to effectively reduce the estimation error. On the contrary, the problem of reducing modeling error is less studied.</Paragraph>
    <Paragraph position="2"> Probabilistic models are usually simplified to make the estimation of parameters feasible.</Paragraph>
    <Paragraph position="3"> However, some important information may be lost while simplifying a model. For example, using the contextual words, instead of contextual parts of speech, enhances the prediction power for tagging parts of speech. But, unfortunately, reducing the m(xteling error by increasing the degree of model granularity is usually accompanied by a large estimation error if there is not enough training data.</Paragraph>
    <Paragraph position="4"> tIowever, if only the discriminative features arc involved (i.e., only those important parameters are used), modeling error could be signiIicantly reduced without using a large co&gt; pus. Those discriminative features usually vary for different words, and it would be very time-consuming to induce such features from the corpus manually. An algorithm for automatically extracting the discriminative features from a corpus is rims highly demanded. In this paper, the Classification and Regression Tree (CARl') method (Breiman, Friedman, Olshen &amp; Stone, 1984) is first used to extract the discriminative features, l lowever, CAP, T basically regards all selected features as jointly dependent. Nodes in different branches are trained with different sets of data, and the available training data of a node becomes less and less while CART asks more and more questions. &amp;quot;FhereR)re, CART can easily split and prune the classification tree to fit the training data and the cross-validation data respectively. The refinement model built by CART tends to be over-tt, ned and its performance is consequently not robust. A probabilistic classification m(,lel is, therefore, proposed to construct a more robust classification model.</Paragraph>
    <Paragraph position="5"> The experimental results show that this proposed model reduces the error rate of the top 10 error dominant words fi'om 5.71% to 4.35% (23.82%  error reduction rate) while CART only reduces the error rate to 4.67% (18.21% error reduction rate).</Paragraph>
  </Section>
  <Section position="3" start_page="148" end_page="148" type="metho">
    <SectionTitle>
2. PROBABILIS'FIC TA(\]GEll
</SectionTitle>
    <Paragraph position="0"> Since part of speech tagging plays an important role in the field of natural language processing (Ctmrch, 1989), it is used to evaluate the performance of various approaches in tiffs paper.</Paragraph>
    <Paragraph position="1"> Tagging problem can be formulated (Church,</Paragraph>
    <Paragraph position="3"> where ~ is the category sequence selected by the tagging model, wi is the i-th word, ci is the possible corresponding category for the i-th word and c't ~ is tim Stiort-hand notation of tile category sequence Cl~ c2~ &amp;quot; * *, on.</Paragraph>
    <Paragraph position="4"> The Brown Corpus is used as the test bed for tagging in this paper. After prepr(xccssing the Brown Corpus, a corpus of 1,050,(X)4 words in 50,(X)0 sentences is constructed. It contains 54,031 different words and 83 different tags (ignoring the four designator tags &amp;quot;FW,&amp;quot; &amp;quot;Ill,&amp;quot; &amp;quot;NC&amp;quot; and &amp;quot;TIJ' (Francis &amp; KuSera, 1982)). To train and test the model, the whole corpus is divided into the training set and the testing set.</Paragraph>
    <Paragraph position="5"> The v-foM cross-validation method (Breiman et al., 1984), where v is set to 10 in this paper, is adopted to reduce the error ira performance evaluation. The average number of words in the training sets and the testing sets are 945,004 (in 45,000 sentences) and 105,000 (in 5,000 sentences) respectively.</Paragraph>
    <Paragraph position="6"> After applying back-off smoothing (Katz 1987) and robust learning (Lin et al., 1992) on Equation (1) to reduce the estimation error, a tagger with 1.87% error rate in the testing set is then obtained. Although the error rate of overall testing set is small, many words are still with high error rates. For instance, the error rate of the word &amp;quot;that&amp;quot; is 9.08% and the error ,ate of the word &amp;quot;out&amp;quot; is 21.09%. To effectively improve accuracy over these words, it is suggested in this paper that the tagging model should be refined.</Paragraph>
  </Section>
  <Section position="4" start_page="148" end_page="151" type="metho">
    <SectionTitle>
3. MOI)EI, REIqNEMENT
</SectionTitle>
    <Paragraph position="0"> For not having enough training data, ttsttally only coarse infornmtion and rather limited context are used in probabilistic models. Some discriminative fe.'m~res, therefore, may be sacririced to make the estimation of parameters leasine. For example, compared to the tag-level contextual information used in a bigram or a trigt'am m(xlel, the word-level contextual inR)rmation provides more prediction power for tagging parts of speech, t lowever, even the simplest word-level contextual information (i.e., word bigram) requires a large number of parameters (about 3 billion in our task). Esthnating such a large ,mmber of parameters requires a vet',\] huge corpus and is far beyond the size of the P, rown Corpus. Thus, the word-level contextual information is usually abandoned.</Paragraph>
    <Paragraph position="1"> To reduce the modeling error introduced by a simplified probabilistic model, one appealing approach is to extract only the discriminative features for those error dominant words. In this way, one can reduce the error rate without enhu'ging the corpus size. I)ifferent error dominant words, however, might be associated with different sets of discriminative features. To induce those discriminative features for each word from a corpus by hand is very tinae-consuming. Automatically acquiring those features directly fl'om a corpus is thus highly desirable. In this section, the Classification and Regression Tree (CAP, T) method (P, reiman et al., 1984) is adopted to aulomatically extract the discriminative fcatures ,'rod resolve the lexical ambigt, ity.</Paragraph>
    <Paragraph position="2"> CART, however, requires a la,'ge amount of training data and validation data, because it regards all those selected features as jointly dependent. The characteristic of being jointly dependent comes from the splitting process, which splits those children nodes only based on the data of their parent nodes. As a result, CART is easily tuned to fit tim training data and validation data. Its performance is thus not robust.</Paragraph>
    <Paragraph position="3"> A probabilistic classification apl)roach is therefore proposed to build robust retinement models with limited training data.</Paragraph>
    <Section position="1" start_page="149" end_page="151" type="sub_section">
      <SectionTitle>
3.1. The error dominant words
</SectionTitle>
      <Paragraph position="0"> To select those words which are worth for model refinement, the top 10 error dominant words ate ordered according to their contribution to overall errors, as listed in Table 1. The second column shows their relative fi'equencies in the Brown Corpus. The third column shows the error rates of those words tagged by the probabilistic tagger described in section 2. The last column shows the contribution of the errors of each word to the overall errors. The last row indicates that the top 10 error dominant words constitute 5.53% of the testing corpus and contribute 16.84% of the errors in the testing corpus. Their averaged error rate is 5.71% (i.e., the ratio of the total errors of these words to their total occurrence times in the testing corpus).</Paragraph>
      <Paragraph position="1"> 3.2. Feature selection &amp;quot;lk~ reduce modeling error, more discriminative infommtion should be incorporated in tagging. In addition to the trigram context information of lexical category, the features in Table 2 are considered to be potentially discriminative for choosing the correct lexical category of a given word.</Paragraph>
      <Paragraph position="2"> Since the size of the parameters will be huge if all the features in Table 2 are jointly considered, it is not suitable to incorporate all of them. Actually only some of the listed features are really discriminative for a particuhtr word. For instance, when we want to tag the word &amp;quot;out,&amp;quot; we do not care whether lhe word behind it (i.e., the right-1 woM) is &amp;quot;book,&amp;quot; &amp;quot;money&amp;quot; or &amp;quot;win- null distance from the left period (gp,.,.iod) distance to the right period (\]~'t,,'riod) distance from the nearest left noun (PSnotm) distance tothe nearest right noun (/~,ot,,,) distance from the nearest left verb (Lwrb) distance to the nearest right verb (J?,.,,rb)  dow;&amp;quot; we only care whether the right-1 word is &amp;quot;of.&amp;quot; Thus, in this section, the CART (Breiman et al., 1984) method is used to extract the really discriminative features fi'om the fcatu,e set. The error rate criterion is adopted to measure the impurity of a node in the classification tree. For every error dominant word, its 4/5 training tokens are used to sp!it the classification tree; the remaining 1/5 training tokens (not the testing tokens) are used to prone that tree. Then, all the questions asked by the pruned tree are considered to be the discriminative features.</Paragraph>
      <Paragraph position="3"> 3.3. CART classilicalion model In our task, it two-stage approach is adopted to tag parts of speech. The first stage is the probabilistic tagger described in section 2, which provides the most likely category sequence of the input sentence. The second stage consists of the refined word models of the error dominant words. In this stage, the p,'uned classification tree is used to re-tag the part of speech. The results in the testing set are shown in ~\[hble 3. In the table, the second column gives the error rates of the error dominant words in the tirst stage. The third cohnnn gives the error rates after using CART to re-tag those words, and the last column gives the corresponding ,eduction rates. In parenthesis it gives the performance in the validation set. The last row in &amp;quot;lable 3 shows that the i'efined models built by CART can reduce the 18.21% o\[' error rate for the 10 ClT/.)I&amp;quot; dominant words. Only the performance of the word &amp;quot;little&amp;quot; deteriorated. This is due to the robusmess problem between the cross-validation data and the testing data, which is induced hy the rare occurrence of the discriminative features.</Paragraph>
      <Paragraph position="4"> 3.4. Prolmbillstic classilication model \]~ecause discriminative features are adopted dependently, CART can easily classify the training data an(l usually introduce the problem of over--tuning. Besides, due to the wu'iation between the validation data and the testing data, the pruning process cannot effectively diminish the problem of over-tuning intr{xiuced while growing the classification tree. Thus, a probabilistic classification model, which uses all the features selected by CART in an independent way, is proposed in this section to robustly re-tag the lexical categories of the error dominant words.</Paragraph>
      <Paragraph position="5">  To use the probabilistic chtssification model, feature vectors are tirst constructed according to the questions asked by the pruned classitication tree. Assume that the 11 questions in Table 4 are asked by the classification tree for the word &amp;quot;than.&amp;quot; Every occurrence of &amp;quot;than&amp;quot; in the corpus is then accompanied by an 8-dimensional feature vector, F = \[fi,..., fs\]. The elements of the feature vector are obtained by the following rule.</Paragraph>
      <Paragraph position="6"> f j, if Qi,j is true; k \ 0, otherwise. (2) Notice that Ql,1 and Q1,2 are merged into the same random variable because both of them ask about what the left-2 category is.</Paragraph>
      <Paragraph position="7"> After constructing the feature vectors, the problem becomes to find a most probable category according to the given feature vector and it can be formulated as = a rgmax_P(cI r,,..., D,), (3)  where c is a possible tag for the word to be retagged. Assume that</Paragraph>
      <Paragraph position="9"> The probabilistic chtssilication model (PCM) is then defined as ~ ;~.,&lt;m~,~ i~ P(f~I~)' _r'(~). (5) ,,5: i:-1 The estimation and learning processes of the PCM approach are generally more robust. As stated before, CAP, T regards all selected features Its jointly dependent. The available training data for a node become less its more questions are asked. On the contrary, due to the conditional independent assumption for P(fl,'&amp;quot;,./;,\[c) in Equation (4), every p,'lrameter of PCM can be trained by the whole training data, and therefore, the estimation and learning pr(xeesses are more robust.</Paragraph>
      <Paragraph position="10"> Furthermore, every feature of PCM should be weighted to retlect its discriminant power because PCM regards all features of different branches in a tree its conditionally independent. Directly using these features without weighting cannot lead to good resuhs. The weighting effect can be implicitly achieved by adaptive learning.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="151" end_page="151" type="metho">
    <SectionTitle>
4. RESUI;I'S AND I)ISCUSSION
</SectionTitle>
    <Paragraph position="0"> After learning the model parameters (Amari, 1967; Lin et al., I992), the results of using the probabilistic classification model (PCM) are listed in qable 5. As shown in the last rows of qables 3 and 5, the error rate of PCM is smaller than that of CART in the testing set while their error rates in the wflidation set are almost the same. The last row of &amp;quot;lhble 5 shows that the error rate of the 10 error dominant words is reduced from 5.71% to 4.35% (23.82% reduction rate) by refining the woM models with the PCM approach.</Paragraph>
    <Paragraph position="1"> In sumnaary, due to dividing the features into independent groups, PCM can use the whole  training data to train every feature and hence construct a more robust retinement model. It is believed that this proposed probabilistic classitication model (i.e., Equation (5)) can also be applied to other problems attacked by CART, such as voiced/w)iceless stop classilication and end-of-sentence detection, etc. (Riley 1989).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML