AUTOMATIC MODEI~ REFINEMENT 
with an application to tagging 
Yi-Chung Lin, Tung-lIui Chiang and Keh-Yih Su 
Department of Electrical En,@Teering, National 7X'ing ltua University, 
Hsinchu, 7?liwan 300, Republic of China 
ABSTRACT 
Statistical NLP models usually only consider 
coarse information and very restricted context to 
make the estimation of parameters feasible. To 
reduce the modeling error introduced by a sim- 
plified probabilistic model, the Classitication and 
Regression Tree (CART) method was adopted in 
this paper to select more discriminative features 
for automatic model refinement. Because the 
features are adopted dependently during split- 
ting the classification tree in CART, the number 
of training data in each terminal node is small, 
which makes the labeling process of terminal 
nodes not robust. This over-tuning phenome- 
non cannot be completely removed by cross- 
validation process (i.e., pruning process). A 
probabilistic classification model based on the 
selected discriminative features is thtls proposed 
to use the training data more efficiently. In tag- 
ging the Brown Corpus, our probabilistic classi- 
fication model reduces the error rate of the top 
10 error dominant words from 5.71% to 4.35%, 
which shows 23.82% improvement over the un- 
refined model. 
l. INTRODUCTION 
To automatically acquire knowledge from 
corpora, statistical methods are widely used re- 
cently (Church, 1989; Chiang, Lin & Su, 1992; 
Su, Chang & Lin, 1992). The perfonnance 
of a probabilistic model is affected by the es- 
timation error due to insufficient training data 
and the modeling error due to lacking complete 
knowledge of the problem to be conquered. In 
the literature, several smoothing methods (Good, 
1953; Katz, 1987) have been used to effectively 
reduce the estimation error. On the contrary, 
the problem of reducing modeling error is less 
studied. 
Probabilistic models are usually simplified 
to make the estimation of parameters feasible. 
However, some important information may be 
lost while simplifying a model. For example, 
using the contextual words, instead of contextual 
parts of speech, enhances the prediction power 
for tagging parts of speech. But, unfortunately, 
reducing the m(×teling error by increasing the 
degree of model granularity is usually accompa- 
nied by a large estimation error if there is not 
enough training data. 
tIowever, if only the discriminative features 
arc involved (i.e., only those important param- 
eters are used), modeling error could be sig- 
niIicantly reduced without using a large co> 
pus. Those discriminative features usually vary 
for different words, and it would be very time- 
consuming to induce such features from the cor- 
pus manually. An algorithm for automatically 
extracting the discriminative features from a cor- 
pus is rims highly demanded. In this paper, 
the Classification and Regression Tree (CARl') 
method (Breiman, Friedman, Olshen & Stone, 
1984) is first used to extract the discriminative 
features, l lowever, CAP, T basically regards all 
selected features as jointly dependent. Nodes 
in different branches are trained with different 
sets of data, and the available training data of a 
node becomes less and less while CART asks 
more and more questions. "FhereR)re, CART 
can easily split and prune the classification tree 
to fit the training data and the cross-validation 
data respectively. The refinement model built 
by CART tends to be over-tt, ned and its perfor- 
mance is consequently not robust. A probabilis- 
tic classification m(,lel is, therefore, proposed 
to construct a more robust classification model. 
The experimental results show that this proposed 
model reduces the error rate of the top 10 error 
dominant words fi'om 5.71% to 4.35% (23.82% 
148 
error reduction rate) while CART only reduces 
the error rate to 4.67% (18.21% error reduction 
rate). 
2. PROBABILIS'FIC TA(\]GEll 
Since part of speech tagging plays an im- 
portant role in the field of natural language pro- 
cessing (Ctmrch, 1989), it is used to evaluate the 
performance of various approaches in tiffs paper. 
Tagging problem can be formulated (Church, 
1989; IAn, Chiang & Su, 1992) as 
.7~ i=i ('1 
(1) 
where ~ is the category sequence selected by 
the tagging model, wi is the i-th word, ci is 
the possible corresponding category for the i-th 
word and c't ~ is tim Stiort-hand notation of tile 
category sequence Cl~ c2~ " • •, on. 
The Brown Corpus is used as the test bed 
for tagging in this paper. After prepr(xccssing 
the Brown Corpus, a corpus of 1,050,(X)4 words 
in 50,(X)0 sentences is constructed. It contains 
54,031 different words and 83 different tags (ig- 
noring the four designator tags "FW," "Ill," 
"NC" and "TIJ' (Francis & KuSera, 1982)). To 
train and test the model, the whole corpus is di- 
vided into the training set and the testing set. 
The v-foM cross-validation method (Breiman et 
al., 1984), where v is set to 10 in this paper, 
is adopted to reduce the error ira performance 
evaluation. The average number of words in the 
training sets and the testing sets are 945,004 (in 
45,000 sentences) and 105,000 (in 5,000 sen- 
tences) respectively. 
After applying back-off smoothing (Katz 
1987) and robust learning (Lin et al., 1992) on 
Equation (1) to reduce the estimation error, a 
tagger with 1.87% error rate in the testing set is 
then obtained. Although the error rate of overall 
testing set is small, many words are still with 
high error rates. For instance, the error rate of 
the word "that" is 9.08% and the error ,ate of 
the word "out" is 21.09%. To effectively im- 
prove accuracy over these words, it is suggested 
in this paper that the tagging model should be 
refined. 
3. MOI)EI, REIqNEMENT 
For not having enough training data, ttstt- 
ally only coarse infornmtion and rather limited 
context are used in probabilistic models. Some 
discriminative fe.'m~res, therefore, may be sacri- 
riced to make the estimation of parameters lea- 
sine. For example, compared to the tag-level 
contextual information used in a bigram or a tri- 
gt'am m(xlel, the word-level contextual inR)rma- 
tion provides more prediction power for tagging 
parts of speech, t lowever, even the simplest 
word-level contextual information (i.e., word bi- 
gram) requires a large number of parameters 
(about 3 billion in our task). Esthnating such 
a large ,mmber of parameters requires a vet',\] 
huge corpus and is far beyond the size of the 
P, rown Corpus. Thus, the word-level contextual 
information is usually abandoned. 
To reduce the modeling error introduced by 
a simplified probabilistic model, one appealing 
approach is to extract only the discriminative 
features for those error dominant words. In this 
way, one can reduce the error rate without en- 
hu'ging the corpus size. I)ifferent error dominant 
words, however, might be associated with dif- 
ferent sets of discriminative features. To induce 
those discriminative features for each word from 
a corpus by hand is very tinae-consuming. Auto- 
matically acquiring those features directly fl'om 
a corpus is thus highly desirable. In this section, 
the Classification and Regression Tree (CAP, T) 
method (P, reiman et al., 1984) is adopted to aulo- 
matically extract the discriminative fcatures ,'rod 
resolve the lexical ambigt, ity. 
CART, however, requires a la,'ge amount of 
training data and validation data, because it re- 
gards all those selected features as jointly de- 
pendent. The characteristic of being jointly de- 
pendent comes from the splitting process, which 
splits those children nodes only based on the 
data of their parent nodes. As a result, CART 
is easily tuned to fit tim training data and vali- 
dation data. Its performance is thus not robust. 
A probabilistic classification apl)roach is there- 
fore proposed to build robust retinement models 
with limited training data. 
149 
Table 1. Some statistics of the top 10 error dominant words. 
Proportion to 
Word Frequency (%) Error rate (%) overall errors (%) 
that 0.895 9.08 4.33 
out 0.170 21.09 1.91 
to 2.259 1.43 1.72 
as 0.603 4.97 1.60 
than 0.160 17.17 1.46 
more 0.188 12.74 1.28 
about 0.147 15.24 1.20 
for 0.785 2.77 1.17 
one 0.252 8.26 1.11 
little 0.068 29.30 1.06 
TOTAL 5.526 5.71 16.84 
3.1. The error dominant words 
To select those words which are worth for 
model refinement, the top 10 error dominant 
words ate ordered according to their contribution 
to overall errors, as listed in Table 1. The 
second column shows their relative fi'equencies 
in the Brown Corpus. The third column shows 
the error rates of those words tagged by the 
probabilistic tagger described in section 2. The 
last column shows the contribution of the errors 
of each word to the overall errors. The last 
row indicates that the top 10 error dominant 
words constitute 5.53% of the testing corpus and 
contribute 16.84% of the errors in the testing 
corpus. Their averaged error rate is 5.71% (i.e., 
the ratio of the total errors of these words to their 
total occurrence times in the testing corpus). 
3.2. Feature selection 
"lk~ reduce modeling error, more discrimina- 
tive infommtion should be incorporated in tag- 
ging. In addition to the trigram context infor- 
mation of lexical category, the features in Table 
2 are considered to be potentially discriminative 
for choosing the correct lexical category of a 
given word. 
Since the size of the parameters will be huge 
if all the features in Table 2 are jointly consid- 
ered, it is not suitable to incorporate all of them. 
Actually only some of the listed features are re- 
ally discriminative for a particuhtr word. For 
instance, when we want to tag the word "out," 
we do not care whether lhe word behind it (i.e., 
the right-1 woM) is "book," "money" or "win- 
Table 2. The potentially discriminative featu,'es. 
• Tim left-2, left-l, right-1 and right-2 categories (denoted as L~t.g(2), 
Lcatg(1),/2cat.g(1) and \]l'catg(2)) 
• The left-1 and right-1 words (denoted as Lwo,.d(1) and \]{,,,,,.d(1)) 
• The 
• The 
• The 
• The 
• The 
• Tim 
distance from the left period (gp,.,.iod) 
distance to the right period (\]~'t,,'riod) 
distance from the nearest left noun (£notm) 
distance tothe nearest right noun (/~,ot,,,) 
distance from the nearest left verb (Lwrb) 
distance to the nearest right verb (J?,.,,rb) 
150 
Table 3. The improvement in the testing set after using CART as the refined 
word model. Value in parenthesis indicates lhc error rate of the validation set. 
Error rate of F.rror rate of Reduction 
Word the 1st stage (%) using CART (%) rate (%) 
that 9.08 8.69 (7.47) 4.3(/ (17.73) 
out 21.09 8.04 (7.13) 6t.88 (66.19) 
to 1.43 1.36 (1.12) 4.90 (21.68) 
as 4.97 3.33 (2.82) 33.00 (43.26) 
than 17.17 13.83 (11.24) 19.45 (34.54) 
more 12.74 10.96 (9.35) 13.97 (26.61) 
about 15.24 11.33 (9.94) 25.66 (34.78) 
for 2.77 2.55 (2.28) 7.94 (17.69) 
one 8.26 6.48 (5.94) 21.55 (28.09) 
little 29.30 30.00 (25.16) -2.39 (14.13) 
7©7AL 5.71 4.67( 4.00) 18.21 (29.95) 
dow;" we only care whether the right-1 word is 
"of." Thus, in this section, the CART (Breiman 
et al., 1984) method is used to extract the re- 
ally discriminative features fi'om the fcatu,e set. 
The error rate criterion is adopted to measure 
the impurity of a node in the classification tree. 
For every error dominant word, its 4/5 training 
tokens are used to sp!it the classification tree; 
the remaining 1/5 training tokens (not the test- 
ing tokens) are used to prone that tree. Then, 
all the questions asked by the pruned tree are 
considered to be the discriminative features. 
3.3. CART classilicalion model 
In our task, it two-stage approach is adopted 
to tag parts of speech. The first stage is the 
probabilistic tagger described in section 2, which 
provides the most likely category sequence of 
the input sentence. The second stage consists of 
the refined word models of the error dominant 
words. In this stage, the p,'uned classification 
tree is used to re-tag the part of speech. The 
results in the testing set are shown in ~\[hble 3. 
In the table, the second column gives the error 
rates of the error dominant words in the tirst 
stage. The third cohnnn gives the error rates 
after using CART to re-tag those words, and the 
last column gives the corresponding ,eduction 
rates. In parenthesis it gives the performance 
in the validation set. The last row in "lable 3 
shows that the i'efined models built by CART 
can reduce the 18.21% o\[' error rate for the 10 
ClT/.)I" dominant words. Only the performance of 
the word "little" deteriorated. This is due to the 
robusmess problem between the cross-validation 
data and the testing data, which is induced hy the 
rare occurrence of the discriminative features. 
3.4. Prolmbillstic classilication model 
\]~ecause discriminative features are adopted 
dependently, CART can easily classify the train- 
ing data an(l usually introduce the problem of 
over--tuning. Besides, due to the wu'iation be- 
tween the validation data and the testing data, the 
pruning process cannot effectively diminish the 
problem of over-tuning intr{×iuced while grow- 
ing the classification tree. Thus, a probabilistic 
classification model, which uses all the features 
selected by CART in an independent way, is pro- 
posed in this section to robustly re-tag the lexical 
categories of the error dominant words. 
151 
Table 4. The 11 questions asked by the 
classification tree for the word "than." 
• QI,1 : L~atg(2) = "RB" ? 
• Q1,2 : Lc,~tg(2) = "IN" ? 
• Q2,1 : Lcatg(1) = "AP" ? 
• Qa,1 : Rcatg(1) = "CD" ? 
• Qa,2 : Rcatg(1) = "JJ" ? 
• Q4,1 : I?~catg(2) = "JJ" ? 
• Os,1 : Lwo~d(1) = "rather" ? 
• Q6,~ : Rwo,.d(1) = "the" ? 
• Oa,.~ : R,vo,.a(1) = "with" ? 
• @7,1 : Lperiod ~ 2 ? 
• (28,1 : Lperiod ~ 6 ? 
To use the probabilistic chtssification model, 
feature vectors are tirst constructed according to 
the questions asked by the pruned classitication 
tree. Assume that the 11 questions in Table 4 
are asked by the classification tree for the word 
"than." Every occurrence of "than" in the cor- 
pus is then accompanied by an 8-dimensional 
feature vector, F = \[fi,..., fs\]. The elements 
of the feature vector are obtained by the follow- 
ing rule. 
f j, if Qi,j is true; k \ 
0, otherwise. (2) 
Notice that Ql,1 and Q1,2 are merged into the 
same random variable because both of them ask 
about what the left-2 category is. 
After constructing the feature vectors, the 
problem becomes to find a most probable cate- 
gory according to the given feature vector and 
it can be formulated as 
= a rgmax_P(cI r,,..., D,), (3) 
(2 
where c is a possible tag for the word to be re- 
tagged. Assume that 
P((21fl,", f,,) 
= I'(.1,,..., ./;, I(). r((2) s,(f,,..., j;,) 
, P((2) ,=,ie\[ s'(f~lc) s (.t.~777 ./..,,) 
(4) 
The probabilistic chtssilication model (PCM) is 
then defined as 
~ ;~.,<m~,~ i~ P(f~I~)' _r'(~). (5) ,,5: i:-1 
The estimation and learning processes of the 
PCM approach are generally more robust. As 
stated before, CAP, T regards all selected features 
Its jointly dependent. The available training data 
for a node become less its more questions are 
asked. On the contrary, due to the conditional 
independent assumption for P(fl,'",./;,\[c) in 
Equation (4), every p,'lrameter of PCM can be 
trained by the whole training data, and therefore, 
the estimation and learning pr(xeesses are more 
robust. 
Furthermore, every feature of PCM should 
be weighted to retlect its discriminant power 
because PCM regards all features of different 
branches in a tree its conditionally independent. 
Directly using these features without weighting 
cannot lead to good resuhs. The weighting effect 
can be implicitly achieved by adaptive learning. 
4. RESUI;I'S AND I)ISCUSSION 
After learning the model parameters (Amari, 
1967; Lin et al., I992), the results of using 
the probabilistic classification model (PCM) are 
listed in qable 5. As shown in the last rows of 
qables 3 and 5, the error rate of PCM is smaller 
than that of CART in the testing set while their 
error rates in the wflidation set are almost the 
same. The last row of "lhble 5 shows that the 
error rate of the 10 error dominant words is re- 
duced from 5.71% to 4.35% (23.82% reduction 
rate) by refining the woM models with the PCM 
approach. 
In sumnaary, due to dividing the features into 
independent groups, PCM can use the whole 
"/52 
Table 5. Improvement of using pmbabilistic classification model (PCM) as the relined 
word model. Value in parenthesis indicates the error rate of the valklatkm set. 
Error rate of Error rate {}f Reduction 
Word the 1st stage (%) using PCM (%) rate (%) 
that 9.08 7.98 (7.36) 12.11 (18.94) 
out 21.09 7.60 (7.43) 63.96 (64.77) 
to 1.43 1.33 (1.12) 6.99 (21.68) 
as 4.97 2.69 (2.49) 4.5.88 (49.90) 
than 17.17 13.49 (12.25) 21.43 (28.65) 
more 12.74 10.15 (9.16) 20.33 (28.10) 
about 15.24 10.60 (9.57) 30.45 (37.20) 
for 2.77 2.39 (2.34) 13.72 (15.52) 
one 8.26 6.19 (5.88) 25.06 (28.81) 
little 29.30 28.66 (27.63) 2.18 (5.70) 
TOTAL 5.71 4.35 ( 4.01) 23.82 (29.77) 
training data to train every feature and hence 
construct a more robust retinement model. It is 
believed that this proposed probabilistic classi- 
tication model (i.e., Equation (5)) can also be 
applied to other problems attacked by CART, 
such as voiced/w)iceless stop classilication and 
end-of-sentence detection, etc. (Riley 1989). 
5. REFERENCE 
Amari, S. (1967). A theory of adaptive pat- 
tern classitiers. IEEE Transactions on Electronic 
Computers, 16, 299-307. 
Breiman, L., Friedman, J. II., Olshen, R. A. & 
Stone, C. J. (1984). Classification and regres- 
sion trees. Wadsworth Inc., Pacific Grove, Cal- 
ifornia, USA. 
Chiang, T. It., Lin, Y. C. & Su, K..Y. (1992). 
Syntactic ambiguity resolution using a discrim- 
ination and robustness oriented adaptive learn- 
ing algorithm. In Proc. of COLING-92, pp. 
352-358. Aug 23-28 1992, Nantes, France. 
Church, K. W. (1989). A stochastic parts pro- 
gram and noun phrase parser for unrestricted 
text. In Proceedings of the IEEE 1989 Inter- 
national Conference on Acoustics, Speech, and 
Signal Processing, pp. 695-698. May 23-26 
1989, Glasgow, U.K.. 
Francis, W. N. & Ku(:era, H. (1982). Frequency 
analysis of English usage. Itoughton Mifflin 
Corn p q n y. 
Good, I. J. (1953) The population frequencies of 
species and tim estimation of population param- 
eters. Biometrika, 40, 237-264. 
Katz, S. M. (1987) Estimation of probabilities 
trom sparse data for the language m(×tel compo- 
nent of a speech recognizer. IEEE 7)'ansactions 
on Acoustics, Speech, and Signal Processing, 35, 
400--401. 
Lin, Y.-C., Chiang, T.-II. & Su, K.-Y. 1992) 
Discrimination o,'icnted pmbabilistic tagging. In 
Proceedings oJ'ROCLING V, pp. 87-96. Sep 
18-20 I992, Taipei, Taiwan, P,.O.C. 
Riley, M. D. (1989). Some applications of tree- 
based modeling to speech and language. In Pro- 
ceedings of the Speech anti Natural Language 
Workshop, pp. 339-352. Oct 15.-18 1989, Cape 
Cod, Massachusetts, USA. 
Su, K. Y., Chang, J. S. & Lin, Y. C. (1992). 
A discriminative approach for ambiguity reso- 
lution based on a semantic score function. In 
Proc. of 1992 International Conference on Spo- 
ken Language Processing, pp. 149-152. Oct 
I2-16 1992, Banff, Alberta, Canada. 
153 
