Detection of Language (Model) Errors 
K.Y. Hung, R.W.P. Luk, D. Yeung, K.F.L. Chung and W. Shu 
Department of Computing 
Hong Kong Polytechnic University 
Hong Kong 
E-mail: {cskyhung, csrluk, csdaniel, cskchung, cswshu}@comp.polvu.edu.hk 
Abstract 
The bigram  models are popular, in 
much  processing applications, in 
both Indo-European and Asian s. 
However, when the  model for 
Chinese is applied in a novel domain, the 
accuracy is reduced significantly, from 96% to 
78% in our evaluation. We apply pattern 
recognition techniques (i.e. Bayesian, decision 
tree and neural network classifiers) to 
discover  model errors. We have 
examined 2 general types of features: model- 
based and -specific features. In our 
evaluation, Bayesian classifiers produce the 
best recall performance of 80% but the 
precision is low (60%). Neural network 
produced good recall (75%) and precision 
(80%) but both Bayesian and Neural network 
have low skip ratio (65%). The decision tree 
classifier produced the best precision (81%) 
and skip ratio (76%) but its recall is the lowest 
(73%). 
Introduction 
Language models are important post-processing 
modules to improve recognition accuracy of a 
wide variety of input, namely speech recognition 
(Balh et al., 1983), handwritten recognition 
(Elliman and Lancaster, 1990) and printed 
character recognition (Sun, 1991), for many 
human s. They can also be used for text 
correction (Ron et al., 1994) and part-of-speech 
tagging. 
For Indo-European s, the word-bigram 
 model is used in speech recognition 
(Jelinek, 1989) and handwriting recognition 
(Nathan et al., 1995). Various ways to improve 
 models were reported. First, the model 
has been extended with longer dependencies (e.g. 
trigram) (Jelinek, 1991) and using non-contiguous 
dependencies, like trigger pairs (Rosenfeid, 1994) 
or long distance n-gram  models (Huang 
et al., 1993). For better probability estimation, the 
model was extended to work with (hidden) word 
classes (Brown et al., 1992, Ward and Issar, 1996). 
A more error-driven approach is the use of hybrid 
 models, in which some detection 
mechanism (e.g. perplexity measures \[Keene and 
O'Kane, 1996\] or topic detection \[Mahajan et al., 
1999\]) selects or combines with a more 
appropriate  model. 
For Asian s (e.g. Chinese, Japanese and 
Korean) represented by ideographic characters, 
 models are widely used in computer 
entry because these Asian s have a large 
set of characters (in thousands) that the 
conventional keyboard is not designed for. Apart 
from using speech and handwriting recognition 
for computer entry,  models for Asian 
s can be used for sentence-based 
keyboard input (e.g. Lochovsky and Chung, 1997), 
as well as detecting improper writing (e.g. dialect- 
specific words or expressions). 
Unlike Indo-European s, words in these 
Asian s are not delimited by space and 
conventional approximate string matching 
techniques (Wagner and Fisher, 1974; Oommen 
and Zhang, 1974) in handwriting recognition are 
seldom used in Asian  models. Instead, a 
widely used and reported Asian  model is 
the character-bigram  model (Jin et al., 
1995; Xia et al., 1996) because it (1) achieved 
high recognition accuracy (around 90-96%) (2) is 
easy to estimate model parameters (3) can be 
processed quickly and (4) is relatively easy to 
implement. 
Improvement of these  models for Indo- 
European s can be applied for the Asian 
s but words need to be identified. For 
Asian s, the model was integrated with 
syntactic rules (Chien, Chen and Lee, 1993). 
Class based  model (Lee and Tung, 1995) 
was also examined but the classes are based on 
semantically related words. A-new approach 
(Yang et al., 1998) is reported using segments 
87 
expressed by prefix and suffix trees but the 
comparison is based on perplexity measures, 
which may not correlate well with recognition 
improvement (Iyer et al., 1997). 
While attempts to improve the; (bigram)  
models were (quite) successful, the high 
recognition accuracy (about 96%) is not adequate 
for professional data entry services, which 
typically require an error rate lower than 1 in 
1,000. As part of the quality control exercises, 
these services estimate their error rate by 
sampling, and they identify and correct the errors 
manually to achieve the required quality. Faced 
with a large volume of text, the ability to 
automatically identify where the errors are is 
perhaps more important than automatically 
correcting errors, in post-editing because (1) 
manual correction is more reliable than automatic 
correction, (2) manual error sampling can be 
carried out and (3) more manual efforts are 
required in error identification than correction due 
to the large volume of text. For example, if the 
identification of errors is 97% and there are no 
errors in error correction, then the accuracy of the 
 model is improved from 96% to 99.9% 
after error correction. 
In typical applications, the accuracy of the bigram 
 model may not be as high as those 
reported in the literature because the data may be 
in a different genre than that of the training data. 
For evaluation, we tested a bigram  
model with text from a novel domain and its 
accuracy dropped significantly from 96% to 78%, 
which is similar to English (Mahajan et al., 1999). 
Improvement in the robustness of the bigram 
 model across different genre is 
necessary and several approaches are available, 
based on detecting errors of the  model. 
One (adaptive) approach is to automatically 
identify the errors and manually correcting them. 
The information about the correction of errors is 
used to improve the bigram  model. For 
example, the bigram probabilities of the  
model may be estimated and updated with the 
corrected data. In this way, future occurrences of 
these errors are reduced. 
Another (hybrid) approach uses another  
model to correct the identified errors. This 
 model can be computationally more 
expensive than the bigram  model 
because it is applied only to the identified errors. 
Also, topic detection (Mahajan et al., 1999) and 
 model selection (Keene and O'Kane, 
1996) can be applied to those area to find a more 
appropriate  model because usually 
topic-dependent words are those causing errors. 
Another (integrative) approach improves the 
 model accuracy using more 
sophisticated recognizers, instead of a 
complementary  model. The more 
sophisticated recognizer may give a set of 
different results that the bigram  model 
can re-apply on or this recognizer simply gives 
the recognized character. This integrates well with 
the coarse-fine recognition architecture proposed 
by Nagy (1988) back in the 1960s. Coarse 
recognition provides the candidates for the 
 model to select. Fine, expensive 
recognition is carried out only where the  
models failed. Finally, it is possible to combine all 
the different approaches (i.e. adaptive, hybrid and 
integrative). 
Given the significance in detecting errors of 
 models, there is little work in this area. 
Perhaps, it was considered that these errors were 
random and therefore hard to detect. However, 
users can detect errors quickly. We suspect that 
some of these errors may be systematic due to the 
properties of the  model used or due to 
 specific properties. 
We adopt a pattern recognition app~'~z, ch to 
detecting errors of the bigram  rnoaei for 
the Chinese . Each output is assigned to 
either the class of correct output or the class of 
errors. The assignment of a class to an output is 
based on a set of features. We explore a number 
of features to detect errors, which are classified 
into model-based features and -specific 
features. 
The proposed approach can work with Indo- 
European s at the word-bigram level. 
However, -specific features have to be 
discovered for the particular . In addition, 
this approach can be adopted for n-gram  
models. In principal, the model-based features can 
be found or evaluated similar to the bigram 
 model. For example, if the trigram 
probability (instead of bigram probability) is low, 
then the likelihood of a  model error is 
high. 
This paper is organized as follows. Section 1 
discusses various features and some preliminary 
evaluation of their suitability for error 
88 
identification. Section 2 describes 3 types of 
classifiers used. In section 3, our evaluation is 
reported. Finally, we conclude. 
1. Features 
We evaluate individual features for error detection 
because they are important to the success of 
detection. Articles from Yazhou Zhoukan (YZZK) 
magazine (4+ Mbytes)/PH corpus (Guo and Liu, 
1992) (7+ Mbytes) are used for evaluation. We 
use the recall and precision measurements for 
evaluation. The recall is the number of errors 
identified by a particular feature divided by the 
total number of errors. The precision is the 
number of errors identified by a particular feature 
divided by the total number of times the feature 
indicate that there are errors. In the first 
subsection, we describe some model-based 
features. Next, we describe the -based 
features. In the last subsection, we discuss the 
combined use of both types of features. 
1.1 Model-based features 
The bigram  model selects the most 
likely path Pm~ out of a set S. The probability of a 
path s in S is simply the product of the conditional 
probabilities of one character c, after the other c~.l 
where s = Co.C+..c:s:, after making the Markov 
assumption. Formally, 
Pm~ = arg max {p(s)} 
s~S 
= arg~ax{p(Co)I~IP(cilC,_l)\[coC,...c,,== s} 
The set s is generated by the set of candidate 
characters for each recognition output. The 
recognizer may supply the set of candidate 
characters. Alternatively, a coarse recogniser may 
simply identify the best-matched group or class of 
characters. Then, members of this class are the 
candidate characters. Formally, we use a function 
h(.), that maps the recognition position to a set of 
candidate characters, i.e. h(i) = {ciJ. We can also 
define the set of sentences in terms of h(.), i.e. S = 
{s I s = cocz...c,, ~, c, ~ h(i)}. 
1.1.1 Features based on zero probabilities (Fl,t) 
One feature to detect errors is to count the number 
of conditional probabilities p(cilc~.l) that are zero, 
between 2 consecutive positions. Zero conditional 
probabilities may be due to insufficient training 
data or may be because they represent the 
 properties. Figure 1 shows the likelihood 
of an error occurring against the percentage of the 
conditional probabilities that are zero. 
60% 
50% 
40% -- 
30% 
20% 
10% 
0% 
50% 60% 70% 80% 90% 100% 110% 
Figure 1: The  model output errors against 
percentages of zero conditional probabilities. 
1.1.2 Features based on low probability (Fl,z) 
When there are insufficient data, the conditional 
probabilities that are small are not reliable. IfPm~ 
have selected some conditional probabilities that 
are low, then probably there are no other choices 
from the candidate sets. Hence, the insufficient 
data problem may occur in that particular Pm~. 
In Figure 2, we plot the likelihood of errors 
identified against the different logarithmic 
conditional probability values. When the recall 
increases, unfortunately, the precision drops. 
120% \[ 
100% 
\[ : ~ precision .~,~'- ! 
80% l :~ recall ....... / | 
Accuracy 
L 6°% \[ ,, - .... 
t .................... - ! I 
20% .... -\ ................ ......... - 
I \] III \[ 
0% I 
~. .. ... .... . 
Figure 2: The precision, recall and accuracy (i.e. recall 
x precision) of detecting  model errors by 
examining the logarithm conditional probabilities on 
the maximum likelihood path. 
1.2 Language-specific Features 
The -specific features are based on 
applying the word segmentation algorithm (Kit et 
al., 1989) to the maximum likelihood path. The 
ROCLING (Chen and Huang, 1993) word list 
used for segmentation has 78,000+ entries. 
89 
1.2.1 Features based on word length (F2,O 
If the matched word in the maximum likelihood 
path is long, then we expect the likelihood of an 
error is low because long words are specific. 
Figure 3 shows the precision of detecting the 
matched word is correct and the recall of errors in 
multi-character words. In general, the longer the 
matched words, the more likely that they are 
correct and the likelihood of missing undetected 
long words is small. 
120% 
lOO% 
00% 
60% 
40% 
20% 
o; 
• precision .. 
1 2 3 4 5 6 7 a 
Figure 3: The precision of correct matched words 
against word lengths. 
1.2.2 Features based on single-character 
sequences (F2,z) 
In word segmentation, when there are no entries 
in the dictionary that can match, the input is 
segmented into single characters. Thus, Lin et al 
(1993) noted that single-character sequences after 
word segmentation might indicate segmentation 
problems. Here, we apply the same technique for 
the detection of errors. If we count on the per 
character basis, the recall of error is 80% and the 
precision in error identification is 35%. If we 
count multi-character words and a sequence of 
single-characters as blocks, then the recall of 
errors is 79% and the precision in finding one or 
more errors in the block is increased to 51%. 
120% ............ 
20% ........ : 
0% ................................................. 
1 2 3 4 5 6 7 8 
Figure 4: The precision and recall of single-character 
sequences of different lengths. 
Similar to matched words in the maximum 
likelihood path, the error detection performance of 
single-character sequences may depend on their 
length. Therefore, we plotted the recall and 
precision of detecting errors against the length of 
the single-character sequences. According to 
Figure 4, as the length of the single-character 
sequence is large, the likelihood of an error is 
larger. The recall of errors is particularly low for 
single-character sequences that have 2 characters. 
The other single-character sequences (i.e. its 
length is not equals to 2) have almost 100% recall. 
One possible reason why 2 single-character 
sequences achieved low precision is that there are 
many spurious bigrams and therefore false match. 
1.3 Combined use of Features 
We carried out a preliminary study using the 
features mentioned in subsection 1.1 and 1.2. Our 
Bayesian classifier (Section 2.1) achieved 83% 
recall but 35% precision, which can be achieved 
using  specific features only (Fz2). 
Therefore, we try to combine the use of these 
features in a more carefiJl manner. We divided the 
detection into 3 scenarios: (1) single character 
(feature F2,2); (2) single-character sequence of 
length 2 (feature F2,2) and (3) 2 character words 
(feature Fzl). Each case is assigned a classifier to 
detect errors. Single-character sequences longer 
than 2 are considered as having errors (Figure 4). 
Words of length longer than 2 are considered 
correct (Figure 3). 
1.3.1 Single characters 
After word segmentation, single characters are 
those cases when there are no multi-character 
words in the dictionary that can match with it and 
its following substring. The single characters have 
different part-of-speech tags. 
L~ l .......................................................................................................................................................................... i 
| i ...................... --~ 
B--ti 
Pa~cfgm:h 
Figure 5: The single characters and their corresponding 
 model output accuracy for different part-of- 
speech tags. 
Figure 5 shows that the accuracy of the  
model for these single characters with part-of- 
90 
speech tags related to exclamations are low. For 
error detection, a feature is assigned to each part- 
of-speech tag. 
The  model accuracy for single 
characters may depend on the availability of the 
left and right context to form high probability 
bigrams. Therefore, we expect that  
model accuracy of single characters at the 
beginning (70%) and end (70%) of a sentence is 
lower than those in the middle (85%) of the 
sentence. The worst case occurs when the 
sentence has only a single character, where the 
measured accuracy is only 8.75% (i.e. no bigram 
context). 
1.3.2 Two-single-characters sequence 
Figure 6 shows that  model output 
accuracy increases as the bigram probability of 
single-character sequences of length 2 increases. 
Hence, the bigram probabilities can be used as a 
feature for detection. 
1.2 "~ 
................... ./< il 
. ~ /\! 
Figure 6: The bigram (logarithm) probability of the 
single-character sequence of length 2. 
Similar to single characters, the  model 
accuracy for 2-single-characters sequences at the 
start, middle and end of a sentence are 48%, 47% 
and 30%, respectively. The accuracy is 33% if the 
sentence is the 2-single-characters sequence. 
I~- ........... )-- 
Q4 F ...... ~ .......................... ./~"------* ....... ace ) ",~._~,)_____,)._~_.___ 
0) 
0 1 2 3 4 5 6 7 8 
nmi:er ofhiddm,,u:rds 
Figure 7: Language model accuracy against different 
number of hidden words (see text). 
Another feature for 2-single-characters sequences 
is to examine whether the characters in the two 
candidate sets can form words that match with the 
dictionary. These matched words are called 
hidden words. Figure 7 shows that if there are 
hidden words, the  model accuracy 
dropped from 60% to 25%. Since there are not 
many cases with 6-8 hidden words, the accuracy 
for these cases are not reliable. 
1.3.3 Two-character words 
For 2 character words, the bigram probability 
(Figure 8) can be used as a feature similar to the 
single-character sequences. The position of these 
2 character words in the sentence does not relate 
to the  model accuracy. Our measured 
accuracy is 91%, 89% and 91% for the beginning, 
the end and the middle of the sentence, 
respectively. Even sentences with a single 2- 
character word achieved 90% accuracy. Hence, 
there is no need to assign features for the position 
of the 2 character words in a sentence. Similar to 
2-single-characters sequences, the  model 
accuracy (Figure 9) decreases as the humber of 
hidden words increase in the corresponding 2 sets 
of candidate characters. 
1.7 : 
o.4~- ::, !i 
O.2 ~ i ~ !~ 
:::::::::::::::::::::::::::::::::::: ~ ~+*~,-.+~.~.~.~- :~.~.~.~+~..+~.~+,+~*~.~+~ 
Figure 8: The  model accuracy of 2 character 
words against the bigram probability. 
= 
1 ............. 
o.7 ~ ............. 
0.6 ....................................... ~.---2.. ::-'=-*--. 
0,5 ............. 
0,4 
0,3 m 
0.2- 
0,1 ............................................ 
0 ............................................................. 
1 2 3 4 5 6 7 8 9 
number of hidden words 
Figure 9: The  model accuracy against 
different number of hidden words. 
91 
2 Classifiers 
One of the problems witlh using individual 
features is that the recall and precision are not 
very high, except the -specific features. It 
is also difficult to set the threshold for detection 
because of the precision-recall trade-off. In 
addition, there may be some improvement in 
detection performance if features are combined 
for detection. Therefore, we adopt a pattern 
recognition approach to detect errors. 
Several classifiers are used to decide for error 
identification because we do not know whether 
particular features work well with particular 
classifiers, which make different assumptions 
about classification. Three types of classifers will 
be examined: Bayesian, decision tree and neural 
network. 
2.1 Bayesian Classifier 
The Bayesian classifier is simple to implement 
and is compatible with the model-based features. 
Given the feature vector x, the Bayesian detection 
scheme assigns the correct class wc and the error 
class we, using the following rule: 
g~(x) > ge(X) assign wc 
.Otherwise assign we 
where go(.) and ge(.) are: 
gc (x) = -(x -/~c)r y.c-. (x _/~,)- log l  , I +2 log p(w c ) 
g, (x)= -(x-lee) r Z,-~(x-/~,) - log\] Z~ \] +2log p(w,) 
Pc and ,ue are the mean vectors of the class wc and 
we, respectively, ~ and ~ are the covariance 
matrices of the class wc and we, respectively, and 
1-I is the determinant. 
2.2 Decision Tree 
Originally, we tried to use the support vector 
machine (SVM) (Vanpik, 1995) but it could not 
converge. Instead, we used the decision tree 
algorithm C4.5 by Quinlan (1993). Decision trees 
are known to produce good classification if 
clusters can be bounded by some hyper-rectilinear 
regions. We trained C4.5 with a set of feature 
vectors, described in Section 1.3. 
2.3 Neural Network 
We use the multi-layer perceptron (MLP) because 
it can perform non-linear classification. The MLP 
has 3 layers of nodes: input, hidden and output. 
Nodes in the input layer are fully connected with 
those in the hidden layer. Likewise nodes in the 
hidden layer are fully connected to the output 
layer. For our application, one input node 
corresponds to a feature in section 1.3. The value 
of the feature is the input value of the node. Two 
output nodes indicate whether the current 
character is correct or erroneous. The number of 
hidden nodes is 2-4, calculated according to 
(Fujita, 1998). 
The output of each node in the MLP is the 
weighted sum of its input, which is transformed 
by a sigmoidal function. Initially, the weights are 
assigned with small random numbers, which are 
adjusted by the gradient descend method with 
learning rate 0.05 and momentum 0.1. 
3 Evaluation 
In the evaluation, the training data is the PH 
corpus and the test data is the YZZK magazine 
articles (4+ Mbytes), downloaded from the 
Internet. In handwritten character recognition, the 
optimal size of the number of candidates is 6 
(Wong and Chan, 1995). For robustness, each 
recognized character in our evaluation is selected 
from 10 candidates. 
We measured the performance in terms of recall, 
precision and the manual effort reduction in 
scanning the text for errors. The recall is the 
number of identified errors over the total number 
of errors. The precision is the number of identified 
errors over the total number of cases classified as 
errors. The amount of saving in manual scanning 
for errors is called the skip ratio, which is the 
number of blocks classified as correct over the 
total number of blocks. The recall and the skip 
ratio are more important than the precision 
because post error correction (manual or 
automatic) can improve the recognition accuracy. 
It is possible to combine the recall and precision 
into one, using the F measures (Van Rijsbergen, 
1979) but the value for rating the relative 
importance is subjective. 
Table 1 shows the classification performance of 
the Bayesian classifier. The recall of errors by the 
Bayesian classifier has reduced slightly from 83% 
using a single classifier to 79% using 3 classifiers 
but the precision improved from 51% to 60%. 
Also, the skip ratio is 65%, which is much higher 
than the skip ratio of 0.1% if we did not use the 
classifier. Although the MLP has a higher 
precision (80%), its recall is slightly lower than 
92 
the Bayesian classifier. The skip ratio of the both 
Bayesian and MLP classifiers are about the same. 
Cases Measure 
Single Recall 
character 
Precision 
2 single Recall 
characters 
Precision 
2-character Recall 
words 
Precision 
Overall Recall 
Precision 
Skip Ratio 
Bayes 
71% 
40% 
60% 
88% 
60% 
29% 
79% 
60% 
65% 
Table 1: The performances 
C4.5 MLP 
56% 28% 
75% 71% 
84% 83% 
82% 80% 
17% 9% 
60% 62% 
73% 75% 
81% 80% 
76% : 66% 
of the 3 types of 
classifiers in detecting  model errors. 
4 Summary and Future Work 
We have evaluated both model-based and 
-specific features for detecting  
model errors. Individual model-based features did 
not yield good detection accuracy, suffering from 
the precision-recall trade-off. The - 
specific features detect errors better. In particular, 
matched multi-character words are usually correct. 
If the model-based and -specific features 
are aggregated as a single feature vector, the recall 
and precision of errors are 83% and 35%, 
respectively, which are the same if we just use 
-specific features. Hence, instead of a 
single classifier, we separated 3 situations 
identified by the -specific features and 3 
classifiers are used to detect these errors 
individually. The Bayesian classifier (simpliest) 
achieved an overall 79% recall, 60% precision 
and 65% skip ratio and the MLP achieved an 
overall 75% recall, 80% precision and a 66% skip 
ratio. Similar recall and precision performances 
are achieved using decision trees, which are 
preferred since their skip ratio is higher (i.e. 76%). 
Although the precision (so far) is not high (60% - 
80%), it is not the most important result because 
(1) this only represents a minor waste of checking 
effort, compared with scanning the entire text, and 
(2) the identified errors will be checked further or 
corrected either manually or automatically. 
Acknowledgement 
This work is supported by the (Hong Kong) 
University Grants Council under project PolyU 
5109/97E and the PolyU Central Research Grant 
G-$603. We are grateful to Guo and Liu for 
providing the PH corpus and ROCLING for 
providing their word list. 

References 
Bahl, L.R., F. Jelinek and R.L. Mercer (1983) "A 
maximum likelihood approach to continuous 
speech recognition", 1EEE Trans. PAM1, 5:2, pp. 
179-190. 
Brown, P.F., V.J. Della Pietra, P.V. deSouza and 
R.L. Mercer (1992) "Class-based n-gram models 
of natural ", Computational Linguistics, 
4, pp.467-479. 
Chen, K.-C. and C.R. Huang (eds.) (1993) 
"Chinese word class analysis", Technical Report 
93-05, Chinese Knowledge Information 
Processing Group, Institute of Information 
Science, Academia Sinica, Taiwan. 
Chien, L.F., Chen, K.J., Lee, L.S. (1993) "A best- 
first  processing model integrating the 
unification grammar and Markov  model 
for speech recognition applications", 1EEE Trans. 
Speech and Audio Processing, 1:2, Page(s): 221 - 
240. 
Elliman, D.G. and I.T. Lancaster (1990) "A 
review of segmentation and contextual analysis 
techniques for text recognition", Pattern 
Recognition, 23:3/4, pp.337-346. 
Fujita, O. (1998) "Statistical estimation of the 
number of hidden units for feedforward neural 
networks", Neural Networks, 11, 851-859. 
Guo, J. and H.C. Liu, "PH - a Chinese corpus for 
pinyin-hanzi transcription", 1SS Technical Report, 
TR93-112-0, Institute of Systems Science, 
National University of Singapore, 1992. 
Huang, X., F. Alleva, H. Hon, M. Hwang,, K. Lee 
and R. Rosenfeld (1993) "The SPHINX-II speech 
recognition system: an overview", Computer 
Speech and Lanaguage, 2, 137-148. 
lyer, R., M. Ostendorf and M~-Meteer (1997) 
"Analyzing and predicting  model 
performance", Proc. IEEE Workshop Automatic 
Speech Recognition and Understanding, pp. 254- 
261. 
Jelinek, F. (1989) "Self-organized  
modeling for speech recognition", in Readings in 
Speech Recognition, Morgan Kayfmann. 
Jelinek, F. (1991) "Up from trigrams", Proc. 
Eurospeech 91, pp. 181 - 184. 
Jin, Y., Y. Xia and X. Chang (1995) "Using 
contextual information to guide Chinese text 
recognition", Proc. 1CCPOL '95, pp. 134-139. 
Kenne, P.E. and M. O'Kane (1996) "Hybrid 
 models and spontaneous legal 
discourse", Proc. ICSLP, Vol. 2, pp. 717-720. 
Kit, C., Y. Liu and N. Liang (1989) "On methods 
of Chinese automatic word segmentation", 
Journal of Chinese Information Processing, 3:1, 
13-20. 
Law, H.H-C. and C. Chan (1996) "N-th order 
ergodic multigram HMM for modeling of 
s without marked word boundaries", 
Proc. COLING 96, pp. 2043-209. 
Lee, H-J. and C-H Tang (1995) "A  
model based on semantically clustered words in a 
Chinese character recognition system", Proc. 3 rd 
lnt Conf. on Document Analysis and Recognition, 
Vol. 1., pp. 450-453. 
Lin, M-Y., T-H. Chiang and K-Y. Su (1993) "A 
preliminary study on unknown word problem in 
Chinese word segmentation", Proc. ROCLING V1, 
pp.119-141. 
Lochovsky, A.F. and K-H. Chung (1997) 
"Homonym resolution for Chinese phonetic input", 
Communications of COLIPS, 7:1, 5-15. 
Mahajan, M., D. Beeferman and X.D. Huang 
(1999) "Improving topic-dependent modeling 
using information retrieval techniques", Proc. 
1EEE ICASSP 99, Vol. 1, pp.541-544. 
Nagy, G. (1988), "Chinese character recognition: 
twenty-five-year retrospective", in Proc. 9th lnt. 
Conf. on Pattern Recognition, Vol. 1, pp. 163-167. 
Nathan, K.S., H.S.M. Beigi, J. Subrahmonia, G.J. 
Clary and H. Maruyama (1995) "Real-time on- 
line unconstrained handwriting recognition using 
statistical methods", 
Oommen, B.J. and K. Zhang (1996) "The 
normalized string editing problem revisited", 
1EEE Trans. on PAM1, 18:6, pp. 669-672. 
Quinlan, J.R. (1993) "C4.5 programs for machine 
learning", Morgan Kaufmann, CA. 
Ron, D., Y. Singer and N. Tishby (1994) "The 
power of Amnesia: learning probabilistic 
automata with variable memory length", to 
appear in Machine Learning 
Rosenfeld, R. (1994) "Adaptive statistical 
 modeling" a maximum entropy 
approach", Ph.D. Thesis, School of Computer 
Science, Carnegie Mellon University, Pittsburgh. 
Sun, S.W. (1991), "A Contextual Postprocessing 
for Optical Chinese Character Recognition", in 
Proc.lnt. Sym. on Circuits and Systems, pp. 2641- 
2644. 
Vapnik, V. (1995) The Nature of Statistical 
Learning Theory, Springer-Verlag, New York. 
Van Rijsbergen (1979) Information Retrieval, 
Butterworths, London. 
Wagner, R.A. and M.J. Fisher (1974) "The string 
to string correction problem", J. ACM, 21:1, pp. 
168-173. 
Ward, W. and S. lssar (1996) "A class based 
 model for speech recognition", Proc. 
1EEE 1CASSP 96, Vol. 1, pp.416-418. 
Wong, P-K. and C. Chan (1999) "Postprocessing 
statistical  models for handwritten 
Chinese character recognizer", IEEE Trans. SMC, 
Part B, 29:2, 286-291. 
Xia, Y., S. Ma, M. Sun, X. Zhu, Y. Jin and X. 
Chang (1996) "Automatic post-processing of off- 
line handwritten Chinese text recognition", Proc. 
ICCC, pp. 413-416. 
Yang, K-C., T-H. Ho, L-F. Chien and L-S. Lee 
(1998) "Statistics-based segment pattern lexicon - 
a new direction for Chinese  modeling", 
Proc. 1EEE ICASSP 98, Vol. 1., pp. 169-172. 
