In: Proceedings of CoNLL-2000 and LLL-2000, pages 55-60, Lisbon, Portugal, 2000. 
Learning Distributed Linguistic Classes 
Stephan Raaijmakers 
Netherlands Organisation for Applied Scientific Research (TNO) 
Institute for Applied Physics 
Delft 
The Netherlands 
raaijmakers@tpd, tno. nl 
Abstract 
Error-correcting output codes (ECOC) have 
emerged in machine learning as a success- 
ful implementation of the idea of distributed 
classes. Monadic class symbols are replaced 
by bit strings, which are learned by an ensem- 
ble of binary-valued classifiers (dichotomizers). 
In this study, the idea of ECOC is applied to 
memory-based  learning with local (k- 
nearest neighbor) classifiers. Regression analy- 
sis of the experimental results reveals that, in 
order for ECOC to be successful for  
learning, the use of the Modified Value Differ- 
ence Metric (MVDM) is an important factor, 
which is explained in terms of population den- 
sity of the class hyperspace. 
1 Introduction 
Supervised learning methods applied to natu- 
ral  classification tasks commonly op- 
erate on high-level symbolic representations, 
with linguistic classes that are usually monadic, 
without internal structure (Daelemans et al., 
1996; Cardie et al., 1999; Roth, 1998). This 
contrasts with the distributed class encoding 
commonly found in neural networks (Schmid, 
1994). Error-correcting output codes (ECOC) 
have been introduced to machine learning as 
a principled and successful approach to dis- 
tributed class encoding (Dietterich and Bakiri, 
1995; Ricci and Aha, 1997; Berger, 1999). With 
ECOC, monadic classes are replaced by code- 
words, i.e. binary-valued vectors. An ensem- 
ble of separate classifiers (dichotomizers) must 
be trained to learn the binary subclassifications 
for every instance in the training set. During 
classification, the bit predictions of the vari- 
ous dichotomizers are combined to produce a 
codeword prediction. The class codeword which 
has minimal Hamming distance to the predicted 
codeword determines the classification of the in- 
stance. Codewords are constructed such that 
their Hamming distance is maximal. Extra bits 
are added to allow for error recovery, allowing 
the correct class to be determinable even if some 
bits are wrong. An error-correcting output code 
for a k-class problem constitutes a matrix with 
k rows and 2 k-1-1 columns. Rows are the code- 
words corresponding to classes, and columns are 
binary subclassifications or bit functions fi such 
that, for an instance e, and its codeword vector 
C 
fi(e) = ~-i(c) (1) 
(~-i(v) the i-th coordinate of vector v). If 
the minimum Hamming distance between ev- 
ery codeword is d, then the code has an error- 
correcting capability of \[-~J. Figure 1 shows 
the 5 x 15 ECOC matrix, for a 5-class problem. 
In this code, every codeword has a Hamming 
distance of at least 8 to the other codewords, 
so this code has an error-correcting capability 
of 3 bits. ECOC have two natural interpreta- 
011000100000001\] 01101001011101 ~ 
10100011101011 11001110110001 
11110100011011 
Figure h ECOC for a five-class problem. 
tions. From an information-theoretic perspec- 
tive, classification with ECOC is like channel 
coding (Shannon, 1948): the class of a pattern 
to be classified is a datum sent over a noisy com- 
munication channel. The communication chan- 
nel consists of the trained classifier. The noise 
consists of the bias (systematic error) and vari- 
ance (training set-dependent error) of the classi- 
fier, which together make up for the overall error 
55 
of the classifier. The received message must be 
decoded before it can be interpreted as a classi- 
fication. Adding redundancy to a signal before 
transmission is a well-known technique in digi- 
tal communication to allow for the recovery of 
errors due to noise in the channel, and this is 
the key to the success of ECOC. From a ma- 
chine learning perspective, an error-correcting 
output code uniquely partitions the instances 
in the training set into two disjoint subclasses, 
0 or 1. This can be interpreted as learning a set 
of class boundaries. To illustrate this, consider 
the following binary code for a three-class prob- 
lem. (This actually is a one-of-c code with no 
error-correcting capability (the minimal Ham- 
ming distance between the codewords is 1). As 
such it is an error-correcting code with lowest 
error correction, but it serves to illustrate the 
point.) 
fl f2 f3 C1 0 0 1 
C2 01 0 (2) 
C3 1 0 0 
For every combination of classes (C1-C2, C1- 
C3, C2-C3), the Hamming distance between the 
codewords is 2. These horizontal relations have 
vertical repercussions as well: for every such 
pair, two bit functions disagree in the classes 
they select. For C1-C2, f2 selects C2 and f3 se- 
lects C1. For C1-C3, fl selects C3 and f3 selects 
C1. Finally, for C2-C3, fl selects C3 and f2 se- 
lects C2. So, every class is selected two times, 
and this implies that every class boundary asso- 
ciated with that class in the feature hyperspace 
is learned twice. In general (Kong and Diet- 
terich, 1995), if the minimal Hamming distance 
between the codewords of an (error-correcting) 
code is d, then every class boundary is learned d 
times. For the error-correcting code from above 
this implies an error correction of zero: only two 
votes support a class boundary, and no vote can 
be favored in case of a conflict. The decoding 
of the predicted bit string to a class symbol ap- 
pears to be a form of voting over class bound- 
aries (Kong and Dietterich, 1995), and is able to 
reduce both bias and variance of the classifier. 
2 Dichotomizer Ensembles 
Dichotomizer ensembles must be diverse apart 
from accurate. Diversity is necessary in order 
to decorrelate the predictions of the various di- 
chotomizers. This is a consequence of the voting 
mechanism underlying ECOC, where bit func- 
tions can only outvote other bit functions if they 
do not make similar predictions. Selecting dif- 
ferent features per dichotomizer was proposed 
for this purpose (Ricci and Aha, 1997). An- 
other possibility is to add limited non-locality to 
a local classifier, since classifiers that use global 
information such as class probabilities during 
classification, are much less vulnerable to cor- 
related predictions. The following ideas were 
tested empirically on a suite of natural  
learning tasks. 
• A careful feature selection approach, where 
every dichotomizer is trained to select (pos- 
sibly) different features. 
• A careless feature selection approach, 
where every bit is predicted by a voting 
committee of dichotomizers, each of which 
randomly selects features (akin in spirit to 
the Multiple Feature Subsets approach for 
non-distributed classifiers (Bay, 1999). 
• A careless feature selection approach, 
where blocks of two adjacent bits are pre- 
dicted by a voting committee of quadro- 
tomizers, each of which randomly selects 
features. Learning blocks of two bits al- 
lows for bit codes that are twice as long 
(larger error-correction), but with half as 
many classifiers. Assuming a normal dis- 
tribution of errors and bit values in every 2 
bits-block, there is a 25% chance that both 
bits in a 2-bit block are wrong. The other 
75% chance of one bit wrong would pro- 
duce performance equal to voting per bit. 
Formally, this implies a switch from N two- 
class problems to N/2 four-class problems, 
where separate regions of the class land- 
scape are learned jointly. 
• Adding non-locality to 1-3 in the form of 
larger values for k. 
• The use of the Modified Value Difference 
Metric, which alters the distribution of in- 
stances over the hyperspace of features, 
yielding different class boundaries. 
3 Memory-based learning 
The memory-based learning paradigm views 
cognitive processing as reasoning by analogy. 
Cognitive classification tasks are carried out by 
56 
matching data to be classified with classified 
data stored in a knowledge base. This latter 
data set is called the training data, and its ele- 
ments are called instances. Every instance con- 
sists of a feature-value vector and a class label. 
Learning under the memory-based paradigm is 
lazy, and consists only of storing the training 
instances in a suitable data structure. The in- 
stance from the training set which resembles 
the most the item to be classified determines 
the classification of the latter. This instance is 
called the nearest neighbor, and models based 
on this approach to analogy are called nearest 
neighbor models (Duda and Hart, 1973). So- 
called k-nearest neighbor models select a winner 
from the k nearest neighbors, where k is a pa- 
rameter and winner selection is usually based on 
class frequency. Resemblance between instances 
is measured using distance metrics, which come 
in many sorts. The simplest distance metric is 
the overlap metric: 
k (3) 5(vi, vj) = 0 if vi = vj 
5(vi, vj) = 1 if vi ¢ vj 
(~ri(I) is the i-th projection of the feature vec- 
tor I.) Another distance metric is the Mod- 
ified Value Difference Metric (MVDM) (Cost 
and Salzberg, 1993). The MVDM defines sim- 
ilarity between two feature values in terms of 
posterior probabilities: 
5(vi, vj) = ~ I P(c I vi) - P(c Ivj) l (4) 
cEClasses 
When two values share more classes, they are 
more similar, as 5 decreases. Memory-based 
learning has fruitfully been applied to natu- 
ral  processing, yielding state-of-the- 
art performance on all levels of linguistic analy- 
sis, including grapheme-to-phoneme conversion 
(van den Bosch and Daelemans, 1993), PoS- 
tagging (Daelemans et al., 1996), and shallow 
parsing (Cardie et al., 1999). In this study, 
the following memory-based models are used, 
all available from the TIMBL package (Daele- 
mans et al., 1999). IBi-IG is a k-nearest dis- 
tance classifier which employs a weighted over- 
lap metric: 
~(I~, b) = ~ wkS(~k(/~), ~(Ij)) (5) 
k 
In stead of drawing winners from the k-nearest 
neighbors pool, IBi-IG selects from a pool of 
instances for k nearest distances. Features are 
separately weighted based on Quinlan's infor- 
mation gain ratio (Quinlan, 1993), which mea- 
sures the informativity of features for predicting 
class labels. This can be computed by subtract- 
ing the entropy of the knowledge of the feature 
values from the general entropy of the class la- 
bels. The first quantity is normalized with the a 
priori probabilities of the various feature values 
of feature F: 
H(C) - Eveva  es(F) P(v) × H(QF=v\]) (6) 
Here, H(C) is the class entropy, defined as 
H(C) =- ~ P(c) log 2P(c). (7) 
cEClass 
H(C\[F=v\] ) is the class entropy computed over 
the subset of instances that have v as value for 
Fi. Normalization for features with many values 
is obtained by dividing the information gain for 
a feature by the entropy of its value set (called 
the split info of feature Fi. 
H(C)--~veValues(Fi) P(v)xH(C\[F=v\]) 
Wi ---- split_in f o( Fi ) 
split - info(Fi) = - ~ P(v) log 2 P(v) 
vE Values( Fi ) 
(s) 
IGTREE is a heuristic approximation of IB1- 
IG which has comparable accuracy, but is op- 
timized for speed. It is insensitive to k-values 
larger than 1, and uses value-class cooccurrence 
information when exact matches fail. 
4 Experiments 
The effects of a distributed class representa- 
tion on generalization accuracy were measured 
using an experimental matrix based on 5 lin- 
guistic datasets, and 8 experimental condi- 
tions, addressing feature selection-based ECOC 
vs. voting-based ECOC, MVDM, values of 
k larger than 1, and dichotomizer weight- 
ing. The following linguistic tasks were used. 
DIMIN is a Dutch diminutive formation task de- 
rived from the Celex lexical database for Dutch 
(Baayen et al., 1993). It predicts Dutch nomi- 
nal diminutive suffixes from phonetic properties 
(phonemes and stress markers) of maximally the 
57 
last three syllables of the noun. The STRESS 
task, also derived from the Dutch Celex lexP 
cal database, assigns primary stress on the ba- 
sis of phonemic values. MORPH assigns mor- 
phological boundaries (a.o. root morpheme, 
stress-changing affix, inflectional morpheme), 
based on English CELEX data. The WSJ- 
NPVP task deals with NP-VP chunking of PoS- 
tagged Wall Street Journal material. GRAPHON, 
finally, is a grapheme-to-phoneme conversion 
task for English based on the English Celex lex- 
ical database. Numeric characteristics of the 
different tasks are listed in table 1. All tasks 
with the exception of GRAPHON happened to 
be five-class problems; for GRAPHON, a five- 
class subset was taken from the original training 
set, in order to keep computational demands 
manageable. The tasks were subjected to the 
Data set Features Classes Instances 
DIMIN 12 5 3,000 
STRESS 12 5 3,000 
MORPH 9 5 300,000 
NPVP 8 5 200,000 
GRAPHON 7 5 73,525 
Table 1: Data sets. 
8 different experimental situations of table 2. 
For feature selection-based ECOC, backward se- 
quential feature elimination was used (Raaij- 
makers, 1999), repeatedly eliminating features 
in turn and evaluating each elimination step 
with 10-fold cross-validation. For dichotomizer 
weighting, error information of the dichotomiz- 
ers, determined from separate unweighted 10- 
fold cross-validation experiments on a separate 
training set, produced a weighted Hamming dis- 
tance metric. Error-based weights were based 
on raising a small constant ~ in the interval 
\[0, 1) to the power of the number of errors made 
by the dichotomizer (Cesa-Bianchi et al., 1996). 
Random feature selection drawing features with 
replacement created feature sets of both differ- 
ent size and composition for every dichotomizer. 
5 Results 
Table 3 lists the generalization accuracies for 
the control groups, and table 4 for the ECOC 
algorithms. All accuracy results are based on 
10-fold cross-validation, with p < 0.05 using 
paired t-tests. The results show that dis- 
ALGORITHM DESCRIPTION 
E1 
£2 
E3 
E4 
£5 
£6 
£7 
$8 
ECOC, feature selection per bit (15), 
k----l, unweighted 
ECOC, feature selection per bit (15), 
k----l, weighted 
ECOC, feature selection per bit (15), 
MVDM, k=l, unweighted 
ECOC, feature selection per bit (15), 
MVDM, k=l, weighted 
ECOC, feature selection per bit (15), 
MVDM, k----3, unweighted 
ECOC, feature selection per bit (15), 
MVDM, k=3, weighted 
ECOC, voting (100) per bit (30), 
MVDM, k=3 
ECOC, voting (100) per bit block 
(15), MVDM, k=3 
Table 2: Algorithms 
GRouP I II III IV 
IBi-IG IBi-IG IBi-IG IBi-IG' 
k=l k=3 k=l k=3 
MVDM MVDM 
98.1±0.5 DIMIN 
STRESS 
MORPH 
NPVP 
GRAPHON 
98.1±0.5 
83.5±2.6 
92.5±1.4 
96.4±0.2 
97.1±2.4 
95.8±0.5 
81.3±2.9 
92.0±1.4 
97.1±0.2 
97.2±2.3 
97.7±0.7 
86.2±2.0 
92.5±1.4 
97.0±0.1 
97.7±0.7 
86.7±1.8 
92.5±1.4 
97.0±0.1 
97.7±0.8 
Table 3: Generalization accuracies control groups. 
tributed class representations can lead to sta- 
tistically significant accuracy gains for a variety 
of linguistic tasks. The ECOC algorithm based 
on feature selection and weighted Hamming dis- 
tance performs best. Voting-based ECOC per- 
forms poorly on DIMIN and STRESS with vot- 
ing per bit, but significant accuracy gains are 
achieved by voting per block, putting it on a par 
with the best performing algorithm. Regression 
analysis was applied to investigate the effect of 
the Modified Value Difference Metric on ECOC 
accuracy. First, the accuracy gain of MVDM 
as a function of the information gain ratio of 
the features was computed. The results show a 
high correlation (0.82, significant at p < 0.05) 
between these variables, indicating a linear re- 
lation. This is in line with the idea underlying 
MVDM: whenever two feature values are very 
predictive of a shared class, they contribute to 
the similarity between the instances they belong 
to, which will lead to more accurate classifiers. 
Next, regression analysis was applied to deter- 
mine the effect of MVDM on ECOC, by relating 
the accuracy gain of MVDM (k=3) compared to 
58 
TASK £1 (I) g2(I) $3 (III) $4 (III) $5 (IV) $6 (IV) $7 (IV) g8 ($6) 
DIMIN 98.6=k0.4x/ 98.5=k0.4x/ 98.6:k0.6~/ 98.7::k0.6x/ 98.8::k0.5x/ 98.9::k0.4~/ 96.6:k0.9x 98.4=E0.4 
STRESS 85.3::kl.Sx/ 86.3::k2.0X/ 88.2=hl.7x/ 88.8=t:1.7X/ 88.2:kl.7x/ 89.3::kl.9~/ 86.5=k2.3x 88.8::kl.7 
MORPH 93.2:kl.6x/ 93.2=kl.5x/ 93.2=kl.3x/ 93.2=kl.3~/ 93.2=1=1.6~/ 93.2=kl.5~/ 93.0::kl.6x/ 93.4:t=1.5x/ 
NPVP t 96.8::k0.1~/ 96.9::k0.2x/ 96.8=E0.1 96.9:k0.1 96.8::k0.1 96.9=k0.1 96.8=h0.2x 96.8=t:0.2 
GRAPHON 98.2=t=0.7 98.3=t=0.7 98.4:k0.6X/ 98.3=E0.5X/ 98.3::h0.6X/ 98.5::k0-5X/ 97.6=k0.7x 97.6:h0.8x 
Table 4: Generalization accuracies for feature selection-based ECOC (x/ indicates significant improvement over 
control group (in round brackets) , and x deterioration at p < 0.05 using paired t-tests). A 1" indicates 25 voters for 
performance reasons. 
control group II to the accuracy gain of ECOC 
(algorithm $6, compared to control group IV). 
The correlation between these two variables is 
very high (0.93, significant at p < 0.05), again 
indicative of a linear relation. From the per- 
spective of learning class boundaries, the strong 
effect of MVDM on ECOC accuracy can be un- 
derstood as follows. When the overlap metric is 
used, members of a training set belonging to the 
same class may be situated arbitrarily remote 
from each other in the feature hyperspace. For 
instance, consider the following two instances 
taken from DIMIN: 
......... d,A,k,je 
......... d,A,x,je 
(Hyphens indicate absence of feature values.) 
These two instances encode the diminutive for- 
mation of Dutch dakje (little roo\]~ from dak 
(roo\]~, and dagje (lit. little day, proverbially 
used) from dag (day). Here, the values k and x, 
corresponding to the velar stop 'k' and the ve- 
lar fricative 'g', are minimally different from a 
phonetic perspective. Yet, these two instances 
have coordinates on the twelfth dimension of 
the feature hyperspace that have nothing to do 
with each other. The overlap treats the k-x 
value clash just like any other value clash. This 
phenomenon may lead to a situation where in- 
habitants of the same class are scattered over 
the feature hyperspace. In contrast, a value dif- 
ference metric like MVDM which attempts to 
group feature values on the basis of class cooc- 
currence information, might group k and x to- 
gether if they share enough classes. The effect 
of MVDM on the density of the feature hyper- 
space can be compared with the density ob- 
tained with the overlap metric as follows. First, 
plot a random numerical transform of a feature 
space. For expository reasons, it is adequate 
to restrict attention to a low-dimensional (e.g. 
two-dimensional) subset of the feature space, for 
a specific class C. Then, plot an MVDM trans- 
form of this feature space, where every coordi- 
nate (a, b) is transformed into (P(Cla), P(C I 
b)). This idea is applied to a subset of DIMIN, 
consisting of all instances classified as j e (one 
of the five diminutive suffixes for Dutch). The 
features for this subset were limited to the last 
two, consisting of the rhyme and coda of the 
last syllable of the word, clearly the most infor- 
mative features for this task. Figure 2 displays 
the two scatter plots. As can be seen, instances 
are widely scattered over the feature space for 
the numerical transform, whereas the MVDM- 
based transform forms many clusters and pro- 
duces much higher density. In a condensed fea- 
70 
o 60 
~30 
¢D 20 
++ ÷ + 
% +++ 
t4 t ++t + 
++ +% +#$+ + + + 
+ + ++ ÷ + 
$÷ St ~$~ , ÷t + 
+ + + ~ ÷ 
$:$ +t+ -tt-i-t~ + +± ÷ 
i i | I I I I I 
5 10 '15 20 25 30 35 40 45 
Feature 11 DIMIN (random) 
'i 
0.9 
0.8 
0.7 
_z 0.6 
~. 0.4 
0.3 
. 0.2 
0.1 
0 
-~.@ + ¢ 4, ¢41. 
0,1 0.2 0.3 0.4 0.5 0.6 0.7 
Feature I | DIMIN (MVDM) 
Figure 2: Random numerical transform of feature val- 
ues based on the overlap metric (left) vs. numerical 
transform of feature values based on MVDM (right), for 
a two-features-one-class subset of DIMIN. 
ture hyperspace the number of class boundaries 
to be learned per bit function reduces. For in- 
stance, figures 3 displays the class boundaries 
for a relatively condensed feature hyperspace, 
where classes form localized populations, and a 
scattered feature hyperspace, with classes dis- 
tributed over non-adjacent regions. The num- 
ber of class boundaries in the scattered feature 
space is much higher, and this will put an addi- 
59 
tional burden on the learning problems consti- 
tuted by the various bit functions. 
C1 b13~ 3 b35 
hi2 i C5 
b15 C2 
b24 bl2ii I 
C4 b14 CI 
Fl 
F2 bl2i I C2 b24i I C4 
b35i C3 ~_~ 
b3 ii C4 b24ii C2 
FI 
Figure 3: Condensed feature space (left) vs. scattered 
feature space (right). 
6 Conclusions 
The use of error-correcting output codes 
(ECOC) for representing natural  
classes has been empirically validated for a suite 
of linguistic tasks. Results indicate that ECOC 
can be useful for datasets with features with 
high class predictivity. These sets typically tend 
to benefit from the Modified Value Difference 
Metric, which creates a condensed hyperspace 
of features. This in turn leads to a lower num- 
ber of class boundaries to be learned per bit 
function, which simplifies the binary subclas- 
sification tasks. A voting algorithm for learn- 
ing blocks of bits proves as accurate as an ex- 
pensive feature-selecting algorithm. Future re- 
search will address further mechanisms of learn- 
ing complex regions of the class boundary land- 
scape, as well as alternative error-correcting ap- 
proaches to classification. 
Acknowledgements 
Thanks go to Francesco Ricci for assistance in 
generating the error-correcting codes used in 
this paper. David Aha and the members of the 
Induction of Linguistic Knowledge (ILK) Group 
of Tilburg University and Antwerp University 
are thanked for helpful comments and criticism. 

References 
H. Baayen, R. Piepenbrock, and H. van Rijn. 1993. 
The CELEX database on CD-ROM. Linguistic 
Data Consortium. Philadelpha, PA. 
S. Bay. 1999. Nearest neighbor classification from 
multiple feature subsets. Intelligent Data Analy- 
sis, 3(3):191-209. 
A. Berger. 1999. Error-correcting output coding 
for text classification. Proceedings of IJCAI'99: 
Workshop on machine learning for information 
filtering. 
C. Cardie, S. Mardis, and D. Pierce. 1999. Com- 
bining error-driven pruning and classification for 
partial parsing. Proceedings of the Sixteenth In- 
ternational Conference on Machine Learning, pp. 
87-96. 
N. Cesa-Bianchi, Y. Freund, D. Helmbold, and 
M. Warmuth. 1996. On-line prediction and con- 
version strategies. Machine Learning 27:71-110. 
S. Cost and S. Salzberg. 1993. A weighted near- 
est neighbor algorithm for learning with symbolic 
features. Machine Learning,10:57-78. 
W. Daelemans, J. Zavrel, P. Berck, and S. Gillis. 
1996. Mbt: A memory-based part of speech tag- 
ger generator. Proceedings of the Fourth Work- 
shop on Very Large Corpora, ACL SIGDAT. 
W. Daelemans, J. Zavrel, K. Van der Sloot, and 
A. Van den Bosch. 1999. Timbh Tilburg memory 
based learner, version 2.0, reference guide. ILK 
Technical Report - ILK 99-01. Tilburg. 
T. Dietterich and G. Bakiri. 1995. Solving multi- 
class learning problems via error-correcting out- 
put codes. Journal of Artificial Intelligence Re- 
search, 2:263-286. 
R. Duda and P. Hart. 1973. Pattern classification 
and scene analysis. Wiley Press. 
E. Kong and T. Dietterich. 1995. Error-correcting 
output coding corrects bias and variance. Pro- 
ceedings of the 12th International Conference on 
Machine Learning. 
J.R. Quinlan. 1993. C4.5: Programs for Machine 
Learning. Morgan Kaufmann, San Mateo, Ca. 
S. Raaijmakers. 1999. Finding representations for 
memory-based  learning. Proceedings of 
CoNLL-1999. 
F. Ricci and D. Aha. 1997. Extending local learners 
with error-correcting output codes. Proceedings of 
the l~th Conference on Machine Learning. 
D. Roth. 1998. A learning approach to shallow pars- 
ing. Proceedings EMNLP- WVLC'99. 
H. Schmid. 1994. Part-of-speech tagging with neu- 
ral networks. Proceedings COLING-9~. 
C. Shannon. 1948. A mathematical theory of com- 
munication. Bell System Technical Journal,27:7, 
pp. 379-423, 27:10, pp. 623-656. 
A. van den Bosch and W. Daelemans. 1993. Data- 
oriented methods for grapheme-to-phoneme con- 
version. Proceedings of the 6th Conference of the 
EACL. 
