/ 
/ 
/ 
/ 
/ 
/ 
/ 
/ 
/ 
CHOOSING A DISTANCE METRIC FOR AUTOMATIC 
WORD CATEGORIZATION 
Emin Erkan Korkmaz G6ktiirk U~;oluk 
Department of Computer Engineering 
Middle East Technical University 
Ankara-Turkey 
Emails: korkmaz~ceng.metu.edu.tr 
ucolu k@ceng.metu.ed u.tr 
Abstract 
This paper analyzes the functionality of dif- 
ferent distance metrics that can be used in 
a bottom-up unsupervised algorithm for au- 
tomatic word categorization. The proposed 
method uses a modified greedy-type algorithm. 
The formulations of fuzzy theory are also used 
to calculate the degree of membership for the 
elements in the linguistic clusters formed. The 
unigram and the bigram statistics of a corpus 
of about two million words are used. Empiri- 
cal comparisons are made in order to support 
the discussions proposed for the type of dis- 
tance metric that would be most suitable for 
measuring the similarity between linguistic el- 
ements. 
1 Introduction 
Statistical natural language processing is a challeng- 
ing area in the field of computational natural lan- 
guage learning. Researchers of this field have an 
approach to language acquisition in which learning 
is visualized as developing a generative, stochastic 
model of language and putting this model into prac- 
tice (Marcken, 1996). 
Automatic word categorization is an important 
field of application in statistical natural language 
processing where the process is unsupervised and is 
carried out by working on n-gram statistics to find 
out the categories of words. Research in this area 
points out that it is possible to determine the struc- 
ture of a natural language by examining the regu- 
laxities of the statistics of language (Finch, 1993). 
It is possible to construct a bottom-up unsuper- 
vised algorithm for the categorization process. In 
our paper named "A Method for Improving Au- 
tomatic Word Categorization"(Korkmaz&Uqoluk, 
1997) such a method, using a modified greedy-type 
algorithm supported by the notions of fuzzy logic, 
has been proposed. The distance metric used to 
measure the similarities of linguistic elements in this 
research is the Manhattan Metric. This metric is 
based on the absolute difference between the corre- 
sponding values of vector components. The compo- 
nents of the vectors correspond to bigrarn statistics 
of words for our case. However words from the same 
linguistic category in natural language may have to- 
tally different frequencies. So using a distance met- 
ric based on only the absolute differences may not be 
so suitable for the linguistic categorization process. 
In this paper various distance metrics are analyzed 
with the same algorithm in order to find out the 
most suitable one that could be used for linguistic 
elements. Comparisons are made for the results ob- 
tained using different metrics. 
The organization of this paper is as follows. First 
the related work in the area of word categorization 
is presented in section 2. Then a general descrip- 
tion of the categorization process and our proposed 
algorithm is given in 3 section, which is followed by 
presentation of different distance metrics that can 
be used with the algorithm. In section 5 the results 
of the experiments and the comparisons between the 
metrics are given. We discuss the relevance of the 
results and conclude in the last section. 
2 Related Work 
Usually unigram and the bigram statistics are used 
for automatic word categorization. There exists re- 
search where bigram statistics are used for the deter- 
ruination of the weight matrix of a neural network 
(Finch, 1992). Also bigrams are used with greedy 
algorithm to form the hierarchical clusters of words 
(Brown, 1992). 
Genetic algorithms have also been success- 
fully used for the categorization process(Lankhorst, 
1994). Lankhorst uses genetic algorithms to deter- 
mine the members of predetermined classes. The 
drawback of his work is that the number of classes 
is determined previous to run-time and the genetic 
algorithm only searches for the membership of those 
Korkmaz and G6ktark (lqoluk 111 Choosing A Distance Metric for Word Categorization 
Emin Erkan Korkmaz and G6ktOrk l)~oluk (1998) Choosing A Distance Metric for Automatic Word Categorization. In 
D.M.W. Powers (ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural Language 
Learning, ACL, pp 111-120. 
classes. 
McMahon and Smith also use the bigram statistics 
of a corpus to find the hierarchical clusters (McMa- 
hon, 1996). However instead of using a greedy al- 
gorithm they use a top-down approach to form the 
clusters. Firstly the system divides the initial set 
containing all the words to be clustered into two 
parts and then the process continues on these new 
clusters iteratively. 
Statistical NLP methods have been used also to- 
gether with other methods of NLP. Wilms (Wilms, 
1995) uses corpus based techniques together with 
knowledge-based techniques in order to induce a lex- 
ical sublanguage grammar. Machine Translation is 
an other area where knowledge bases and statistics 
are integrated. Knight et al., (Knight, 1994) aim to 
scale-up grammar-based, knowledge-based MT tech- 
niques by means of statistical methods. 
3 Word Categorization 
Zipf, (Zipf, 1935), who is a linguist, was one of the 
early researchers in statistical language models. His 
work states that 66% of large English corpus will fall 
within the first 2,000 most frequent words. There- 
fore, the number of distinct structures needed to find 
an approximation to a large proportion of natural 
language would be small compared to the size of cor- 
pus that could be used. It can be claimed that by 
working on a small set consisting of frequent words, 
it is possible to build a framework for the whole nat- 
ural language. 
N-gram models of language are commonly used 
to build up such a framework. An N-gram model 
can be formed by collecting the probabilities of word 
streams (wiIi = 1..n) where wi is followed by wi+l. 
These probabilities will be used to form the model 
where we can predict the behavior of the language up 
to n words. There exists current research that uses 
bigram statistics for word categorization in which 
probabilities of word pairs in the text are collected 
and processed. 
These n-gram models can be used together with 
the concept of mutual information to form the clus- 
ters. Mutua//nformation is based on the concept 
of entropy which can be defined informally as the 
unpredictability of a stochastic experiment. For 
linguistic categorization, mutual information calcu- 
lated would denote the amount of knowledge pre- 
served in the bigram statistics. The detailed expla- 
nation of mutual information and adapting the for- 
mulations for automatic word categorization process 
could be found in (Lankhorst, 1994). 
3.1 Clustering Approach 
When the mutual information is used for cluster- 
ing, the process is carried out somewhat at a macro- 
level. Usually search techniques and tools are used 
together with the mutual information in order to 
form some combinations of different sets, each of 
which is then subject to some validity test. The 
idea used for the validity testing process is as follows. 
Since the mutual information denotes the amount of 
probabilistic knowledge that a word provides on the 
proceeding word, if similar behaving words would 
be collected together into the same cluster, then the 
loss of mutual information would be minimal. So, 
the search is among possible alternatives for sets or 
clusters with the aim to obtain a minimal loss in 
mutual information. 
Though this top-to-bottom method seems the- 
oretically possible, in the presented work (Kork- 
maz~Uqoluk, 1997) a different approach, which is 
bottom-up, is used. In this incremental approach, 
set prototypes axe built and then combined with 
other sets or single words to form larger ones. The 
method is based on the similarities or differences be- 
tween single words rather than the mutual informa- 
tion of a whole corpus. In combining words into sets 
a fuzzy set approach is used. 
Using this constructive approach, it is possible to 
visualize the word clustering problem as the problem 
of clustering points in an n-dimensional space if the 
lexicon space to be clustered consists of n words. 
The points which are the words of the corpus are 
positioned on this n-dimensional space according to 
their behavior relative to other words in the lexicon 
space. Each word is placed on the i th dimension 
according to its bigram statistic with the word rep- 
resenting the dimension namely wi. So the degree of 
similarity between two words can be defined as hav- 
ing close bigram statistics in the corpus. Words are 
distributed in the n-dimensional space according to 
those bigram statistics. The idea is quite simple: Let 
wl and w2 be two words from the corpus. Let Z be 
the stochastic variable ranging over the words to be 
clustered. Then if Px(wl, Z) is close to Px(w~, Z) 
and if Px(Z, wl) is close to Px(Z, w2) for Z rang- 
Lug over all the words to be clustered in the corpus, 
then we can state a closeness between the words Wl 
and w2. Here Px is the probability of occurrences 
of word pairs. Px (wl, Z) is the probability where ~ 
wl appears as the first element in a word pair and 
Px(Z, wl) is the reverse probability where wl is the 
second element of the word pair. This is the same 
for w2 respectively. 
In order to start the clustering process, a distance 
function has to be defined between the elements in 
Korkmaz and G6ktark (J¢oluk 112 Choosing A Distance Metric for Word Categorization 
I 
I 
I 
I 
l 
I 
I 
I 
I 
I 
| 
I 
I 
I 
I 
I 
I 
B 
/ 
I 
/ 
II 
/ 
/ 
/ 
/ 
l 
I 
II 
the space. Assume that the bigram statistics for 
word couples are placed in a matrix N, where N~j 
denotes the number of times word-couple (w~, wj) is 
observed in the corpus. So formulating the similar- 
ity between two linguistic elements would be finding 
out the distance between two vectors that can be 
obtained from this matrix. Different distance met- 
rics are proposed for the distance between vectors. 
The usage of a distance metric forms the main dis- 
cussion point of this paper. In next section first the 
algorithm used for categorization will be presented 
and in section 4 these metrics and their usage for 
linguistic categorization will be discussed. 
3.2 The Algorithm for Categorization 
Having a distance function, it is possible to start 
the clustering process. The first idea that can be 
used is to form a greedy algorithm to start form- 
ing the hierarchy of word clusters. If the lexicon 
space to be clustered consists of {wl,w2,...,wn}, 
then the first element from the lexicon space w~ is 
taken and a cluster with this word and its near- 
est neighbor or neighbors is formed. Then the 
lexicon space is {(wl, ws~, ..., w~), wi, ..., w,} where 
(wl, ws~, ..., ws~) is the first cluster formed. The pro- 
cess is repeated with the first element in the list 
which does not belong to any set yet (wi for our 
case) and the process iterates until no such word is 
left. The sets formed will be the clusters at the bot- 
tom of the cluster hierarchy. Then to determine the 
behavior of a set, the frequencies of its elements axe 
added and the previous process this time is carried 
on the sets rather than on single words until the clus- 
ter hierarchy is formed, so the algorithm stops when 
a single set is formed that contains all the words in. 
the lexicon space. 
In the early stages of this research such a greedy 
method was used to form the clusters. However, 
though some clusters at the low levels of the tree 
seemed to be correctly formed, as the number of 
elements in a cluster increased towards the higher 
levels, the clustering results became unsatisfactory. 
Two main factors were observed as the reasons for 
the unsatisfactory results. 
These were: 
• Shortcomings of the greedy type algorithm. 
• inadequacy of the method used to obtain the set 
behavior from the properties of its elements. 
The greedy method results in a non optimal clus- 
tering in the initial level. To make this point clearer 
consider the following example: Let us assume that 
four words wl,w2, w3 and w4 axe forming the lexicon 
LEXICON SPACE 2 
Figure 1: Example for the clustering problem of greedy al- 
gorithm in a lexicon space with four different words. Note that 
d~2.~ s is the smallest distance in the distribution. However since 
wl is taken into consideration, it forms setl with its nearest neigh- 
bor w2 and w3 combines with w4 and form set2, although w2 is 
nearer. And the expected third set is not formed. 
space. Furthermore, let the distances between these 
words be defined as dw~,wj. Then consider the distri- 
bution in Figure 1. If the greedy method first tries to 
cluster Wl, then it will be clustered with w2, since 
the smallest dwl,w, value is d~l,~ 2. So the second 
word will be captured in the set and the algorithm 
will continue the clustering process with w3. At this 
point, though w3 is closest to w2, it is captured in 
a set and since w3 is closer to w4 than the center of 
this set is, a new cluster will be formed with mem- 
bers w3 and w4. However, as it can be obviously 
seen visually from Figure 1 the first optimal clus- 
ter to be formed between these four words is the set 
The second problem causing unsatisfactory clus- 
tering occurs after the initial sets axe formed. Ac- 
cording to the algorithm, the clusters behave exactly 
like other single words and participate in the cluster- 
ing just as single words do. However to continue the 
process, the bigram statistics of the clusters should 
be determined. This means that the distance be- 
tween the cluster and all the other elements in the 
search space have to be calculated. One easy way 
to determine this behavior is to find the average of 
the statistics of all the elements in a cluster. This 
method has its drawbacks. If the corpus used for the 
process is not large, the proximity problem becomes 
severe. On the other hand the linguistic role of a 
word may vary in contexts in different sentences. 
Many words axe used as noun , adjective or falling 
intosome other linguistic category depending on the 
context. It can be claimed that each word initially 
shall be placed in a cluster according to its dominant 
role. However to determine the behavior of a set the 
dominant roles of its elements should also be used. 
Somehow the common properties (bigrams) of the 
elements should be always used and the deviations 
of each element should be eliminated in the process. 
Korkmaz and G6kt~rk O~oluk 113 Choosing,4 Distance Metn'c for Word Categorization 
3.2.1 Improving the Greedy Method 
The clustering process is improved to overcome 
the above mentioned drawbacks. To overcome the 
first problem the idea used is to allow words to be 
members of more than one cluster. So after the first 
pass over the lexicon space, intersecting clusters are 
formed. For the lexicon space presented in Figure 
1 with four words, the expected third set will be 
also .formed. As the second step these intersecting 
sets are combined into a single set. Then the closest 
two words in each combined set (according to the 
distance function) are found and these two closest 
words are taken into consideration as the centroid 
for that set. After finding the centroids of all sets, 
the distances between a member and all the cen- 
troids are calculated for all the words in the lexicon 
space. Following this, each word is moved to the set 
where the distance between this member and the set 
center is minimal. This procedure is necessary since 
the initial sets are formed by combining the inter- 
secting sets. When these intersecting sets are com- 
bined the set center of the resulting set might be far 
away from some elements and there may be other 
closer set centers formed by other combinations, so 
a reorganization of membership is appropriate. 
3.2.2 Fuzzy Membership 
As presented in the previous section the cluster- 
ing process builds up a cluster hierarchy. In the first 
step, words are combined to form the initial clusters, 
then those clusters become members of the process 
themselves. To combine dusters into new ones their 
statistical behavior should be determined. The sta- 
tistical behavior of a cluster is related to the bigrams 
of its members. In order to find out the dominant 
statistical role of each cluster the notion of fuzzy 
membership is used. 
The problem that each word can belong to more 
than one linguistic category brings up the idea that 
the sets of word clusters cannot have crisp border 
lines and even if a word seems to be in a set due 
to its dominant linguistic role in the corpus, it can 
have a degree of membership to the other clusters 
in the search space. Therefore the concept of fuzzy 
membership can be used for determining the bigram 
statistics of a cluster. 
Researchers working on fuzzy clustering present 
a framework for defining fuzzy membership of ele- 
ments. Gath and Geva (Gath, 1989) describe such 
an unsupervised optimal fuzzy clustering. They 
present the K-means algorithm based on minimiza- 
tion of an objective function. For the purpose of 
this research only the membership function of the 
algorithm presented is used. The membership func- 
tion uij that is the degree of membership of the i th 
element to the jth cluster is defined as: 
I (q-i) 
Uij -~ K Ek-~l I 1 I (q-~l } (1) 
Here Xi denotes an element in the search space, 
Vj is the centroid of the jth cluster. K denotes the 
number of clusters. And d2(Xi, Vj) is the distance 
of Xith element to the centroid Vj of the jth cluster. 
The parameter q is the weighting exponent for uij 
and controls the fuzziness of the resulting cluster. 
After the degrees of membership of all the ele- 
ments of all classes in the search space are calcu- 
lated, the bigram statistics of the classes are de- 
rived. To find those statistics the following method 
is used: For each subject cluster, the bigram statis- 
tics of each element is multiplied with its mem- 
bership value. This forms the amount of statisti- 
cal knowledge passed from the element to that set. 
So the elements chosen as set centroids will be the 
ones that affect a set's statistical behavior tile most. 
Hepce an element away from a centroid will have a 
lesser statistical contribution. 
4 Distance Metrics 
Various distance metrics have been proposed by 
mathematicians that can be used to formulate the 
similarity between vectors. Four of them are ex- 
mined and used for this study. The first one is the 
Manhattan Metric which just calculates the absolute 
difference between the values of two vector elements. 
It is defined by: 
D(x,y)= E Ixi-yi\] (2) 
l <_i<_n 
Here x = {Xl,Xa,...,zn} and y = {Yl,Y2,...,Yn} 
are two vectors defined over 7~ n. 
Having such a metric it is possible to define the 
distance function between two linguistic elements. 
The distance function D between two words wl and 
w2 could be defined as follows: 
D(wl, w2) = D1 (wx, w2) + D2 (Wl, W2) (3) 
Here the distance function consists of two differ- 
ent parts D1 and D2. This is because we want the 
distance function to be based on both proceeding 
and preceding words. So the first part denotes the 
distance on proceeding words and the second one de- 
notes the distance obviously on the preceding words. 
If we use the Manhattan metric, the distance func- 
tion would be : 
Korkmaz and G61aiirk (\]9oluk 114 Choosing,4 Distance Metric for Word Categorization 
I 
I 
m 
II 
| 
II 
II 
II 
II 
m 
II 
m 
m 
Ii 
m 
II 
II 
/ 
/ 
/ 
/ 
l 
I 
/ 
/ 
II 
/ 
D(wl,w2)= ~ I Nw, i-Nw=i l+ Ni~-Niw= I 
l<i<n (4) 
Here n is the total number of words to be clus- 
tered, gwai is the number of times word couple 
(wt, wi) is observed in the corpus and Niwa is the 
number of times word couple (wi, wl) is observed. 
Obviously it is the same for word w2. This dis- 
tahoe metric just calculates the total difference on 
two vector-couples obtained from the frequency ma- 
trix N, where the first couple denotes the vectors ob- 
tained by the frequencies of the word-couples formed 
by wl, w2 and their proceeding words. The second 
couple denotes the vectors formed by the frequencies 
with the preceding words correspondingly. 
The above formulation explains the structure of 
the distance metric used for the study. For the 
researched presented in our previous paper (Kork- 
maz&0~oluk, 1997) Manhattan Metric was the only 
metric used for the distance function. However oth- 
ers axe proposed for the similarity between vectors. 
Another metric is the Euclidean Metric: 
/ 
D(x,y) = .\[ E (xi - y,)2 (5) 
V l<i<n 
Here x and y axe again two vectors defined over 
7~ n. Also the formulation of the angle between two 
vectors is also used for this study as a distance met- 
ric. If 0 is the angle between the two vectors x and 
y, then cos 0 is calculated by: 
x'y El <i<n xiyi COS 
0 = ~ = Ixllyl z.2½ 2½ 
,_<. , \] \] 
(6) 
Here, x'y denote the scalar product of the two 
vectors x and y and I x \] denote the magnitude of 
the vector x. Since the components of the vectors 
in our case are corresponding to the frequencies of 
words, they will be non-negative. So the angle be- 
tween the two vectors will be between 0 ° and 90 °. 
Since cos 0 ° is unity and cos 90 ° is zero, a distance 
metric between the two vectors can be defined as: 
D(x,y) = 1 - cos0 (7) 
This distance metric will give us a number from 
the closed interval \[0, 1\], 0 denoting that the two 
vectors are overlapping and 1 denoting that there 
is an angle of 90 ° which is the highest difference 
between the vectors. 
The last distance metric used for the similarity 
function is the Spearman Rank Correlation Coeffi- 
cient. This metric is based on the difference between 
the ranks of two vectors rather than the difference 
between their elements. The metric is defined as: 
D(x,y) = Z (Rix - Ri~)2 (8) 
l<i<n 
Here x and y axe again two vectors as defined 
above. Ri z nd R/~ are the ranks of the correspond- 
ing vectors. The rank is calculated for our case by 
normalizing the vectors in the interval \[0,1\]. The 
component with the highest value among the com- 
ponents of the vector takes the value 1 and if there 
axe n elements in the vector, the one with the second 
highest value will correspond to the number 1-(l/n) 
and so on. The smallest value will correspond to 
zero. 
For the process of formulating the distance be- 
tween linguistic elements, the main problem appears 
due to the difference between the frequencies of 
words from the same linguistic category. For in- 
stance the word go has a very high frequency in 
natural language corpora compared to many other 
verbs, but still we have to cluster go with low fre- 
quency verbs. However if we use a distance met- 
ric based on only the absolute differences of vectors 
like the Euclidean Metric or Manhattan Metric, the 
distance calculated between high frequency and low 
frequency words would be high, which is undesired. 
Therefore when comparing a high frequency word 
with a low frequency one, we should be able to de- 
termine if the difference is caused by some regular 
magnitude difference. A similarity can exist between 
th e corresponding values when this magnitude differ- 
ence is discarded. Without having a distance func- 
tion that compensates for this, it is not possible to 
overcome the errors introduced by having different 
frequencies for words from the same linguistic cate- 
gory. This acts as a considerable factor disturbing 
the quality of formed clusters. 
Having this in mind the Spearman Rank Corre- 
lation Coefficient Metric and the Angle Metric are 
used as distance function. These two seem to dis- 
card the magnitude difference between the compo- 
nents of the vectors. Such a comparison seems to 
be more suitable for evaluating the similarity of lin- 
guistic elements. 
In the Spearman Rank Correlation Coefficient the 
vectors are normalized into the closed interval \[0,1\]. 
So the vectors are similar if the change from one 
component to the next is similar, regardless of the 
difference in the absolute values. We have a similar 
comparison for the Angle Metric. When this metric 
Korkmaz and GOlaiirk @oluk 115 Choosing A Distance Metric for Word Categorization 

i 
II 
i 
i 
i 
i 
II 
Spearman 
Test Criteria Manhattan Angle Euclidean Rank Combined 
Metric Metric Metric Correlation Metric 
Coe/~eient 
of initial clusters 60 169 132 185 171 
# of elm. in the initial clusters 16.6 5.9 7.56 5.4 5.8 
Depth of the tree 8 
5Zaand 
6th level$ 
9 
7tnand 
8thlevel5 
i1 
9tnand 
lOShlevels Location ofleaves 3thlevel 
#of nodes on the 18 39 35 41 37 second level 
I1 
9ta and 
lOthlevels 
Table 2: Comparison of cluster hierarchies obtained with different metrics. 
disappeared. We were able to get an initial success 
rate of about 90% with the Manhattan Metric when 
we discarded this large faulty cluster. However with 
the other metrics this success rate has been obtained 
for all the lexicon space. 
The second problem encountered for the catego- 
rization process appears while combining the initial 
clusters into larger ones• Although it is possible to 
obtain some local successful combinations with the 
first metric, the overall performance in combining 
these initial clusters is not so satisfactory. So differ- 
ent metrics presented in section 4 have been tested 
on the algorithm. Unfortunately, although the pro- 
posed metrics were able to overcome the first prob- 
lem of having a large faulty cluster, the progress ob- 
tained in combining initial clusters into larger ones 
was not so significant. This has been the factor trig- 
gering the idea that a metric taking into considera- 
tion both of the approaches for linguistic similarity 
would be more suitable for our case. So the fifth 
metric, the Combined Metric, has been constructed. 
The main progress obtained with this fifth metric is 
on the second problem described. 
In table 2 the hierarchies obtained using differ- 
ent metrics are presented. When the properties 
presented in this table are examined, the hierarchy 
formed by the Manhattan Metric has the minimum 
number of initial clusters. This is due to the large 
faulty cluster formed with this metric. The proper- 
ties of the hierarchies presented in table 2 seem to 
be similar to each other. Only the depth of the tree 
formed with the Angle Metric differs from the other 
ones. This is because more initial clusters are com- 
bined on the second level in the hierarchy obtained 
with this metric. This brings in an increase in the 
number of ill-structured clusters on the second level 
over-combining distinct linguistic categories. 
5.1 Empirical Comparison 
The main progress for the clustering hierarchy is ob- 
tained by the Combined Metric. It seems suitable to 
examine this metric in detail and compare the re- 
sults with the initial organization obtained by the 
Manhattan Metric. 
Some linguistic categories inferred by the algo- 
rithm using the Combined Metric are listed below: 
• professor opposite church hall least present once last 
baby prisoner doctor wind gate village sun country 
• earth forest garden truth river 
• picture case glass 
• captain servant book horse meeting situation circumstances 
summer afternoon evening night morning day future early 
• large new small great very strange certain good fine few 
little 
• slight man's sudden thousand hundred different 
• rich fair secret blue soft cold bright quick frightened sur- 
prised plain clear true greater worse better tall dead living 
wrong 
• notice cry hold touch influence act account form effect care 
• meant ought wanted used enough back began tried turned 
came 
• enter pass follow carry call give bring tell do let forgive 
• impossible possible necessary 
• calm pale warm simple sweet quiet busy hot angry ill 
• aunt uncle sister husband's 
• duty attention desire turning coming close 
• listening ready trying going 
• died fallen drawn learned written gone 
• known taken brought given 
• shoulders neck pocket hat chair shoulder arm mouth 
• person girl lady woman gentleman man fellow else thing 
• affairs age speech action marriage questions ideas looks 
silence society love experience 
• between towards upon against after before like about round 
off away up 
• under into through on at over 
• during near toward beside within around behind gave told 
took 
• shall should may will must would might i 
• won't cannot can can't are didn't don't 
Korkmaz and G6ktark Ugoluk 117 Choosing,4 Distance Metric for Word Categorization 
J Combined Manhattan 
Metric Metric 
Nouns 
Largest ~ of words 94 
co!!~eed 
S,,~qs Rate 91.5% 
# of initial 15 
clusters connected 
Verbs (present perfect~ 
111 
94.6% 
6 
--Largest ~ of words 67 
collected 
Success Rate 100% 
of initial 12 
clusters connected 
Verl~s (past perfect) 
Largest ~ of words 16 
collected 
Success Rate 100% 
~# of initial 5 
clusters connected 
45 
73.3% 
2 
100% 
Larlgest ~ of words 
collected 
of initial 
clusters connected 
68 17 
92.6% 100% 
7 
Largest ~ of words 
collected 
S ,,~x~ Rate 
of initial 
clusters connected 
Adverbs 
9 4 
100% 100% 
1 1 
LargeSt ~ of words 
collected 
Success Rate 
of initial 
clusters connected 
I, ar~est ~ of words 
enll*cted 
S~MS Rat® 
of initial 
clusters connected 
Auxiliaries 
7 9 
100% 100% 
1 1 
~)etecmlners 
16 10 
100% 100% 
1 1 
Table 3: Comparison made between Combined Met- 
ric and Manhattan Metric based on the largest num- 
ber of elements combined in a cluster. 
• anybody everyone nobody everybody everything 
• exactly finding hearing watching all leaving seeing giving 
keeping knowing 
• those these our an a this his their the your my her any no 
some not such its 
The ill-placed members in the clusters above are 
shown using bold font. The above initial clusters 
represent the linguistic categories with a success rate 
of 90.2%. Also the plural nouns in singular noun 
clusters are shown in italics. If we consider those 
placements as faulty ones also, the calculated suc- 
cess rate would fall to 88.1%. This success rate seems 
to be similar to the results obtained with other dis- 
tance metrics. However as explained above the main 
progress obtained with this Combined Metric is on 
the process of combining these initial clusters into 
larger ones in the upper levels of the duster hierar- 
chy. 
Two examples from the cluster hierarchy obtained 
with this metric are given in tables 4 and 5. In ta- 
ble 4 94 nouns coming from different initial clusters 
are combined in the same part of the cluster hier- 
archy. Only one cluster seems to be misplaced in 
this region. This is an adjective cluster. In table 5 
67 different verbs are collected. They are all present 
tense verbs and no misplaced word exists in tiffs part 
of the hierarchy. This is another well-formed part of 
the cluster organization. It is believed that this is an 
important improvement compared to earlier results, 
since there is an increase in the number of success- 
fully connected initial clusters. 
Table 3 exhibits the improvement obtained using 
the Combined Metric. Maximum number of words 
correctly classified for some linguistic categories are 
shown in this table. Obviously there are other clus- 
ters having dements from the same linguistic cate- 
gories in different parts of the hierarchy. This table 
makes a comparison of the maximum numbers of 
words successfully collected in order to analyze the 
improvement obtained. Gathering nouns and auzil- 
iaries seems to be carried out better with the Man- 
hattan Metric. However if we consider the number 
of initial clusters forming these largest ones, a sig- 
nilicant progress seems to exist for the Combined 
Metric. There is a big difference for these num- 
bers between the two. For instance 12 present per- 
fect verb classes are combined successfully when the 
Combined Metric is used, but only 8 of them were 
combined with the Manhattan Metric. For adjec- 
tives this is 7 to 2, for past perfect verbs 5 to 1 and 
although number of nouns collected by the Manhat- 
tan Metric is larger, number of initial clusters sue- 
cessfully combined by the Combined Metric is still 
larger. 
It can be claimed that there is a significant 
progress in the process of successfully combining the 
initial clusters when the new metric is used. This 
was the main problem encountered with the Man- 
hat'tan Metric and the other ones. This is denoted 
as the progress obtained by using the Combined Met- 
ric trying to represent both of the two approaches 
that can be taken into account for the similarity of 
linguistic dements. 
6 Discussion And Conclusion 
This research has focussed on the usage of distance 
function for an unsupervised, bottom-up algorithm 
for automatic word categorization. The results ob- 
tained seem to show that natural language preserves 
the necessary information implicitly for the acquisi- 
tion of the linguistic categories it has. A convergence 
of linguistic categories could be obtained by using 
the algorithm we have presented. This result is a 
motivating one for further studies on acquisition of 
Korkmaz and G6ktark ~19oluk 118 Choosing A Distance Metric for Word Categorization 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
bbb 
bbbb bbbc 
water book, horse 
family meeting, situation 
bbbbb bbbbc children early, summer ~ money afternoon, evening 
~ ~ night, morning 
~~ .day, future 
subject sight middle line direction~ . ~ ~st 
matter end point bbbbcb bbbbcc • rest cause 
question sound story ~ / \ 
/ / / ~.nf...2. h=l\[ // / bank, steps 
/ wal~, ladies 
world best least present e earth / streets, fire \ ' forest city floor,horses • 
pictur same room most once, last garden light darkness, path case 
first house whole baby, prisoner truth crowd court, watch glass 
door old doctor, wind river drawing, fact 
other gate, village scene, news 
sun, country windows, sick 
Table 4: Part of the cluster hierarchy holding nouns 
bcbcb 
bcbcbb 
take send carry make call keep marry 
save 
bcbcbc 
bcbcbcb bcbcbcc get 
be pay 
give help meet ~¢ 
b~ng leave 7/ J/ tell do find /# /// 
let see //// // / ,o- y./ 
/ 2~app en / / / / / stop / / I 
wait / / =: // / 
write . . / / =~; / dAk l;e . 
/ / begin / die change conmder 
come talk answer stay play fall forget 
go speak return live fight stand understand 
try sleep walk say 
order bed sit • imagine 
regard 
run 
turn 
drive 
seem 
Table 5: Part of the cluster hierarchy holding present tense verbs 
Korkmaz and G6ktfirk ~/¢oluk 119 Choosing A Distance Metn'c for Word Categorization 
structures preserved in natural language at various 
abstraction levels. 
Different distance metrics are used for the algo- 
rithm. The results obtained by the Combined Met- 
ric show that special distance metrics trying to com- 
bine different properties of linguistic elements could 
be developed for linguistic categorization. 
Considering the results obtained by the experi- 
ments carried out, the following remarks could be 
made on the linguistic clusters formed in the study. 
In the initial clusters formed the success rate ob- 
tained is satisfactory. Though it was not possible to 
to combine these initial clusters into exact linguistic 
categories, the cluster hierarchy obtained with Com- 
bined metric is encouraging. The faulty placements 
axe mainly due to the the very complex structure 
of natural language. The fact that many words can 
be used with different linguistic roles in natural lan- 
guage sentences produces deviations in the informa- 
tion given by the bigrams. Using fuzzy logic and a 
suitable distance metric is a way to decrease these 
deviations, however it was not possible to remove 
them totally. 

References 
Brown P.F., V.J. Della Pietra, P.V. deSouza, J.C. 
Lai, and R.L. Mercer. Class-based n-gram models 
of natural language. Computational Linguistics, 
18(4):467-477, 1992 
de Marcken, Carl G. Unsupervised Language Acqui- 
sition. Phi) Thesis, Department of Electrical En- 
gineering and Computer Science, Massachusetta 
Institute of Technology, 1996. 
Finch, S. Finding Structure in language. PhD The- 
sis. Centre for Cognitive Science, University of Ed- 
inburgh, 1993. 
Finch, S. and N. Chater, Automatic methods for 
finding linguistic categories. In Igor Alexander and 
John Taylor, editors, Artificial Neural Networks, 
volume 2. Elsevier Science Publishers, 1992. 
Gath, I and A.B. Geva Unsupervised Optimal Fuzzy 
Clustering. IEEE Transactions on pattern analy- 
sis and machine intelligence, Vol. 11, No. 7, July 
1989. 
Knight, Kevin, Ishwar Chander, Haines Matthew, 
Hatzivassiloglou Vasieios, Hovy Eduard, Iida 
Masayo, Luk Steve, Okumura Akitoshi, Whitney 
Richard, Yamada Kenji. Intagrating Knowledge 
Bases and Statistics in MT. Proceedings of the 
1st AMTA Conference. Columbia, MD. 1994. 
Korkmaz, E. E. and G. U~oluk A Method For 
Improving Automatic Word Categorization. Pro- 
ceedings of the Workshop on Computational 
Natural Language Learning. (Conl197). Madrid, 
Spain. pp. 43-49, 1997. 
Lankhorst, M.M. A Genetic Algorithm for Auto- 
matic Word Categorization. In: E. Backer (ed.), 
Proceedings of Computing Science in the Nether- 
lands CSN'94, SION, 1994, pp. 171-182. 
McMahon, John G. and Francis J. Smith. Improv- 
ing Statistical Language Model Performance with 
Automatically Generated Word Hierarchies. Com- 
putational Linguistics, 22(2):217-247,1996. 
Wilms, G. J. Automated Induction of a Lexical Sub- 
language Grammar Using a Hybrid System of Cor- 
pus and Knowledge-Based Techniques. Mississippi 
State University. PhD Thesis, 1995. 
Zipf, G.K. The psycho-biology of Language. Boston: 
Houghton Mifflin. 1935 
