Extracting semantic clusters from the alignment of definitions 
Gerardo SIERRA 
Institute de Ingenierfa, UNAM 
Apdo. Postal 70-472 
Mdxico 04510, D.F. 
gsm @pumas.iingen.unanunx 
John McNAUGHT 
Department of Language Engineering, UMIST 
PO Box 88 
Manchester M60 IQD, UK 
John.McNaught @unlist.ac.uk 
Abstract 
Through tile alignment of definitions fronl 
two or more dilTerent sources, it is 
possible to retrieve pairs of words that can 
be used indistinguishably in the same 
sentence without changing tile meaning of 
the concept. As lexicographic work 
exploits common defining schemes, such 
as genus and dilTerentia, a concept is 
simihu'ly defined by different dictionaries. 
The dilTerence in words used between two 
lexicographic sources lets us extend lhe 
lexical knowledge base, so that clustering 
is available through merging two or more 
dictionaries into a single database and 
then using an approlwiate alignment 
techlaique. Since aligmnent starts from thc 
same entry of two dictionaries, clustering 
is l~lster than any other technique. 
Tile algorithm introduced here is analogy- 
based, and starts from calculating the 
Levenshtein distance, which is a variation 
o1' the edit distance, and allows us to align 
the definitions. As a measure of similarity, 
the concept el' longest collocation couple 
is introduced, which is the basis of 
clustering similar words. The process 
iterates, replacing similar pairs of words 
in tile definitions until no new clusters are 
found. 
Introduction 
Clustering methods to identify semantically 
similar words are usually divided in relation- 
based and distribution-based approaches 
\[Hirawaka, Xu and Haase 1996\]. Relation-based 
clustering methods rely on the relations in a 
semantic network or ontology to judge the 
similarity between two concepts, either by 
measuring the shortest length that connects two 
concepts in the hierarchical net \[Agirrc and 
Rigau 199611, oi" by comparing tile information 
content shared by the members unde," tile same 
cluster \[Morris and Hirst 1991, Resnik 1997\]. 
ltowever, even although these ontologies 
describe a huge number of members for a 
cluster, few words of a category may be 
interchangeable in the same context and then 
used as members of tile same cluster. This 
means that not all words in a category arc 
necessary. 
Conversely, distribution-based clustering 
methods depend on pure statistical analysis of 
the lexical occurrences ill running texts. A relier 
drawback is that distribution-based methods 
require us to process a large amount of data in 
order to get more reliable results. Moreover, tile 
use el hu'ge corpora is not always practical, due 
to economic, time or capabilities factors. Gao 
11199711 states that tile problem for statistical 
alignment algorilhms, such as those based on tile 
facts described by Gale and Church \[1991\], is 
the low frequency of words that occur in parallel 
corpora. The consequences for lacking hu'ge 
corpora include results based on low-frequency 
words, which are quite unrepresentative for 
clustering. 
From a methodological point of view, there is, in 
addition to the above two approaches, a little 
known approach called the analogy-based 
approach. This employs an inferential process 
and is used ill computatkmal linguistics and 
artificial intelligence as an alternative to current 
rule-based linguistic models. 
1 Analogy-based clustering 
Jones \[11996\] suggests corpus alignment as a 
feasible analogy-based approach, ill order to 
align two sentences in tile same , 
Waterman \[1996\] uses a technique for 
measuring tile similarity between lexical strings, 
named edit distance. This matches tile words of 
795 
two sentences in linear order and determines 
their correspondence. For example, given the 
followiug two definitions for alkalimeter: 
• An apparatus for determining the 
concentration o1' alkalis in solution \[CED\] 
• An instrument for ascertaining the amount 
of alkali in a solution \[OED2\] 
Alignment may identil'y which words in these 
definitions are equivalents of each other. A 
quick observation of the sentences lets us 
identify three pairs of words: (apparatus, 
instrument), (determining, ascertaining) and 
(concentration, amount). 
The appeal of using definitions as corpora for 
alignment is l'ounded on two reasons. Firstly, 
dictionaries contain all necessary information as 
a knowledge base for extracting keywords 
\[Boguraev and Pustejovsky 1996\]. Secondly, it 
is much easier to find the sentences for aligning, 
since definitions are distinguished by entry 
headword. 
Taking into account Waterman's studies, we 
propose an analogy-based method to identify 
automatically semantic clusters. The difference 
in words used between two or more 
lexicographic definitions enables us to infer 
paradigms by merging the dictionary definitions 
into a single database and then using our own 
alignment technique. 
2 Clustering algorithm 
Tile overall structure of the clustering algorithn\] 
is shown in figure l, and its description is given 
below. 
2.1 Processing definitions 
Our algorithms are used in an overall system 
called "onomasiological search system" (OSS), 
whose aim is to allow the user to find terms by 
giving a description o1' a concept. Lexicographic 
and terminological definitions constitute the 
main lexical resources. Our algorithms cluster 
words that are used in the same context, thus 
operate on pairs of definitions for a same entry 
word, drawn fi'om two different dictionaries. If 
dictionary 1 does not have an entry word that 
exists in dictionary J, then this entry word is 
omitted from consideration. In order to balance 
the number of strings when an entry word in the 
dictionary 1 has two or more senses, the entry 
word in dictionary J is repeated as many times 
as necessary to equal the number of senses of 
dictionary I. We thus derive two files I and J 
containing an equal number of strings S~ and S 2, 
respectively. Each string consists of an entry 
term followed by its definition, the definition 
giving only one sense of the entry term. For each 
string S 1 there is a string S 2. 
Our experiments focus on 314 terms for 
measuring instruments extracted with their 
definitions from CED \[199411 and OED2 \[1994\], 
resulting in 387 strings from each dictionary. 
match S t and S 2 
/S 1 and S 2 
/ definitions 
t 
processing 
H 
/ 
/ 
/ 
> 
stelnlller 
calculate Levenshtein 
distance H align S~ and S 2 
I "'-I 
J replace strings ~ identify bindings 
<clusters ),~__~ cluster bindings 
Figure 1 Clustering algorithm 
stoplist \[ 
,L 
I find Ice @0 
796 
The strings consist ()1' the entry term and the 
definition, so that etymology, part of speech, 
inl'lected t'orms ol' the cntry term, examples and 
other inl'ormation were deleted. Subject-field 
labels, such as 'astronomy' and 'meteorology', 
were preserved, either in full or slightly 
abbreviated, as they are helpful to resolve which 
sense o1' a word to choose, and usually constitute 
a l'undamental property of the concept. 
It should be noted that none of the 387 strings 
suffered any additional transformation, apart 
l'rom a few cases in order to complete a 
del'inition when it had been broken in two pm'ts 
by the dictionary editor, such as when a core 
meaning appears just once at the beginning of 
several subsequent senses. Althongh some 
abbreviations ('U.S.A.'), initials of proper 
names ('C.T.R. Wilson') and possessives ( :un s 
rays') will come out as two or more words al'ter 
deleting punctuation marks and therefore can 
alter the efficiency el' the algorithm, they were 
preserved to observe their effect. 
2.2 Aligning definitions 
In order to compare two strings of woMs, we use 
the Levenshtein distance \[Levenshtein 196611, a 
similar method to the edit distance. This method 
measures the edit transl'ormations that change 
one string into other. The Levenshtein distance 
arrangcs the strings in a matrix, with the words 
el' Sj heading the columns and those of S 2 
heading the rows. A null word is inserted at the 
beginning of each string S~ and S 2, in position 
i=0,.j=0. The matrix is filled with the costs of 
insertion, deletion and substitution using the 
l'ollowing formtfla • 
D(ai, b i_, ) + Di,,. ,, (bi) 
D(ai,bj)= rain D(ai_j,bi)+Di,,.,.(ai) 
D(aH, b /< ) + D ,I, (ai, b j) 
Where the cost of insertion. D~,,.,(), is 1. and the 
cost of substitution. D,,i,(), is 0 or 1, according to 
whether a~ and bj differ or not. 
Our experimental results have shown that the 
application of the Levenshtein distance using 
stem forms gives better matches than nsing full 
forms. Therefore, we shall fill the matrix with 
the cost for the stem l'orms, although the strings 
preserve the fnll forms both l'or the following 
steps and in the output table. We used the 
stmnming algorithm or' Porter \[1980\], which 
removes endings l'ronl words. 
Building on the Levenshtein distance, Wagner 
and Fisher \[1974\] propose a dynamic 
t~rogramming method to align the elements of 
two strings. Their procedure to return the 
ordered pairs of the alignment starts with the last 
cell of the matrix with cost\[n\]\[m\] and works 
back until either i or j equals 0, according to 
which o1' its neighbours a cell was derived l'rom. 
I1' it is derived either from the previous 
horizontal or vertical cell (\[i-l\]\[j\] or \[i\]lj-l\] 
respectively) then the difference in cost is.just 1, 
otherwise it is derived l'rom the diagonal. 
2.3 Extracting triplets 
The alignment gives us a list of triplets formed 
by ~.ll, J,l~, cost\[i\]\[j\]), in decreasing order 
according to cost\[i\]\[jl, where./.)' I, and ./\]~ arc full 
forms from the strings S~ and S e, respectively. 
There are three possible pairings of words: 
"Equal couple" is defined as the pair (1-\[i, .ffj) of 
full forms such that the corresponding stem 
forms are equal (,s.'/' I = 4)" 
"Matched couple" is a pair (/.)~i, .Oj) such that .sf~ 
# .ff~. This couple represents a potential pair ot' 
similar words. 
"Null couple" is a pair (.g, .g) such that ,s:/I ()r 4 
is missing. 
With respect to the Levcnshtein distance, the 
equal couple means these words do not need any 
change to make both equal, while for the 
matched couple we shall replace one word with 
the other progressively, and for the null couple 
we must either insert one word into the given 
string or delete it from the given string. 
The purpose of clustering is to match different 
pairs of words (matched couples), thus neither 
pairs of equal words (equal couples) nor pairs 
with a null word (null couples) are relevant. 
2.4 Measuring similarity 
As a measure of the similarity between a 
matched couple, we quantify the surrounding 
equal couples above and below it. This concept 
is similar to the "longest common subsequence" 
of two strings suggested by Wagner and Fisher 
\[1974\], which is del'ined as the common 
subsequence of two strings having maximal 
length, although in our case both strings differ 
by the single matched couple. By analogy, we 
use longest collocation couple, henceforth 
797 
abbreviated lcc, since we refer to couples instead 
of a single string. Besides, the word 
"collocation" is more representative for a pair o1' 
words and their neighbourhood, being the core 
of two longest common subsequences. We 
define longest collocation couple as the maximal 
sequence of pairs of words formed by equal 
couples surrounding a matched couple. 
Given the alignment of the strings S~ and S 2 
consisting of a list of triplets formed by (ffi., ff, 
cost\[ill/\]), in decreasing order according to 
cost\[i\]\[j\], where.ff I, and fl~ are, respectively, full 
fomas l'rom S~ and $2, the lcc is the longest 
consecutive sequence of triplets (~i., f~, 
cost\[i\]\[j\]) formed by one matched couple, such 
that it meets 3 conditions: 
• The cost dilTerence between the first triplet 
and the last triplet is 1. 
• There is no null couple. 
• The matched couple is neither the first nor 
the last triplet. 
By these conditions, only the matched couple 
becomes the core el' a Icc: we constrain a 
matched couple 1o be between two or more 
equal couples, and eliminate the possibility that 
the matched couple appears at the beginning or 
end o1' a phrase. 
As a result, we get a new triplet Off, .\[f~, Icco), 
where (If, J\[~) is the matched couple and lcc,a is 
the length of the longest collocation couple. As 
an example, for the definitions of "dynameter" 
in table 1, there is only one matched couple, 
"determining-measuring", whose lcc is 9 (the 
extent o1' the Icc is indicated by arrows). 
telescopes 
of 
power 
magnifying 
the determining 
for 
inslrulnent 
an 
dynameter 
,/,/; 
telescope a 
of 
power 
magnifying 
the 
measuring 
for 
instrument 
An 
dynameter 
cost\[il\[jl 2 
2 
1 
1 
I 
1 
1 0 
0 0 
0 
Table 1 Triplets for "dynameter" 
<- 
¢U> 
II 
¢o ¢J 
<-- 
Ranking all triplets found by lcc in decreasing 
order, we observe that the greater the value o1' 
lcc, the greater the similarity between the words 
of the matched couple. 
2.5 Removing flmetion words 
So far, function words and other noise words 
will also be clustered by our algorithms, in 
general, such words interfere in the 
identification of clusters and can give more 
wrong than good results. We use a stoplist to 
automatically identify any pair of words where a 
non-relevant word appears and exclude it, on the 
grounds that they are not very useful words for 
clustering. Thus, when the program comes 
across a matched pair of different words in a 
context and il' that matched pair contains a word 
from the stoplist, then the pair is rejected. 
Essentially, this is the same thing as using a 
tagger and looking at the tags as well as the 
words, since one would not want to choose a 
noun pairing with a determiner or a relative. 
By inspection, we observe that, after stoplist 
discrimination, the best potential clusters are 
found at higher values ot' Icc. Our experimental 
results show us that a length of lcc equal to 5 is a 
reliable threshold. Although there are also good 
matches for values equal to 4 and 3, the majority 
of these are duplicates of higher values. 
2.6 Clustering 
We introduce the terln binding to represent a 
candidate cluster, i.e. two words that may be 
used in the same context without changing the 
meaning o1' a definition. A binding is a matched 
couple (J.l~, .\[/'9 formed by the full forms .\[f~ and 
ft;, after stoplist discrimination, drawn t'rom the 
strings S, and S~, respectively, in such a way that 
the stem forms are equivalent, in a determined 
context, according to a determined threshold'. 
The threshold associated with a binding is the 
length of the lcc, and we consider only bindings 
of matched couples where lcc >_ 5. 
Each binding can be considered as an initial 
cluster. Clusters represent sets o1' words that are 
used with the same meaning in particular 
contexts. In a consecutive sequence of bindings, 
it may happen that a stem form occurs in two or 
more dilTerent bindings. In this case, one can 
cluster all bindings with a common stem form 
according to the transitive property. 
in order to cluster bindings, we use an algorithm 
consisting o1' three loops. First, it assigns a 
cluster number to each binding, so those 
bindings with a common word have the same 
cluster number. Secondly, it clusters bindings 
with the same cluster number, but removes 
798 
duplicate stem forms in tile same cluster. 
Thirdly, it checks if it is possible to inerge new 
clusters with those of previous cycles. This 
process will typically result in a set of 
overlapping clusters, reflecting the natm'al state 
where concepts may belong to more than one 
conceptual class. 
2.7 Cycling 
As bindings represent pairs of words such that 
the stem forms can be substituted in a particular 
context without changing the meaning, sJi = aJ~ 
we can replace any of the full formsf£ with the 
full l'orms ffj according to each binding, so that 
the corresponding definition preserves the same 
meaning. After substituting bindings, we 
observe that several pairs of words will now 
typically present a high lcc score, even those 
pairs of words which initially did not yield 
matches with any word. It is then advantageous 
to replace thus the bindings in the definitions 
alld to repeat the entire process until no new 
clusters are found. The first cycle runs from the 
reading o1' definitkms up to merging of clusters. 
All subsequent cycles will start by replacing 
retained bindings in the definitions, thus each 
subsequent cycle works with new data. 
3 Experimental results 
The current clustering algorithm was developed 
by analysing definitions on the following basis: 
• Language dictionaries. The use of  
dictionaries has been preferred because there 
are enough to extract data from. As they are 
in machine-readable form, it is possible to 
copy definitions, avoiding likely mistakes 
while typewriting. 
• Corpus on 314 "measuring instruments". 
This domain has the advantage that it is easy 
to search for the terms that correspond to it, 
as they usually end in "-meter", "-scope" or 
"-graph". As a conscqueuce o1' applying the 
clustering program to the 387 strings, it is 
evident that the maiority of clusters were 
related to "measure" and "instrument". 
• Alignment of two strings. We have shown 
that two sources of data (pairs of del'inition) 
are sufficient for clustering to yield good 
results. 
• No manipulation el'data. After ktentification 
of the term and the definitions, these were 
truncated to 200 characters and punctuation 
marks were removed. No words in 
definitions were replaced or moved, to "tidy 
up" the data, before being submitted to the 
main process. 
• Stemming algorithm. The stemmer 
algorithm presents both overstemming and 
understemming, but nevertheless the 
clustering program yiekts good results. 
• Stoplist discrimination. The stoplist has 
been used as a tagger, i.e. as a filter to avoid 
matching words with dil'ferent parts ot' 
speech. 
• Bindings for Ice _> 5. The best clusters have 
been observed for bindings with lcc> 5, and 
the results presented m'e good. 
Table 2 presents some cluster results after two 
cycles of the clustering procedure starting from 
the Levcnshtein distance. In addition to these 
clusters, 14 other clusters of two or three 
elements were obtained. 
I. apparalus inslrumcnt telescope 
2. analyse ascerlaining determining estimating 
location measuring recording lakins testing 
3. amotmt concenlration intensity percentage 
proportion rate salinity strength 
Table 2 Cluster results for "measuring 
illstFunlents" 
The procedure then stops, as no more matched 
words with lcc _> 5 have been found for our data. 
The following sections analyse variations of 
these considerations. 
3.1 Using multiple resources 
General  dictionaries present the 
advantage of using well-established 
lexicographic criteria to normalise definitions. 
These criteria, as for example the use of 
analytical definitious by genus and differentia, 
have been nowadays implemented by 
terminological or specialiscd dictionaries, with 
the addition of a richer vocabulary and the 
identification of properties that are not always 
considered relevant in other resources. 
Unfortunately, these are more oriented to a 
specific domain, so that it is sometimes 
necessary to search in two or more resources to 
compile the data. 
We used many online lexical resources, some of 
them available on the lnternct. This allowed us 
to easily use different databases to extract 

References

Agirre, E., and Rigau, G. 1996. Word Sense
Disambiguation using Conceptual Density. Proc.
COLING-96. The 16th International Conference on
Computational Linguistics, Copenhagen, 16-22.

Boguraev, V. and Pustejovsky, J. 1996. Issues in
text-based lexicon acquisition. Corpus processing
for lexical acquisition. B. Boguraev and J.
Pustejovsky (eds.). Cambridge: The MIT Press.

[CED2] 1994. Collins English dictionary. Glasgow:
Harper Collins Publishers.

Gale, W.A. and Church, K.W. 1991. A program for
aligning sentences in bilingual corpora. Proc. of
29th Annual Conference of the ACL, 177-184.

Gao, Z.M. 1997. Automatic extraction of translation
equivalents from a parallel Chinese-English
corpus. PhD Thesis, UMIST.

Hirakawa, H., Xu, Z., and Haase, K. 1996. Inherited
Feature-based Similarity Measure Based on Large
Semantic Hierarchy and Large Text Corpus. Proc.
COLING-96. The 16th International Conference on
Computational Linguistics. Copenhagen: Center
for Sprogteknologi.

Jones, D. 1996. Analogical Natural Language
Processing. London: UCL Press.

Levenshtein, V.I. 1966. Binary codes capable of
correcting deletions, insertions, and reversals.
Cybernetics and Control Theory 10 (8), 707-710.

Morris, J., and Hirst, G. 1991. Lexical Cohesion
Computed by Thesaural Relations as an Indicator
of the Structure of Text. Computational
Linguistics 17(1), 21-48.

[OED2] 1994. Oxford English dictionary. Oxford:
Oxford University Press and Rotterdam: Software
B.V.

Porter, M.F. 1980. An algorithm for suffix
stripping. Program 14(3), 130-137.

Resnik, P. 1997. Disambiguating noun groupings
with respect to WordNet senses. Proceedings of
the 3rd Workshop on Very Large Corpora, MIT.

Wagner, R.A., and Fisher, M.J. 1974. The String-to-
String Correction Problem. Journal of the
Association for Computing Machinery 21(1), 168-
173.

Waterman, S.A. 1996. "Distinguished Usage". In
Corpus Processing for Lexical Acquisition. B.
Boguraev and J. Pustejovsky (eds.) Cambridge:
The MIT Press.
