A METHOD FOR IMPROVING AUTOMATIC WORD 
CATEGORIZATION 
Emin Erkan Korkmaz GSktfirk U~oluk 
Department of Computer Engineering 
Middle East Technical University 
Emails: korkmaz@ceng.metu.edu.tr ucoluk@ceng.metu.edu.tr 
Abstract 
This paper presents a new approach to 
automatic word categorization which im- 
proves both the efficiency of the algorithm 
and the quality of the formed clusters. The 
unigram and the bigram statistics of a cor- 
pus of about two million words are used 
with an efficient distance function to mea- 
sure the similarities of words, and a greedy 
algorithm to put the words into clusters. 
The notions of fuzzy clustering like cluster 
prototypes, degree of membership are used 
to form up the clusters. The algorithm is of 
unsupervised type and the number of clus- 
ters are determined at run-time. 
1 Introduction 
Statistical natural language processing is a challeng- 
ing area in the field of computational natural lan- 
guage learning. Researchers of this field have an 
approach to language acquisition in which learning 
is visualised as developing a generative, stochastic 
model of language and putting this model into prac- 
tice (de Marcken, 1996). It has been shown practi- 
cally that the usage of such an approach can yield 
better performances for acquiring and representing 
the structure of language. 
Automatic word categorization is an important 
field of application in statistical natural language 
processing where the process is unsupervised and is 
carried out by working on n-gram statistics to find 
out the categories of words. Research in this area 
points out that it is possible to determine the struc- 
ture of a natural language by examining the regular- 
ities of the statistics of the language (Finch, 1993). 
The organization of this paper is as follows. After 
the related work in the area of word categorization 
is presented in section 2, a general background of 
the categorization process is described in 3 section, 
which is followed by presentation of newly proposed 
method. In section 4 the results of the experiments 
are given. We discuss the relevance of the results 
and conclude in the last section. 
2 Related Work 
There exists previous work in which the unigram and 
the bigram statistics are used for automatic word 
clustering. Here it is concluded that the frequency 
of single words and the frequencies of occurance of 
word pairs of a large corpus can give the necessary 
information to build up the word clusters. Finch 
(Finch and Chater, 1992), makes use of these bigram 
statistics for the determination of the weight matrix 
of a neural network. Brown, (Brown et al., 1992) 
uses the same bigrams and by means of a greedy 
algorithm forms the hierarchical clusters of words. 
Genetic algorithms have also been successfuly 
used for the categorization process. Lanchorst 
(Lankhorst, 1994) uses genetic algorithms to deter- 
mine the members of predetermined classes. The 
drawback of his work is that the number of classes 
is determined previous to run-time and the genetic 
algorithm only searches for the membership of those 
classes. 
McMahon and Smith (McMahon and Smith, 
1996) also use the mutual information of a corpus 
to find the hierarchical clusters. However instead of 
using a greedy algorithm they use a top-down ap- 
proach to form the clusters. Firstly by using the 
mutual information the system divides the initial 
set containing all the words to be clustered into two 
parts and then the process continues on these new 
clusters iteratively. 
Statistical NLP methods have been used also to- 
gether with other methods of NLP. Wilms (Wilms, 
1995) uses corpus based techniques together with 
knowledge-based techniques in order to induce a lex- 
ical sublanguage grammar. Machine Translation is 
an other area where knowledge bases and statistics 
Korkmaz ~ U~oluk 43 Automatic Word Categorization 
Emin Erkan Korkmaz and GSktiirk U~oluk (1997) A Method for Improving Automatic Word Categorization. 
In T.M. EUison (ed.) CoNLL97: Computational Natural Language Learning, ACL pp 43-49. 
(~) 1997 Association for Computational Linguistics 
are integrated. Knight, (Knight et al., 1994) aims to 
scale-up grammar-based, knowledge-based MT tech- 
niques by means of statistical methods. 
3 Word Categorization 
The words in a natural language can be visualised 
as consisting of two different sets. The closed class 
words and the open class ones. New open class words 
can be added to the language as the language pro- 
gresses, however the closed class is a fixed one and 
no new words are added to the set. For instance the 
prepositions are in the closed class. However nouns 
are in the open class, since new nouns can be added 
to the language as the social and economical life pro- 
gresses. However the most frequently used words in 
a natural language usually form a closed class. 
Zipf, (Zipf, 1935), who is a linguist, was one of 
the early researchers on statistical language models. 
His work states that only 2% of the words of a large 
English corpus form 66% of the total corpus. There- 
fore, it can be claimed that by working on a small 
set consisting of frequent words it is possible to build 
a framework for the whole natural language. 
N-gram models of language are commonly used to 
build up such framework. An n-gram model can 
be formed by collecting the probabilities of word 
streams (will : 1..n). The probabilities will be 
used to form the model where we can predict the 
behaviour of the language up to n words. There ex- 
ists current research that use bigram statistics for 
word categorization in which probabilities of word 
pairs in the text are collected and processed. 
3.1 Mutual Information 
As stated in the related work part these n-gram mod- 
els can be used together with the concept of mutual 
information to form the clusters. Mutual Informa- 
tion is based on the concept of entropy which can be 
defined informally as the uncertainty of a stochas- 
tic experiment. Let X be a stochastic variable de- 
fined over the set X = {Xl,X2,...,x,~} where the 
probabilities Px(xi) are defined for 1 _< i _~ n as 
Px(xi) = P(X = xi) then the entropy of X, H(X) 
is defined as: 
H{X}=- E Px(x,)logPx(x,) (1) 
And if Y is another stochastic variable than the 
mutual information between these two stochastic 
variables is defined as: 
I{X: V} : H{X} + H{Y} - H{X,Y} (2) 
Korkmaz ~ O~oluk 
Here H{X, Y} is the joint entropy of the stochas- 
tic variables X and Y. The joint entropy is defined 
as: 
44 
H{X,Y}=- E E P~u(x"YJ)l°gP~u(x"YJ) 
(3) 
And in this formulation Pxu(xi, yj) is the joint 
probability defined as P~u(xi, yj) : P(X : x~, Y : 
Given a lexicon space W = {wl, w2, ..., w,} con- 
sisting of n words to be clustered, we can use the 
formulation of mutual information for the bigram 
statistics of a natural language corpus. In this for- 
mulation X and Y are defined over the sets of the 
words appearing in the first and second positions re- 
spectively. So the mutual information that is the 
amount of knowledge that a word in a corpus can 
give about the proceeding word can be reformulated 
using the bigram statistics as follows: 
N~j , Nij. N** itx:Y}= (4) 
l _< i _< n l _< j _< n 
In this formulation N** is the total number of 
word pairs in the corpus and N~j is the number of 
occurances of word pair (wordi, wordj), Ni. is the 
number of occurences of wordi and N.j is the num- 
ber of occurences of wordj respectively. This formu- 
lation denotes the amount of linguistic knowledge 
preserved in bigram of words in a natural language. 
3.2 Clustering Approach 
When the mutual information is used for clustering, 
the process is carried out somewhat at a macro-level. 
Usually search techniques and tools are used to- 
gether with the mutual information in order to form 
some combinations of different sets each of which is 
then subject to some validity test. The idea used 
for the validity testing process is as follows. Since 
the mutual information denotes the amount of prob- 
abilistic knowledge that a word provides on the pro- 
ceeding word in a corpus, if similar behaving words 
are collected together to the same clusters than the 
loss of mutual information would be minimal. So 
the search is among possible alternatives for sets or 
clusters with the aim to obtain a minimal loss in 
mutual information. 
Although this top-to-bottom method seems theo- 
retically possible, in the presented work a different 
approach, which is bottom-up is used. In this incre- 
mental approach, set prototypes are first built and 
then combined with other sets or single words to 
Automatic Word Categorization 
form larger ones. The method is based on the sim- 
ilarities or differences between single words rather 
than the mutual information of a whole corpus. In 
combining words into sets a fuzzy set approach is 
used. The authors believe that this serves to deter- 
mine the behavior of the whole set more properly. 
Using this constructive approach, it is possible to 
visualize the word clustering problem as the prob- 
lem of clustering points in an n-dimensional space 
if the lexicon space to be clustered consists of n 
words. The points that are the words in a corpus 
are positioned on this n-dimensional space accord- 
ing to their behaviour related to other words in the 
lexicon space. Each wordi is placed on the i th di- 
mension according to its bigram statistic with the 
word representing the dimension. So the degree of 
similarity between two words can be defined as hav- 
ing close bigram statistics in the corpus. Words 
are distributed in the plane according to those bi- 
gram statistics. The idea is quite simple: Let wl 
and w~ be two words from the corpus. Let Z be 
the stochastic variable ranging over the words to be 
clustered. Then if Px (Wl, Z) is close to Px (w2, Z) 
and if Px(Z, wl) is close to Px(Z, w2) for Z rang- 
ing over all the words to be clustered in the corpus, 
than we can state a closeness between the words Wl 
and w2. Here Px is the probability of occurences of 
word pairs as stated in section 3.1. Px(wl, Z) is the 
probability where Wl appears as the first element in 
a word pair and Px(Z, wl) is the reverse probabil- 
ity where wl is the second element of the word pair. 
This is the same for w2 respectively. 
In order to start the clustering process, a distance 
function has to be defined between the elements in 
our plane. Using the idea presented above we define 
a simple distance function between words using the 
bigram statistics. The distance function D between 
two words wl and w2 is defined as follows: 
D(wl, w2) = Dl(Wl, w2) + D2(wl,w2) 
where 
(5) 
Dl(Wl,W2) : ~ \] Px(wl,Wl)- Px(w2,wl) l 
l<i<n (6) 
and 
D~(wl, w2) = ~ I PX(W,, Wl) -- Px(wi, w2) \] 
l<i<n (7) 
Korkmaz ~ O~oluk 45 
Here n is the total number of words to be clus- 
tered. Since Px(wi wj) is defined as ~ the pro- ' Nee 
' portion of the number of occurences of word pair 
wi and wj to the total number of word pairs in the 
corpus, the distance function for wl and w2 reduces 
down to: 
D(wl, w2) = ~ \[ N~li- Nw2i \[+ IN/w1 -Nivo2 \[ 
l<#.<n (8) 
Having such a distance function, it is possible to 
start the clustering process. The first idea that can 
be used is to form a greedy algorithm to start form- 
ing the hierarchy of word clusters. If the lexicon 
space to be clustered consists of {Wl, w2, ..., wn}, 
then the first element from the lexicon space wl is 
taken and a cluster with this word and its near- 
est neighbour or neighbors is formed. Then the 
lexicon space is {(Wl, W,,, ..., Wsk), Wi, ..., Wn) where 
(wl, ws,, ..., wsk) is the first cluster formed. The pro- 
cess is repeated with the first element in the list 
that is outside the formed sets, wi for our case and 
the process iterates until no word is left not being 
a member of any set. The formed sets will be the 
clusters at the bottom of the cluster hierarchy. Then 
to determine the behaviour of a set, the frequencies 
of its elements are added and the previous process 
is carried on the sets this time rather than on sin- 
gle words until the cluster hierarchy is formed, so 
the algorithm stops when a single set is formed that 
contains all the words in the lexicon space. 
In the early stages of this research such a greedy 
method was used to form the clusters, however, 
though some clusters at the low levels of the tree 
seemed to be correctly formed, as the number of 
elements in a cluster increased towards the higher 
levels, the clustering results became unsatisfactory. 
Two main factors were observed as the reasons for 
the unsatisfactory results. 
These were: 
• Shortcomings of the greedy type algorithm. 
• Inadequency of the method used to obtain the 
set behaviour from its element's properties. 
The greedy method results in a nonoptimal clus- 
tering in the initial level. To make this point clear 
consider the following example: Let us assume that 
four words wl,w2, w3 and w4 are forming the lexicon 
space. And let the distances between these words 
be defined as d~,~j. Then consider the distribu- 
tion in Figure 1. If the greedy method first tries to 
cluster wl. Then it will be clustered with w2, since 
Automatic Word Categorization 
Set I 
LEXICON SPACE 
'~ ~ 3 (¢=pected) 
Figure 1: Example for the clustering problem of greedy al- 
gorithm in a lexicon space with four different words. Note that 
d~,~ 3 is the smallest distance in the distribution. However since 
wl is taken into consideration, it forms setl with its nearest neigh- 
bour w~ and ws combines with w4 and form set2, although w~ is 
nearer. And the expected third set is not formed. 
the smallest d~l,w , for the first word is d~l,w~. So 
the second word will be captured in the set and the 
algorithm will skip w2 and continue the clustering 
process with w3. At this point, though w3 is clos- 
est to w2, since it is captured in a set and since w3 
is more closer to w4 than the center of this set is, 
a new cluster will be formed with members w3 and 
w4. However, as it can be obviously seen visually 
from Figure 1 the first optimal cluster to be formed 
between these four words is the set {w2, w3}. 
The second problem causing unsatisfactory clus- 
tering occurs after the initial sets are formed. Ac- 
cording to the algorithm after each cluster is formed, 
the clusters behave exactly like other single words 
and get into clustering with other clusters or single 
words. However to continue the process, the bigram 
statistics of the clusters or in other words the com- 
mon behaviour of the elements in a cluster should be 
determined so that the distance between the cluster 
and other elements in the search space could be cal- 
culated. One easy way to determine this behaviour 
is to find the average of the statistics of all the ele- 
ments in a cluster. This method has its drawbacks. 
The points in the search space for the natural lan- 
guage application are very close to each other. Fur- 
thermore, if the corpus used for the process is not 
large, the proximity problem becomes more severe. 
On the other hand the linguistic role of a word may 
vary in contexts in different sentences. Many words 
are used as noun, adjective or falling into some other 
linguistic category depending on the context. It can 
be claimed that each word initially shall be placed in 
a cluster according to its dominant role. However to 
determine the behaviour of a set the dominant roles 
of its elements should also be used. Somehow the 
common properties (bigrams) of the elements should 
be always used and the deviations of each element 
should be eliminated in the process. 
3.2.1 Improving the Greedy Method 
The clustering process is improved to overcome 
the above mentioned drawbacks. 
The idea used to find the optimal cluster for each 
word at the initial step is to form up such initial clus- 
ters in the algorithm used in which words are allowed 
to be members of more than one class. So after the 
first pass over the lexicon space, intersecting clus- 
ters are formed. For the lexicon space presented in 
Figure 1 with four words, the expected third set is 
also formed. As the second step these intersecting 
sets are combined into a single set. Then the clos- 
est two words (according to the distance function) 
are found in each combined set and these two closest 
words are taken into consideration as the prototype 
for that set. After finding the centroids of all sets, 
the distances between a member and all the cen- 
troids are calculated for all the words in the lexicon 
space. Following this, each word is moved to the set 
where the distance between this member and the set 
center is minimal. This procedure is necessary since 
the initial sets are formed with combining the inter- 
secting sets. When these intersecting sets are com- 
bined the set center of the resulting set might be far 
away from some elements and there may be other 
closer set centers formed with other combinations, 
so a reorganization of membership is appropriate. 
3.2.2 Fuzzy Membership 
As presented in the previous section the clustering 
process builds up a cluster hierarchy. In the first 
step, words are combined to form the initial clusters, 
then those clusters become members of the process 
themselves. To combine clusters into new ones the 
statistical behaviour of them should be determined, 
since bigram statistics are used for the process. The 
statistical behaviour ~of a cluster is related to the 
bigrams of its members. In order to find out the 
dominant statistical role of each cluster the notion 
of fuzzy membership is used. 
The problem that each word can belong to more 
than one linguistic category brings up the idea that 
the sets of word clusters cannot have crisp border- 
lines and even ifa word is in a set due to its dominant 
linguistic role in the corpus, it can have a degree of 
membership to the other clusters in the search space. 
Therefore fuzzy membership can be used for deter- 
mining the bigram statistics of a cluster. 
Researchers working on fuzzy clustering present 
a framework for defining fuzzy membership of ele- 
ments. Gath and Geva (Gath and Geva, 1989) de- 
scribe such an unsupervised optimal fuzzy cluster- 
ing. They present the K-means algorithm based on 
minimization of an objective function. For the pur- 
Korkmaz 8t Ufoluk 46 Automatic Word Categorization 
the 5.002056% 
and 3.281249% 
to 2.836796% 
of 2.561952% 
a 2.107116% 
m 1.591189% 
he 1.533916% 
was 1.419838% 
that 1.306431% 
his 1.124362% 
it 1.061797% 
Table 1: Frequencies of the most frequent ten words 
pose of this research only the membership function 
of the presented algorithm is used. The membership 
function uij that is the degree of membership of the 
i th element to the jth cluster is defined as: 
~x77~y 
= -K 1 ir - (9) Ek=l I  X77 5 
Here Xi denotes an element in the search space, 
l,~ is the centroid of the jth cluster. K denotes the 
number of clusters. And d2(Xi, ~) is the distance 
of Xith element to the centroid I,~ of the jth cluster. 
The parameter q is the weighting exponent for ztij 
and controls the fuzziness of the resulting cluster. 
After the degrees of membership for all the ele- 
ments of all classes in the search space are calculated, 
the bigram statistics of the classes are added up. To 
find those statistics the following method is used: 
The bigram statistics of each element is multiplied 
with the degree of the membership of the element in 
the working set and this forms the amount of statis- 
tical knowledge passed from the element to that set. 
So the elements chosen as set centroids will be the 
ones that affect a set's statistical behaviour mostly. 
Hence an element away from a centroid will have a 
lesser statistical contribution. 
4 Results 
The algorithm is tested on a corpus formed with on- 
line novels collected from the www page of the "Book 
Stacks Unlimited, Inc." The corpus consists of twelve 
free on-line novels adding up to about 1.700.000 
words. The corpus is passed through a filtering pro- 
cess where the special words, useless characters and 
words are filtered and the frequencies of words are 
collected. Then the most frequent thousand words 
are chosen and they are sent to the clustering process 
described in the previous sections. These most fre- 
quent thousand words form the 70.4% of the whole 
corpus. The percentage goes up to about 77% if the 
next most frequent thousand is added to the lexi- 
con space. The first ten most frequent words in the 
Korkmaz ~ Ufoluk 
corpora and their frequencies are presented in Table 
1. 
The clustering process builds up a tree of words 
having words on the leaves and clusters on the in- 
ner nodes. The starting node denotes the largest 
class containing all the lexicon space. The number 
of leaves that is the number of clusters formed at 
the initial step is 60. The depth of the tree is 8. 
Leaves appear starting from the 5th level and they 
are mainly located at the 5th and 6th level. The 
number of nodes connecting the initial clusters is 18. 
So on the average about three clusters are combined 
together in the second step. Table 2 displays two re- 
sults from the clustering tree. The first one collects 
a set of nouns from the lexicon space. However the 
second one is somewhat ill-structured namely two 
prepositions, two adjectives and a verb cluster are 
combined into one. 
Some linguistic categories inferred by the algo- 
rithm are listed below: 
. prepositions(l): by with in to and of 
s prepositions(2): from on at for 
s prepositions(3): must might will should could would may 
s determiners(l) : your its our these some this my her all any 
no 
. prepositions(4): between among against through under 
upon over about 
. adjectives(l) : large young small good long 
• nouns(l) : spirit body son head power age character death 
sense part case state 
• verbs(l) : exclaimed answered cried says knew felt said or 
is was saw did asked gave took made thought either told 
whether replied because though how repeated open 
remained lived died lay does why 
• verbs(2) : shouted wrote showed spoke makes dropped 
struck laid kept held raised led carried sent brought rose 
drove threw drew shook talked yourself listened wished 
meant ought seem seems seemed tried wanted began used 
continued returned appeared comes knows liked loved 
• adjectives(2) : sad wonderful special fresh serious particular 
painful terrible pleasant happy easy hard sweet 
• nouns(2) : boys girls gentlemen ladies 
s adverbs(l) : scarcely hardly neither probably 
• verbs(3) : consider remember forget suppose believe say do 
think know feel understand 
• verbs(4) : keeping carrying putting turning shut holding 
getting hearing knowing finding drawing leaving giving tak- 
ing making having being seeing doing 
s nouns(3) : streets village window evening morning night 
middle rest end road sun garden table room ground door 
church world name people city year day time house country 
way place fact river next earth 
$ nouns(4) : beauty confidence pleasure interest fortune hap- 
piness tears 
47 Automatic Word Categorization 
bcccc 
bccccb  r,,r 
irs ~o@n~i~i!in 
passion questions books manner 
marriage ideas story 
speech faces sound 
feelings course distance 
point 
daughter 
friend 
family 
children 
men 
bccccc 
ty 
/ /oa~t confidence 
pleasure society sun interest 
action garden fortune table 
happines room 
tears ground 
door 
church 
world 
name 
people 
city 
between 
among 
against 
through 
under 
upon 
over 
about 
into 
before 
after 
than 
like 
bbbcc 
down small twenty five given ten taken 
out good three 
up long two 
Table 2: Examples from the cluster hierarchy 
Korkmaz ~ 09oluk 48 Automatic Word Categorization 
The ill-placed members in the clusters are shown 
above using bold font. The clusters represent the lin- 
guistic categories with a high success rate (,~ 91%). 
Some semantic relations in the clusters can also be 
observed. Group nouns(2) is a good example for 
such a semantic relation between the words in a clus- 
ter. 
5 Discussion And Conclusion 
It can be claimed that the results obtained in this 
research are encouraging. Although the corpus used 
for the clustering is quite small compared to other 
researches, the clusters formed seem to represent the 
linguistic categories. It is believed that the incorrect 
ones are due to the poorness of the knowledge con- 
veyed through the corpus. With a larger training 
data, an increase in the convergence of frequencies, 
thus an increase in the quality of clusters is expected. 
Since the distance function depends on only the dif- 
ference of the bigram statistics, the running time of 
the algorithm is quite low compared to algorithms 
using mutual information. Though the complexity of 
the two algorithms are the same there is an increase 
in the efficiency due to the lack of time consuming 
mathematical operations like division and multipli- 
cation needed to calculate the mutual information 
of the whole corpus. 
This research has focussed on adding fuzziness to 
the categorization process. Therefore different simi- 
larity metrics have not been tested for the algorithm. 
For further research the algorithm could be tested 
with different distance metrics. The metrics from 
the statistical theory given in (de Marcken, 1996) 
could be used to improve the algorithm. Also the 
algorithm could be used to infer the phrase struc- 
ture of a natural language. Finch (Finch, 1993) 
again uses the mutual information to find out such 
structures. Using fuzzy membership degrees could 
be another way to repeat the same process. To find 
out the phrases, most frequent sentence segments of 
some length could be collected from a corpus. In ad- 
dition to the frequencies and bigrams of words, the 
statistics for these frequent segments could be gath- 
ered and then they could also be passed to the clus- 
tering inference mechanism and the resulting clus- 
ters would then be expected to hold such phrases 
together with the words. 
To conclude, it can claimed that automatic word 
categorization is the initial step for the acquisition 
of the structure in a natural language and the same 
method could be used with modifications and im- 
provements to find out more abstract structures in 
the language and moving this abstraction up to the 
sentence level succesfuly might make it possible for 
Korkmaz ~ (/~oluk 
a computer to acquire the whole grammar of any 
natural language automatically. 

References 
Brown, P.F., V.J. Della Pietra, P.V. de Souza, 
J.C. Lai, and R.L. Mercer. Class-based n-gram 
models of natural language. Computational 
Linguistics, 18(4):467-477, 1992 
Finch, Steven. Finding Structure in language. PhD 
Thesis. Centre for Cognitive Science, Univer- 
sity of Edinburgh, 1993. 
Finch, S. and N. Chater, Automatic methods for 
finding linguistic categories. In Igor Alexan- 
der and John Taylor, editors, Artificial Neural 
Networks, volume 2. Elsevier Science Publish- 
ers, 1992 
Gath, I and A.B. Geva. Unsupervised Optimal 
Fuzzy Clustering. IEEE Transactions on pat- 
tern analysis and machine intelligence, Vol. 11, 
No. 7, July 1989. 
Knight, Kevin, Ishwar Chander, Matthew Haines, 
Vasieios Hatzivassiloglou, Eduard Hovy, Iida 
Masayo, Steve Luk, Okumura Akitoshi, 
Richard Whitney, Yamada Kenji. Integrat- 
ing Knowledge Bases and Statistics in MT. 
Proceedings of the 1st AMTA Conference. 
Columbia, MD. 1994. 
Lankhorst, M.M. A Genetic Algorithm for Auto- 
matic Word Categorization. In: E. Backer 
(ed.), Proceedings of Computing Science in the 
Netherlands CSN'94, SION, 1994, pp. 171-182. 
de Marcken, Carl G. Unsupervised Language Acqui- 
sition PhD Thesis, 1996. 
McMahon, John G. and Francis J. Smith. Improv- 
ing Statistical Language Model Performance 
with Automatically Generated Word Hierar- 
chies. Computational Linguistics, 1996. 
Wilms, Geert Jan. Automated Induction of a Lex- 
ical Sublanguage Grammar Using a Hybrid 
System of Corpus and Knowledge-Based Tech- 
niques. Mississippi State University. PhD The- 
sis, 1995. 
Zipf, G. K. The psycho-biology of Language. Boston: 
Houghton Mifflin. 1935. 
