File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1006_metho.xml

Size: 19,658 bytes

Last Modified: 2025-10-06 14:14:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1006">
  <Title>A METHOD FOR IMPROVING AUTOMATIC WORD CATEGORIZATION</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Statistical natural language processing is a challenging area in the field of computational natural language learning. Researchers of this field have an approach to language acquisition in which learning is visualised as developing a generative, stochastic model of language and putting this model into practice (de Marcken, 1996). It has been shown practically that the usage of such an approach can yield better performances for acquiring and representing the structure of language.</Paragraph>
    <Paragraph position="1"> Automatic word categorization is an important field of application in statistical natural language processing where the process is unsupervised and is carried out by working on n-gram statistics to find out the categories of words. Research in this area points out that it is possible to determine the structure of a natural language by examining the regularities of the statistics of the language (Finch, 1993). The organization of this paper is as follows. After the related work in the area of word categorization is presented in section 2, a general background of the categorization process is described in 3 section, which is followed by presentation of newly proposed method. In section 4 the results of the experiments are given. We discuss the relevance of the results and conclude in the last section.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> There exists previous work in which the unigram and the bigram statistics are used for automatic word clustering. Here it is concluded that the frequency of single words and the frequencies of occurance of word pairs of a large corpus can give the necessary information to build up the word clusters. Finch (Finch and Chater, 1992), makes use of these bigram statistics for the determination of the weight matrix of a neural network. Brown, (Brown et al., 1992) uses the same bigrams and by means of a greedy algorithm forms the hierarchical clusters of words.</Paragraph>
    <Paragraph position="1"> Genetic algorithms have also been successfuly used for the categorization process. Lanchorst (Lankhorst, 1994) uses genetic algorithms to determine the members of predetermined classes. The drawback of his work is that the number of classes is determined previous to run-time and the genetic algorithm only searches for the membership of those classes.</Paragraph>
    <Paragraph position="2"> McMahon and Smith (McMahon and Smith, 1996) also use the mutual information of a corpus to find the hierarchical clusters. However instead of using a greedy algorithm they use a top-down approach to form the clusters. Firstly by using the mutual information the system divides the initial set containing all the words to be clustered into two parts and then the process continues on these new clusters iteratively.</Paragraph>
    <Paragraph position="3"> Statistical NLP methods have been used also together with other methods of NLP. Wilms (Wilms, 1995) uses corpus based techniques together with knowledge-based techniques in order to induce a lexical sublanguage grammar. Machine Translation is an other area where knowledge bases and statistics Korkmaz ~ U~oluk 43 Automatic Word Categorization Emin Erkan Korkmaz and GSktiirk U~oluk (1997) A Method for Improving Automatic Word Categorization. In T.M. EUison (ed.) CoNLL97: Computational Natural Language Learning, ACL pp 43-49.  (~) 1997 Association for Computational Linguistics are integrated. Knight, (Knight et al., 1994) aims to scale-up grammar-based, knowledge-based MT techniques by means of statistical methods.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="44" type="metho">
    <SectionTitle>
3 Word Categorization
</SectionTitle>
    <Paragraph position="0"> The words in a natural language can be visualised as consisting of two different sets. The closed class words and the open class ones. New open class words can be added to the language as the language progresses, however the closed class is a fixed one and no new words are added to the set. For instance the prepositions are in the closed class. However nouns are in the open class, since new nouns can be added to the language as the social and economical life progresses. However the most frequently used words in a natural language usually form a closed class.</Paragraph>
    <Paragraph position="1"> Zipf, (Zipf, 1935), who is a linguist, was one of the early researchers on statistical language models.</Paragraph>
    <Paragraph position="2"> His work states that only 2% of the words of a large English corpus form 66% of the total corpus. Therefore, it can be claimed that by working on a small set consisting of frequent words it is possible to build a framework for the whole natural language.</Paragraph>
    <Paragraph position="3"> N-gram models of language are commonly used to build up such framework. An n-gram model can be formed by collecting the probabilities of word streams (will : 1..n). The probabilities will be used to form the model where we can predict the behaviour of the language up to n words. There exists current research that use bigram statistics for word categorization in which probabilities of word pairs in the text are collected and processed.</Paragraph>
    <Section position="1" start_page="0" end_page="44" type="sub_section">
      <SectionTitle>
3.1 Mutual Information
</SectionTitle>
      <Paragraph position="0"> As stated in the related work part these n-gram models can be used together with the concept of mutual information to form the clusters. Mutual Information is based on the concept of entropy which can be defined informally as the uncertainty of a stochastic experiment. Let X be a stochastic variable defined over the set X = {Xl,X2,...,x,~} where the probabilities Px(xi) are defined for 1 _&lt; i _~ n as</Paragraph>
      <Paragraph position="2"> mutual information between these two stochastic variables is defined as:</Paragraph>
      <Paragraph position="4"> Here H{X, Y} is the joint entropy of the stochastic variables X and Y. The joint entropy is defined as:</Paragraph>
      <Paragraph position="6"> And in this formulation Pxu(xi, yj) is the joint probability defined as P~u(xi, yj) : P(X : x~, Y : Given a lexicon space W = {wl, w2, ..., w,} consisting of n words to be clustered, we can use the formulation of mutual information for the bigram statistics of a natural language corpus. In this formulation X and Y are defined over the sets of the words appearing in the first and second positions respectively. So the mutual information that is the amount of knowledge that a word in a corpus can give about the proceeding word can be reformulated using the bigram statistics as follows: N~j , Nij. N** itx:Y}= (4) l _&lt; i _&lt; n l _&lt; j _&lt; n In this formulation N** is the total number of word pairs in the corpus and N~j is the number of occurances of word pair (wordi, wordj), Ni. is the number of occurences of wordi and N.j is the number of occurences of wordj respectively. This formulation denotes the amount of linguistic knowledge preserved in bigram of words in a natural language.</Paragraph>
    </Section>
    <Section position="2" start_page="44" end_page="44" type="sub_section">
      <SectionTitle>
3.2 Clustering Approach
</SectionTitle>
      <Paragraph position="0"> When the mutual information is used for clustering, the process is carried out somewhat at a macro-level.</Paragraph>
      <Paragraph position="1"> Usually search techniques and tools are used together with the mutual information in order to form some combinations of different sets each of which is then subject to some validity test. The idea used for the validity testing process is as follows. Since the mutual information denotes the amount of probabilistic knowledge that a word provides on the proceeding word in a corpus, if similar behaving words are collected together to the same clusters than the loss of mutual information would be minimal. So the search is among possible alternatives for sets or clusters with the aim to obtain a minimal loss in mutual information.</Paragraph>
      <Paragraph position="2"> Although this top-to-bottom method seems theoretically possible, in the presented work a different approach, which is bottom-up is used. In this incremental approach, set prototypes are first built and then combined with other sets or single words to  form larger ones. The method is based on the similarities or differences between single words rather than the mutual information of a whole corpus. In combining words into sets a fuzzy set approach is used. The authors believe that this serves to determine the behavior of the whole set more properly. Using this constructive approach, it is possible to visualize the word clustering problem as the problem of clustering points in an n-dimensional space if the lexicon space to be clustered consists of n words. The points that are the words in a corpus are positioned on this n-dimensional space according to their behaviour related to other words in the lexicon space. Each wordi is placed on the i th dimension according to its bigram statistic with the word representing the dimension. So the degree of similarity between two words can be defined as having close bigram statistics in the corpus. Words are distributed in the plane according to those bi-gram statistics. The idea is quite simple: Let wl and w~ be two words from the corpus. Let Z be the stochastic variable ranging over the words to be clustered. Then if Px (Wl, Z) is close to Px (w2, Z) and if Px(Z, wl) is close to Px(Z, w2) for Z ranging over all the words to be clustered in the corpus, than we can state a closeness between the words Wl and w2. Here Px is the probability of occurences of word pairs as stated in section 3.1. Px(wl, Z) is the probability where Wl appears as the first element in a word pair and Px(Z, wl) is the reverse probability where wl is the second element of the word pair. This is the same for w2 respectively.</Paragraph>
      <Paragraph position="3"> In order to start the clustering process, a distance function has to be defined between the elements in our plane. Using the idea presented above we define a simple distance function between words using the bigram statistics. The distance function D between two words wl and w2 is defined as follows:</Paragraph>
      <Paragraph position="5"> Here n is the total number of words to be clus- null tered. Since Px(wi wj) is defined as ~ the pro- ' Nee ' portion of the number of occurences of word pair wi and wj to the total number of word pairs in the corpus, the distance function for wl and w2 reduces down to:</Paragraph>
      <Paragraph position="7"> Having such a distance function, it is possible to start the clustering process. The first idea that can be used is to form a greedy algorithm to start forming the hierarchy of word clusters. If the lexicon space to be clustered consists of {Wl, w2, ..., wn}, then the first element from the lexicon space wl is taken and a cluster with this word and its nearest neighbour or neighbors is formed. Then the lexicon space is {(Wl, W,,, ..., Wsk), Wi, ..., Wn) where (wl, ws,, ..., wsk) is the first cluster formed. The process is repeated with the first element in the list that is outside the formed sets, wi for our case and the process iterates until no word is left not being a member of any set. The formed sets will be the clusters at the bottom of the cluster hierarchy. Then to determine the behaviour of a set, the frequencies of its elements are added and the previous process is carried on the sets this time rather than on single words until the cluster hierarchy is formed, so the algorithm stops when a single set is formed that contains all the words in the lexicon space.</Paragraph>
      <Paragraph position="8"> In the early stages of this research such a greedy method was used to form the clusters, however, though some clusters at the low levels of the tree seemed to be correctly formed, as the number of elements in a cluster increased towards the higher levels, the clustering results became unsatisfactory.</Paragraph>
      <Paragraph position="9"> Two main factors were observed as the reasons for the unsatisfactory results.</Paragraph>
      <Paragraph position="10"> These were: * Shortcomings of the greedy type algorithm.</Paragraph>
      <Paragraph position="11"> * Inadequency of the method used to obtain the set behaviour from its element's properties.</Paragraph>
      <Paragraph position="12"> The greedy method results in a nonoptimal clustering in the initial level. To make this point clear consider the following example: Let us assume that four words wl,w2, w3 and w4 are forming the lexicon space. And let the distances between these words be defined as d~,~j. Then consider the distribution in Figure 1. If the greedy method first tries to cluster wl. Then it will be clustered with w2, since  gorithm in a lexicon space with four different words. Note that d~,~ 3 is the smallest distance in the distribution. However since wl is taken into consideration, it forms setl with its nearest neighbour w~ and ws combines with w4 and form set2, although w~ is nearer. And the expected third set is not formed.</Paragraph>
      <Paragraph position="13"> the smallest d~l,w , for the first word is d~l,w~. So the second word will be captured in the set and the algorithm will skip w2 and continue the clustering process with w3. At this point, though w3 is closest to w2, since it is captured in a set and since w3 is more closer to w4 than the center of this set is, a new cluster will be formed with members w3 and w4. However, as it can be obviously seen visually from Figure 1 the first optimal cluster to be formed between these four words is the set {w2, w3}.</Paragraph>
      <Paragraph position="14"> The second problem causing unsatisfactory clustering occurs after the initial sets are formed. According to the algorithm after each cluster is formed, the clusters behave exactly like other single words and get into clustering with other clusters or single words. However to continue the process, the bigram statistics of the clusters or in other words the common behaviour of the elements in a cluster should be determined so that the distance between the cluster and other elements in the search space could be calculated. One easy way to determine this behaviour is to find the average of the statistics of all the elements in a cluster. This method has its drawbacks.</Paragraph>
      <Paragraph position="15"> The points in the search space for the natural language application are very close to each other. Furthermore, if the corpus used for the process is not large, the proximity problem becomes more severe.</Paragraph>
      <Paragraph position="16"> On the other hand the linguistic role of a word may vary in contexts in different sentences. Many words are used as noun, adjective or falling into some other linguistic category depending on the context. It can be claimed that each word initially shall be placed in a cluster according to its dominant role. However to determine the behaviour of a set the dominant roles of its elements should also be used. Somehow the common properties (bigrams) of the elements should be always used and the deviations of each element should be eliminated in the process.</Paragraph>
      <Paragraph position="17">  The clustering process is improved to overcome the above mentioned drawbacks.</Paragraph>
      <Paragraph position="18"> The idea used to find the optimal cluster for each word at the initial step is to form up such initial clusters in the algorithm used in which words are allowed to be members of more than one class. So after the first pass over the lexicon space, intersecting clusters are formed. For the lexicon space presented in Figure 1 with four words, the expected third set is also formed. As the second step these intersecting sets are combined into a single set. Then the closest two words (according to the distance function) are found in each combined set and these two closest words are taken into consideration as the prototype for that set. After finding the centroids of all sets, the distances between a member and all the centroids are calculated for all the words in the lexicon space. Following this, each word is moved to the set where the distance between this member and the set center is minimal. This procedure is necessary since the initial sets are formed with combining the intersecting sets. When these intersecting sets are combined the set center of the resulting set might be far away from some elements and there may be other closer set centers formed with other combinations, so a reorganization of membership is appropriate.</Paragraph>
      <Paragraph position="19">  As presented in the previous section the clustering process builds up a cluster hierarchy. In the first step, words are combined to form the initial clusters, then those clusters become members of the process themselves. To combine clusters into new ones the statistical behaviour of them should be determined, since bigram statistics are used for the process. The statistical behaviour ~of a cluster is related to the bigrams of its members. In order to find out the dominant statistical role of each cluster the notion of fuzzy membership is used.</Paragraph>
      <Paragraph position="20"> The problem that each word can belong to more than one linguistic category brings up the idea that the sets of word clusters cannot have crisp borderlines and even ifa word is in a set due to its dominant linguistic role in the corpus, it can have a degree of membership to the other clusters in the search space.</Paragraph>
      <Paragraph position="21"> Therefore fuzzy membership can be used for determining the bigram statistics of a cluster.</Paragraph>
      <Paragraph position="22"> Researchers working on fuzzy clustering present a framework for defining fuzzy membership of elements. Gath and Geva (Gath and Geva, 1989) describe such an unsupervised optimal fuzzy clustering. They present the K-means algorithm based on minimization of an objective function. For the pur-Korkmaz 8t Ufoluk 46 Automatic Word Categorization</Paragraph>
      <Paragraph position="24"> pose of this research only the membership function of the presented algorithm is used. The membership function uij that is the degree of membership of the i th element to the jth cluster is defined as: ~x77~y = -K 1 ir - (9) Ek=l I X77 5 Here Xi denotes an element in the search space, l,~ is the centroid of the jth cluster. K denotes the number of clusters. And d2(Xi, ~) is the distance of Xith element to the centroid I,~ of the jth cluster. The parameter q is the weighting exponent for ztij and controls the fuzziness of the resulting cluster. After the degrees of membership for all the elements of all classes in the search space are calculated, the bigram statistics of the classes are added up. To find those statistics the following method is used: The bigram statistics of each element is multiplied with the degree of the membership of the element in the working set and this forms the amount of statistical knowledge passed from the element to that set. So the elements chosen as set centroids will be the ones that affect a set's statistical behaviour mostly. Hence an element away from a centroid will have a lesser statistical contribution.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML