File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0103_metho.xml
Size: 17,066 bytes
Last Modified: 2025-10-06 14:14:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0103"> <Title>Hierarchical Clustering of Words and Application to NLP Tasks</Title> <Section position="4" start_page="28" end_page="31" type="metho"> <SectionTitle> 2 Hierarchical Clustering of Words </SectionTitle> <Paragraph position="0"> Several algorithms have been proposed for automatically clustering words based on a large corpus (Jardino and Adda 91, Brown et al. 1992, Kneser and Ney 1993, Martin et al. 1995, Ueberla 1995). They are classified into two types. One type is based on shuffling words from class to class starting from some initial set of classes. The other type repeats merging classes starting from a set of singleton classes (which contain only one word). Both types are driven by some objective function, in most cases by perplexity or average mutual information. The merit of the second type for the purpose of constructing hierarchical clustering is that we can easily convert the history of the merging process to a tree-structured representation of the vocabulary.</Paragraph> <Paragraph position="1"> On the other hand, the second type is prone to being trapped by a local minimum. The first type is more robust to the local minimum problem, but the quality of classes greatly depends on the initial set of classes, and finding an initial set of good quality is itself a very difficult problem. Moreover, the first approach only provides a means of partitioning the vocabulary and it doesn't provide a way of constructing a hierarchical clustering of words. In this paper we adopt the merging approach and propose an improved method of constructing hierarchical clustering. An attempt is also made to combine the two types of clustering and some results will be shown. The combination is realized by the construction of clusters using the merging method followed by the reshuffling of words from class to class.</Paragraph> <Paragraph position="2"> Our word bits construction algorithm (Ushioda 1996) is a modification and an extension of the mutual information (MI) clustering algorithm proposed by Brown et al. (1992). The reader is referred to (Ushioda 1996) and (Brown et al. 1992) for details of MI clustering, but we will first briefly summarize the MI clustering and then describe our hierarchical clustering algorithm.</Paragraph> <Section position="1" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 2.1 Mutual Information Clustering Algorithm </SectionTitle> <Paragraph position="0"> Suppose we have a text of T words, a vocabulary of V words, and a partition 7r of the vocabulary which is a function from the vocabulary V to the set C of classes of words in the vocabulary.</Paragraph> <Paragraph position="1"> Brown et al. showed that the likelihood L(Tr) of a bigram class model generating the text is given by the following formula.</Paragraph> <Paragraph position="3"> Here H is the entropy of the 1-gram word distribution, and I is the average mutual information (AMI) of adjacent classes in the text and is given by equation 2.</Paragraph> <Paragraph position="5"> Since H is independent of r, the partition that maximizes the AMI also maximizes the likelihood L(r) of the text. Therefore, we can use the AMI as an objective function for the construction of classes of words.</Paragraph> <Paragraph position="6"> The mutual information clustering method employs a bottum-up merging procedure. In the initial stage, each word is assigned to its own distinct class. We then merge two classes if the merging of them induces minimum AMI reduction among all pairs of classes, and we repeat the merging step until the number of the classes is reduced to the predefined number C. Time complexity of this basic algorithm is O(V s) when implemented straightforwardly, but it can be reduced to O(V 3) by storing the result of all the trial merges at the previous merging step. Even with the O(V 3) algorithm, however, the calculation is not practical for a large vocabulary of order 104 or higher. Brown et al. proposed the following method, which we also adopted. We first make V singleton classes out of the V words in the vocabulary and arrange the classes in the descending order of frequency, then define the merging region as the first C + 1 positions in the sequence of classes. So initially the C + 1 most frequent words are in the merging region. Then do the following.</Paragraph> <Paragraph position="7"> I. Merge the pair of classes in the merging region merging of which induces minimum AMI reduction among all the pairs in the merging region.</Paragraph> <Paragraph position="8"> 2. Put the class in the (C + 2) nd position into the merging region and shift each class after the (C + 2) nd position to its left.</Paragraph> <Paragraph position="9"> 3. Repeat I. and 2. until C classes remain.</Paragraph> <Paragraph position="10"> With this algorithm, the time complexity becomes O(C2V) which is practical for a workstation with V in the order of 100,00O and C up to 1,000.</Paragraph> </Section> <Section position="2" start_page="29" end_page="31" type="sub_section"> <SectionTitle> 2.2 Word Bits Construction Algorithm </SectionTitle> <Paragraph position="0"> The simplest way to construct a tree-structured representation of words is to construct a dendrogram as a byproduct of the merging process, that is, to keep track of the order of merging and make a binary tree out of the record. A simple example with a five word vocabulary is shown in Figure 1. If we apply this method to the above O(C2V) algorithm straightforwardly, however, we obtain for each class an extremely unbalanced, almost left branching subtree. The reason is that after classes in the merging region are grown to a certain size, it is much less expensive, in terms of AMI, to merge a singleton class with lower frequency into a higher frequency class than merging two higher frequency classes with substantial sizes.</Paragraph> <Paragraph position="1"> A new approach we adopted incorporates the following steps.</Paragraph> <Paragraph position="2"> 1. MI-clustering: Make C classes using the mutual information clustering algorithm with the merging region constraint mentioned in (2.1).</Paragraph> <Paragraph position="3"> 2. Outer-clustering: Replace all words in the text with their class token 1 and execute binary merging without the merging region constraint until all the classes are merged into a single class. Make a dendrogram out of this process. This dendrogram, Droot, constitutes the upper part of the final tree.</Paragraph> <Paragraph position="4"> 3. Inner-clustering: Let {C(1), C(2), ..., C(C)} be the set of the classes obtained at step 1. For each i (1 < i < C) do the following.</Paragraph> <Paragraph position="5"> (a) Replace all words in the text except those in C(i) with their class token. Define a new vocabulary V' = V1 U V2, where V1 = {all the words in C(i)}, V2 = {C1,C2, ...,Ci-l,Ci+l,Cc}, and Cj is a token for C(j) for 1 < j _< C. Assign each element in V' to its own class and execute binary merging with a merging constraint such that only those classes which only contain elements of V1 can be merged. This can be done by ordering elements of V' with elements of V1 in the first Ivll positions and keep merging with a merging region whose width is \]Vll initially and decreases by one with each merging step.</Paragraph> <Paragraph position="6"> (b) Repeat merging until all the elements in V1 are put in a single class.</Paragraph> <Paragraph position="7"> Make a dendrogram Dsub out of the merging process for each class. This dendrogram constitutes a subtree for each class with a leaf node representing each word in the class. 4 Combine the dendrograms by substituting each leaf node of Droot with the corresponding D sub .</Paragraph> <Paragraph position="8"> This algorithm produces a balanced binary tree representation of words in which those words which are close in meaning or syntactic feature come close in position. Figure 2 shows an example of Dsu b for one class out of 500 classes constructed using this algorithm with a vocabulary of the 70,000 most frequently occurring words in the Wall Street Journal Corpus. Finally, by tracing the path from the root node to a leaf node and assigning a bit to each branch with zero or one representing a left or right branch, respectively, we can assign a bit-string (word bits) to each word in the vocabulary.</Paragraph> <Paragraph position="10"/> </Section> </Section> <Section position="5" start_page="31" end_page="35" type="metho"> <SectionTitle> 3 Word Clustering Experiments </SectionTitle> <Paragraph position="0"> We performed experiments using plain texts from six years of the Wall Street Journal Corpus to create clusters and word bits. The sizes of the texts are 5 million words (MW), 10MW, 20MW, and 50MW. The vocabulary is selected as the 70,000 most frequently occurring words in the entire corpus. We set the number C of classes to 500. The obtained hierarchical clusters are evaluated via the error rate of the ATR Decision-Tree Part-Of-Speech Tagger.</Paragraph> <Paragraph position="1"> Then as an attempt to combine the two types of clustering methods discussed in Section 2, we performed an experiment for incorporating a word-reshuffling process into the word bits construction process.</Paragraph> <Section position="1" start_page="31" end_page="34" type="sub_section"> <SectionTitle> 3.1 Decision-Tree Part-Of-Speech Tagging </SectionTitle> <Paragraph position="0"> The ATR Decision-Tree Part-Of-Speech Tagger is an integrated module of the ATR Decision-Tree Parser which is based on SPATTER (Magerman 1994). The tagger employs a set of 441 syntactic tags, which is one order of magnitude larger than that of the University of Pennsylvania Treebank Project. Training texts, test texts, and held-out texts are all sequences of word-tag pairs. In the training phase, a set of events are extracted from the training texts.</Paragraph> <Paragraph position="1"> An event is a set of feature-value pairs or question-answer pairs. A feature can be any attribute of the context in which the current word word(O) appears; it is conveniently expressed as a question. Tagging is performed left to right. Figure 3 shows an example of an event with a current word like. The last pair in the event is a special item which shows the answer, i.e., the correct tag of the current word. The first two lines show questions about identity of words around the current word and tags for previous words. These questions are called basic questions. The second type of questions, word bits questions, are on clusters and word bits such as is the current word in Class 295? or what is the 29th bit of the previous word's word bits?.</Paragraph> <Paragraph position="2"> The third type of questions are called linguist's questions and these are compiled by an expert grammarian. Such questions could concern membership relations of words or sets of words, or morphological features of words.</Paragraph> <Paragraph position="3"> Out of the set of events, a decision tree is constructed. The root node of the decision tree represents the set of all the events with each event containing the correct tag for the corresponding word. Probability distribution of tags for the root node can be obtained by calculating relative frequencies of tags in the set. By asking a value of a specific feature on each event in the set, the set can be split into N subsets where N is the number of possible values for the feature. We can then calculate conditional probability distribution of tags for</Paragraph> <Paragraph position="5"> each subset, conditioned on the feature value. After computing for each feature the entropy reduction incurred by splitting the set, we choose the best feature which yields maximum entropy reduction. By repeating this step and dividing the sets into their subsets we can construct a decision tree whose leaf nodes contain conditional probability distributions of tags.</Paragraph> <Paragraph position="6"> The obtained probability distributions are then smoothed using the held-out data. The reader is referred to (Magerman 1994) for the details of smoothing. In the test phase the system looks up conditional probability distributions of tags for each word in the test text and chooses the most probable tag sequences using beam search.</Paragraph> <Paragraph position="7"> We used WSJ texts and the ATR corpus for the tagging experiment. The WSJ texts are re-tagged manually using the ATR syntactic tag set. The ATR corpus is a comprehensive sampling of Written American English, displaying language use in a very wide range of styles and settings, and compiled from many different domains (Black et al. 1996). Since the ATR corpus is still in the process of development, the size of the texts we have at hand for this experiment is rather minimal considering the large size of the tag set. Table 1 shows the sizes of texts used for the experiment. Figure 4 shows the tagging error rates plotted against various clustering text sizes. Out of the three types of questions, basic questions and word bits into the tagger, we performed a separate experiment in which a randomly generated bit-string is assigned to each word 2 and basic questions and word bits questions are used. The results are plotted at zero clustering text size. For both WSJ texts and ATR corpus, the tagging error rate dropped by more than 30% when using word bits information extracted from the 5MW text, and increasing the clustering text size further decreases the error rate. At 50MW, the error rate drops by 43%. This shows the improvement of the quality of clusters with increasing size of the clustering text. Overall high error rates are attributed to the very large tag set and the small training set. One notable point in this result is that introducing word bits constructed from WSJ texts is as effective for tagging ATtt texts as it is for tagging WS3 texts even though these texts are from very different domains. To that extent, the obtained hierarchical clusters are considered to be portable across domains.</Paragraph> <Paragraph position="8"> Figure 5 contrasts the tagging results using only word bits against the results with both word bits and linguistic questions 3 for the WS3 text. The zero clustering text size again corresponds ~Since a distinctive bit-string is assigned to each word, the tagger also uses the bit-string as an ID number for each word in the process. In this control experiment bit-strings are assigned in a random way, but no two words are assigned the same word bits. Random word bits are expected to give no class information to the tagger except for the identity of words.</Paragraph> <Paragraph position="9"> to a randomly generated bit-string. Introduction of linguistic questions is shown to significantly reduce the error rates for the WSJ corpus. Note that the dependency of the error rates on the clustering text size is quite similar in the two cases. This indicates the effectiveness of combining automatically created word bits and hand-crafted linguistic questions in the same platform, i.e., as features. In Figure 5 the tagging error rates seem to be approaching saturation after the clustering text size of 50MW. However, whether no further improvement can be obtained by using texts of greater size is still an unsolved question.</Paragraph> </Section> <Section position="2" start_page="34" end_page="35" type="sub_section"> <SectionTitle> 3.2 Reshuffling </SectionTitle> <Paragraph position="0"> One way to improve the quality of word bits is to introduce a reshuffling process just after step 1 (MI-clustering) of the word bits construction process (cf. SS 2.2). The reshuffling process we adopted is quite simple.</Paragraph> <Paragraph position="1"> 1. Pick a word from the vocabulary. Move the word from its current class to another class if that movement increases the AMI most among all the possible movements. 2. Repeat step 1 starting from the most frequent word through the least frequent word. This constitutes one round of reshuffling. After several rounds of reshuffling, the word bits construction process is resumed from step 2 (Outer-clustering).</Paragraph> <Paragraph position="2"> Figure 6 shows the tagging error rates with word bits obtained by zero, two and five rounds of reshuffling 4 with a 23MW text. Tagging results presented in Figure 5 are also shown as a reference. Although the vocabulary used in this experiment is slightly different from the other comprehensive.</Paragraph> <Paragraph position="3"> experiments, we can clearly see the effect of reshuffling for both the word-bits-only case and the case with word bits and linguistic questions. After five rounds of reshuffling, the tagging error rates become much smaller than the error rates using the 50MW clustering text with no reshuffling. It is yet to be determined if the effect of reshuffling increases with increasing amount of clustering text.</Paragraph> </Section> </Section> class="xml-element"></Paper>