File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0111_metho.xml
Size: 22,481 bytes
Last Modified: 2025-10-06 14:14:37
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0111"> <Title>Clustering Co-occurrence Graph based on Transitivity</Title> <Section position="4" start_page="0" end_page="91" type="metho"> <SectionTitle> 2 Word Ambiguity and Transitivity </SectionTitle> <Paragraph position="0"> Two words are said to co-occur when they frequently appear close to each other within texts.</Paragraph> <Paragraph position="1"> Regarding words as nodes and co-occurring re- null latious as branches, a graph can be constructed from a given corpus. We define such a graph as co-occurrence graph.</Paragraph> <Paragraph position="2"> When a portion of a corpus specializes in a topic, we can sti\]l extract a co-occurrence graph from the portion. A general corpus, such as newspaper corpus, contains many corpus portious, each specializing in one topic. Therefore, the whole co-occurrence graph obtained from a general corpus cont~.in.q subgraphs, each specializing in one topic. Our question is to extract such subgraphs of topics from a co-occurrence graph.</Paragraph> <Paragraph position="3"> We denote V as the set of nodes (words), E as the set of branches (co-occurrence relations), G=< V, E ~ as an input graph and I1/&quot;1 as the number of nodes. English words referred as examples will be w~itten in this font.</Paragraph> <Section position="1" start_page="91" end_page="91" type="sub_section"> <SectionTitle> 2.1 TrAnsitivity in Co-occurrence Rela- </SectionTitle> <Paragraph position="0"> tion The most basic mathematical laws discussed about relations between elements in a set are reflective, symmetric and transitive laws. Having a, b, c E V and R as a relation, they can be described as follows: Reflective aRa.</Paragraph> <Paragraph position="1"> Symmetric aRb ~ bRa.</Paragraph> <Paragraph position="2"> Transitive aRb, bRc ~ arc.</Paragraph> <Paragraph position="3"> Let V be word set and R be co-occurrence relation. When each property holds for .EL, words a, b and c can be explained as follows from the linguistic viewpoint: Reflective Word a co-occurs with itself. Symmetric Co-occurrence relation does not depend on the occurrence order.</Paragraph> <Paragraph position="4"> Transitive Word b does not have two-sided meanings (ambiguity). For instance, doctor, which has both medical and academic meanings, co-occurs with nurse within a medical topic, and co-occurs with professor within an academic topic. However, nurse and professor do not co-occur, so the transitivity between nurse: doctor and professor does not hold.</Paragraph> <Paragraph position="5"> Our request is to extract subgraphs each of which focuses on one topic with no ambiguity. Therefore, we perform clustering by extracting subgraphs whose branches form transitive co-occurrence relations.</Paragraph> </Section> </Section> <Section position="5" start_page="91" end_page="93" type="metho"> <SectionTitle> 3 TrAnsitivity and Clustering </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="91" end_page="92" type="sub_section"> <SectionTitle> 3.1 Decomposition and Duplication </SectionTitle> <Paragraph position="0"> The simplest case is a graph of three nodes.</Paragraph> <Paragraph position="1"> Figure 1-(1) is a graph in which the transitivity does not hold. For example, when b is doctor, a is nurse and c is professor, nurse and professor do not co-occur due to the node b's two-sided meanings. Therefore, as in Figure 1-(2), we &quot;duplicate&quot; b so that each duplicated node corresponds only to a single me~in,~. Then the ambiguity within b is resolved and the entire graph is divided into two subgraphs, the academic one and the medical one. To sum up, when the transitivity does not hold, a graph can be decomposed by duplicating the ambiguous node.</Paragraph> <Paragraph position="2"> On the other hand, when the transitivity holds among three nodes (Figure 1-(3)), the</Paragraph> <Paragraph position="4"> graph cannot be decomposed by duplication of I b (Figure I-(4)). This can be explained that 2&quot;~::'&quot; T the graph does not have the ambiguity. (1) .=.~..q.g~ (2) m~.umm m * m~ttt:~m ~ We extend the above into the case of four ''&quot; ~ - &quot; i nodes(Figure 1-(5)). Here, transitivity does not * hold in a-b-d because there is no branch be- b b tween a-d. When b--c is duplicated, the graph ~ I can be decomposed into two subgraphs (Figure (3) a d (4)a d i-(6)) in which the transitivity holds On the contrary, Figure 1-(7) c~not be decomposed c c by duplicating b-c due to the branch a-d (Figure 1-(8)); this shows that b-c is not ambiguous. Note that Figure 1-(7) is a complete graph of 4 nodes. We deYme duplicate branch as a branch to be duplicated for graph decomposition (such as b-c) and anchor branch as a branch which i~hlbit graph decomposition by duplication (such as a-d).</Paragraph> <Paragraph position="5"> In general, when a graph could not be separated by duplicating its subgraph, then the subgraph is regarded not to have ambiguity.</Paragraph> <Paragraph position="6"> Therefore, ideal clustering is to decompose graphs into subgraphs which cannot be decomposed further by duplication. Unfortunately, this constraint is too strict because such a graph is restricted to a complete graph. In addition, extracting complete graphs withln a given graph is NP-complete. Therefore we discuss in the following how to loosen the constraint. null</Paragraph> </Section> <Section position="2" start_page="92" end_page="93" type="sub_section"> <SectionTitle> 3.2 Transitive Graph </SectionTitle> <Paragraph position="0"> There are two methods to loosen the constraint. null The first is to decrease the nllmber of anchor branches. In the complete graph of more than 5 nodes, several anchor branches exist for each duplicate branch (Figure 2-(1)). However, only one anchor branch is sufficient to inhibit the decomposition. The less the number of anchor branch is, the looser the constraint is.</Paragraph> <Paragraph position="1"> This intuitively corresponds to loosen the sharpness of the focus of the topic in the resulting cluster. For instance, two words pneumonia and cancer do not always co-occur, but they do co-occur with words as doctor, nurse and hospital forming the core of medical topics. anchor distance 2 anchor distance n position Pneumonia will be included into a cluster if it is connected with these three words even if it is not connected with cancer. If cancer is also connected with these three words, both cancer and pneumonia, the different subtopical words within a medical topic are included in a cluster. The second is to loosen the transitivity itself. It was defined in Section 2.1 within three nodes. We may prepare a loose transitivity as follows: VlJ~:~21 -'. , ~n_l~:~Vn ---> ~l_~n We define anchor distance as the maximum distance of the minimum distances of a--b-d and a-c--d. For example, when minimum distance of a-b-d is 4 and that of a-c-d is 6 then the anchor distance is 6. The tightest constraint is when anchor distance is 2 as in Figure 2(3). This also blurs the topic focus of a cluster. In the example of pneumonia, the word will be included if it is connected directly with at least one of the words among doctor, hospital, nurse, and cancer, and connected indirectly with the others.</Paragraph> <Paragraph position="2"> For m,n < IVI- 1, G is called (m,n)transitive graph when for all e E E, there ave m anchor branches e t E E of anchor distance _<n.</Paragraph> <Paragraph position="3"> (m, n)-transitive graphs can be extracted as the subgraphs of the input graph. Figure 3 shows a map of (m,n)-transitive graphs. The axis of ordinates describes the number of anchor branches (m). The axis of abscissas de- null same input, (ml,n)-transitive graphs are included in (m2, n)-transitive graphs when ml < m2, and (m, nl)-transitive graphs are included in (m, n2)-transitive graphs when n2 < hi.</Paragraph> <Paragraph position="4"> GS2 in Figure 3 is the clusters obtained under the loosest constraint: m is the maximum and n is the ~ni~um. In GS2, all ambiguity of a branch and nodes at its ends are resolved. G$O are the transitive graphs of the tightest constraint. All transitive graphs on the horizontal line including G$O are complete.</Paragraph> </Section> </Section> <Section position="6" start_page="93" end_page="93" type="metho"> <SectionTitle> 4 Extraction of Transitive Graphs </SectionTitle> <Paragraph position="0"> So far, we did not explain how to detect the duplicate and anchor branches, given a graph.</Paragraph> <Paragraph position="1"> An algorithm for clustering can be top-down or bottom-up. The former gives clusters by decomposing the input graph by detecting duplicate branches.</Paragraph> <Paragraph position="2"> Although we have explained our clustering method top-down up to now, we propose our clustering method as bottom-up. We obtain clusters by accumulating adjacent nodes so that every branch has anchor branches and the resulting clusters include no duplicate branch. Thus, in the bottom-up method, we need no~ detect duplicate branches. This is convenient, because the condition for anchor and duplicate branches is denoted by local relationships among nodes.</Paragraph> <Paragraph position="3"> The branches in the input graph are assumed to be all symmetric. In this section, we use terms clusters as our output and subgraphs as their candidates.</Paragraph> <Section position="1" start_page="93" end_page="93" type="sub_section"> <SectionTitle> 4.1 Definition of Clusters </SectionTitle> <Paragraph position="0"> We extract GS1, the (1, 2)-transitive graph.</Paragraph> <Paragraph position="1"> A subgraph A including a branch e in the input graph can be extracted as follows: Step 1. Put a triangle graph including e into A.</Paragraph> <Paragraph position="2"> Step 2. Take a branch e ~ in A and a node v which makes a triangle with e t (Figure 4). If the following condition is satisfied, put v into A.</Paragraph> <Paragraph position="3"> There exists a node v t E G (input graph) whose distance from e ~ is 1, and it is connected to v with a branch. Here, the branch J-~v is the anchor branch so that e t is hindered to be the duplicate branch in the resulting cluster.</Paragraph> <Paragraph position="4"> Additionally, put every branch between v&quot; E G and v into A.</Paragraph> <Paragraph position="5"> Step 3. Repeat Step 2 until A c~nnot be extended. null Performing the above procedure starting from every branch in the input graph, we obtain many subgraphs. Considering the inclusion relation between subgraphs, they constitute a partial order (Figure 5). We define clusters as maximal subgraphs in this partial order chain. They are subgraphs not included in any other subgraphs. The uniqueness of the clusters for an input is self-evident.</Paragraph> </Section> <Section position="2" start_page="93" end_page="93" type="sub_section"> <SectionTitle> 4.2 Algorithm for Clustering </SectionTitle> <Paragraph position="0"> In the previous section, the procedure to obtain subgraphs should begin from every branch in the input. However, it is su~cient to calculate as follows.</Paragraph> <Paragraph position="1"> Step O. i -- 0</Paragraph> <Paragraph position="3"> Step 1. Choose a branch e E G not included in Go,&quot;', Gi-1. If no e is found, go to Step 5. Gi --- < 0,0 >. Put a triangle graph including e into Gi.</Paragraph> <Paragraph position="4"> Step 2 and 3. Extend Gi using Step 2 and 3 of the previous section Step 4. Set i - i q- 1 and goto Step 1.</Paragraph> <Paragraph position="5"> Step 5. Examine every pair of subgraphs (A, B), andifA includes B, then drop B. The remaining subgraphs are defined as clusters. null A maximal subgraph c~ot be missed. Its starting branch is encountered without fail in the above algorithm. If it is encountered as the starting branch in Step 1, the maximal sub-graph is obtained. If it is captured into a sub-graph and becomes e ~ in Step 2, the subgraph extends to the size of maximal subgraph; if it gets larger, the subgraph contradicts being maximal as the result of the last section.</Paragraph> <Paragraph position="6"> The algorithm halts since the input graph is finite, and the output is unique for an input.</Paragraph> </Section> </Section> <Section position="7" start_page="93" end_page="95" type="metho"> <SectionTitle> 5 Related Work </SectionTitle> <Paragraph position="0"> \[Li and Abe, 1996\] compared clustering methods proposed so far \[Hindle, 1990\] \[Brown et C/1., 1992\] \[Pereira et al., 1993\] \[Tokunaga et el., 1995\]\[Li and Abe, 199@ Most of them are so-called hard ch~tering: each word is included only in one cluster. We do not follow the trend, from the sense that our objective is the extraction of clusters of topics. It is natural that an ambiguous word should be included in different clusters.</Paragraph> <Paragraph position="1"> \[Pereira et C/l., 1993\] adopts sof~ clustering. They measured co-occurrence between nouns and verbs, and clustered nouns of the same distribution of verbs.</Paragraph> <Paragraph position="2"> \[Fukumoto and Tsujii, 1994\]'s work has common motivation with us: the ambiguity should be resolved for clustering. They clustered verbs using the gravity of multivariate analysis.</Paragraph> <Paragraph position="3"> \[Sugihara, 1995\]'s approach has a common point in that it focuses on graph structure for clustering and tries to structurize the input graph, a bipartite graph of words and concepts (such as food, fruit etc.). His clustering method is so called Dlllrnage-Mendelsohn decomposition in graph theory. The output naturally gives a partial order of clusters which can be compared with conventional thesauri.</Paragraph> <Paragraph position="4"> Our input is not bipartite. In the beginning, we tried to decompose input graph into maximum strongly connected components to obtain graphs of topics from the observation that nodes in a cycles are strongly related 1.</Paragraph> <Paragraph position="5"> However, subgraphs about different topics is merged into the same cluster by two ambiguous words which bridge these two subgraphs(Figure 6). Next, we observed that articulation nodes are ambiguous, so we performed decomposition into biconnected components. In this case when several biconnected components are connected in a ring, articulation nodes could not be detected (Figure 7). The observation that there are no co-occurrence relationship between X\[Tokunagaaud Tanaka, 1990\] discusses on extraction of cycles formed by trauslation relations fzom bilingual dictionary.</Paragraph> <Paragraph position="6"> two biconnected graph across the articulation node was the start point of this paper.</Paragraph> </Section> <Section position="8" start_page="95" end_page="96" type="metho"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="95" end_page="95" type="sub_section"> <SectionTitle> 6.1 Procedure of Clustering </SectionTitle> <Paragraph position="0"> First, we make the input graph from a 30M bytes of Wall Street Journal. Co-occurrences of no~s and verbs are extracted by a morphological analyzer. We defined that a word co-occurs with 5 words ahead of the word within a sentence. Co-occurrence degree is measured by mutual information\[Church and Hanks, 1990\].</Paragraph> <Paragraph position="1"> We set a certain threshold to the values to extract the input graph.</Paragraph> <Paragraph position="2"> The number of resulting clusters depends on the input graph as follows: When the threshold value is too high, the output is 3. On the contrary, when it is too low, then the output becomes 1. Both 1 and 3 are not interesting, because 1 is a graph including all topics, and 3 generates graphs of too small topics to check the global trend of topics in the input. Therefore, we varied threshold from 1.0 up to 7.0 by 0.5 steps to make the input graph, applied our algorithm to each input in order to detect the best threshold.</Paragraph> <Paragraph position="3"> The result is shown in Figure 8. The number of clusters whose sizes are more than 7 is plotted against the threshold value. When the threshold is 3.5, such dusters were most numerous. In 39 dusters, there were 727 different words out of 15347 in the input graph.</Paragraph> </Section> <Section position="2" start_page="95" end_page="96" type="sub_section"> <SectionTitle> 6.2 Evaluation of Clusters </SectionTitle> <Paragraph position="0"> In Appendix, 39 clusters are shown their contents sorted by size. Words judged inappropriate in each cluster are attached ~t'- Words tmdecidable being suitable in their clusters are put &quot;~&quot;.</Paragraph> <Paragraph position="1"> All 39 clusters are attached four items as follows: null ER=10.3%, UP,.= 14.7%; hence, the ambiguity was removed from clusters up to 85% on average. The number of the cluster whose topic was inestimable is only 1. The estimation of topic becomes clltTicult with two factors, CS and UR. When CS is too small, even when UR is 0.0, the cluster itself lacks in information. When UR is high, it is natural that the topic becomes inestimable. null</Paragraph> </Section> <Section position="3" start_page="96" end_page="96" type="sub_section"> <SectionTitle> 6.3 Evaluation of Words Contained in </SectionTitle> <Paragraph position="0"> More thun Two Clusters The number of words belonging to more than two clusters amounts to 57. They are classified as follows (numbers in parenthesis are cluster number in Appendix): i. A word with different words).</Paragraph> <Paragraph position="1"> men-lugs (10 l cell (1, 15)I ice (3,8) I panel (1, 7) treat (1, 3) 2. A word with the same me~nlng but used in different contexts (32 words). star (9,12,14) brand (3,22) 3. A word with the same meaning in the same context (7 words).</Paragraph> <Paragraph position="2"> 4. Others (One of the words is uncertain, or its cluster's context is not estimated. 6 words) Words of class 1 is the ambiguous words. Cell in Cluster I means cells of tissue, whereas that in Cluster 15 means battery. Ice in Cluster 3 means ice for cooling beverage, whereas that in Cluster 8 means ice to skate on.</Paragraph> <Paragraph position="3"> According to our objectives to obtain sub-graphs of topics, words in class 2 is quite important to be duplicated. For instance, star in Cluster 9 is a sport player star, that in Cluster 12 is a singer star and that in Cluster 14 is a movie star. If star were not duplicated, the three different topics would be merged into a single subgraph. The same situation is observed for children: it would merge topics of childbirth and education into a graph if it was not duplicated. We are apt to pay attention only to the words of class 1, but that of class 2 plays an important role in clustering.</Paragraph> <Paragraph position="4"> Words in class 3 is not ambiguous: they should connect two subgraphs into one (see Section7).</Paragraph> </Section> <Section position="4" start_page="96" end_page="96" type="sub_section"> <SectionTitle> 6.4 Cluster Hierarchy </SectionTitle> <Paragraph position="0"> An output subgraph of higher threshold is included as that of lower threshold. With this inclusion relation, the clusters form a hierarchy(Figure 9). A part of the hierarchy is shown below: Threshold 3.75 A admission college scholarship \]3 admission applicant college C campus children classroom college education enroll faculty grade math parent scholarship school student sugar teach teacher tuition tutor university voucher D birth child children couple marriage marry parent wedlock woman G admission applicant baby birth boy campus century child children classroom college couple daughter education endowment enroll enrollment establishment faculty father gift girl god grade homework husband infant ivy kid live love man marriage marry math mother oxford parent professor psychologist scholar scholarship school son student study sugar taught teach teacher teaching toy tuition tutor university voucher wed wedlock woman At threshold 3.75, the origins of education (Cluster 6) and childbirth(Cluster 24) clusters are already formed. Among education, there are subtopics on scholarship school and school entrance. They are merged into Cluster 6 when the threshold is lowered to 3.5. Cluster 24 is also formed from Clusters D,E,F of threshold 3.75. Then Cluster 6 and 24 are merged into Cluster G when the threshold is lowered again to 3.25. The clusters' hierarchical relationships are shown in Figure 9. We may see that the topic is more specialized when the threshold is high. Clusters which are merged between threshold 3.75 and 3.5 were those within a topic (A,B,C or D,E,F), but topics of different clusters are merged at 3.25 (Clusters 6 and 24). Thus, the lower the threshold is, the more the cluster contains ambiguity. The reason is that the two words in different topics do not co-occur.</Paragraph> </Section> </Section> <Section position="9" start_page="96" end_page="98" type="metho"> <SectionTitle> 7 Discussion </SectionTitle> <Paragraph position="0"> The best threshold differs in topics. Some examples are: Economic topic: Although Wall Street Journal articles have economic tendency, clusters of economic topic cannot be found in the dusters with more than ? words of threshold 3.5. They appear in clusters at threshold 3.0 as follows: * accountant audit bracket deduction filer income offset tax taxpayer * convert conversion debenture debt holder out.stand prefer redeem redemption repay tidewater The threshold should be lower for this topic.</Paragraph> <Paragraph position="1"> Medical topic: Cluster 1 have too many words. Despite of a medical cluster, potato appears in the cluster. At the threshold 3.0, Cluster 3 is completely merged with Cluster 1. The appearance of potato shows the sign that the merge of two clifferent topic has already begun. Therefore, a higher threshold is preferred.</Paragraph> <Paragraph position="2"> Trial topic: Several clusters exist on trial in Appendix. They should form a cluster with relatively lower threshold.</Paragraph> <Paragraph position="3"> Consequently, one of the most important future work is to integrate two stages, the first stage of malc;ng input graph with the static threshold, and the second stage of clustering, into a single stage with dynamic threshold.</Paragraph> </Section> class="xml-element"></Paper>