File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1305_metho.xml

Size: 21,968 bytes

Last Modified: 2025-10-06 14:07:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1305">
  <Title>Topic Analysis Using a Finite Mixture Model</Title>
  <Section position="4" start_page="35" end_page="36" type="metho">
    <SectionTitle>
3 Word Clustering
</SectionTitle>
    <Paragraph position="0"> Before conducting topic analysis, we create word clusters using a large data corpus. More precisely, we treat all words in a vocabulary as seed words, and for each seed word we collect from the data those words which frequently co-occur with it and group them into a cluster.</Paragraph>
    <Paragraph position="1"> As one example, the word-cluster in Figure 1 has been constructed with the word 'trade' as the seed word.</Paragraph>
    <Paragraph position="2"> We have developed a new method for reliably collecting frequently co-occurring words on the basis of stochastic complexity, or the MDL principle. For a given data sequence z m = xl...zm and for a fixed probability model M, 1 the stochastic complexity of x m relative to M, which we denote as SC(x m : M), is defined as the least code length required to encode x rn with M (Rissanen, 1996). SC(x m : M) can be interpreted as the amount information included in x n relative to M. The 1 Here, we use 'model' to refer to aprobability distnbution which has specified paxameters but unspecified parameter values.</Paragraph>
    <Paragraph position="3">  MDL (Minimum Description Length) principle is a model selection criterion which asserts that, for a given data sequence, the lower a model's SC value, the greater its likelihood of being a model which would have actually generated the data. MDL has many good properties as a criterion for model selection. 2 For a fixed seed word s, we take a word w as a frequently co-occurring word if the presence of s is a statistically significant indicator of the presence of w.</Paragraph>
    <Paragraph position="4"> Let a data sequence: (sl,wl), (s2,w2), .-., (Sin,Win) be given where (si, wi) denotes the state of co-occurrence of words s and w in the i-th text in the corpus data. Here, sl E</Paragraph>
    <Paragraph position="6"> the presence of a word, while 0 the absence of it. We further denote s TM = sl...sm, and W TM ~.. W 1 * . . W m .</Paragraph>
    <Paragraph position="7"> Then as in (Rissanen, 1996), the SC value of w TM relative to a model I in which the presence or absence of w is independent from those of s (i.e., a Bernoulli model), is calculated as SC(w TM : I) = mH + ~ log ~ + log 7r, where m + denotes the number of l's in wm.</Paragraph>
    <Paragraph position="8"> Here, log denotes the logarithm to the base 2, ~- the circular constant, and H(z) deJ</Paragraph>
    <Paragraph position="10"> Let w m&amp;quot; be the sequence of all wi's (wi E w rn) such that its corresponding si is 1, where ms denotes the number of l's in s ~. Let w rn'' be the sequence ofaU wi's (wi E w m) such that its corresponding si is 0, where rn.~s denotes the number O's in s m. The SC value of w m relative to a model D in which the presence or absence of w is dependent on those of s is then calculated as SC(w ( s.u log ) : = + ~logT~ + + (m&amp;quot;sH (-m'-'~'~'~ W 1/21degg-m-='~ W ldeggr) 2~ where ms + denotes the number of l's in wm', and w~+s the number of l's in w m~,.</Paragraph>
    <Paragraph position="11"> 2For an introduction to MDL, see (Li, 1998).</Paragraph>
    <Paragraph position="12"> We can then calculate</Paragraph>
    <Paragraph position="14"> According to the MDL principle, the larger the 6SC value, the more likely that the presence or absence of w is dependent on those of 8. 3 Actually, we may think of a word w for which the value of 6SC is larger than a pre-determined threshold 3' and P(wls ) &gt; P(w) is satisfied as that which occurs significantly frequently with the seed word s.</Paragraph>
    <Paragraph position="15"> Note that the word clustering process is independent of topic analysis. While one could employ other methods (e.g., (Hofmann, 1999)) here for word clustering, our clustering algorithm is more efficient than conventional ones. For example, Hofmann's is of order O(\]DIIWI2), while ours is only of O(ID I + \]WI2), where IDI denotes the number of texts and IW\] the number of words. That means that our method is more practical when a large amount of text data is available.</Paragraph>
  </Section>
  <Section position="5" start_page="36" end_page="36" type="metho">
    <SectionTitle>
4 Topic Analysis
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="36" end_page="36" type="sub_section">
      <SectionTitle>
4.1 Input and Output
</SectionTitle>
      <Paragraph position="0"> In topic analysis, we use STM to parse a given text and output a topic structure which consists of segmented blocks and their topics. Figure 3 shows an example topic structure as output with our method. The text has been segmented into five blocks, and to each block, a number of topics having high probability values have been assigned (topics axe represented by their seed words). The topic structure clearly represents what topics are included in the text and how the topics change within the text.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="36" end_page="37" type="metho">
    <SectionTitle>
4.2 Outline
</SectionTitle>
    <Paragraph position="0"> Our topic analysis consists of three processes: a pre-process called 'topic spotting,' text segmentation, and topic identification. In topic SNote that the quantity within \[---\] in (1) is (empirical) mutual inyormation, which is an effective measure for word co-occurrence calculation (cf.,(Brown et al., 1992)). When the sample size is small, mutual information values tend to be undesirably large. The quantity within {-..} in (1) can help avoid this undesirable tendency because its value will become large when data size is small.</Paragraph>
  </Section>
  <Section position="7" start_page="37" end_page="39" type="metho">
    <SectionTitle>
ASIAI SXPOITERS PSAk DAEAOS Fit05 U.S.-IAPA |RIFT (25-HAE-1987)
</SectionTitle>
    <Paragraph position="0"> block 0 ........ trade-expor~-cari~t-impo:rt(O,12) Japan-Japa.l~ese(O.07) U$(0.06) 0 Sountin S trade friction between the U.:3. and $opau has raised fears amen S many of lsia*s exporting nations chat the row could inflict ... 1 They told Router correspondents in Asian capitals a U.S. move against Japan might boost prote(c)tionist sentiment in she U.S. ~nd lead to ... 2 But some exporters said Chat while the conflict would hurt them in the lens-run, in the short-term Tokyo's loss might be their gain.</Paragraph>
    <Paragraph position="1"> 3 The U.S. Xas said it sill ~apose 300 ~tn dlrs of tariffs on imports of Japanese electronics seeds on April 17. in retaliation for Japa~*s ... 4 Unofficial Japanese ost~Jnates put the impact of the tariffs at 10 billion dlro and spokesmen for major electronics 1irma said they would ...</Paragraph>
    <Paragraph position="2"> 5 &amp;quot;go wouldn't be able to do business,&amp;quot; Isaid a spokesman for l.odin S ;apanese electronics ~irm Satanohita Electric Industrial C/o Lad tlt. 6 &amp;quot;If the tariffs remain in place for any length of time beyond a ~eg months it sill ~an the complete erosion of experts (o~ good8 subject ...</Paragraph>
    <Paragraph position="3"> block I ........ trade-export-ta~vif~-Impo:rt(O.lT) US(O.Og) Taiwan(O.05) dlrs(O.O$) T In Taigan. businessmen and officials ~re also worried.</Paragraph>
    <Paragraph position="4"> $ &amp;quot;We ire agLre of the seriousness ot the U.5. threat against Japan because it serves as a warning to o|,&amp;quot; said * senior Taiganese trade ... g Taiu&amp;n had z trade trade surplus of 15~6 billion dire last yeardeg gS pot of it uitb the U.$.</Paragraph>
    <Paragraph position="5"> 10 The surplus helped sgell 7aiwan's foreign exchange reserves to 53 billion dlrs. ninon S the world's largest.</Paragraph>
    <Paragraph position="6"> 11 &amp;quot;Re must quickly open our markets, remove trade barriers and cut Import tariffs to allow imports o~ U.S. predicts, if ue want to de~nse ... 12 I senior officiL1 ef South \[orea's tr~Lde promotion association said the trade dispute between the U.S. and Japan might also lead to ... 13 List year South |urea had a trade surplus ef 7.1 billion dlro uith the U.S.. np ~ron t.9 billion dlrs in 1985.</Paragraph>
    <Paragraph position="7"> 1~ In Halaysia. erode officers and businessmen said ~ou~h curbs against Japan might allen hard-hit producers o~ anuLicondnctors in third ... block 2 ........ Hong-|en$(0.16) trado-export-ta~iff-impert(O. 10) U5(0.06) 15 In Hung long, where nauspaporo have alleged Japan has been nailing baler-cost semiconductors, some electronicsmanu~acturnrn share ... 16 &amp;quot;That is a very short-term vies.&amp;quot; said Lawrn~ce Rills, director-general o~ the Federation of Hung Eerie Industry.</Paragraph>
    <Paragraph position="8"> 17 &amp;quot;I~ the uhole purpose is te prevent imports, one day it gill be extended to other sources. Hush more serious for Hsng Ions is the ... 18 The U.S. last year gas Hon K Eong's hiogest expert market, accounting for ever 30 pot of domestically produced exports.</Paragraph>
    <Paragraph position="9"> block 3 ........ trade-export-tariff-import(0.14) Button(O.08) ~apan-lapaneoe(O.07) 19 ~ho Australian government is anaiting the outcome of trade talks botmean the U.S. and Japan uitb interest and concern, Industry ...</Paragraph>
    <Paragraph position="10"> 20 *'1his kind o~ deterioration in trade relations between sue countries nhich &amp;r~majer trading partners of ours is a very ...</Paragraph>
    <Paragraph position="11"> 21 He said lostralia*s concerns centred en coal and beef, Anstrnlia:8 tee larsest exports to Japan and also significant U.S ....</Paragraph>
    <Paragraph position="12"> 22 Heanwhile U.S.-JapanaSe &amp;quot;diplomatic manoeuvmes to solve the trade stand-off continue.</Paragraph>
    <Paragraph position="13"> block 4 ........ Japan-Japanese(O,12) measure(O.06) trade-export-tariff-i~port(O.O5) 23 Japan's ruling Liberal Democratic Party yesterday outlined a package of economic measuru8 to boost the ~apananu $csnony.</Paragraph>
    <Paragraph position="14"> 24 The Measures proposed include * lapse supplementary budget and record public works spending in the firso half of ohe financial year. 25 \]hey also call gor stepped-up spending as an emergency measure to stimulate the economy danpite Prime Sinister Yasuhiro HaJ~asome ... 26 Deputy U.S. Trade kepreanutagive 5ichael Sunth and H~koto lnrrda, Japan's deputy minister of International Trade ~nd Zndustry (BZTZ) .... 0-26; sentence id (..): probability value  spotting, we select topics discussed in a given text. We can then construct STMs on the basis of the topics. In text segmentation, we segment the text on the basis of the STMs, assuming that each block is generated by an individual STM. In topic identification, we estimate the parameters of the STM for each segmented block and select topics with high probabilities for the block. In this way, we obtain a topic structure for the text.</Paragraph>
    <Section position="1" start_page="37" end_page="37" type="sub_section">
      <SectionTitle>
4.3 Topic Spotting
</SectionTitle>
      <Paragraph position="0"> In topic spotting, we first select key words from a given text. We calculate what we call the Shannon information of each word in the text. The Shannon information of word w in text t is defined as</Paragraph>
      <Paragraph position="2"> where N(w) denotes the frequency of w in t, and P(w) the probability of the occurrence of w as estimated from corpus data. I(w) may be interpreted as the amount of information represented by w. We select as key words the top I words sorted in descending order of I.</Paragraph>
      <Paragraph position="3"> While Shannon information is similar to the tf-idf widely used in information retrieval (e.g., (Salton and Yang, 1973)), the use of Shannon information can be justified on the basis of information theory, but that of tf-idf cannot. Our preliminary experimental results indicate that Shannon information performs better than or at least as well as tf-idf in key word extraction. 4 From the results of word clustering, we next select any cluster (topic) whose seed word is included among the selected key words.</Paragraph>
      <Paragraph position="4"> We next merge any two clusters if one of their seed words is included in the other's cluster. For example, when a cluster with seed word 'trade' contains the word 'import,' and a cluster with seed word 'import' contains the word 'trade,' we merge the two. After two such merges, we may obtain a relatively large cluster with, for example, ~trade-import-tariffexport' as its seed words, as is shown in Figure 3. Figure 4 shows the merging algorithm.</Paragraph>
      <Paragraph position="5"> In this way, we obtain the most conspicuous and mutually independent topics discussed in a given text.</Paragraph>
    </Section>
    <Section position="2" start_page="37" end_page="38" type="sub_section">
      <SectionTitle>
4.4 Text Segmentation
</SectionTitle>
      <Paragraph position="0"> In segmentation, we first identify candidates for points of segmentation within the given text. When we assume a relatively short text ~We will discuss it in the full version of the paper.</Paragraph>
      <Paragraph position="2"> For each cluster pair (ki, kj), if the seed word of ki is included in kj and the seed word of kj is included in ki, then push (ki, kj) into queue Q; while (Q # 0) { Remove the first element (kl, kj) from Q; if (kl and kj belong to different sets</Paragraph>
      <Paragraph position="4"> for the purposes of our explanation here, all sentence-ending periods will be candidates.</Paragraph>
      <Paragraph position="5"> For each candidate, we create two pseudotexts, one consisting of the h sentences preceding it, and the other of the h sentences following it (when fewer than h exist in any ..:direction, we simply use those which do exist). We use the EM algorithm ((Dempster et al., 1977), cL, Figure 5) to separately estimate the parameters of an STM from each of the two pseudo texts. It is theoretically guaranteed that the EM algorithm converges to a local maximum of the likelihood. We next calculate the similarity (i.e., essentially the converse notion of distance s) between the STM based on the preceding pseudo-text, and the STM based on the following pseudo-text. These STMs axe denoted, respectively, as PL(W) and PR(w). The similarity between PL(W) and</Paragraph>
      <Paragraph position="7"> The numerator is referred to in statistics as variational distance and has good properties as a distance between two probability distributions (cf., (Cover and Thomas, 1991), p.299).</Paragraph>
      <Paragraph position="8"> Figure 7 shows a graph of calculated similaxity values for each of the candidates in the 5We use similarity rather than distance here in order to simplify comparison between our method and TextTiling (Hearst, 1997).</Paragraph>
      <Paragraph position="9"> s: predetermined number.</Paragraph>
      <Paragraph position="10"> For the lth iteration (I = 1,-.., s),</Paragraph>
      <Paragraph position="12"> N(w) denotes the frequency of word w in the data; N = ~ew N(w).</Paragraph>
      <Paragraph position="14"> text shown in Figure 3. 'Valleys' (i.e., lowsimilarity values) in the graph suggest points for reasonable segmentations. In actual practice, segmentation is performed for each valley whose similarity values is lower to a predetermined degree 0 than each of the values of its left 'peak' and right 'peak' (cf., Figure 6) For example, for the text in Figure 3, segmentation was performed at candidates (i.e., end of sentences) 6, 14, 18, and 22, with 8 = 0.05.</Paragraph>
    </Section>
    <Section position="3" start_page="38" end_page="39" type="sub_section">
      <SectionTitle>
4.5 Topic Identification
</SectionTitle>
      <Paragraph position="0"> After segmentation, we separately estimate the parameters of the STM for each block, again using the EM algorithm, and obtain a topic (cluster) probability distribution for each block. We then choose those topics (dusters) in each block having.high probability values. In this way, we construct a topic struc- null ture as in Figure 3 for the given text (topics are here represented by their seed words).</Paragraph>
      <Paragraph position="1"> We can view topics appearing in all the blocks as main topics, and topics appearing only in individual blocks as subtopics. In the text in Figure 3, the topic represented by seed-words 'trade-export-tariff-import' is the main topic, and 'Japan-Japanese,' 'Hong Kong,' etc., are subtopics.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="39" end_page="39" type="metho">
    <SectionTitle>
5 Applications
</SectionTitle>
    <Paragraph position="0"> Our method can be used in a variety of text processing applications.</Paragraph>
    <Paragraph position="1"> For example, given a collection of texts (e.g., home pages), we can automatically construct an index of the texts on the basis of the extracted topics. We can indicate which topic is from which text or even which block of a text. Furthermore, we can indicate which topics are main topics of texts and which topics are subtopics (e.g., by displaying main topics in boldface, etc). In this way, users can get a fair sense of the contents of the texts simply by looking through the index. For a specific text, users can get a rough sense of the content by looking at the topic structure as, for example, it is shown in Figure 3.</Paragraph>
    <Paragraph position="2"> Our method can also be useful for text mining, text summarization, information extraction, and other text processing, which require one to first analyze the structure of a text.</Paragraph>
  </Section>
  <Section position="9" start_page="39" end_page="40" type="metho">
    <SectionTitle>
6 Related Work
</SectionTitle>
    <Paragraph position="0"> To the best of our knowledge, no previous study has so far dealt with topic identification and text segmentation within a single framework. null A widely used method for key word extraction calculates the tf-idf value of each word in a text and uses those words having the largest tf-idf values as key words for that text (e.g., (Salton and Yang, 1973)). One can view these extracted key words as the topics of the text.</Paragraph>
    <Paragraph position="1"> No keyword extraction method by itself, however, is able to conduct segmentation.</Paragraph>
    <Paragraph position="2"> With respect to text segmentation, existing methods can be classified into two groups. One is to divide a text into blocks (e.g., TextTiling (Hearst, 1997)), the other to divide a stream of texts into its original texts (e.g.,(Allan et al., 1998; Yamron et al., 1998; Beeferman et al., 1999; tteynar, 1999)). The former group generally employs unsupervised learning, while the latter supervised one. No existing segmentation method, however, has attempted topic identification.</Paragraph>
    <Paragraph position="3"> TextTiling creates for each segmentation candidate two pseudo-texts, one preceding it and the other following it, and calculates as similarity the cosine value between the word frequency vectors of the two pseudo texts. It then conducts segmentation at valley points in a similar way to that of our method. Since the problem setting of TextTiling (in general the former group) is most close to that of our study, we use TextTiling for comparison in our experiments.</Paragraph>
    <Paragraph position="4"> Our method by its nature performs topic identification and segmentation within a single framework. While it is possible with a combination of existing methods to extract key words from a given text by using tf-idf, view the extracted key words as topics, segment the text into blocks by employing Text-Tiling, estimate distribution of topics in each block, and identify topics having high probabilities in each block. Our method outper: forms such a combination (referred to hereafter as 'Corn') for topic identification, because it utilizes word duster information. It also performs better than Com in text segmentation because it is based on a well-defined probability framework. Most importantly is that our method is able to output an easily understandable topic structure, which has not been proposed so far.</Paragraph>
    <Paragraph position="5"> Note that topic analysis is different from text classification (e.g., (Lewis et al., 1996; Li and Yamanishi, 1999; Joachims, 1998; Weiss et al., 1999; Nigam et al., 2000)). While text classification uses a number of pre-determined categories, topic analysis includes no notion of category. The output of topic analysis is a topic structure, while the output of text clas- null sification is a label representing a category.</Paragraph>
    <Paragraph position="6"> Furthermore, text classification is generally based on supervised learning, which uses labeled text data 6. By way of contrast, topic analysis is based on unsupervised learning, which uses only unlabeled text data.</Paragraph>
    <Paragraph position="7"> Finite mixture models have been used in a variety of applications in text processing (e.g., (Li and Yamanishi, 1997; Nigam et al., 2000; Hofmann, 1999)), indicating that they are essential to text processing. We should note, however, that their definitions and the ways they use them axe different from those for STM in this paper. For example, Li and Yamanishi propose to employ in text classification a mixture model (Li and Yamanishi, 1997) defined over categories:</Paragraph>
    <Paragraph position="9"> where W denotes a set of words, and C a set of categories. In their framework, a new text d is assigned into a category c* such that</Paragraph>
    <Paragraph position="11"> proposes using in information retrieval a joint distribution which he calls 'an aspect model,'</Paragraph>
    <Paragraph position="13"> where D denotes a set of texts. Furthermore, he proposes extracting in retrieval those texts whose estimated word distributions P(w\[d) are similar to the word distribution of a query.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML