File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1305_intro.xml
Size: 2,227 bytes
Last Modified: 2025-10-06 14:01:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1305"> <Title>Topic Analysis Using a Finite Mixture Model</Title> <Section position="3" start_page="35" end_page="35" type="intro"> <SectionTitle> 2 Stochastic Topic Model 2.1 Topic </SectionTitle> <Paragraph position="0"> While the term 'topic' is used in different ways in different linguistic theories, we simply view it here as a subject within a text. We represent a topic by means of a cluster of words that are closely related to the topic, assuming that a cluster has a seed word (or several seed words) which indicates a topic. Figure 1 shows an example topic with the word 'trade'</Paragraph> <Section position="1" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 2.2 Definition of STM </SectionTitle> <Paragraph position="0"> Let W denote a set of words, and K a set of topics. We first define a distribution of topics (clusters) P(k) : ~kEIK P(k) = 1. Then, for each topic k E K, we define a probability distribution of words P(wik) : ~,ew P(wlk) = 1. Here the value of P(wik) will be zeroif w is not included in k. We next define a Stochastic Topic Model (STM) as a finite mixture model, which is a linear combination of the word probability distributions P(w\[k), with the topic distribution P(k) being used as the coefficient vector. The probability of word w in W is, then,</Paragraph> <Paragraph position="2"> For the purposes of statistical modeling, it is advantageous to conceive of a text (i.e., a word sequence) as having been generated by some 'true' STMs, which we then seek to estimate as closely as possible. A text may have a number of blocks, and each block is assumed to be generated by an individual STM. The STMs within a text are assumed to have the same set of topics, but have different parameter values.</Paragraph> <Paragraph position="3"> From the linguistic viewpoint, a text generally focuses on a single main topic, but it may discuss different subtopics in different blocks. While a text is discussing any one topic, it will more frequently use words strongly related to that topic. Hence, STM is a natural representation of statistical word occurrence based on topics.</Paragraph> </Section> </Section> class="xml-element"></Paper>