File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1305_intro.xml

Size: 2,227 bytes

Last Modified: 2025-10-06 14:01:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1305">
  <Title>Topic Analysis Using a Finite Mixture Model</Title>
  <Section position="3" start_page="35" end_page="35" type="intro">
    <SectionTitle>
2 Stochastic Topic Model
2.1 Topic
</SectionTitle>
    <Paragraph position="0"> While the term 'topic' is used in different ways in different linguistic theories, we simply view it here as a subject within a text. We represent a topic by means of a cluster of words that are closely related to the topic, assuming that a cluster has a seed word (or several seed words) which indicates a topic. Figure 1 shows an example topic with the word 'trade'</Paragraph>
    <Section position="1" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
2.2 Definition of STM
</SectionTitle>
      <Paragraph position="0"> Let W denote a set of words, and K a set of topics. We first define a distribution of topics (clusters) P(k) : ~kEIK P(k) = 1. Then, for each topic k E K, we define a probability distribution of words P(wik) : ~,ew P(wlk) = 1. Here the value of P(wik) will be zeroif w is not included in k. We next define a Stochastic Topic Model (STM) as a finite mixture model, which is a linear combination of the word probability distributions P(w\[k), with the topic distribution P(k) being used as the coefficient vector. The probability of word w in W is, then,</Paragraph>
      <Paragraph position="2"> For the purposes of statistical modeling, it is advantageous to conceive of a text (i.e., a word sequence) as having been generated by some 'true' STMs, which we then seek to estimate as closely as possible. A text may have a number of blocks, and each block is assumed to be generated by an individual STM. The STMs within a text are assumed to have the same set of topics, but have different parameter values.</Paragraph>
      <Paragraph position="3"> From the linguistic viewpoint, a text generally focuses on a single main topic, but it may discuss different subtopics in different blocks. While a text is discussing any one topic, it will more frequently use words strongly related to that topic. Hence, STM is a natural representation of statistical word occurrence based on topics.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML