File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-2243_intro.xml

Size: 5,021 bytes

Last Modified: 2025-10-06 14:06:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2243">
  <Title>How to thematically segment texts by using lexical cohesion?</Title>
  <Section position="3" start_page="0" end_page="1481" type="intro">
    <SectionTitle>
2 Method
</SectionTitle>
    <Paragraph position="0"> The segmentation algorithm we propose includes two steps. First, a computation of the cohesion of the different parts of a text is done by using a collocation network. Second, we locate the major breaks in this cohesion to detect the thematic shifts and build segments.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The collocation network
</SectionTitle>
      <Paragraph position="0"> Our collocation network has been built from 24 months of the French Le Monde newspaper. The size of this corpus is around 39 million words. The cohesion between words has been evaluated with the mutual information measure, as in (Church and Hanks, 1990). A large window, 20 words wide, was used to take into account the thematic links. The texts were pre-processed with the probabilistic POS tagger TreeTagger (Schmid, 1994) in order to keep only the lemmatized form of their content words, i.e.</Paragraph>
      <Paragraph position="1"> nouns, adjectives and verbs. The resulting network is composed of approximatively 31 thousand words and 14 million relations.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="1481" type="sub_section">
      <SectionTitle>
2.2 Computation of text cohesion
</SectionTitle>
      <Paragraph position="0"> As in Kozima's work, a cohesion value is computed at each position of a window in a text (after pre-processing) from the words in this window. The collocation network is used for determining how close together these words are.</Paragraph>
      <Paragraph position="1"> We suppose that if the words of the window are strongly connected in the network, they belong to the same domain and so, the cohesion in this part of text is high. On the contrary, if they are not very much linked together, we assume that the words of the window belong to two different domains. It means that the window is located across the transition from one topic to another.</Paragraph>
      <Paragraph position="3"> Q word from the collocation network (with its computed weight) O word from the text (with its computed weight 1.0 ex. for the first word: Pwl+PwlXO.14 = 1.14) 0.14 link in the collocation network (with its cohesion value) Pwi initial weight of the word of the window wi (equal to 1.0 here}  In practice, the cohesion inside the window is evaluated by the sum of the weights of the words in this window and the words selected from the collocation network common to at least two words of the window. Selecting words from the network linked to those of the texts makes explicit words related to the same topic as the topic referred by the words in the window and produces a more stable description of this topic when the window moves.</Paragraph>
      <Paragraph position="4"> As shown in Figure 1, each word w (from the window or from the network) is weighted by the sum of the contributions of all the words of the window it is linked to. The contribution of such a word is equal to its number of occurrences in the window modulated by the cohesion measure associated to its link with w. Thus, the more the words belong to a same topic, the more they are linked together and the higher their weights are.</Paragraph>
      <Paragraph position="5"> Finally, the value of the cohesion for one position of the window is the result of the following</Paragraph>
      <Paragraph position="7"> wght(wi), the resulting weight of the word wi, sign(wi), the significance of wi, i.e. the normalized information of wi in the Le Monde corpus.</Paragraph>
      <Paragraph position="8"> Figure 2 shows the smoothed cohesion graph for ten texts of the experiment. Dotted lines are text boundaries (see 3.1).</Paragraph>
    </Section>
    <Section position="3" start_page="1481" end_page="1481" type="sub_section">
      <SectionTitle>
2.3 Segmenting the cohesion graph
</SectionTitle>
      <Paragraph position="0"> First, the graph is smoothed to more easily detect the main minima and maxima. This operation is done again by moving a window on the text. At each position, the cohesion associ-</Paragraph>
      <Paragraph position="2"> ated to the window center is re-evaluated as the mean of all the cohesion values in the window.</Paragraph>
      <Paragraph position="3"> After this smoothing, the derivative of the graph is calculated to locate the maxima and the minima. We consider that a minimum marks a thematic shift. So, a segment is characterized by the following sequence: minimum - maximum - minimum. For making the delimitation of the segments more precise, they are stopped before the next (or the previous) minimum if there is a brutal break of the graph and after this, a very slow descent. This is done by detecting that the cohesion values fall under a given percentage of the maximum value.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML