File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1077_metho.xml

Size: 5,475 bytes

Last Modified: 2025-10-06 14:15:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1077">
  <Title>TextTiling VecTile \[ Subjects</Title>
  <Section position="4" start_page="0" end_page="591" type="metho">
    <SectionTitle>
3 The Method
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="591" type="sub_section">
      <SectionTitle>
3.1 Context Vectors
</SectionTitle>
      <Paragraph position="0"> The VecTile system is based on the WordSpae~ model of (Schiitze, 1997; Schfitze, 1998). The idea is to represent words by encoding the environments in which they typically occur in texts. Such a representation can be obtained automatically and often provides sufficient information to make deep linguistic analysis unnecessary. This has led to promising results in information retrieval and related areas (Flournoy et al., 1998a; Flournoy et al., 1998b).</Paragraph>
      <Paragraph position="1"> Given a dictionary W and a relatively small set-C of meaningful &amp;quot;content&amp;quot; words, for each pair in W x C, the number of times is recorded that the two co-occur within some measure of distance in a training corpus. This yields a \[C\]-dimensionalvector for each w E W. The direction that the vector has in the resulting ICI-dimensional space then represents the collocational behavior of w in the training corpus. In the present implementation, IW\[-- 20,500 and ICI = 1000. For computational efficiency and to avoid the high number of zero values in the resulting matrix, the matrix is reduced to 100 dimensions using Singular-Value Decomposition (Golub and van Loan, 1989).</Paragraph>
      <Paragraph position="2">  As a measure of similarity in collocational behavior between two words, the cosine between their vectors is computed: Given two n-dimensional vectors</Paragraph>
      <Paragraph position="4"/>
    </Section>
    <Section position="2" start_page="591" end_page="591" type="sub_section">
      <SectionTitle>
3.2 Comparing Window Vectors .
</SectionTitle>
      <Paragraph position="0"> In order to represent pieces of text larger than single words, the vectors of the constituent words are added up. This yields new vectors in the same space, which can again be compared against each other and word vectors. If the word vectors in two adjacent portions of text are added up, then the cosine between the two resulting vectors is a measure of the lexical similarity between the two portions of text.</Paragraph>
      <Paragraph position="1"> The VecTile system uses word vectors based on co-occurrence counts on a corpus of New York Times articles. Two adjacent windows (200 words each in this experiment) move over the input text, and at pre-determined intervals (every 10 words), the vectors associated with the words in each window are added up, and the cosine between the resulting window vectors is assigned to the gap between the windows in the text. High values indicate lexical closeness. Troughs in the resulting similarity'curve mark spots with low cohesion.</Paragraph>
    </Section>
    <Section position="3" start_page="591" end_page="591" type="sub_section">
      <SectionTitle>
3.3 Text Segmentation
</SectionTitle>
      <Paragraph position="0"> To evaluate the performance of the system and facilitate comparison with other approaches, it was used in text segmentation. The motivating assumption behind this test is that cohesion reinforces the topical unity of subparts of text and lack of it correlates with their boundaries, hence if a system correctly; predicts segment boundaries, it is indeed measuring cohesion. For want of a way of observing cohesion directly, this indirect relationship is commonly used for purposes of evaluation.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="591" end_page="592" type="metho">
    <SectionTitle>
4 Implementation
</SectionTitle>
    <Paragraph position="0"> The implementation of the text segmenter resembles that of the Texl~Tiling system (Hearst, 1997.), The words from the input are stemmed and associated with their context vectors. The similarity curve over the text, obtained as described above, is smoothed out by a simple low-pass filter, and low points are assigned depth scores according to the difference between their values and those of the surrounding peaks. The mean and standard deviation of those depth scores are used to calculate a cutoff below which a trough is judged to be near a section break. The nearest paragraph boundary is then marked as a section break in the output.</Paragraph>
    <Paragraph position="1"> An example of a text similarity curve is given in  inserted in five rows in the upper half.</Paragraph>
    <Paragraph position="2">  The crucial difference between this and the TextTiling system is that the latter builds window vectors solely by counting the occurrences of strings in the windows. Repetition is rewarded by the present approach, too, as identical 'words contribute most to the similarity between the block vectors. However, similarity scores can be high even in the absence of pure string repetition, as long as the adjacent windows contain words that co-occur frequently in the training corpus. Thus what a direct comparison between the systems will show is whether the addition of collocational information gleaned from the training corpu s sharpens or blunts the judgment.</Paragraph>
    <Paragraph position="3"> For comparison, the TextTfling algorithm was implemented and run with the same window size (200) and gap interval (10).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML