File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0412_intro.xml

Size: 2,190 bytes

Last Modified: 2025-10-06 14:02:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0412">
  <Title>Non-Contiguous Word Sequences for Information Retrieval</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Vector Space Model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Preprocessing
</SectionTitle>
      <Paragraph position="0"> The first step of the process is to clean the data.</Paragraph>
      <Paragraph position="1"> A way to do this consists in skipping a set of words that are considered least informative, the stopwords. We also discarded all words of small size (less than three characters).</Paragraph>
      <Paragraph position="2"> We then reduced each word to its stem using the Porter algorithm (Porter, 1980). For example, the words &amp;quot;models&amp;quot;, &amp;quot;modelling&amp;quot; and &amp;quot;modeled&amp;quot; are all stemmed to &amp;quot;model&amp;quot;. This technique for reducing words to their root permits to further reduce the number of word terms.</Paragraph>
      <Paragraph position="3"> This feature selection phase brings more computational comfort for the next steps since it greatly reduces the size of the document collection representation in the vector space model (the dimension of the vector space).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Vector Space Model
</SectionTitle>
      <Paragraph position="0"> The set of the distinct remaining word stems W is used to represent the document collection within the vector space model. Each document is represented by a bardblWbardbl-dimensional vector filled in with a weight standing for the importance of each word token with respect to that document. To calculate this weight, we use a tfnormalized version of the &amp;quot;tfc&amp;quot; term-weighted components as described by Salton and Buckley (1988), i.e.:</Paragraph>
      <Paragraph position="2"> where tfw is the term frequency of the word w. N is the total number of documents in the collection and nw the number of documents in which w occurs.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML