File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0102_metho.xml

Size: 6,580 bytes

Last Modified: 2025-10-06 14:07:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0102">
  <Title>Using Long Runs as Predictors of Semantic Coherence in a Partial Document Retrieval System</Title>
  <Section position="5" start_page="7" end_page="7" type="metho">
    <SectionTitle>
28 QUAN
29 REAF
3o RELN
31 REOR
32 REPR
33 ROVO
34 SIG
35 SIVO
36 SYAF
37 TIME
38 VOAC
39 VOIG
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
2.3 Indexing Space and Stop Lists
</SectionTitle>
      <Paragraph position="0"> Many of the most frequently occurring words in English, such as &amp;quot;the,&amp;quot; &amp;quot;of, .... and,&amp;quot; &amp;quot;to,&amp;quot; etc. are non-discriminators with respect to information filtering. Since many of these function words make up a large fraction of the text of most documents, their early elimination in the indexing process speeds processing, saves significant amounts of index space and does not compromise the filtering process. In the Brown Corpus, the frequency of stop words is 551,057 out of 1,013,644 total words. Function words therefore account for about 54.5% of the tokens in a document.</Paragraph>
      <Paragraph position="1"> The Brown Corpus is useful in text retrieval because it is small and efficiently exposes content word runs. Furthermore, minimizing the document token size is very important in NLP-based methods, because NLP-based methods usually need much larger indexing spaces than statistical-based methods due to processes for tagging and parsing.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="7" end_page="9" type="metho">
    <SectionTitle>
3 Experimental Basis
</SectionTitle>
    <Paragraph position="0"> In order to verify that long runs contribute to resolve semantic complexities and can be used as predictors of semantic intent, we employed a probabilistic, vector processing methodology.</Paragraph>
    <Section position="1" start_page="7" end_page="8" type="sub_section">
      <SectionTitle>
3.1 Revised Probability and Vector Processing
</SectionTitle>
      <Paragraph position="0"> In order to understand the calculation of SEMCATs, it is helpful to look at the structure  of a preprocessed document. One document &amp;quot;Barbie&amp;quot; in the Jang (1997) collection has a total of 1,468 words comprised of 755 content words and 713 function words. The document has 17 paragraphs. Filtering out function words using the Brown Corpus exposed the runs of content words as shown in Figure 1.</Paragraph>
      <Paragraph position="1">  The traditional vector processing model requires the following set of terms: * (dO the number of documents in the collection that each word occurs in * (idf) the inverse document frequency of each word determined by logl0(N/df) where N is the total number of documents. If a word appears in a query but not in a document, its idf is undefined.</Paragraph>
      <Paragraph position="2"> * The category probability of each query word.</Paragraph>
      <Paragraph position="3"> Wendlandt (1991) points out that it is useful to retrieve a set of documents based upon key words only, and then considers only those documents for semantic category and attribute analysis. Wendlandt (1991) appends the s category weights to the t term weights of each document vector Di and the Query vector Q. Since our basic query unit is a paragraph, document frequencY (df) and inverse document frequency (idf) have to be redefined. As we pointed out in Section 1, all terms are not known in partial text retrieval. Further, our approach is based on semantic weight rather than word frequency. Therefore any frequency based measures defined by Boyd et al. (1994) and Wendlandt (1991) need to be built from the probabilities of individual semantic categories. Those modifications are described below. As a simplifying assumption, we assume SEMCATs have a uniform probability distribution with regard to a word.</Paragraph>
    </Section>
    <Section position="2" start_page="8" end_page="9" type="sub_section">
      <SectionTitle>
3.2 Calculating SEMCATs
</SectionTitle>
      <Paragraph position="0"> Our first task in computing SEMCAT values was to create a SEMCAT dictionary for our method. We extracted SEMCATs for every word from the World Wide Web version of Roget's thesaurus. SEMCATs give probabilities of a word corresponding to a semantic category. The content word run 'favorite companion detractors love' is of length 4. Each word of the run maps to at least one SEMCAT. The word 'favorite' maps to categories 'PEAF and SYAF'.</Paragraph>
      <Paragraph position="1"> 'companion' maps to categories 'ANT, MECO, NUM, ORD, ORGM, PEAF, PRVO, QUAN, and SYAF'. 'detractor' maps to 'MOAF'. 'love' maps to 'AFIG, ANT, MECO, MOAF, MOCO, ORGM, PEAF, PORE, PRVO, SYAF, and VOIG'. We treat the long runs as a semantic core from which to calculate SEMCAT values.</Paragraph>
      <Paragraph position="2"> SEMCAT weights are calculated based on the following equations.</Paragraph>
      <Paragraph position="3">  is the sum of each SEMCAT(j) weight of long runs based on their probabilities.</Paragraph>
      <Paragraph position="4"> In the above example, the long run  'favorite companion detractors love,' ihe</Paragraph>
      <Paragraph position="6"> paragraph) - Given a set of N content words (data) in a paragraph, the expected weight of the SEMCATs of long runs in a paragraph is:</Paragraph>
      <Paragraph position="8"> paragraph) - The inverse data weight of SEMCATs of long runs for a set of N content words in a paragraph is</Paragraph>
      <Paragraph position="10"> Our method performs the following steps:  1. calculate the SEMCAT weight of each long content word run in every paragraph (Sw) 2. calculate the expected data weight of each paragraph (edw) 3. calculate the inverse expected data weight of each paragraph (idw) 4. calculate the actual weight of each paragraph (Swxidw) 5. calculate coherence weights (total relevance)  by summing the weights of (Swxidw). In every paragraph, extraction of SEMCATs from long runs is done first. The next step is finding the same SEMCATs of long runs through every word in a paragraph (expected data weight), then calculate idw, and finally Swxidw. The final, total relevance weights are an accumulation of all weights of SEMCATs of content words in a paragraph. Total relevance tells how many SEMCATs of the Query's long runs appear in a paragraph. Higher values imply that the paragraph is relevant to the long runs of the Query.</Paragraph>
      <Paragraph position="11"> The following is a program output for</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML