File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1076_metho.xml

Size: 6,215 bytes

Last Modified: 2025-10-06 14:10:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1076">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Comparison of Document, Sentence, and Term Event Spaces</Title>
  <Section position="5" start_page="601" end_page="603" type="metho">
    <SectionTitle>
3 Experimental Design
</SectionTitle>
    <Paragraph position="0"> Our goal in this paper is to compare and contrast language models based on a document with those based on a sentence and term event spaces. We considered several of the corpora from the Text Retrieval Conferences (TREC, trec.nist.gov); however, those collections were primarily news  articles. One exception was the recently added genomics track, which considered full-text scientific articles, but did not provide relevance judgments at a sentence or term level. We also considered the sentence level judgments from the novelty track and the phrase level judgments from the question-answering track, but those were news and web documents respectively and we had wanted to explore the event spaces in the context of scientific literature.</Paragraph>
    <Paragraph position="1"> Table 1 shows the corpus that we developed for these experiments. The American Chemistry Society provided 103,262 full-text documents, which were published in 27 journals from 2000- null . We processed the headings, text, and tables using Java BreakIterator class to identify sentences and a Java implementation of the Porter Stemming algorithm (Porter, 1980) to identify terms. The inverted index was stored in an Oracle 10i database.</Paragraph>
    <Paragraph position="2">  Formatting inconsistencies precluded two journals and reduced the number of documents by 2,432.</Paragraph>
    <Paragraph position="3"> We made the following comparisons between the document, sentence, and term event spaces. (1) Raw term comparison A set of well-correlated spaces would enable an accurate prediction from one space to the next. We will plot pair-wise correlations between each space to reveal similarities and differences. This comparison reflects a previous analysis comprising a random sample of 193 words from a 50 million word corpus of 85,432 news articles (Church and Gale 1999). Church and Gale's analysis of term and document spaces resulted in a p value of -0.994. Our work complements their approach by considering full-text scientific articles rather than news documents, and we consider the entire stemmed term vocabulary in a  526 million-term corpus.</Paragraph>
    <Paragraph position="4"> (2) Zipf Law comparison Information theory tells us that the frequency of terms in a corpus conforms to the power law distribution K/j th  1999). Zipf's Law is a special case of the power law, where th is close to 1 (Zipf, 1949). To provide another perspective of the alternative spaces, we calculated the parameters of Zipf's Law, K and th for each event space and journal using the binning method proposed in (Adamic 2000). By accounting for K, the slope as defined by th will provide another way to characterize differences between the document, sentence and term spaces. We expect that all event spaces will conform to Zipf's Law.</Paragraph>
    <Paragraph position="6"> direct comparison between IDF, ISF and ITF.</Paragraph>
    <Paragraph position="7"> Our third experiment was to provide pair-wise comparisons among these the event spaces.</Paragraph>
    <Paragraph position="8"> (4) Abstract versus full-text comparison Language models of scientific articles often consider only abstracts because they are easier to obtain than full-text documents. Although historically difficult to obtain, the increased availability of full-text articles motivates us to understand the nature of language within the body of a document. For example, one study found that full-text articles require weighting schemes that consider document length (Kamps, et al, 2005).</Paragraph>
    <Paragraph position="9"> However, controlling the weights for document lengths may hide a systematic difference between the language used in abstracts and the language used in the body of a document. For example, authors may use general language in an  abstract and technical language within a document. null Transitioning from abstracts to full-text documents presents several challenges including how to weigh terms within the headings, figures, captions, and tables. Our forth experiment was to compare IDF between the abstract and full text of the document. We did not consider text from headings, figures, captions, or tables.</Paragraph>
  </Section>
  <Section position="6" start_page="603" end_page="603" type="metho">
    <SectionTitle>
(5) IDF Sensitivity
</SectionTitle>
    <Paragraph position="0"> In a dynamic environment such as the Web, it would be desirable to have a corpus-based weight that did not change dramatically with the addition of new documents. An increased understanding of IDF stability may enable us to make specific system recommendations such as if the collection increases by more than n% then update the IDF values.</Paragraph>
    <Paragraph position="1"> To explore the sensitivity we compared the amount of change in IDF values for various sub-sets of the corpus. IDF values were calculated using samples of 10%, 20%, ..., 90% and compared with the global IDF. We stratified sampling such that the 10% sample used term frequencies in 10% of the ACHRE4 articles, 10% of the BICHAW articles, etc. To control for variations in the corpus, we repeated each sample 10 times and took the average from the 10 runs.</Paragraph>
    <Paragraph position="2"> To explore the sensitivity we compared the global IDF in Equation 1 with the local sample, where N was the average number of documents in the sample and n i was the average term frequency for each stemmed term in the sample. In addition to exploring sensitivity with respect to a random subset, we were interested in learning more about the relationship between the global IDF and the IDF calculated on a journal sub-set. To explore these differences, we compared the global IDF with local IDF where N was the number of documents in each journal and n i was the number of times the stemmed term appears in the text of that journal.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML