File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1076_evalu.xml
Size: 9,959 bytes
Last Modified: 2025-10-06 13:59:37
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1076"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Comparison of Document, Sentence, and Term Event Spaces</Title> <Section position="7" start_page="603" end_page="606" type="evalu"> <SectionTitle> 4 Results and Discussion </SectionTitle> <Paragraph position="0"> The 100830 full text documents comprised 2,001,730 distinct unstemmed terms, and 1,391,763 stemmed terms. All experiments reported in this paper consider stemmed terms.</Paragraph> <Section position="1" start_page="603" end_page="604" type="sub_section"> <SectionTitle> 4.1 Raw frequency comparison </SectionTitle> <Paragraph position="0"> The dimensionality of the document, sentence, and terms spaces varied greatly, with 100830 documents, 16.5 million sentences, and 2.0 million distinct unstemmed terms (526.0 million in total), and 1.39 million distinct stemmed terms.</Paragraph> <Paragraph position="1"> Figure 2A shows the correlation between the frequency of a term in the document space (x) and the average frequency of the same set of terms in the sentence space (y). For example, the average number of sentences for the set of terms that appear in 30 documents is 74.6. Figure 2B compares the document (x) and average term freq-</Paragraph> <Paragraph position="3"> SAT in the document (A), sentence (B), and term (C) event spaces. Note the predicted slope coefficients of 1.6362, 1.7138 and 1.7061 respectively). D shows the document, sentence, and term slope coefficients for each of the 25 journals when fit to the power law K=j m , where j is the rank.</Paragraph> <Paragraph position="4"> quency (y) These figures suggest that the document space differs substantially from the sentence and term spaces. Figure 2C shows the sentence frequency (x) and average term frequency (y), demonstrating that the sentence and term spaces are highly correlated.</Paragraph> <Paragraph position="5"> Luhn proposed that if terms were ranked by the number of times they occurred in a corpus, then the terms of interest would lie within the center of the ranked list (Luhn 1958). Figures 2D, E and F show the standard deviation between the document and sentence space, the document and term space and the sentence and term space respectively. These figures suggest that the greatest variation occurs for important terms.</Paragraph> </Section> <Section position="2" start_page="604" end_page="605" type="sub_section"> <SectionTitle> 4.2 Zipf's Law comparison </SectionTitle> <Paragraph position="0"> Zipf's Law states that the frequency of terms in a corpus conforms to a power law distribution K/j th where th is close to 1 (Zipf, 1949). We calculated the K and th coefficients for each journal and language model combination using the binning method proposed in (Adamic, 2000). Figures 3A-C show the actual frequencies, and the power law fit for the each language model in just one of the 25 journals (jacsat). These and the remaining 72 figures (not shown) suggest that Zipf's Law holds in all event spaces.</Paragraph> <Paragraph position="1"> Zipf Law states that th should be close to -1. In our corpus, the average th in the document space was -1.65, while the average th in both the sentence and term spaces was -1.73.</Paragraph> <Paragraph position="2"> Figure 3D compares the document slope (x) coefficient for each of the 25 journals with the sentence and term spaces coefficients (y). These findings are consistent with a recent study that suggested th should be closer to 2 (Cancho 2005). Another study found that term frequency rank distribution was a better fit Zipf's Law when the term space comprised both words and phrases (Ha et al, 2002). We considered only stemmed terms. Other studies suggest that a Poisson mixture model would better capture the frequency rank distribution than the power model (Church and Gale, 1995). A comprehensive overview of using Zipf's Law to model language can be found in (Guiter and Arapov, 1982).</Paragraph> </Section> <Section position="3" start_page="605" end_page="605" type="sub_section"> <SectionTitle> 4.3 Direct IDF, ISF, and ITF comparison </SectionTitle> <Paragraph position="0"> Our third experiment was to compare the three language models directly. Figure 4A shows the average, minimum and maximum ISF value for each rounded IDF value. After fitting a regression line, we found that ISF correlates well with IDF, but that the average ISF values are 5.57 greater than the corresponding IDF. Similarly, ITF correlates well with IDF, but the ITF values are 10.45 greater than the corresponding IDF.</Paragraph> <Paragraph position="1"> parisons.</Paragraph> <Paragraph position="2"> It is little surprise that Figure 4C reveals a strong correlation between ITF and ISF, given the correlation between raw frequencies reported in section 4.1. Again, we see a high correlation between the ISF and ITF spaces but that the ITF values are on average 4.69 greater than the equivalent ISF value. These findings suggests that simply substituting ISF or ITF for IDF would result in a weighting scheme where the corpus weights would dominate the weights assigned to query in the vector based retrieval model. The variation appears to increase at higher IDF values.</Paragraph> <Paragraph position="3"> Table 2 (see over) provides example stemmed terms with varying frequencies, and their corresponding IDF, ISF and ITF weights. The most frequent term &quot;the&quot;, appears in 100717 documents, 12,771,805 sentences and 31,920,853 times. In contrast, the stemmed term &quot;electrochem&quot; appeared in only six times in the corpus, in six different documents, and six different sentences. Note also the differences between abstracts, and the full-text IDF (see section 4.4).</Paragraph> </Section> <Section position="4" start_page="605" end_page="606" type="sub_section"> <SectionTitle> 4.4 Abstract vs full text comparison </SectionTitle> <Paragraph position="0"> Although abstracts are often easier to obtain, the availability of full-text documents continues to increase. In our fourth experiment, we compared the language used in abstracts with the language used in the full-text of a document. We compared the abstract and non-abstract terms in each of the three language models.</Paragraph> <Paragraph position="1"> Not all of the documents distinguished the abstract from the body. Of the 100,830 documents, 92,723 had abstracts and 97,455 had sections other than an abstract. We considered only those documents that differentiated between sections.</Paragraph> <Paragraph position="2"> Although the number of documents did not differ greatly, the vocabulary size did. There were 214,994 terms in the abstract vocabulary and 1,337,897 terms in the document body, suggesting a possible difference in the distribution of terms, the log(n i ) component of IDF.</Paragraph> <Paragraph position="3"> Figure 5 suggests that language used in an abstract differs from the language used in the body of a document. On average, the weights assigned to stemmed terms in the abstract were higher than the weights assigned to terms in the body of a document (space limitations preclude the inclusion of the ISF and ITF figures).</Paragraph> </Section> <Section position="5" start_page="606" end_page="606" type="sub_section"> <SectionTitle> 4.5 IDF sensitivity </SectionTitle> <Paragraph position="0"> The stability of the corpus weighting scheme is particularly important in a dynamic environment such as the web. Without an understanding of how IDF behaves, we are unable to make a principled decision regarding how often a system should update the corpus-weights.</Paragraph> <Paragraph position="1"> To measure the sensitivity of IDF we sampled at 10% intervals from the global corpus as outlined in section 3. Figure 6 compares the global IDF with the IDF from each of the 10% samples.</Paragraph> <Paragraph position="2"> The 10% samples are almost indiscernible from the global IDF, which suggests that IDF values are very stable with respect to a random subset of articles. Only the 10% sample shows any visible difference from the global IDF values, and even then, the difference is only noticeable at higher global IDF values (greater than 17 in our corpus). null In addition to a random sample, we compared the global based IDF with IDF values generated from each journal (in an on-line environment, it may be pertinent to partition pages into academic or corporate URLs or to calculate term frequencies for web pages separately from blog and wikis). In this case, N in equation (1) was the number of documents in the journal and n</Paragraph> <Paragraph position="4"> the distribution of terms within a journal.</Paragraph> <Paragraph position="5"> If the journal vocabularies were independent, the vocabulary size would be 4.1 million for unstemmed terms and 2.6 million for stemmed terms. Thus, the journals shared 48% and 52% of their vocabulary for unstemmed and stemmed and suggests that the average IDF within a journal differed greatly from the global IDF value, particularly when the global IDF value exceeds five. This contrasts sharply with the random samples shown in Figure 6.</Paragraph> <Paragraph position="6"> At first glance, the journals with more articles appear to correlated more with the global IDF than journals with fewer articles. For example, JACSAT has 14,400 documents and is most correlated, while MPOHBP with 58 documents is least correlated. We plotted the number of articles in each journal with the mean squared error (figure not shown) and found that journals with fewer than 2,000 articles behave differently to journals with more than 2,000 articles; however, the relationship between the number of articles in the journal and the degree to which the language in that journal reflects the language used in the entire collection was not clear.</Paragraph> </Section> </Section> class="xml-element"></Paper>