File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-0102_evalu.xml
Size: 5,914 bytes
Last Modified: 2025-10-06 13:58:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0102"> <Title>Using Long Runs as Predictors of Semantic Coherence in a Partial Document Retrieval System</Title> <Section position="7" start_page="9" end_page="11" type="evalu"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> The goal of employing probability and vector processing is to prove the linguistic basis that long runs of content words can be used as predictors of semantic intent But we also want to exploit the computational advantage of removing the function words from the document, which reduces the number of tokens processed by about 50% and thus reduces vector space and probability computations. If it is true that long runs of content words are predictors of semantic coherence, we can further reduce the complexity of vector computations: (1) by eliminating those paragraphs without long runs from consideration, (2) within remaining paragraphs with long runs, computing and summing the semantic coherence of the longest runs only, (3) ranking the eligible paragraphs for retrieval based upon their semantic weights relative to the query.</Paragraph> <Paragraph position="1"> Jang (1997) established that the distribution of long runs of content words and short runs of content words in a collection of paragraphs are drawn from different populations. This implies that either long runs or short runs are predictors, but since all paragraphs contain short runs, i.e. a single content word separated by function words, only long runs can be useful predictors. Furthermore, only long runs as we define them can be used as predictors because short runs are insufficient to construct the language constructs for prepositional phrase and subject complement positions. If short runs were discriminators, the linguistic assumption of this research would be violated. The statistical analysis of Jang (1997) does not indicate this to be the case.</Paragraph> <Paragraph position="2"> To proceed in establishing the viability of our approach, we proposed the following experimental hypotheses: (HI) The SEMCAT weights for long runs of content words are statistically greater than weights for short runs of content words. Since each content word can map to multiple SEMCATs, we cannot assume that the semantic weight of a long run is a function of its length. The semantic coherence of long runs should be a more granular discriminator.</Paragraph> <Paragraph position="3"> (H2) For paragraphs containing long runs and short runs, the distribution of long run SEMCAT weights is statistically different from the distribution of short run SEMCAT weights.</Paragraph> <Paragraph position="4"> (H3) There is a positive correlation between the sum of long run SEMCAT weights and the semantic coherence of a paragraph, the total paragraph SEMCAT weight.</Paragraph> <Paragraph position="5"> A detailed description of these experiments and their outcome are described in Shin (1997, 1999). The results of the experiments and the implications of those results relative to the method we propose are discussed below. Table 3 gives the SEMCAT weights for seventeen paragraphs randomly chosen from one document in the collection of Jang (1997).</Paragraph> <Paragraph position="6"> The data was evaluated using a standard two way F test and analysis of variance table with ct = .05. The analysis of variance table for the paragraphs in Table 3 is shown in Table 4.</Paragraph> <Paragraph position="7"> At the .05 significance level, Fa _ .o5 = 4.49 for 1,16 degrees of freedom. Since 68.56 > 4.49 we reject the assertion that column means (run weights) are equal in Table 2. Long run and short run weights come from different * populations. We accept HI.</Paragraph> <Paragraph position="8"> For the between paragraph treatment, the row means (paragraph weights) have an F value of 2.21. At the .05 significance level, F,~ = 05 = 2.28 for 16,16 degrees of freedom. Since 2.21 < 2.28 we cannot reject the assertion that there is no significant difference in SEMCAT weights between paragraphs. That is, paragraph weights do not appear to be taken from different populations, as do the long run and short run weight distributions. Thus, the semantic weight of the content words in a paragraph cannot be used to predict the semantic weight of the paragraph. We therefore proceed to examine H2. Notice that two paragraphs in Table 2 are without long runs. We need to repeat the analysis of variance for only those paragraphs with long runs to see if long runs are discriminators. Table 5 summarizes those paragraphs.</Paragraph> <Paragraph position="9"> This data was evaluated using a standard two way F test and analysis of variance with cx = .05. The analysis of variance table for the paragraphs in At the .05 significance level, F== .05 = 4.10 for 2,10 degrees of freedom. 4.10 < 291.44. At the .05 significance level, F= = .05 = 2.98 for 10,10 degrees of freedom. 2.98 < 19.22. For paragraphs in a collection containing both long and short runs, the SEMCAT weights of the long runs and short runs are drawn from different distributions. We accept H2.</Paragraph> <Paragraph position="10"> For paragraphs containing long runs and short runs, the distributions of long run SEMCAT weights is different from the distribution of short run SEMCAT weights. We know from the linguistic basis for long runs that short runs cannot be used as predictors. We therefore proceed to examine the Pearson correlation between the long run SEMCAT weights and paragraph SEMCAT weights for those paragraphs with both long and short The weights in Table have a positive Pearson Product Correlation coefficient of .952. We therefore accept H3. There is a positive correlation between the sum of long run SEMCAT weights and the semantic coherence of a paragraph, the total paragraph SEMCAT weight.</Paragraph> </Section> class="xml-element"></Paper>