File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/92/h92-1043_evalu.xml

Size: 7,329 bytes

Last Modified: 2025-10-06 14:00:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1043">
  <Title>Test Sets</Title>
  <Section position="9" start_page="226" end_page="227" type="evalu">
    <SectionTitle>
EXPERIMENTAL RESULTS
</SectionTitle>
    <Paragraph position="0"> To judge the effectiveness of the Relevancy Signatures Algorithm, we performed a variety of experiments. Since our algorithm derives relevancy signatures from a training set of texts, it is important that the training set be large enough to produce significant statistics. It is harder for a given word/concept node pair to occur than it is for only the word to occur, so many potenually useful pairings may not occur very often. At the same time, it is also important to have a large test set so we can feel confident that our results accurately represent the effectiveness of the algorithm. Because we were constrained by the relatively small size of the MUC-3 collection (1500 texts), balancing these two requirements was something of a problem.</Paragraph>
    <Paragraph position="1"> Dividing the MUC-3 corpus into 15 blocks of 100 texts each, we ran 15 preliminary experiments with each block using 1400 texts for training and the remaining 100 for testing. The results showed that we could achieve high levels of precision with non-trivial levels of recall. Of the 15 experiments, 7 test sets reached 80% precision with 70% recall, 10 sets hit 80% precision with _&gt; 40% recall, and 12 sets achieved 80% precision with ~ 25% recall. In addition, 7 of the test runs produced precision scores of 100% for recall levels &gt; 10% and 5 test sets produced recall levels &gt; 50% with precision over 85%.</Paragraph>
    <Paragraph position="2"> Based on these experiments, we identified two blocks of I(X) texts that gave us our best and our worst results. With these 200 texts in hand, we then trained once again on the remaining 1300 in order to obtain a uniform training base under which the remaining two test sets could be compared.</Paragraph>
    <Paragraph position="3"> Figure 1 shows the performance of these two test sets based on the training set of 1300 texts. Each data point represents the results of the Relevancy Signatures Algorithm for a different combination of parameter values.</Paragraph>
    <Paragraph position="4"> We tested the reliability threshold at 70%, 75%, 80%, 85%, 90%, and 95% and varied the minimum number of occurrences from 0 to 19. As the data demonstrates, the results of the two test sets are clearly separated. Our best test results are associated with uniformly high levels of precision throughout (&gt; 78%), while our worst test results ranged from 47% to 67% precision. These results indicate the full range of our performance: average performance would fall somewhere in between these two extremes.</Paragraph>
    <Section position="1" start_page="226" end_page="227" type="sub_section">
      <SectionTitle>
Test Sets
</SectionTitle>
      <Paragraph position="0"> Low reliability and low M thresholds produce strong recall (but weaker precision) for relevant texts while high refiabifity and high M thresholds produce strong precision (but weaker recall) for the relevant texts being retrieved. A high reliability threshold ensures that the algorithm uses only relevancy signatures that are very strongly correlated with relevant texts and a high minimum number of occurrences threshold ensures that it uses only relevancy signatures that have appeared with greater frequency. By adjusting these two parameter values, we can manipulate a recall/precision tradeoff.</Paragraph>
      <Paragraph position="1"> However, a clear recall/precision tradeoff is evident only when the algorithm is retrieving statistically significant numbers of texts. We can see from the graph in Figure 1 that precision fluctuates dramatically for our worst test set when recall values are under 50%. At these lower recall values, the algorithm is retreiving such small numbers of texts (less than 20 for our worst test se0 that gaining or losing a single text can have a significant impact on precision. Since our test sets contain only 100 texts each, statistical significance may not be reached until we approach fairly high recall values. With larger test sets we could expect to see somewhat more stable precision scores at lower recall levels because the number of texts being retrieved would be greater.</Paragraph>
      <Paragraph position="2"> The percentage of relevant texts in a test set also plays a role in determining statistical significance. Each of the test sets contains a different number of relevant texts. For example, the best test set (represented by the data points near the top of the Y-axis) contains 66 relevant texts, whereas the worst test set (represented by the data points near the middle of the Y-axis) contains only 39 relevant texts. The total percentage of relevant texts in the test  corpus provides a baseline against which precision must be assessed. A constant algorithm that classifies all texts as relevant will always yield 100% recall with a precision level determined by this baseline percentage. If only 10% of the test corpus is relevant, the constant algorithm will show a 10% rate Of pre~Sision. If 90% of the test corpus is relevant, the constant algorithm will achieve 90% precision. If we look at the graph in Figure 1 with this in mind, we find that a constant algorithm would yield 66% precision for the first test set but only 39% for the second test set. From this vantage point, we can see that the Relevancy Signatures Algorithm performs substantially better than the constant algorithm on both test sets.</Paragraph>
      <Paragraph position="3"> It was interesting to see how much variance we got across the different test sets. Several other factors may have also contributed to this. For one, the corpus is not a randomly ordered collection of texts. The MUC-3 articles were often ordered by date so it is not uncommon to find sequences of articles that describe the same event. One block of texts may contain several articles about a specific kidnapping event while a different block will not contain any articles about kidnappings. Second, the quality of the answer keys is not consistent across the corpus. During the course of MUC-3, each participating site was responsible for encoding the answer keys for different parts of the corpus.</Paragraph>
      <Paragraph position="4"> Although some cross-checking was done, the quality of the encoding is not consistent across the corpus. 5 The quality of the answer keys can affect both training and testing.</Paragraph>
      <Paragraph position="5"> The relatively small size of our training set was undoubtedly a limiting factor since many linguistic expressions appeared only a few times throughout the entire corpus. This has two ramifications for our algorithm: (1) many infrequent expressions are never considered as relevancy signatures because the minimum number of occurrences parameter prohibits them, and (2) expressions that occur with low frequencies will yield less reliable statistics. Having run experiments with smaller training sets, we have seen our results show marked improvement as the training set grows. We expect that this trend would continue for training sets greater than 1400, but corpus limitations have restricted us in that regard.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML