XML Viewer - h93-1050

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/93/h93-1050_evalu.xml
Size: 8,155 bytes
Last Modified: 2025-10-06 14:00:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1050">
  <Title>SMOOTHING OF AUTOMATICALLY GENERATED SELECTIONAL CONSTRAINTS</Title>
  <Section position="7" start_page="255" end_page="256" type="evalu">
    <SectionTitle>
5. EVALUATION
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="255" end_page="256" type="sub_section">
      <SectionTitle>
5.1. Evaluation Metric
</SectionTitle>
      <Paragraph position="0"> We have previously (at COLING-92) described two methods for the evaluation of semantic constraints. For the current experiments, we have used one of these methods, where the constraints are evaluated against a set of manually classified semantic triples.</Paragraph>
      <Paragraph position="1"> For this evaluation, we select a small test corpus separate from the training corpus. We parse the corpus, regularize the parses, and extract triples just as we did for the semantic acquisition phase (with the exception that we use the non-stochastic grammar in order to generate all grammatically valid parses of each sentence). We then manually classify each triple as semantically valid or invalid (a triple is counted as valid if we believe that this pair of words could meaningfully occur in this relationship, even if this was not the intended relationship in this particular text).</Paragraph>
      <Paragraph position="2"> We then establish a threshold T for the weighted triples counts in our training set, and define</Paragraph>
      <Paragraph position="4"> number of triples in test set which were classified as valid and which appeared in training set with count &gt; T number of triples in test set which were classified as valid and which appeared in training set with count _&lt; T number of triples in test set which were classified as invalid and which appeared in training set with count &gt; T number of triples in test set which were classified as invalid and which appeared in training set with count &lt; T and then define</Paragraph>
      <Paragraph position="6"> appears at the top of the list for &amp;quot;attack&amp;quot; because both appear with the object &amp;quot;position&amp;quot;.) At a given threshold T, our smoothing process should increase recall but in practice will also increase the error rate. How can we tell if our smoothing is doing any good? We can view the smoothing process as moving some triples from v_ to v+ and from i_ to i+.4 Is it doing so better than some random process? I.e., is it preferentially raising valid items above the threshold? To assess this, we compute (for a fixed threshold) the quality measure V~--V+ i_ where the values with superscript S represent the values with smoothing, and those without superscripts represent the values without smoothing. If Q &gt; 1, then smoothing is doing better than a random process in identifying valid triples.</Paragraph>
    </Section>
    <Section position="2" start_page="256" end_page="256" type="sub_section">
      <SectionTitle>
5.2. Test Data
</SectionTitle>
      <Paragraph position="0"> The training corpus was the set of 1300 messages (with a total of 18,838 sentences) which constituted the development corpus for Message Understanding Conferences - 3 and 4 \[1,2\]. These messages are news reports from the Foreign Broadcast Information Service concerning terrorist activity in Central and South America. The average sentence length is about 24 words. In order to get higher-quality parses of these sentences, we disabled several of the recovery mechanisms 4In fact, some triples will move above the threshold and other will move below the threshold, but in the regions we are considering, the net movement will be above the threshold.</Paragraph>
      <Paragraph position="1"> normally used in parsing, such as longest-substring parsing; with these mechanisms disabled, we obtained parses for 9,903 of the 18,838 sentences. These parses were then regularized and reduced to triples. We generated a total of 46,659 distinct triples from this test corpus.</Paragraph>
      <Paragraph position="2"> The test corpus--used to generate the triples which were manually classified---consisted of 10 messages of similar style, taken from one of the test corpora for Message Understanding Conference - 3. These messages produced a test set containing a total of 636 distinct triples, of which 456 were valid and 180 were invalid.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="256" end_page="257" type="evalu">
    <SectionTitle>
5.3. Results
</SectionTitle>
    <Paragraph position="0"> In testing our smoothing procedure, we first generated the confusion matrix Pc and examined some of the entries. Figure 1 shows the largest entries in Pc for the verb &amp;quot;attack&amp;quot; and the noun &amp;quot;terrorist&amp;quot;, two very common words in the terrorist domain. It is clear that (with some odd exceptions) most of the words with high Pc values are semantically related to the original word.</Paragraph>
    <Paragraph position="1"> To evaluate the effectiveness of smoothing, we have compared three sets of triples frequency data:  1. the original (unsmoothed) dam 2. the data as smoothed using Pc 3. the data as generalized using a manually-prepared classi null fication hierarchy for a subset of the words of the domain  generalization strategy T v+ v_ i+ i_ recall error rate Q 1. no smoothing 0 139 317 13 167 30% 7% 2. confusion matrix 0 237 219 50 130 52% 28% 1.39 3. classification hierarchy 0 154 302 18 162 34% 10% 1.58 4. confusion matrix 0.29 154 302 17 163 34% 9% 1.90  For the: third method, we employed a classification hierarchy which had previously been prepared as part of the information extraction system used for Message Understanding Conference-4. This hierarchy included only the subset of the vocabulary thought relevant to the information extraction task (not counting proper names, roughly 10% of the words in the vocabulary). From this hierarchy we identified the 13 classes which were most frequently referred to in the lexico-semantic models used by the extraction system. If the head (first element) of a semantic triple was a member of one of these classes, the generalization process replaced that word by the most specific class to which it belongs (since we have a hierarchy with nested classes, a word will typically belong to several classes); to make the results comparable to those with confusion-matrix smoothing, we did not generalize the argument (last element) of the triple.</Paragraph>
    <Paragraph position="2"> The basic results are shown in rows 1, 2, and 3 of Table 1. For all of these we used a threshold (T) of 0, so a triple with any frequency &gt; 0 would go into the v+ or i+ category. In each case the quality measure Q is relative to the run without smoothing, entry 1 in the table. Both the confusion matrix and the classification hierarchy yield Qs substantially above 1, indicating that both methods are performing substantially better than random. The Q is higher with the classification hierarchy, as might be expected since it has been manually checked; on the other hand, the improvement in recall is substantially smaller, since the hierarchy covers only a small portion of the total vocabulary. As the table shows, the confusion matrix method produces a large increase in recall (about 73% over the base run).</Paragraph>
    <Paragraph position="3"> These comparisons all use a T (frequency threshold) of 0, which yields the highest recall and error rate. Different recall/error-rate trade-offs can be obtained by varying T. For example, entry 4 of the table shows the result for T=0.29, the point at which the recall using the confusion matrix and the classification hierarchy is the same (the values without smoothing and the values using the classification hierarchy are essentially unchanged at T=0.29), We observe that, for the same recall, the automatic smoothing does as well as the manually generated hierarchy with regard to error rate. (In fact, the Q value with smoothing (line 4) is much higher than with the classification hierarchy (line 3), but this reflects a difference of only 1 in i+ and should not be seen as significant.)</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML