File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/w97-0318_evalu.xml

Size: 6,533 bytes

Last Modified: 2025-10-06 14:00:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0318">
  <Title>Learning Methods for Combining Linguistic Indicators to Classify Verbs</Title>
  <Section position="5" start_page="158" end_page="159" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> Since we are evaluating our approach over verbs other than be and have, the test set is only 16.2% states, as shown in Table 4. Therefore, simply classifying every verb as an event achieves an accuracy of 83.8% over the 739 test cases, since 619 are events.</Paragraph>
    <Paragraph position="1"> However, this approach classifies all stative clauses incorrectly, achieving a stative recall of 0.0%. This method serves as a baseline for comparison since wc are attempting to improve over an uninformed approach. 3</Paragraph>
    <Section position="1" start_page="158" end_page="159" type="sub_section">
      <SectionTitle>
4.1 Individual Indicators
</SectionTitle>
      <Paragraph position="0"> The second and third columns of Table 2 show the average value for each indicator over stative and event clauses, as measured over the 739 training examples. As described above, these examples exclude be and have. For example, 4.44% of stative clauses are modified by either not or never, but only 1.56% of event clauses were modified by these adverbs. The fourth column shows the results of T-tests that compare the indicator values over stative verbs to those over event verbs. For example, there is less than a 0.05% chance that the difference between stative and event means for the first four indicators listed  many classification problems (Duda and Hart, 1973), e.g., part-of-speech tagging (Church, 1988; Allen, 1995).  is due to chance. Overall, this shows that the differences in stative and event averages are statistically significant for the first seven indicators listed (p &lt; .01).</Paragraph>
      <Paragraph position="1"> This analysis has revealed correlations between verb class and five indicators that have not been linked to stativity in the linguistics literature. Of the top seven indicators shown to have positive correlations with stativity, three have been linguistically motivated, as shown in Table 1. The other four were not previously hypothesized to correlate with aspectual class: (1) verb :frequency, (2) occurrences modified by &amp;quot;not&amp;quot; or &amp;quot;never&amp;quot;, (3) occurrences with no deep subject, and (4) occurrences in the past or present participle. Furthermore, the last of these seven, occurrences in the perfect tense, was not previously hypothesized to correlate with stativity in particular.</Paragraph>
      <Paragraph position="2"> However, a positive correlation between indicator value and verb class does not necessarily mean an indicator can be used to increase classification accuracy. Each indicator was tested individually for its ability to improve classification accuracy over the baseline by selecting the best classification thresho',d over the training data. Only two indicators, verb :frequency, and occurrences with not and never, were able to improve classification accuracy over that obtained by classifying all clauses as events.</Paragraph>
      <Paragraph position="3"> To validate that this improved accuracy, the thresholds established over the training set were used over the test set, with resulting accuracies of 88.0% and 84.0%, respectively. Binomial tests showed the first of these to be a significant improvement over the baseline of 83.8%, but not the second.</Paragraph>
    </Section>
    <Section position="2" start_page="159" end_page="159" type="sub_section">
      <SectionTitle>
4.2 Combining Indicators
</SectionTitle>
      <Paragraph position="0"> All three machine learning methods successfully combined indicator values, improving classification accuracy over the baseline measure. As shown in Table 5, the decision tree's accuracy was 93.9%, genetic programming's function trees had an average accuracy of 91.2% over seven runs, and the log-linear regression achieved an 86.7% accuracy. Binomial tests showed that both the decision tree and genetic programming achieved a significant improvement over the 88.0% accuracy achieved by the :frequency indicator alone. Therefore, we have shown that machine learning methods can successfully combine multiple numerical indicators to improve the accuracy by which verbs are classified.</Paragraph>
      <Paragraph position="1"> The differences in accuracy between the three methods are each significant (p &lt; .01). Therefore, these results highlight the importance of how linear and non-linear interactions between numerical linguistic indicators are modeled.</Paragraph>
    </Section>
    <Section position="3" start_page="159" end_page="159" type="sub_section">
      <SectionTitle>
4.3 Improved Recall Tradeoff
</SectionTitle>
      <Paragraph position="0"> The increase in the number of stative clauses correctly classified, i.e. stative recall, illustrates a more dramatic improvement over the baseline. As shown in Table 5, stative recalls of 74.2%, 47.4% and 34.2% were achieved by the three learning methods, as compared to the 0.0% stative recall achieved by the baseline, while only a small loss in recall over event clauses was suffered. The baseline does not classify any stative clauses correctly because it classifies all clauses as events. This difference in recall is more dramatic than the accuracy improvement because of the dominance of event clauses in the test set.</Paragraph>
      <Paragraph position="1"> This favorable tradeoff between recall values presents an advantage for applications that weigh the identification of stative clauses more heavily than that of event clauses. For example, a prepositional phrase denoting a duration with/or, e.g., &amp;quot;.for a minute,&amp;quot; describes the duration of a state, e.g., &amp;quot;She felt sick for two weeks,&amp;quot; or the duration of the state that results from a telic event, e.g., &amp;quot;She left the room for a minute.&amp;quot; That is, correctly identifying the use of for depends on identifying the stativity of the clause it modifies. A language understanding system that incorrectly classifies &amp;quot;She felt sick for two weeks&amp;quot; as a non-telie event will not detect that &amp;quot;for two weeks&amp;quot; describes the duration of the feel-state. If this system, for example, summarizes durations, it is important to correctly identify states.</Paragraph>
      <Paragraph position="2"> In this case, our approach is advantageous.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML