File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1015_metho.xml
Size: 17,929 bytes
Last Modified: 2025-10-06 14:15:27
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1015"> <Title>Corpus-Based Linguistic Indicators for Aspectual Classification</Title> <Section position="3" start_page="112" end_page="113" type="metho"> <SectionTitle> 3 Linguistic Indicators </SectionTitle> <Paragraph position="0"> The best way to exploit aspectual markers is not obvious, since, while the presence of a marker in a particular clause indicates a constraint on the aspectual class of the clause, the absence thereof does not place any constraint. Therefore, as with most statistical methods for natural language, the linguistic constraints associated with markers are best exploited by a system that measures co-occurrence frequencies. For example, a verb that appears more frequently in the progressive is more likely to describe an event.</Paragraph> <Paragraph position="1"> Klavans and Chodorow (1992) pioneered the application of statistical corpus analysis to aspectuai classification by ranking verbs according to the frequencies with which they occur with certain aspectual markers.</Paragraph> <Paragraph position="2"> Table 3 lists the linguistic indicators evaluated for aspectual classification. Each indica- null She can not explain why.</Paragraph> <Paragraph position="3"> I saw to it then.</Paragraph> <Paragraph position="4"> He was admitted.</Paragraph> <Paragraph position="5"> ... blood pressure going up.</Paragraph> <Paragraph position="6"> She built it in an hour.</Paragraph> <Paragraph position="7"> They have landed.</Paragraph> <Paragraph position="8"> I am happy.</Paragraph> <Paragraph position="9"> I am behaving myself.</Paragraph> <Paragraph position="10"> She studied diligently.</Paragraph> <Paragraph position="11"> They performed horribly.</Paragraph> <Paragraph position="12"> I was happy.</Paragraph> <Paragraph position="13"> I sang for ten minutes.</Paragraph> <Paragraph position="14"> She will live indefinitely.</Paragraph> <Paragraph position="15"> aspectual classification.</Paragraph> <Paragraph position="16"> tor has a unique value for each verb. The first indicator, frequency, is simply the frequency with which each verb occurs over the entire corpus. The remaining 13 indicators measure how frequently each verb occurs in a clause with the named linguistic marker. For example, the next three indicators listed measure the frequency with which verbs 1) are modified by not or never, 2) are modified by a temporal adverb such as then or frequently, and 3) have no deep subject (e.g., passive phrases such as, &quot;She was admitted to the hospital&quot;). Further details regarding these indicators and their linguistic motivation is given by Siegel (1998b).</Paragraph> <Paragraph position="17"> There are several reasons to expect superior classification performance when employing multiple linguistic indicators in combination rather than using them individually. While individual indicators have predictive value, they are predictively incomplete. This incompleteness has been illustrated empirically by showing that some indicators help for only a subset of verbs (Siegel, 1998b). Such incompleteness is due in * part to sparsity and noise of data when computing indicator values over a corpus with limited size and some parsing errors. However, this incompleteness is also a consequence of the linguistic characteristics of various indicators. For example: * Aspectual coercion such as iteration compromises indicator measurements (Moens and Steedman, 1988). For example, a punctual event appears with the progressive in, &quot;She was sneezing for a week.&quot; (point --, process --. culminated process) In this example, for a week can only modify an extended event, requiring the first coercion. In addition, this for-PP also makes an event culminated, causing the second transformation.</Paragraph> <Paragraph position="18"> * Some aspectual markers such as the pseudo-cleft and manner adverbs test for intentional events, and therefore are not compatible with all events, e.g., &quot;*I died diligently.&quot; * The progressive indicator's predictiveness for stativity is compromised by the fact that many location verbs can appear with the progressive, even in their stative sense, e.g. &quot;The book was lying on the shelf.&quot; (Dowty, 1979) * Several indicators measure phenomena that are not linguistically constrained by any aspectuM category, e.g., the present tense, frequency and not/never indicators.</Paragraph> </Section> <Section position="4" start_page="113" end_page="116" type="metho"> <SectionTitle> 4 Method and Results </SectionTitle> <Paragraph position="0"> In this section, we evaluate the set of fourteen linguistic indicators for two aspectual distinctions: stativity and completedness. Evaluation is over corpora of medical reports and novels, respectively. This data is summarized in Table 4 (available at www. CS. columbia, edu/~evs/YerbData).</Paragraph> <Paragraph position="1"> First, linguistic indicators are each evaluated individually. A training set is used to select indicator value thresholds for classification. Then, we report the classification performance achieved by combining multiple indicators. In this case, the training set is used to optimize a model for combining indicators. In both cases, evaluation is performed over a separate test set of clauses.</Paragraph> <Paragraph position="2"> The combination of indicators is performed by four standard supervised learning algorithms: decision tree induction (Quinlan, 1986), CART (Friedman, 1977), log-linear regression (Santner and Duffy, 1989) and genetic programming (GP) (Cramer, 1985; Koza, 1992).</Paragraph> <Paragraph position="3"> A pilot study showed no further improvement in accuracy or recall tradeoff by additional learning algorithms: Naive Bayes (Duda and stativity completedness corpus: 3,224 med reports 10 novels size: 1,159,891 846,913 parsed clauses: 97,973 training: 739 (634 events) testing: 739 (619 events) verbs in test set: 222 204 clauses excluded: be and have stative data sets.</Paragraph> <Paragraph position="4"> Hart, 1973), Ripper (Cohen, 1995), ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993), and metalearning to combine learning methods (Chan and Stolfo, 1993).</Paragraph> <Section position="1" start_page="114" end_page="115" type="sub_section"> <SectionTitle> 4.1 Stativity </SectionTitle> <Paragraph position="0"> Our experiments are performed across a corpus of 3,224 medical discharge summaries. A medical discharge summary describes the symptoms, history, diagnosis, treatment and outcome of a patient's visit to the hospital. These reports were parsed with the English Slot Grammar (ESG) (McCord, 1990), resulting in 97,973 clauses that were parsed fully with no selfdiagnostic errors (ESG produced error messages on 12,877 of this corpus' 51,079 complex sentences). null Be and have, the two most popular verbs, covering 31.9% of the clauses in this corpus, are handled separately from all other verbs. Clauses with be as their main verb, comprising 23.9% of the corpus, always denote a state. Clauses with have as their main verb, composing 8.0% of the corpus, are highly ambiguous, and have been addressed separately by considering the direct object of such clauses (Siegel, 1998a).</Paragraph> <Paragraph position="1"> 1,851 clauses from the parsed corpus were manually marked according to stativity. As a linguistic test for marking, each clause was tested for readability with &quot;What happened was... ,1 A comparison between human markers for this test performed over a different corpus is re- null events.</Paragraph> <Paragraph position="2"> clauses were rejected because of parsing problems. This left 1,478 clauses, divided equally into training and testing sets.</Paragraph> <Paragraph position="3"> 83.8% of clauses with main verbs other than be and have are events, which thus provides a baseline method of 83.8% for comparison. Since our approach examines only the main verb of a clause, classification accuracy over the test cases has a maximum of 97.4% due to the presence of verbs with multiple classes.</Paragraph> <Paragraph position="4"> The values of the indicators listed in Table 5 were computed, for each verb, across the 97,973 parsed clauses from our corpus of medical discharge summaries.</Paragraph> <Paragraph position="5"> The second and third columns of Table 5 show the average value for each indicator over stative and event clauses, as measured over the training examples. For example, 4.44% of stative clauses are modified by either not or never, but only 1.56% of event clauses were so modified.</Paragraph> <Paragraph position="6"> The fourth column shows the results of T-tests that compare indicator values over stative training cases to those over event cases for each indicator. As shown, the differences in stative and event means are statistically significant (p < .01) for the first seven indicators.</Paragraph> <Paragraph position="7"> Each indicator was tested individually for classification accuracy by establishing a classification threshold over the training data, and validating performance over the testing data using the same threshold. Only the frequency indicator succeeded in significantly improving clas- null and two performance baselines, distinguishing states from events.</Paragraph> <Paragraph position="8"> sification accuracy by itself, achieving an accuracy of 88.0%. This improvement in accuracy was achieved simply by discriminating the popular verb show as a state~ but classifying all other verbs as events. Although many domains may primarily use show as an event, its appearances in medical discharge summaries, such as, &quot;His lumbar puncture showed evidence of white cells,&quot; primarily utilize show to denote a state. Three machine learning methods successfully combined indicator values, improving classification accuracy over the baseline measure. As shown in Table 6, the decision tree attained the highest accuracy, 93.9%. Binomial tests showed this to be a significant improvement over the 88.0% accuracy achieved by the frequency indicator alone, as well as over the other two learning methods. No further improvement in classification performance was achieved by CART. The increase in the number of stative clauses correctly classified, i.e. stative recall, illustrates an even greater improvement over the baseline. As shown in Table 6, the three learning methods achieved stative recalls of 74.2%, 47.4% and 34.2%, as compared to the 0.0% stative recall achieved by the baseline, while only a small loss in recall over event clauses was suffered. The baseline does not classify any stative clauses correctly because it classifies all clauses as events.</Paragraph> <Paragraph position="9"> Classification performance is equally competitive without the frequency indicator, although this indicator appears to dominate over others. When decision tree induction was employed to combine only the 13 indicators other than frequency, the resulting decision tree achieved 92.4% accuracy and 77.5% stative recall.</Paragraph> </Section> <Section position="2" start_page="115" end_page="116" type="sub_section"> <SectionTitle> 4.2 Completedness </SectionTitle> <Paragraph position="0"> In medical discharge summaries, non-culminated event clauses are rare. Therefore, our experiments for classification according to completedness are performed across a corpus of ten novels comprising 846,913 words. These novels were parsed with ESG, resulting in 75,289 fully-parsed clauses (22,505 of 59,816 sentences produced errors).</Paragraph> <Paragraph position="1"> 884 clauses from the parsed corpus were manually marked according to completedness. Of these, 109 were rejected because of parsing problems, and 160 rejected because they described states. The remaining 615 clauses were divided into training and test sets such that the distribution of classes was equal. The baseline method in this case achieves 63.3% accuracy.</Paragraph> <Paragraph position="2"> The linguistic test was selected for this task by Passonneau (1988): If a clause in the past progressive necessarily entails the past tense reading, the clause describes a non-culminated event. For example, We were talking just like men (non-culm.) entails that We talked just like men, but The woman was building a house (culm.) does not necessarily entail that The woman built a house. Cross-checking between linguists shows high agreement. In particular, in a pilot study manually annotating 89 clauses from this corpus according to stativity, two linguists agreed 81 times. Of 57 clauses agreed to be events, 46 had agreement with respect to completedness.</Paragraph> <Paragraph position="3"> The verb say (point), which occurs nine times in the test set, was initially marked incorrectly as culminated, since points are non-extended and therefore cannot be placed in the progressive. After some initial experimentation, we corrected the class of each occurrence of say in the data.</Paragraph> <Paragraph position="4"> ness. The differences in culminated and non-culminated means are statistically significant (p < .05) for the first six indicators. However, for completedness, no indicator was shown to significantly improve classification accuracy over the baseline.</Paragraph> <Paragraph position="5"> and two performance baselines, distinguishing culminated from non-culminated events.</Paragraph> <Paragraph position="6"> As shown in Table 8, the highest accuracy, 74.0%, was attained by CART. A binomial test shows this is a significant improvement over the 63.3% baseline.</Paragraph> <Paragraph position="7"> The increase in non-culminated recall illustrates a greater improvement over the baseline. As shown in Table 8, non-culminated recalls of up to 53.6% were achieved by the learning methods, compared to 0.0%, achieved by the baseline. null Additionally, a non-culminated F-measure of 61.9 was achieved by GP, when optimizing for F-Measure, improving over 53.7 attained by the optimal uninformed baseline. F-measure computes a tradeoff between recall and precision (Van Rijsbergen, 1979). In this work, we weigh recall and precision equally, in which case, recall*precision F - measure = (recall+precision)f2 Automatic methods highly prioritized the perfect indicator. The induced decision tree uses the perfect indicator as its first discriminator, log-linear regression ranked the perfect indicator as fourth out of fourteen, function trees created by GP include the perfect indicator as one of five indicators used together to increase classification performance, and the perfect indicator tied as most highly correlated with completedness (cf. Table 7).</Paragraph> </Section> </Section> <Section position="5" start_page="116" end_page="117" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> Since certain verbs are aspectually ambiguous, and, in this work, clauses are classified by their main verb only, a second baseline approach would be to simply memorize the majority aspect of each verb in the training set, and classify verbs in the test set accordingly. In this case, test verbs that did not appear in the training set would be classified according to majority class.</Paragraph> <Paragraph position="1"> However, classifying verbs and clauses according to numerical indicators has several important advantages over this baseline: * Handles rare or unlabeled verbs. The results we have shown serve to estimate classification performance over &quot;unseen&quot; verbs that were not included in the supervised training sample. Once the system has been trained to distinguish by indicator values, it can automatically classify any verb that appears in unlabeled corpora, since measuring linguistic indicators for a verb is fully automatic. This also applies to verbs that are underrepresented in the training set. For example, one node of the resulting decision tree trained to distinguish according to stativity identifies 19 stative test cases without misclassifying any of 27 event test cases with verbs that occur only one time each in the training set.</Paragraph> <Paragraph position="2"> * Success when training doesn't include test verbs. To test this, all test verbs were eliminated from the training set, and log-linear regression was trained over this smaller set to distinguish according to completedness. The result is shown in Table 8 (&quot;llr2&quot;). Accuracy remained higher than the baseline &quot;br' (bl2 not applicable), and the recall tradeoff is felicitous.</Paragraph> <Paragraph position="3"> . Improved performance. Memorizing majority aspect does not achieve as high an accuracy as the linguistic indicators for completedness, nor does it achieve as wide a recall tradeff for both stativity and completedness. These results are indicated as the second baselines (&quot;bl2&quot;) in tables 6 and 8, respectively.</Paragraph> <Paragraph position="4"> * Scalar values assigned to each verb allow the tradeoff between recall and precision to be selected for particular applications by selecting the classification threshold. For example, in a separate study, optimizing for F-measure resulted in a more dramatic tradeoff in recall values as compared to those attained when optimizing for accuracy (Siegel, 1998b). Moreover, such scalar values can provide input to systems that perform reasoning on fuzzy or uncertainty knowledge.</Paragraph> <Paragraph position="5"> * This framework is expandable since additional indicators can be introduced by measuring the frequencies of additional aspectual markers. Furthermore, indicators measured over multiple clausal constituents, e.g., main verb-object pairs, alleviate verb ambiguity and sparsity and improve classification performance (Siegel, 1998b).</Paragraph> </Section> class="xml-element"></Paper>