XML Viewer - p06-1050

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1050_metho.xml
Size: 23,614 bytes
Last Modified: 2025-10-06 14:10:16
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1050">
  <Title>Learning Event Durations from Event Descriptions</Title>
  <Section position="4" start_page="393" end_page="395" type="metho">
    <SectionTitle>
2 Inter-Annotator Agreement
</SectionTitle>
    <Paragraph position="0"> Although the graphical output of the annotations enables us to visualize quickly the level of agreement among different annotators for each event, a quantitative measurement of the agreement is needed.</Paragraph>
    <Paragraph position="1"> The kappa statistic (Krippendorff, 1980; Carletta, 1996) has become the de facto standard to assess inter-annotator agreement. It is computed as:</Paragraph>
    <Paragraph position="3"> P(A) is the observed agreement among the annotators, and P(E) is the expected agreement, which is the probability that the annotators agree by chance.</Paragraph>
    <Paragraph position="4"> In order to compute the kappa statistic for our task, we have to compute P(A) and P(E), but those computations are not straightforward.</Paragraph>
    <Paragraph position="5"> P(A): What should count as agreement among annotators for our task? P(E): What is the probability that the annotators agree by chance for our task?</Paragraph>
    <Section position="1" start_page="393" end_page="394" type="sub_section">
      <SectionTitle>
2.1 What Should Count as Agreement?
</SectionTitle>
      <Paragraph position="0"> Determining what should count as agreement is not only important for assessing inter-annotator agreement, but is also crucial for later evaluation of machine learning experiments. For example, for a given event with a known gold standard duration range from 1 hour to 4 hours, if a machine learning program outputs a duration of 3 hours to 5 hours, how should we evaluate this result? In the literature on the kappa statistic, most authors address only category data; some can handle more general data, such as data in interval scales or ratio scales. However, none of the techniques directly apply to our data, which are ranges of durations from a lower bound to an upper bound.</Paragraph>
      <Paragraph position="1">  30 minutes] and [10 minutes, 2 hours].</Paragraph>
      <Paragraph position="2"> In fact, what coders were instructed to annotate for a given event is not just a range, but a duration distribution for the event, where the area between the lower bound and the upper bound covers about 80% of the entire distribution area. Since it's natural to assume the most likely duration for such distribution is its mean (average) duration, and the distribution flattens out toward the upper and lower bounds, we use the normal or Gaussian distribution to model our duration distributions. If the area between lower and upper bounds covers 80% of the entire distribution area, the bounds are each 1.28 standard deviations from the mean.</Paragraph>
      <Paragraph position="3"> Figure 1 shows the overlap in distributions for judgments of [10 minutes, 30 minutes] and [10 minutes, 2 hours], and the overlap or agreement is 0.508706.</Paragraph>
    </Section>
    <Section position="2" start_page="394" end_page="395" type="sub_section">
      <SectionTitle>
2.2 Expected Agreement
</SectionTitle>
      <Paragraph position="0"> What is the probability that the annotators agree by chance for our task? The first quick response to this question may be 0, if we consider all the possible durations from 1 second to 1000 years or even positive infinity.</Paragraph>
      <Paragraph position="1"> However, not all the durations are equally possible. As in (Krippendorff, 1980), we assume there exists one global distribution for our task (i.e., the duration ranges for all the events), and &amp;quot;chance&amp;quot; annotations would be consistent with this distribution. Thus, the baseline will be an annotator who knows the global distribution and annotates in accordance with it, but does not read the specific article being annotated. Therefore, we must compute the global distribution of the durations, in particular, of their means and their widths. This will be of interest not only in determining expected agreement, but also in terms of  Durations.</Paragraph>
      <Paragraph position="2"> what it says about the genre of news articles and about fuzzy judgments in general.</Paragraph>
      <Paragraph position="3"> We first compute the distribution of the means of all the annotated durations. Its histogram is shown in Figure 2, where the horizontal axis represents the mean values in the natural logarithmic scale and the vertical axis represents the number of annotated durations with that mean.</Paragraph>
      <Paragraph position="4"> There are two peaks in this distribution. One is from 5 to 7 in the natural logarithmic scale, which corresponds to about 1.5 minutes to 30 minutes. The other is from 14 to 17 in the natural logarithmic scale, which corresponds to about 8 days to 6 months. One could speculate that this bimodal distribution is because daily newspapers report short events that happened the day before and place them in the context of larger trends.</Paragraph>
      <Paragraph position="5"> We also compute the distribution of the widths</Paragraph>
      <Paragraph position="7"> ) of all the annotated durations, and its histogram is shown in Figure 3, where the horizontal axis represents the width in the natural logarithmic scale and the vertical axis represents the number of annotated durations with that width. Note that it peaks at about a half order of magnitude (Hobbs and Kreinovich, 2001).</Paragraph>
      <Paragraph position="8"> Since the global distribution is determined by the above mean and width distributions, we can then compute the expected agreement, i.e., the probability that the annotators agree by chance, where the chance is actually based on this global distribution.</Paragraph>
      <Paragraph position="9"> Two different methods were used to compute the expected agreement (baseline), both yielding nearly equal results. These are described in detail in (Pan et al., 2006). For both, P(E) is about 0.15.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="395" end_page="396" type="metho">
    <SectionTitle>
3 Features
</SectionTitle>
    <Paragraph position="0"> In this section, we describe the lexical, syntactic, and semantic features that we considered in learning event durations.</Paragraph>
    <Section position="1" start_page="395" end_page="395" type="sub_section">
      <SectionTitle>
3.1 Local Context
</SectionTitle>
      <Paragraph position="0"> For a given event, the local context features include a window of n tokens to its left and n tokens to its right, as well as the event itself, for n = {0, 1, 2, 3}. The best n determined via cross validation turned out to be 0, i.e., the event itself with no local context. But we also present results for n = 2 in Section 4.3 to evaluate the utility of local context.</Paragraph>
      <Paragraph position="1"> A token can be a word or a punctuation mark.</Paragraph>
      <Paragraph position="2"> Punctuation marks are not removed, because they can be indicative features for learning event durations. For example, the quotation mark is a good indication of quoted reporting events, and the duration of such events most likely lasts for seconds or minutes, depending on the length of the quoted content. However, there are also cases where quotation marks are used for other purposes, such as emphasis of quoted words and titles of artistic works.</Paragraph>
      <Paragraph position="3"> For each token in the local context, including the event itself, three features are included: the original form of the token, its lemma (or root form), and its part-of-speech (POS) tag. The lemma of the token is extracted from parse trees generated by the CONTEX parser (Hermjakob and Mooney, 1997) which includes rich context information in parse trees, and the Brill tagger (Brill, 1992) is used for POS tagging.</Paragraph>
      <Paragraph position="4"> The context window doesn't cross the boundaries of sentences. When there are not enough tokens on either side of the event within the window, &amp;quot;NULL&amp;quot; is used for the feature values.  event in sentence (1) with n = 2.</Paragraph>
      <Paragraph position="5"> The local context features extracted for the &amp;quot;signed&amp;quot; event in sentence (1) is shown in Table</Paragraph>
    </Section>
    <Section position="2" start_page="395" end_page="395" type="sub_section">
      <SectionTitle>
3.2 Syntactic Relations
</SectionTitle>
      <Paragraph position="0"> The information in the event's syntactic environment is very important in deciding the durations of events. For example, there is a difference in the durations of the &amp;quot;watch&amp;quot; events in the phrases &amp;quot;watch a movie&amp;quot; and &amp;quot;watch a bird fly&amp;quot;. For a given event, both the head of its subject and the head of its object are extracted from the parse trees generated by the CONTEX parser.</Paragraph>
      <Paragraph position="1"> Similarly to the local context features, for both the subject head and the object head, their original form, lemma, and POS tags are extracted as features. When there is no subject or object for an event, &amp;quot;NULL&amp;quot; is used for the feature values. For the &amp;quot;signed&amp;quot; event in sentence (1), the head of its subject is &amp;quot;presidents&amp;quot; and the head of its object is &amp;quot;plan&amp;quot;. The extracted syntactic relation features are shown in Table 2, and the feature vector is [presidents, president, NNS, plan, plan, NN].</Paragraph>
    </Section>
    <Section position="3" start_page="395" end_page="396" type="sub_section">
      <SectionTitle>
3.3 WordNet Hypernyms
</SectionTitle>
      <Paragraph position="0"> Events with the same hypernyms may have similar durations. For example, events &amp;quot;ask&amp;quot; and &amp;quot;talk&amp;quot; both have a direct WordNet (Miller, 1990) hypernym of &amp;quot;communicate&amp;quot;, and most of the time they do have very similar durations in the corpus.</Paragraph>
      <Paragraph position="1"> However, closely related events don't always have the same direct hypernyms. For example, &amp;quot;see&amp;quot; has a direct hypernym of &amp;quot;perceive&amp;quot;, whereas &amp;quot;observe&amp;quot; needs two steps up through the hypernym hierarchy before reaching &amp;quot;perceive&amp;quot;. Such correlation between events may be lost if only the direct hypernyms of the words are extracted.</Paragraph>
      <Paragraph position="2">  event (&amp;quot;signed&amp;quot;), its subject (&amp;quot;presidents&amp;quot;), and its object (&amp;quot;plan&amp;quot;) in sentence (1).</Paragraph>
      <Paragraph position="3"> It is useful to extract the hypernyms not only for the event itself, but also for the subject and object of the event. For example, events related to a group of people or an organization usually last longer than those involving individuals, and the hypernyms can help distinguish such concepts. For example, &amp;quot;society&amp;quot; has a &amp;quot;group&amp;quot; hypernym (2 steps up in the hierarchy), and &amp;quot;school&amp;quot; has an &amp;quot;organization&amp;quot; hypernym (3 steps up). The direct hypernyms of nouns are always not general enough for such purpose, but a hypernym at too high a level can be too general to be useful. For our learning experiments, we extract the first 3 levels of hypernyms from WordNet.</Paragraph>
      <Paragraph position="4"> Hypernyms are only extracted for the events and their subjects and objects, not for the local context words. For each level of hypernyms in the hierarchy, it's possible to have more than one hypernym, for example, &amp;quot;see&amp;quot; has two direct hypernyms, &amp;quot;perceive&amp;quot; and &amp;quot;comprehend&amp;quot;. For a given word, it may also have more than one sense in WordNet. In such cases, as in (Gildea and Jurafsky, 2002), we only take the first sense of the word and the first hypernym listed for each level of the hierarchy. A word disambiguation module might improve the learning performance.</Paragraph>
      <Paragraph position="5"> But since the features we need are the hypernyms, not the word sense itself, even if the first word sense is not the correct one, its hypernyms can still be good enough in many cases. For example, in one news article, the word &amp;quot;controller&amp;quot; refers to an air traffic controller, which corresponds to the second sense in WordNet, but its first sense (business controller) has the same hypernym of &amp;quot;person&amp;quot; (3 levels up) as the second sense (direct hypernym). Since we take the first 3 levels of hypernyms, the correct hypernym is still ex-</Paragraph>
      <Paragraph position="7"/>
    </Section>
    <Section position="4" start_page="396" end_page="396" type="sub_section">
      <SectionTitle>
Event Durations.
</SectionTitle>
      <Paragraph position="0"> When there are less than 3 levels of hypernyms for a given word, its hypernym on the previous level is used. When there is no hypernym for a given word (e.g., &amp;quot;go&amp;quot;), the word itself will be used as its hypernyms. Since WordNet only provides hypernyms for nouns and verbs, &amp;quot;NULL&amp;quot; is used for the feature values for a word that is not a noun or a verb.</Paragraph>
      <Paragraph position="1"> For the &amp;quot;signed&amp;quot; event in sentence (1), the extracted WordNet hypernym features for the event (&amp;quot;signed&amp;quot;), its subject (&amp;quot;presidents&amp;quot;), and its object (&amp;quot;plan&amp;quot;) are shown in Table 3, and the feature vector is [write, communicate, interact, corporate_executive, executive, administrator, idea, content, cognition].</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="396" end_page="398" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> The distribution of the means of the annotated durations in Figure 2 is bimodal, dividing the events into those that take less than a day and those that take more than a day. Thus, in our first machine learning experiment, we have tried to learn this coarse-grained event duration information as a binary classification task.</Paragraph>
    <Section position="1" start_page="396" end_page="397" type="sub_section">
      <SectionTitle>
4.1 Inter-Annotator Agreement, Baseline,
and Upper Bound
</SectionTitle>
      <Paragraph position="0"> Before evaluating the performance of different learning algorithms, the inter-annotator agreement, the baseline and the upper bound for the learning task are assessed first.</Paragraph>
      <Paragraph position="1"> Table 4 shows the inter-annotator agreement results among 3 annotators for binary event durations. The experiments were conducted on the same data sets as in (Pan et al., 2006). Two kappa values are reported with different ways of measuring expected agreement (P(E)), i.e., whether or not the annotators have prior knowledge of the global distribution of the task.</Paragraph>
      <Paragraph position="2"> The human agreement before reading the guidelines (0.877) is a good estimate of the upper bound performance for this binary classification task. The baseline for the learning task is always taking the most probable class. Since 59.0% of the total data is &amp;quot;long&amp;quot; events, the baseline performance is 59.0%.</Paragraph>
    </Section>
    <Section position="2" start_page="397" end_page="397" type="sub_section">
      <SectionTitle>
4.2 Data
</SectionTitle>
      <Paragraph position="0"> The original annotated data can be straightforwardly transformed for this binary classification task. For each event annotation, the most likely (mean) duration is calculated first by averaging (the logs of) its lower and upper bound durations.</Paragraph>
      <Paragraph position="1"> If its most likely (mean) duration is less than a day (about 11.4 in the natural logarithmic scale), it is assigned to the &amp;quot;short&amp;quot; event class, otherwise it is assigned to the &amp;quot;long&amp;quot; event class. (Note that these labels are strictly a convenience and not an analysis of the meanings of &amp;quot;short&amp;quot; and &amp;quot;long&amp;quot;.) We divide the total annotated non-WSJ data (2132 event instances) into two data sets: a training data set with 1705 event instances (about 80% of the total non-WSJ data) and a held-out test data set with 427 event instances (about 20% of the total non-WSJ data). The WSJ data (156 event instances) is kept for further test purposes (see Section 4.4).</Paragraph>
    </Section>
    <Section position="3" start_page="397" end_page="397" type="sub_section">
      <SectionTitle>
4.3 Experimental Results (non-WSJ)
</SectionTitle>
      <Paragraph position="0"> Learning Algorithms. Three supervised learning algorithms were evaluated for our binary classification task, namely, Support Vector Machines (SVM) (Vapnik, 1995), Naive Bayes (NB) (Duda and Hart, 1973), and Decision Trees C4.5 (Quinlan, 1993). The Weka (Witten and Frank, 2005) machine learning package was used for the implementation of these learning algorithms. Linear kernel is used for SVM in our experiments. null Each event instance has a total of 18 feature values, as described in Section 3, for the event only condition, and 30 feature values for the local context condition when n = 2. For SVM and C4.5, all features are converted into binary features (6665 and 12502 features).</Paragraph>
      <Paragraph position="1"> Results. 10-fold cross validation was used to train the learning models, which were then tested on the unseen held-out test set, and the performance (including the precision, recall, and F-score  Data.</Paragraph>
      <Paragraph position="2"> for each class) of the three learning algorithms is shown in Table 5. The significant measure is overall precision, and this is shown for the three algorithms in Table 6, together with human agreement (the upper bound of the learning task) and the baseline.</Paragraph>
      <Paragraph position="3"> We can see that among all three learning algorithms, SVM achieves the best F-score for each class and also the best overall precision (76.6%). Compared with the baseline (59.0%) and human agreement (87.7%), this level of performance is very encouraging, especially as the learning is from such limited training data.</Paragraph>
      <Paragraph position="4"> Feature Evaluation. The best performing learning algorithm, SVM, was then used to examine the utility of combinations of four different feature sets (i.e., event, local context, syntactic, and WordNet hypernym features). The detailed comparison is shown in Table 7.</Paragraph>
      <Paragraph position="5"> We can see that most of the performance comes from event word or phrase itself. A significant improvement above that is due to the addition of information about the subject and object. Local context does not help and in fact may hurt, and hypernym information also does not seem to help  . It is of interest that the most important information is that from the predicate and arguments describing the event, as our linguistic intuitions would lead us to expect.</Paragraph>
    </Section>
    <Section position="4" start_page="397" end_page="398" type="sub_section">
      <SectionTitle>
4.4 Test on WSJ Data
</SectionTitle>
      <Paragraph position="0"> Section 4.3 shows the experimental results with the learned model trained and tested on the data with the same genre, i.e., non-WSJ articles.</Paragraph>
      <Paragraph position="1"> In order to evaluate whether the learned model can perform well on data from different news genres, we tested it on the unseen WSJ data (156 event instances). The performance (including the precision, recall, and F-score for each class) is shown in Table 8. The precision (75.0%) is very close to the test performance on the non-WSJ 2 In the &amp;quot;Syn+Hyper&amp;quot; cases, the learning algorithm with and without local context gives identical results, probably because the other features dominate.</Paragraph>
      <Paragraph position="2">  Likely Temporal Unit.</Paragraph>
      <Paragraph position="3"> data, and indicates the significant generalization capacity of the learned model.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="398" end_page="399" type="metho">
    <SectionTitle>
5 Learning the Most Likely Temporal
Unit
</SectionTitle>
    <Paragraph position="0"> These encouraging results have prompted us to try to learn more fine-grained event duration information, viz., the most likely temporal units of event durations (cf. (Rieger 1974)'s ORDER-HOURS, ORDERDAYS).</Paragraph>
    <Paragraph position="1"> For each original event annotation, we can obtain the most likely (mean) duration by averaging its lower and upper bound durations, and assigning it to one of seven classes (i.e., second, minute, hour, day, week, month, and year) based on the temporal unit of its most likely duration.</Paragraph>
    <Paragraph position="2"> However, human agreement on this more fine-grained task is low (44.4%). Based on this observation, instead of evaluating the exact agreement between annotators, an &amp;quot;approximate agreement&amp;quot; is computed for the most likely temporal unit of events. In &amp;quot;approximate agreement&amp;quot;, temporal units are considered to match if they are the same temporal unit or an adjacent one. For example, &amp;quot;second&amp;quot; and &amp;quot;minute&amp;quot; match, but &amp;quot;minute&amp;quot; and &amp;quot;day&amp;quot; do not.</Paragraph>
    <Paragraph position="3"> Some preliminary experiments have been conducted for learning this multi-classification task. The same data sets as in the binary classification task were used. The only difference is that the class for each instance is now labeled with one  of the seven temporal unit classes.</Paragraph>
    <Paragraph position="4"> The baseline for this multi-classification task is always taking the temporal unit which with its two neighbors spans the greatest amount of data.</Paragraph>
    <Paragraph position="5"> Since the &amp;quot;week&amp;quot;, &amp;quot;month&amp;quot;, and &amp;quot;year&amp;quot; classes together take up largest portion (51.5%) of the data, the baseline is always taking the &amp;quot;month&amp;quot; class, where both &amp;quot;week&amp;quot; and &amp;quot;year&amp;quot; are also considered a match. Table 9 shows the inter-annotator agreement results for most likely temporal unit when using &amp;quot;approximate agreement&amp;quot;. Human agreement (the upper bound) for this learning task increases from 44.4% to 79.8%.</Paragraph>
    <Paragraph position="6"> 10-fold cross validation was also used to train the learning models, which were then tested on the unseen held-out test set. The performance of the three algorithms is shown in Table 10. The best performing learning algorithm is again SVM with 67.9% test precision. Compared with the baseline (51.5%) and human agreement (79.8%), this again is a very promising result, especially for a multi-classification task with such limited training data. It is reasonable to expect that when more annotated data becomes available, the learning algorithm will achieve higher performance when learning this and more fine-grained event duration information.</Paragraph>
    <Paragraph position="7"> Although the coarse-grained duration information may look too coarse to be useful, computers have no idea at all whether a meeting event takes seconds or centuries, so even coarse-grained estimates would give it a useful rough sense of how long each event may take. More fine-grained duration information is definitely more desirable for temporal reasoning tasks. But coarse-grained  durations to a level of temporal units can already be very useful.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML