XML Viewer - w06-0906

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0906_intro.xml
Size: 19,738 bytes
Last Modified: 2025-10-06 14:03:52
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0906">
  <Title>Extending TimeML with Typical Durations of Events</Title>
  <Section position="4" start_page="38" end_page="42" type="intro">
    <SectionTitle>
2 Annotating and Learning Typical Du-
</SectionTitle>
    <Paragraph position="0"> ration of Events In the corpus of typical durations of events, every event to be annotated was already identified in the TimeBank corpus. Annotators are asked to provide lower and upper bounds on the duration of the event, and a judgment of level of confidence in those estimates on a scale from one to ten. An interface was built to facilitate the annotation. Graphical output is displayed to enable us to visualize quickly the level of agreement among different annotators for each event. For example, here is the output of the annotations (3 annotators) for the &amp;quot;finished&amp;quot; event (in bold) in the sentence After the victim, Linda Sanders, 35, had finished her cleaning and was waiting for her clothes to dry,...</Paragraph>
    <Paragraph position="1"> This graph shows that the first annotator believes that the event lasts for minutes whereas the second annotator believes it could only last for several seconds. The third annotates the event to range from a few seconds to a few minutes. A logarithmic scale is used for the output.</Paragraph>
    <Section position="1" start_page="38" end_page="38" type="sub_section">
      <SectionTitle>
2.1 Annotation Instructions
</SectionTitle>
      <Paragraph position="0"> Annotators are asked to identify upper and lower bounds that would include 80% of the possible cases, excluding anomalous cases.</Paragraph>
      <Paragraph position="1"> The judgments are to be made in context.</Paragraph>
      <Paragraph position="2"> First of all, information in the syntactic environment needs to be considered before annotating, and the events need to be annotated in light of the information provided by the entire article.</Paragraph>
      <Paragraph position="3"> Annotation is made easier and more consistent if coreferential and near-coreferential descriptions of events are identified initially.</Paragraph>
      <Paragraph position="4"> When the articles were completely annotated by the three annotators, the results were analyzed and the differences were reconciled. Differences in annotation could be due to the differences in interpretations of the event; however, we found that the vast majority of radically different judgments can be categorized into a relatively small number of classes. Some of these correspond to aspectual features of events, which have been intensively investigated (e.g., Vendler, 1967; Dowty, 1979; Moens and Steedman, 1988; Passonneau, 1988). We then developed guidelines to cover those cases (see the next section).</Paragraph>
    </Section>
    <Section position="2" start_page="38" end_page="39" type="sub_section">
      <SectionTitle>
2.2 Event Classes
</SectionTitle>
      <Paragraph position="0"> Action vs. State: Actions involve change, such as those described by words like &amp;quot;speaking&amp;quot;, &amp;quot;gave&amp;quot;, and &amp;quot;skyrocketed&amp;quot;. States involve things staying the same, such as being dead, being dry, and being at peace. When we have an event in the passive tense, sometimes there is an ambiguity about whether the event is a state or an action. For example, Three people were injured in the attack.</Paragraph>
      <Paragraph position="1"> Is the &amp;quot;injured&amp;quot; event an action or a state? This matters because they will have different durations. The state begins with the action and lasts until the victim is healed. Besides the general diagnostic tests to distinguish them (Vendler, 1967; Dowty, 1979), another test can be applied to this specific case: Imagine someone says the sentence after the action had ended but the state was still persisting. Would they use the past or present tense? In the &amp;quot;injured&amp;quot; example, it is clear we would say &amp;quot;Three people were injured in the attack&amp;quot;, whereas we would say &amp;quot;Three people are injured from the attack.&amp;quot; Our annotation interface handles events of this type by allowing the annotator to specify which interpretation he is giving. If the annotator feels it's too ambiguous to distinguish, annotations can be given for both interpretations.</Paragraph>
      <Paragraph position="2"> Aspectual Events: Some events are aspects of larger events, such as their start or finish. Although they may seem instantaneous, we believe they should be considered to happen across some interval, i.e., the first or last sub-event of the larger event. For example, After the victim, Linda Sanders, 35, had finished her cleaning and was waiting for her clothes to dry,...</Paragraph>
      <Paragraph position="3"> The &amp;quot;finished&amp;quot; event should be considered as the last sub-event of the larger event (the &amp;quot;cleaning&amp;quot; event), since it actually involves opening the door of the washer, taking out the clothes, closing the door, and so on. All this takes time. This  interpretation will also give us more information on typical durations than simply assuming such events are instantaneous.</Paragraph>
      <Paragraph position="4"> Reporting Events: These are everywhere in the news. They can be direct quotes, taking exactly as long as the sentence takes to read, or they can be summarizations of long press conferences. We need to distinguish different cases: Quoted Report: This is when the reported content is quoted. The duration of the event should be the actual duration of the utterance of the quoted content. The time duration can be easily verified by saying the sentence out loud and timing it. For example, &amp;quot;It looks as though they panicked,&amp;quot; a detective said of the robbers.</Paragraph>
      <Paragraph position="5"> This probably took between 1 and 3 seconds; it's very unlikely it took more than 10 seconds.</Paragraph>
      <Paragraph position="6"> Unquoted Report: This is when the reporting description occurs without quotes that could be as short as just the duration of the actual utterance of the reported content (lower bound), and as long as the duration of a briefing or press conference (upper bound).</Paragraph>
      <Paragraph position="7"> If the sentence is very short, then it's likely that it is one complete sentence from the speaker's remarks, and a short duration should be given; if it is a long, complex sentence, then it's more likely to be a summary of a long discussion or press conference, and a longer duration should be given. For example, The police said it did not appear that anyone else was injured.</Paragraph>
      <Paragraph position="8"> A Brooklyn woman who was watching her clothes dry in a laundromat was killed Thursday evening when two would-be robbers emptied their pistols into the store, the police said.</Paragraph>
      <Paragraph position="9"> If the first sentence were quoted text, it would be very much the same. Hence the duration of the &amp;quot;said&amp;quot; event should be short. In the second sentence everything that the spokesperson (here the police) has said is compiled into a single sentence by the reporter, and it is unlikely that the spokesperson said only a single sentence with all this information. Thus, it is reasonable to give longer duration to this &amp;quot;said&amp;quot; event.</Paragraph>
      <Paragraph position="10"> Multiple Events: Many occurrences of verbs and other event descriptors refer to multiple events, especially, but not exclusively, if the sub-ject or object of the verb is plural. For example, Iraq has destroyed its long-range missiles.</Paragraph>
      <Paragraph position="11"> Both single (i.e., destroyed one missile) and aggregate (i.e., destroyed all missiles) events happened. This was a significant source in disagreements in our first round of annotation.</Paragraph>
      <Paragraph position="12"> Since both judgments provide useful information, our current annotation interface allows the annotator to specify the event as multiple, and give durations for both the single and aggregate events.</Paragraph>
      <Paragraph position="13"> Events Involving Negation: Negated events didn't happen, so it may seem strange to specify their duration. But whenever negation is used, there is a certain class of events whose occurrence is being denied. Annotators should consider this class, and make a judgment about the likely duration of the events in it. In addition, there is the interval during which the nonoccurrence of the events holds. For example, He was willing to withdraw troops in exchange for guarantees that Israel would not be attacked.</Paragraph>
      <Paragraph position="14"> There is the typical amount of time of &amp;quot;being attacked&amp;quot;, i.e., the duration of a single attack, and a longer period of time of &amp;quot;not being attacked&amp;quot;. Similarly to multiple events, annotators are asked to give durations for both the event negated and the negation of that event.</Paragraph>
      <Paragraph position="15"> Positive Infinite Durations: These are states which continue essentially forever once they begin. For example, He is dead.</Paragraph>
      <Paragraph position="16"> Here the time continues for an infinite amount of time, and we allow this as an annotation.</Paragraph>
    </Section>
    <Section position="3" start_page="39" end_page="41" type="sub_section">
      <SectionTitle>
2.3 Inter-Annotator Agreement
</SectionTitle>
      <Paragraph position="0"> Although the graphical output of the annotations enables us to visualize quickly the level of agreement among different annotators for each event, a quantitative measurement of the agreement is needed. The kappa statistic (Krippendorff, 1980; Carletta, 1996) has become the de facto standard to assess inter-annotator agreement. It is computed as:  30 minutes] and [10 minutes, 2 hours].</Paragraph>
      <Paragraph position="1"> which is the probability that the annotators agree by chance.</Paragraph>
      <Paragraph position="2">  Determining what should count as agreement is not only important for assessing inter-annotator agreement, but is also crucial for later evaluation of machine learning experiments.</Paragraph>
      <Paragraph position="3"> We first need to decide what scale is most appropriate. One possibility is just to convert all the temporal units to seconds. However, this would not correctly capture our intuitions about the relative relations between duration ranges. For example, the difference between 1 second and 20 seconds is significant; while the difference between 1 year 1 second and 1 year 20 seconds is negligible. In order to handle this problem, we use a logarithmic scale for our data. After first converting from temporal units to seconds, we then take the natural logarithms of these values. This logarithmic scale also conforms to the half orders of magnitude (HOM) (Hobbs and Kreinovich, 2001) which was shown to have utility in several very different linguistic contexts.</Paragraph>
      <Paragraph position="4"> In the literature on the kappa statistic, most authors address only category data; some can handle more general data, such as data in interval scales or ratio scales (Krippendorff, 1980; Carletta, 1996). However, none of the techniques directly apply to our data, which are ranges of durations from a lower bound to an upper bound.</Paragraph>
      <Paragraph position="5"> In fact, what coders were instructed to annotate for a given event is not just a range, but a duration distribution for the event, where the area between the lower bound and the upper bound covers about 80% of the entire distribution area. Since it's natural to assume the most likely duration for such distribution is its mean (average) duration, and the distribution flattens out toward the upper and lower bounds, we use the  normal or Gaussian distribution to model our duration distributions.</Paragraph>
      <Paragraph position="6"> In order to determine a normal distribution, we need to know two parameters: the mean and the standard deviation. For our duration distributions with given lower and upper bounds, the mean is the average of the bounds. Under the assumption that the area between lower and upper bounds covers 80% of the entire distribution area, the lower and upper bounds are each 1.28 standard deviations from the mean.</Paragraph>
      <Paragraph position="7"> With this data model, the agreement between two annotations can be defined as the overlapping area between two normal distributions. The agreement among many annotations is the average overlap of all the pairwise overlapping areas. For example, the overlap of judgments of [10 minutes, 30 minutes] and [10 minutes, 2 hours] are as in Figure 1. The overlap or agreement is 0.508706.</Paragraph>
      <Paragraph position="8">  As in (Krippendorff, 1980), we assume there exists one global distribution for our task (i.e., the duration ranges for all the events), and &amp;quot;chance&amp;quot; annotations would be consistent with this distribution. Thus, the baseline will be an annotator who knows the global distribution and annotates in accordance with it, but does not read the specific article being annotated. Therefore, we must compute the global distribution of the durations, in particular, of their means and their widths.</Paragraph>
      <Paragraph position="9"> This will be of interest not only in determining expected agreement, but also in terms of what it says about the genre of news articles and about fuzzy judgments in general.</Paragraph>
      <Paragraph position="10"> We first compute the distribution of the means of all the annotated durations. Its histogram is shown in Figure 2, where the horizontal axis  represents the mean values in the natural logarithmic scale and the vertical axis represents the number of annotated durations with that mean.</Paragraph>
      <Paragraph position="11"> We also compute the distribution of the widths (i.e., upper bound - lower bound) of all the annotated durations, and its histogram is shown in Figure 3, where the horizontal axis represents the width in the natural logarithmic scale and the vertical axis represents the number of annotated durations with that width.</Paragraph>
      <Paragraph position="12"> Two different methods were used to compute the expected agreement (baseline), both yielding nearly equal results. These are described in detail in (Pan et al., 2006a). For both, P(E) is about 0.15.</Paragraph>
      <Paragraph position="13"> Experimental results show that the use of the annotation guidelines resulted in about 10% improvement in inter-annotator agreement, measured as described in this section, see (Pan et al., 2006a) for details.</Paragraph>
    </Section>
    <Section position="4" start_page="41" end_page="41" type="sub_section">
      <SectionTitle>
2.4 Machine Learning Experiments
</SectionTitle>
      <Paragraph position="0"> Local Context. For a given event, the local context features include a window of n tokens to its left and n tokens to its right, as well as the event itself. The best n was determined via cross validation. A token can be a word or a punctuation mark. For each token in the local context, including the event itself, three features are included: the original form of the token, its lemma (or root form), and its part-of-speech (POS) tag.</Paragraph>
      <Paragraph position="1"> Syntactic Relations. The information in the event's syntactic environment is very important in deciding the durations of events. For a given event, both the head of its subject and the head of its object are extracted from the parse trees generated by the CONTEX parser (Hermjakob and Mooney, 1997). Similarly to the local context features, for both the subject head and the object head, their original form, lemma, and POS tags are extracted as features.</Paragraph>
      <Paragraph position="2"> WordNet Hypernyms. Events with the same hypernyms may have similar durations. But closely related events don't always have the same direct hypernyms. We extract the hypernyms not only for the event itself, but also for the subject and object of the event, since events related to a group of people or an organization usually last longer than those involving individuals, and the hypernyms can help distinguish such concepts. For our learning experiments, we extract the first 3 levels of hypernyms from Word-Net (Miller, 1990).</Paragraph>
    </Section>
    <Section position="5" start_page="41" end_page="42" type="sub_section">
      <SectionTitle>
Event Durations
</SectionTitle>
      <Paragraph position="0"> The distribution of the means of the annotated durations in Figure 2 is bimodal, dividing the events into those that take less than a day and those that take more than a day. Thus, in our first machine learning experiment, we have tried to learn this coarse-grained event duration information as a binary classification task.</Paragraph>
      <Paragraph position="1"> Data. The original annotated data can be straightforwardly transformed for this binary classification task. For each event annotation, the most likely (mean) duration is calculated first by averaging (the logs of) its lower and upper bound durations. If its most likely (mean) duration is less than a day (about 11.4 in the natural logarithmic scale), it is assigned to the &amp;quot;short&amp;quot; event class, otherwise it is assigned to the &amp;quot;long&amp;quot; event class. (Note that these labels are strictly a convenience and not an analysis of the meanings of &amp;quot;short&amp;quot; and &amp;quot;long&amp;quot;.) We divide the total annotated non-WSJ data (2132 event instances) into two data sets: a training data set with 1705 event instances (about 80% of the total non-WSJ data) and a held-out test data set with 427 event instances (about 20% of the total non-WSJ data). The WSJ data (156 event instances) is kept for further test purposes.</Paragraph>
      <Paragraph position="2"> Results. The learning results in Figure 4 show that among all three learning algorithms explored (Naive Bayes (NB), Decision Trees C4.5, and Support Vector Machines (SVM)), SVM with linear kernel achieves the best overall precision (76.6%). Compared with the baseline (59.0%) and human agreement (87.7%), this level of performance is very encouraging, especially as the learning is from such limited training data.</Paragraph>
      <Paragraph position="3">  Data.</Paragraph>
      <Paragraph position="4"> Feature evaluation in (Pan et al., 2006b) shows that most of the performance comes from event word or phrase itself. A significant improvement above that is due to the addition of information about the subject and object. Local context does not help and in fact may hurt, and hypernym information also does not seem to help. It is gratifying to see that the most important information is that from the predicate and arguments describing the event, as our linguistic intuitions would lead us to expect.</Paragraph>
      <Paragraph position="5"> In order to evaluate whether the learned model can perform well on data from different news genres, we tested it on the unseen WSJ data (156 event instances). A precision of 75.0%, which is very close to the test performance on the non-WSJ data, proves the great generalization capacity of the learned model.</Paragraph>
      <Paragraph position="6"> Some preliminary experimental results of learning the more fine-grained event duration information, i.e., the most likely temporal unit (cf. (Rieger 1974)'s ORDERHOURS, ORDERDAYS), are shown in (Pan et al., 2006b). SVM again achieves the best performance with 67.9% test precision (baseline 51.5% and human agreement 79.8%) in &amp;quot;approximate agreement&amp;quot; where temporal units are considered to match if they are the same temporal unit or an adjacent one.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML