XML Viewer - p06-1095

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1095_metho.xml
Size: 21,676 bytes
Last Modified: 2025-10-06 14:10:22
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1095">
  <Title>Machine Learning of Temporal Relations</Title>
  <Section position="4" start_page="37" end_page="753" type="metho">
    <SectionTitle>
AQUAINT Program, Phase II.
</SectionTitle>
    <Paragraph position="0"> able effort to developing insightful baselines.</Paragraph>
    <Paragraph position="1"> Our work is, accordingly, evaluated in comparison against four baselines: (i) the usual majority class statistical baseline, shown along with each result, (ii) a more sophisticated baseline that uses hand-coded rules (Section 4.1), (iii) a hybrid baseline based on hand-coded rules expanded with Google-induced rules (Section 4.2), and (iv) a machine learning version that learns from imperfect annotation produced by (ii) (Section 4.3).</Paragraph>
  </Section>
  <Section position="5" start_page="753" end_page="754" type="metho">
    <SectionTitle>
2 Annotation Scheme and Corpora
2.1 TimeML
</SectionTitle>
    <Paragraph position="0"> TimeML (Pustejovsky et al. 2005) (www.timeml.org) is an annotation scheme for markup of events, times, and their temporal relations in news articles. The TimeML scheme flags tensed verbs, adjectives, and nominals with EVENT tags with various attributes, including the class of event, tense, grammatical aspect, polarity (negative or positive), any modal operators which govern the event being tagged, and cardinality of the event if it's mentioned more than once. Likewise, time expressions are flagged and their values normalized, based on TIMEX3, an extension of the ACE (2004) (tern.mitre.org) TIMEX2 annotation scheme.</Paragraph>
    <Paragraph position="1"> For temporal relations, TimeML defines a TLINK tag that links tagged events to other events and/or times. For example, given (3a), a TLINK tag orders an instance of the event of entering to an instance of the drinking with the relation type AFTER. Likewise, given the sentence (3c), a TLINK tag will anchor the event instance of announcing to the time expression Tuesday (whose normalized value will be inferred from context), with the relation IS_INCLUDED. These inferences are shown (in slightly abbreviated form) in the annotations in  (4) and (5).</Paragraph>
    <Paragraph position="2"> (4) Max &lt;EVENT eventID=&amp;quot;e1&amp;quot; class=&amp;quot;occurrence&amp;quot; tense=&amp;quot;past&amp;quot; aspect=&amp;quot;none&amp;quot;&gt;entered&lt;/EVENT&gt; the room. He &lt;EVENT eventID=&amp;quot;e2&amp;quot; class=&amp;quot;occurrence&amp;quot; tense=&amp;quot;past&amp;quot; aspect=&amp;quot;perfect&amp;quot;&gt;had drunk&lt;/EVENT&gt;a lot of wine.</Paragraph>
    <Paragraph position="3"> &lt;TLINK eventID=&amp;quot;e1&amp;quot; relatedToEventID=&amp;quot;e2&amp;quot; relType=&amp;quot;AFTER&amp;quot;/&gt; (5) The company &lt;EVENT even-</Paragraph>
    <Paragraph position="5"> The anchor relation is an Event-Time TLINK, and the order relation is an Event-Event TLINK.</Paragraph>
    <Paragraph position="6"> TimeML uses 14 temporal relations in the TLINK RelTypes, which reduce to a disjunctive classification of 6 temporal relations RelTypes = {SIMULTANEOUS, IBEFORE, BEFORE, BE-GINS, ENDS, INCLUDES}. An event or time is SIMULTANEOUS with another event or time if they occupy the same time interval. An event or time INCLUDES another event or time if the latter occupies a proper subinterval of the former. These 6 relations and their inverses map one-to-one to 12 of Allen's 13 basic relations (Allen 1984)  . There has been a considerable amount of activity related to this scheme; we focus here on some of the challenges posed by the TLINK annotation, the part that is directly relevant to the temporal ordering and anchoring problems.</Paragraph>
    <Section position="1" start_page="753" end_page="754" type="sub_section">
      <SectionTitle>
2.2 Challenges
</SectionTitle>
      <Paragraph position="0"> The annotation of TimeML information is on a par with other challenging semantic annotation schemes, like PropBank, RST annotation, etc., where high inter-annotator reliability is crucial but not always achievable without massive pre-processing to reduce the user's workload. In TimeML, inter-annotator agreement for time expressions and events is 0.83 and 0.78 (average of Precision and Recall) respectively, but on TLINKs it is 0.55 (P&amp;R average), due to the large number of event pairs that can be selected for comparison. The time complexity of the human TLINK annotation task is quadratic in the number of events and times in the document.</Paragraph>
      <Paragraph position="1"> Two corpora have been released based on TimeML: the TimeBank (Pustejovsky et al. 2003) (we use version 1.2.a) with 186 documents and  Of the 14 TLINK relations, the 6 inverse relations are redundant. In order to have a disjunctive classification, SIMULTANEOUS and IDENTITY are collapsed, since IDENTITY is a subtype of SIMULTANEOUS. (Specifically, X and Y are identical if they are simultaneous and coreferential.) DURING and IS_INCLUDED are collapsed since DURING is a subtype of IS_INCLUDED that anchors events to times that are durations. IBEFORE (immediately before) corresponds to Allen's MEETS. Allen's OVER-LAPS relation is not represented in TimeML. More details can be found at timeml.org.</Paragraph>
      <Paragraph position="2">  64,077 words of text, and the Opinion Corpus (www.timeml.org), with 73 documents and 38,709 words. The TimeBank was developed in the early stages of TimeML development, and was partitioned across five annotators with different levels of expertise. The Opinion Corpus was developed very recently, and was partitioned across just two highly trained annotators, and could therefore be expected to be less noisy. In our experiments, we merged the two datasets to produce a single corpus, called OTC.</Paragraph>
      <Paragraph position="3"> Table 1 shows the distribution of EVENTs and TIMES, and TLINK RelTypes  in the OTC. The majority class percentages are shown in parentheses. It can be seen that BEFORE and SIMULTANEOUS together form a majority of event-ordering (Event-Event) links, whereas most of the event anchoring (Event-Time) links are INCLUDES.</Paragraph>
    </Section>
    <Section position="2" start_page="754" end_page="754" type="sub_section">
      <SectionTitle>
Corpus
</SectionTitle>
      <Paragraph position="0"> The lack of TLINK coverage in human annotation could be helped by preprocessing, provided it meets some threshold of accuracy. Given the availability of a corpus like OTC, it is natural to try a machine learning approach to see if it can be used to provide that preprocessing. However, the noise in the corpus and the sparseness of links present challenges to a learning approach.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="754" end_page="756" type="metho">
    <SectionTitle>
3 Machine Learning Approach
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="754" end_page="755" type="sub_section">
      <SectionTitle>
3.1 Initial Learner
</SectionTitle>
      <Paragraph position="0"> There are several sub-problems related to inferring event anchoring and event ordering. Once a tagger has tagged the events and times, the first task (A) is to link events and/or times, and the second task (B) is to label the links. Task A is hard to evaluate since, in the absence of massive preprocessing, many links are ignored by the human in creating the annotated corpora. In addi- null The number of TLINKs shown is based on the number of TLINK vectors extracted from the OTC.</Paragraph>
      <Paragraph position="1"> tion, a program, as a baseline, can trivially link all tagged events and times, getting 100% recall on Task A. We focus here on Task B, the labeling task. In the case of humans, in fact, when a TLINK is posited by both annotators between the same pairs of events or times, the inter-annotator agreement on the labels is a .77 average of P&amp;R.</Paragraph>
      <Paragraph position="2"> To ensure replicability of results, we assume perfect (i.e., OTC-supplied) events, times, and links. Thus, we can consider TLINK inference as the following classification problem: given an ordered pair of elements X and Y, where X and Y are events or times which the human has related temporally via a TLINK, the classifier has to assign a label in RelTypes. Using RelTypes instead of RelTypes [?] {NONE} also avoids the problem of heavily skewing the data towards the NONE class.</Paragraph>
      <Paragraph position="3"> To construct feature vectors for machine learning, we took each TLINK in the corpus and used the given TimeML features, with the TLINK class being the vector's class feature.</Paragraph>
      <Paragraph position="4"> For replicability by other users of these corpora, and to be able to isolate the effect of components, we used 'perfect' features; no feature engineering was attempted. The features were, for each event in an event-ordering pair, the event-class, aspect, modality, tense and negation (all nominal features); event string, and signal (a preposition/adverb, e.g., reported on Tuesday), which are string features, and contextual features indicating whether the same tense and same aspect are true of both elements in the event pair. For event-time links, we used the above event and signal features along with TIMEX3 time features.</Paragraph>
      <Paragraph position="5"> For learning, we used an off-the-shelf Maximum Entropy (ME) classifier (from Carafe, available at sourceforge.net/projects/carafe). As shown in the UNCLOSED (ME) column in Table 2  , accuracy of the unclosed ME classifier does not go above 77%, though it's always better than the majority class (in parentheses). We also tried a variety of other classifiers, including the SMO support-vector machine and the naive Bayes tools in WEKA (www.weka.net.nz). SMO performance (but not naive Bayes) was comparable with ME, with SMO trailing it in a few cases (to save space, we report just ME performance). It's possible that feature engineering could improve performance, but since this is &amp;quot;perfect&amp;quot; data, the result is not encouraging.</Paragraph>
      <Paragraph position="6">  All machine learning results, except for ME-C in Table 4, use 10-fold cross-validation. 'Accuracy' in tables is Predictive Accuracy.</Paragraph>
    </Section>
    <Section position="2" start_page="755" end_page="756" type="sub_section">
      <SectionTitle>
3.2 Expanding Training Data using Tem-
poral Reasoning
</SectionTitle>
      <Paragraph position="0"> To expand our training set, we use a temporal closure component SputLink (Verhagen 2004), that takes known temporal relations in a text and derives new implied relations from them, in effect making explicit what was implicit. SputLink was inspired by (Setzer and Gaizauskas 2000) and is based on Allen's interval algebra, taking into account the limitations on that algebra that were pointed out by (Vilain et al. 1990). It is basically a constraint propagation algorithm that uses a transitivity table to model the compositional behavior of all pairs of relations in a document. SputLink's transitivity table is represented by 745 axioms. An example axiom:</Paragraph>
      <Paragraph position="2"> Once the TLINKs in each document in the corpus are closed using SputLink, the same vector generation procedure and feature representation described in Section 3.1 are used. The effect of closing the TLINKs on the corpus has a dramatic impact on learning. Table 2, in the CLOSED (ME-C) column shows that accuracies for this method (called ME-C, for Maximum Entropy learning with closure) are now in the high 80's and low 90's, and still outperform the closed majority class (shown in parentheses).</Paragraph>
      <Paragraph position="3"> What is the reason for the improvement?  One reason is the dramatic increase in the amount of training data. The more connected the initial un- null Interestingly, performance does not improve for SIMUL-TANEOUS. The reason for this might be due to the relatively modest increase in SIMULTANEOUS relations from applying closure (roughly factor of 2).</Paragraph>
      <Paragraph position="4"> closed graph for a document is in TLINKs, the greater the impact in terms of closure. When the OTC is closed, the number of TLINKs goes up by more than 11 times, from 6147 Event-Event  making BEFORE the majority class in the closed data for both Event-Event and Event-Time TLINKs. There are only an average of 0.84 TLINKs per event before closure, but after closure it shoots up to 9.49 TLINKs per event. (Note that as a result, the majority class percentages for the closed data have changed from the unclosed data.) Being able to bootstrap more training data is of course very useful. However, we need to dig deeper to investigate how the increase in data affected the machine learning. The improvement provided by temporal closure can be explained by three factors: (1) closure effectively creates a new classification problem with many more instances, providing more data to train on; (2) the class distribution is further skewed which results in a higher majority class baseline (3) closure produces additional data in such a way as to increase the frequencies and statistical power of existing features in the unclosed data, as opposed to adding new features. For example, with unclosed data, given A BEFORE B and B BEFORE C, closure generates A BEFORE C which provides more significance for the features related to A and C appearing as first and second arguments, respectively, in a BEFORE relation.</Paragraph>
      <Paragraph position="5"> In order to help determine the effects of the above factors, we carried out two experiments in which we sampled 6145 vectors from the closed  data - i.e. approximately the number of Event-Event vectors in the unclosed data. This effectively removed the contribution of factor (1) above. The first experiment (Closed Class Distribution) simply sampled 6145 instances uniformly from the closed instances, while the second experiment (Unclosed Class Distribution) sampled instances according to the same distribution as the unclosed data. Table 3 shows these results. The greater class distribution skew in the closed data clearly contributes to improved accuracy. However, when using the same class distribution as the unclosed data (removing factor (2) from above), the accuracy, 76%, is higher than using the full unclosed data. This indicates that closure does indeed help according to factor (3).</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="756" end_page="758" type="metho">
    <SectionTitle>
4 Comparison against Baselines
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="756" end_page="756" type="sub_section">
      <SectionTitle>
4.1 Hand-Coded Rules
</SectionTitle>
      <Paragraph position="0"> Humans have strong intuitions about rules for temporal ordering, as we indicated in discussing sentences (1) to (3). Such intuitions led to the development of pattern matching rules incorporated in a TLINK tagger called GTag. GTag takes a document with TimeML tags, along with syntactic information from part-of-speech tagging and chunking from Carafe, and then uses 187 syntactic and lexical rules to infer and label TLINKs between tagged events and other tagged events or times. The tagger takes pairs of TLINKable items (event and/or time) and searches for the single most-confident rule to apply to it, if any, to produce a labeled TLINK between those items. Each (if-then) rule has a left-hand side which consists of a conjunction of tests based on TimeML-related feature combinations (TimeML features along with part-of-speech and chunk-related features), and a right-hand side which is an assignment to one of the TimeML TLINK classes.</Paragraph>
      <Paragraph position="1"> The rule patterns are grouped into several different classes: (i) the event is anchored with or without a signal to a time expression within the same clause, e.g., (3c), (ii) the event is anchored without a signal to the document date (as is often the case for reporting verbs in news), (iii) an event is linked to another event in the same sentence, e.g., (3c), and (iv) the event in a main clause of one sentence is anchored with a signal or tense/aspect cue to an event in the main clause of the previous sentence, e.g., (1-2), (3a-b).</Paragraph>
      <Paragraph position="2"> The performance of this baseline is shown in Table 4 (line GTag). The top most accurate rule (87% accuracy) was GTag Rule 6.6, which links a past-tense event verb joined by a conjunction to another past-tense event verb as being BEFORE the latter (e.g., they traveled and slept the</Paragraph>
      <Paragraph position="4"> The vast majority of the intuition-bred rules have very low accuracy compared to ME-C, with intuitions failing for various feature combinations and relations (for relations, for example, GTag lacks rules for IBEFORE, STARTS, and ENDS). The bottom-line here is that even when heuristic preferences are intuited, those preferences need to be guided by empirical data, whereas hand-coded rules are relatively ignorant of the distributions that are found in data.</Paragraph>
    </Section>
    <Section position="2" start_page="756" end_page="757" type="sub_section">
      <SectionTitle>
4.2 Adding Google-Induced Lexical Rules
</SectionTitle>
      <Paragraph position="0"> One might argue that the above baseline is too weak, since it doesn't allow for a rich set of lexical relations. For example, pushing can result in falling, killing always results in death, and so forth. These kinds of defeasible rules have been investigated in the semantics literature, including the work of Lascarides and Asher cited in Section 1.</Paragraph>
      <Paragraph position="1"> However, rather than hand-creating lexical rules and running into the same limitations as with GTag's rules, we used an empirically-derived resource called VerbOcean (Chklovski and Pantel 2004), available at http://semantics.isi.edu/ocean. This resource consists of lexical relations mined from Google searches. The mining uses a set of lexical and syntactic patterns to test for pairs of verb strongly associated on the Web in an asymmetric 'happens-before' relation. For example, the system discovers that marriage happens-before divorce, and that tie happens-before untie.</Paragraph>
      <Paragraph position="2"> We automatically extracted all the 'happens-before' relations from the VerbOcean resource at the above web site, and then automatically converted those relations to GTag format, producing 4,199 rules. Here is one such converted rule:  Adding these lexical rules to GTag (with morphological normalization being added for rule matching on word features) amounts to a considerable augmentation of the rule-set, by a factor of 22. GTag with this augmented rule-set might be a useful baseline to consider, since one would expect the gigantic size of the Google 'corpus' to yield fairly robust, broad-coverage rules.</Paragraph>
      <Paragraph position="3"> What if both a core GTag rule and a VerbOcean-derived rule could both apply? We assume the one with the higher confidence is chosen.</Paragraph>
      <Paragraph position="4"> However, we don't have enough data to reliably estimate rule confidences for the original GTag rules; so, for the purposes of VerbOcean rule integration, we assigned either the original VerbOcean rules as having greater confidence than the original GTag rules in case of a conflict (i.e., a preference for the more specific rule), or viceversa. null The results are shown in Table 4 (lines GTag+VerbOcean). The combined rule set, under both voting schemes, had no statistically significant difference in accuracy from the original GTag rule set. So, ME-C beat this baseline as well.</Paragraph>
      <Paragraph position="5"> The reason VerbOcean didn't help is again one of data sparseness, due to most verbs occurring rarely in the OTC. There were only 19 occasions when a happens-before pair from VerbOcean correctly matched a human BEFORE TLINK, of which 6 involved the same rule being right twice (including learn happens-before forget, a rule which students are especially familiar with!), with the rest being right just once. There were only 5 occasions when a VerbOcean rule incorrectly matched a human BEFORE TLINK, involving just three rules.</Paragraph>
    </Section>
    <Section position="3" start_page="757" end_page="758" type="sub_section">
      <SectionTitle>
4.3 Learning from Hand-Coded Rules
Baseline
</SectionTitle>
      <Paragraph position="0"> The previous baseline was a hybrid confidence-based combination of corpus-induced lexical relations with hand-created rules for temporal ordering. One could consider another obvious hybrid, namely learning from annotations created by GTag-annotated corpora. Since the intuitive baseline fares badly, this may not be that attractive. However, the dramatic impact of closure could help offset the limited coverage provided by human intuitions.</Paragraph>
      <Paragraph position="1"> Table 4 (line GTag+closure+ME-C) shows the results of closing the TLINKs produced by GTag's annotation and then training ME from the resulting data. The results here are evaluated against a held-out test set. We can see that even after closure, the baseline of learning from unclosed human annotations is much poorer than ME-C, and is in fact substantially worse than the majority class on event ordering.</Paragraph>
      <Paragraph position="2"> This means that for preprocessing new data sets to produce noisily annotated data for this classification task, it is far better to use machine-learning from closed human annotations rather  than machine-learning from closed annotations produced by an intuitive baseline.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML