File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1020_metho.xml
Size: 24,135 bytes
Last Modified: 2025-10-06 14:08:52
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1020"> <Title>Inferring Sentence-internal Temporal Relations</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Parameter Estimation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Data Extraction </SectionTitle> <Paragraph position="0"> Subordinate clauses (and their main clause counterparts) were extracted from the BLLIP corpus (30 M words), a Treebank-style, machine-parsed version of the Wall Street Journal (WSJ, years 1987-89) which was produced using Charniak's (2000) parser. From the extracted clauses we estimate the features described in Section 3.2.</Paragraph> <Paragraph position="1"> We first traverse the tree top-down until we identify the tree node bearing the subordinate clause label we are interested in and extract the subtree it dominates.</Paragraph> <Paragraph position="2"> Assuming we want to extract after subordinate clauses, this would be the subtree dominated by SBAR-TMP in Figure 1 indicated by the arrow pointing down. Having found the subordinate clause, we proceed to extract the main clause by traversing the tree upwards and identifying the S node immediately dominating the subordinate clause node (see the arrow pointing up in Figure 1). In cases where the subordinate clause is sentence initial, we first identify the SBAR-TMP node and extract the subtree dominated by it, and then traverse the tree downwards in order to extract the S-tree immediately dominating it.</Paragraph> <Paragraph position="3"> For the experiments described here we focus solely on subordinate clauses immediately dominated by S, thus ignoring cases where nouns are related to clauses via a temporal marker. Note also that there can be more than one main clause that qualify as attachment sites for a subordinate clause. In Figure 1 the subordinate clause after the sale is completed can be attached either to said or will loose. We are relying on the parser for providing relatively accurate information about attachment sites, but unavoidably there is some noise in the data.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Model Features </SectionTitle> <Paragraph position="0"> A number of knowledge sources are involved in inferring temporal ordering including tense, aspect, temporal adverbials, lexical semantic information, and world knowledge (Asher and Lascarides, 2003). By selecting features that represent, albeit indirectly and imperfectly, these knowledge sources, we aim to empirically assess their contribution to the temporal inference task. Below we introduce our features and provide the motivation behind their selection.</Paragraph> <Paragraph position="1"> Temporal Signature (T) It is well known that verbal tense and aspect impose constraints on the temporal order of events but also on the choice of temporal markers. These constraints are perhaps best illustrated in the system of Dorr and Gaasterland (1995) who examine how inherent (i.e., states and events) and non-inherent (i.e., progressive, perfective) aspectual features interact with the time stamps of the eventualities in order to generate clauses and the markers that relate them.</Paragraph> <Paragraph position="2"> Although we can't infer inherent aspectual features from verb surface form (for this we would need a dictionary of verbs and their aspectual classes together with a process that infers the aspectual class in a given context), we can extract non-inherent features from our parse trees. We first identify verb complexes including modals and auxiliaries and then classify tensed and non-tensed expressions along the following dimensions: finiteness, non-finiteness, modality, aspect, voice, and polarity. The values of these features are shown in Table 1. The features finiteness and non-finiteness are mutually exclusive.</Paragraph> <Paragraph position="3"> Verbal complexes were identified from the parse trees heuristically by devising a set of 30 patterns that search for sequencies of auxiliaries and verbs. From the parser output verbs were classified as passive or active by building a set of 10 passive identifying patterns requiring both a passive auxiliary (some form of be and get) and a past participle.</Paragraph> <Paragraph position="4"> To illustrate with an example, consider again the parse tree in Figure 1. We identify the verbal groups will lose and is completed from the main and subordinate clause respectively. The former is mapped to the features a17 present, future, imperfective, active, affirmativea18 , whereas the latter is mapped to a17 present, /0, imperfective, passive, affirmativea18 , where /0 indicates the absence of a modal. In Table 2 we show the relative frequencies in our corpus for finiteness (FIN), past tense (PAST), active voice (ACT), and negation (NEG) for main and subordinate clauses conjoined with the markers once and since.</Paragraph> <Paragraph position="6"> As can be seen there are differences in the distribution of counts between main and subordinate clauses for the same and different markers. For instance, the past tense is more frequent in since than once subordinate clauses and modal verbs are more often attested in since main clauses when compared with once main clauses. Also, once main clauses are more likely to be active, whereas once subordinate clauses can be either active or passive.</Paragraph> <Paragraph position="7"> Verb Identity (V) Investigations into the interpretation of narrative discourse have shown that specific lexical information plays an important role in determing temporal interpretation (e.g., Asher and Lascarides 2003). For example, the fact that verbs like push can cause movement of the patient and verbs like fall describe the movement of their subject can be used to predict that the discourse (8) is interpreted as the pushing causing the falling, making the linear order of the events mismatch their temporal order.</Paragraph> <Paragraph position="8"> (8) Max fell. John pushed him.</Paragraph> <Paragraph position="9"> We operationalise lexical relationships among verbs in our data by counting their occurrence in main and subordinate clauses from a lemmatised version of the BLLIP corpus. Verbs were extracted from the parse trees containing main and subordinate clauses. Consider again the tree in Figure 1. Here, we identify lose and complete, without preserving information about tense or passivisation which is explictly represented in our temporal signatures. Table 3 lists the most frequent verbs attested in main (VerbM) and subordinate (VerbS) clauses conjoined with the temporal markers after, as, before, once, since, until, when, and while (TMark in Table 3).</Paragraph> <Paragraph position="10"> Verb Class (VW, VL) The verb identity feature does not capture meaning regularities concerning the types of verbs entering in temporal relations. For example, in Table 3 sell and pay are possession verbs, say and announce are communication verbs, and come and rise are motion verbs. We use a semantic classification for obtaining some TMark VerbM VerbS NounN NounS AdjM AdjS after sell leave year company last new as come acquire market dollar recent previous before say announce time year long new once become complete stock place more new since rise expect company month first last until protect pay president year new next when make sell year year last last while wait complete chairman plan first other and subordinate clauses degree of generalisation over the extracted verb occurrences. We experimented with WordNet (Fellbaum, 1998) and the verb classification proposed by Levin (1993).</Paragraph> <Paragraph position="11"> Verbs in WordNet are classified in 15 general semantic domains (e.g., verbs of change, verbs of cognition, etc.). We mapped the verbs occurring in main and subordinate clauses to these very general semantic categories (feature VW). Ambiguous verbs in WordNet will correspond to more than one semantic class. We resolve ambiguity heuristically by always defaulting to the verb's prime sense and selecting the semantic domain for this sense. In cases where a verb is not listed in WordNet we default to its lemmatised form.</Paragraph> <Paragraph position="12"> Levin (1993) focuses on the relation between verbs and their arguments and hypothesizes that verbs which behave similarly with respect to the expression and interpretation of their arguments share certain meaning components and can therefore be organised into semantically coherent classes (200 in total). Asher and Lascarides (2003) argue that these classes provide important information for identifying semantic relationships between clauses. Verbs in our data were mapped into their corresponding Levin classes (feature VL); polysemous verbs were disambiguated by the method proposed in Lapata and Brew (1999). Again, for verbs not included in Levin, the lemmatised verb form is used.</Paragraph> <Paragraph position="13"> Noun Identity (N) It is not only verbs, but also nouns that can provide important information about the semantic relation between two clauses (see Asher and Lascarides 2003 for detailed motivation). In our domain for example, the noun share is found in main clauses typically preceding the noun market which is often found in subordinate clauses. Table 3 shows the most frequently attested nouns (excluding proper names) in main (NounM) and subordinate (NounS) clauses for each temporal marker. Notice that time denoting nouns (e.g., year, month) are quite frequent in this data set.</Paragraph> <Paragraph position="14"> Nouns were extracted from a lemmatised version of the parser's output. In Figure 1 the nouns employees, jobs and sales are relevant for the Noun feature. In cases of noun compounds, only the compound head (i.e., rightmost noun) was taken into account. A small set of rules was used to identify organisations (e.g., United Laboratories Inc.), person names (e.g., Jose Y. Campos), and locations (e.g., New England) which were subsequently substituted by the general categories person, organisation, and location.</Paragraph> <Paragraph position="15"> Noun Class (NW). As in the case of verbs, nouns were also represented by broad semantic classes from the WordNet taxonomy. Nouns in WordNet do not form a single hierarchy; instead they are partitioned according to a set of semantic primitives into 25 semantic classes (e.g., nouns of cognition, events, plants, substances, etc.), which are treated as the unique beginners of separate hierarchies. The nouns extracted from the parser were mapped to WordNet classes. Ambiguity was handled in the same way as for verbs.</Paragraph> <Paragraph position="16"> Adjective (A) Our motivation for including adjectives in our feature set is twofold. First, we hypothesise that temporal adjectives will be frequent in subordinate clauses introduced by strictly temporal markers such as before, after, and until and therefore may provide clues for the marker interpretation task. Secondly, similarly to verbs and nouns, adjectives carry important lexical information that can be used for inferring the semantic relation that holds between two clauses. For example, antonyms can often provide clues about the temporal sequence of two events (see incoming and outgoing in (9)).</Paragraph> <Paragraph position="17"> (9) The incoming president delivered his inaugural speech. The outgoing president resigned last week.</Paragraph> <Paragraph position="18"> As with verbs and nouns, adjectives were extracted from the parser's output. The most frequent adjectives in main (AdjM) and subordinate (AdjS) clauses are given in main and subordinate clauses are captured by the syntactic signature feature. The feature can be viewed as a measure of tree complexity, as it encodes for each main and subordinate clause the number of NPs, VPs, PPs, ADJPs, and ADVPs it contains. The feature can be easily read off from the parse tree. The syntactic signature for the main clause in Figure 1 is [NP:2 VP:2 ADJP:0 ADVP:0 PP:0] and for the subordinate clause [NP:1 VP:1 ADJP:0 ADVP:0 PP:0]. The most frequent syntactic signature for main clauses is [NP:2 VP:1 PP:0 ADJP:0 ADVP:0]; subordinate clauses typically contain an adverbial phrase [NP:2 VP:1 ADJP:0 ADVP:1 PP:0].</Paragraph> <Paragraph position="19"> Argument Signature (R) This feature captures the argument structure profile of main and subordinate clauses. It applies only to verbs and encodes whether a verb has a direct or indirect object, whether it is modified by a preposition or an adverbial. As with syntactic signature, this feature was read from the main and subordinate clause parse-trees. The parsed version of the BLLIP corpus contains information about subjects. NPs whose nearest ancestor was a VP were identified as objects. Modification relations were recovered from the parse trees by finding all PPs and ADVPs immediately dominated by a VP. In Figure 1 the argument signature of the main clause is [SUBJ,OBJ] and for the subordinate it is [OBJ].</Paragraph> <Paragraph position="20"> Position (P) This feature simply records the position of the two clauses in the parse tree, i.e., whether the subordinate clause precedes or follows the main clause. The majority of the main clauses in our data are sentence intitial (80.8%). However, there are differences among individual markers. For example, once clauses are equally frequent in both positions. 30% of the when clauses are sentence intitial whereas 90% of the after clauses are found in the second position.</Paragraph> <Paragraph position="21"> In the following sections we describe our experiments with the model introduced in Section 2. We first investigate the model's accuracy on the temporal interpretation and fusion tasks (Experiment 1) and then describe a study with humans (Experiment 2). The latter enables us to examine in more depth the model's classification accuracy when compared to human judges.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiment 1: Interpretation and Fusion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Method </SectionTitle> <Paragraph position="0"> The model was trained on main and subordinate clauses extracted from the BLLIP corpus as detailed in Section 3.1. We obtained 83,810 main-subordinate pairs.</Paragraph> <Paragraph position="1"> These were randomly partitioned into training (80%), development (10%) and test data (10%). Eighty randomly selected pairs from the test data were reserved for the human study reported in Experiment 2. We performed parameter tuning on the development set; all our results are reported on the unseen test set, unless otherwise stated.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4.2 Results </SectionTitle> <Paragraph position="0"> In order to assess the impact of our features on the interpretation task, the feature space was exhaustively evaluated on the development set. We have nine features, which results in 9!a24 9 a25 ka26 ! feature combinations where k is the arity of the combination (unary, binary, ternary, etc.). We measured the accuracy of all feature combinations (1023 in total) on the develoment set. From these, we selected the most informative combinations for evaluating the model on the test set. The best accuracy (61.4%) on the development set was observed with the combination of verbs (V) with syntactic signatures (S). We also observed that some feature combinations performed reasonably well on individual markers, even though their overall accuracy was not better than V and S combined. Some accuracies for these combinations are shown in Table 4. For example, NPRSTV was one of the best combinations for generating after, whereas SV was better for before (feature abbreviations are as introduced in Section 3.2).</Paragraph> <Paragraph position="1"> Given the complementarity of different model parametrisations, an obvious question is whether these can be combined. An important finding in Machine Learning is that a set of classifiers whose individual de- null cisions are combined in some way (an ensemble) can be more accurate than any of its component classifiers if the errors of the individual classifiers are sufficiently uncorrelated (Dietterich, 1997). In this paper an ensemble was constructed by combining classifiers resulting from training different parametrisations of our model on the same data. A decision tree (Quinlan, 1993) was used for selecting the models with the least overlap and for combining their output.</Paragraph> <Paragraph position="2"> The decision tree was trained and tested on the development set using 10-fold cross-validation. We experimented with 65 different models; out of these, the best results on the development set were obtained with the combination of 12 models: ANWNPSV, APSV, ASV, VWPRS, VNPS, VLS, NPRSTV, PRS, PRST, PRSV, PSV, and SV.</Paragraph> <Paragraph position="3"> These models formed the ensemble whose accuracy was next measured on the test set. Note that the features with the most impact on the interpretation task are verbs either as lexical forms (V) or classes (VW, VL), the syntactic structure of the main and subordinate clauses (S) and their position (P). The argument structure feature (R) seems to have some influence (it is present in five of the 12 combinations), however we suspect that there is some overlap with S. Nouns, adjectives and temporal signatures seem to have less impact on the interpretation task, for the WSJ domain at least. Our results so far point to the importance of the lexicon (represented by V, N, and A) for the marker interpretion task but also indicate that the syntactic complexity of the two clauses is crucial for inferring their semantic relation.</Paragraph> <Paragraph position="4"> The accuracy of the ensemble (12 feature combinations) was next measured on the unseen test set using 10-fold cross-validation. Table 5 shows precision (Prec) and recall (Rec). For comparison we also report precision and recall for the best individual feature combination on the test set (SV) and the baseline of always selecting when, the most frequent marker in our data set (42.6%). The ensemble (E) classified correctly 70.7% of the instances in the test set, whereas SV obtained an accuracy of 62.6%. The ensemble performs significantly better than SV (kh2 a4 102a27 57, df a4 1, p a28a29a27 005) and both SV and E perform significantly better than the base-line (kh2 a4 671a27 73, df a4 1, p a28a30a27 005 and kh2 a4 1278a27 61, df a4 1, p a28a30a27 005, respectively). The ensemble has difficulty inferring the markers since, once and while (see the recall figures in Table 5). Since is often confused with the semantically similar while. Until is not ambiguous, however it is relatively infrequent in our corpus (6.3% of our data set). We suspect that there is simply not enough data for the model to accurately infer these markers.</Paragraph> <Paragraph position="5"> For the fusion task we also explored the feature space exhaustively on the development set, after removing the position feature (P). Knowing the linear precedence of the two clauses is highly predictive of their type: 80.8% of the main clauses are sentence initial. However, this type of positional information is typically not known when fragments are synthesised into a meaningful sentence. null The best performing feature combinations on the development set were ARSTV and ANWRSV with an accuracy of 80.4%. Feature combinations with the highest accuracy (on the development set) for individual markers are shown in Table 4. Similarly to the interepretation task, an ensemble of classifiers was built in order to take advantage of the complementarity of different model parameterisations. The decision tree learner was again trained and tested on the development set using 10-fold cross-validation. We experimented with 44 different model instantiations; the best results were obtained when the following 20 models were combined: AVWNRSTV, ANWNSTV, ANWNV, ANWRS, ANV, ARS, ARSTV, ARSV, ARV, AV, VWHS, VWRT, VWTV, NWRST, NWS, NWST, VWT, VWTV, RT, and STV. Not surprisingly V and S are also important for the fusion task. Adjectives (A), nouns (N and NW) and temporal signatures (T), all seem to play more of a role in the fusion rather than the interpretation task. This is perhaps to be expected given that the differences between main and subordinate clauses are rather subtle (semantically and structurally) and more information is needed to perform the inference.</Paragraph> <Paragraph position="6"> The ensemble (consisting of the 20 selected models) attained an accuracy of 97.4% on the test. The accuracy of the the best performing model on the test set (ARSTV) was 80.1% (see Table 5). Precision for each individual marker is shown in Table 5 (we omit recall as it is always one). Both the ensemble and ARSTV significantly outperform the simple baseline of 50%, amounting to always guessing main (or subordinate) for both clauses (kh2 a4 4848a27 46, df a4 1, p a28a29a27 005 and kh2 a4 1670a27 81, df a4 1, p a28a30a27 005, respectively). The ensemble performed significantly better than ARSTV (kh2 a4 1233a27 63, df a4 1, p a28a29a27 005).</Paragraph> <Paragraph position="7"> Although for both tasks the ensemble outperformed the single best model, it is worth noting that the best individual models (ARSTV for fusion and PSTV for interpretation) rely on features that can be simply extracted from the parse trees without recourse to taxonomic information. Removing from the ensembles the feature combinations that rely on corpus external resources (i.e., Levin, WordNet) yields an overall accuracy of 65.0% for the interpretation task and 95.6% for the fusion task.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Experiment 2: Human Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Method </SectionTitle> <Paragraph position="0"> We further compared our model's performance against human judges by conducting two separate studies, one for the interpretation and one for the fusion task. In the first study, participants were asked to perform a multiple choice task. They were given a set of 40 main-subordinate pairs (five for each marker) randomly chosen from our test data. The marker linking the two clauses was removed and participants were asked to select the missing word from a set of eight temporal markers.</Paragraph> <Paragraph position="1"> In the second study, participants were presented with a series of sentence fragments and were asked to arrange them so that a coherent sentence can be formed. The fragments were a main clause, a subordinate clause and a marker. Participants saw 40 such triples randomly selected from our test set. The set of items was different from those used in the interpretation task; again five items were selected for each marker.</Paragraph> <Paragraph position="2"> Both studies were conducted remotely over the Internet. Subjects first saw a set of instructions that explained the task, and had to fill in a short questionnaire including basic demographic information. For the interpretation task, a random order of main-subordinate pairs and a random order of markers per pair was generated for each subject. For the fusion task, a random order of items and a random order of fragments per item was generated for each subject. The interpretation study was completed by 198 volunteers, all native speakers of English. 100 volunteers participated in the fusion study, again all native speakers of English. Subjects were recruited via postings to local Email lists.</Paragraph> </Section> </Section> class="xml-element"></Paper>