File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1087_metho.xml
Size: 20,251 bytes
Last Modified: 2025-10-06 14:08:59
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1087"> <Title>Acquiring the Meaning of Discourse Markers</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> POS-POL NEG-POL </SectionTitle> <Paragraph position="0"> after, and, as, as soon as, because, before, considering that, ever since, for, given that, if, in case, in order that, in that, insofar as, now, now that, on the grounds that, once, seeing as, since, so, so that, the instant, the moment, then, to the extent that, when, whenever although, but, even if, even though, even when, only if, only when, or, or else, though, unless, until, whereas, yet</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> periment 2.2 Veridicality </SectionTitle> <Paragraph position="0"> A discourse relation is veridical if it implies the truth of both its arguments (Asher and Lascarides, 2003), otherwise it is not. For example, in (3) it is not necessarily true either that David can stay up or that he promises, or will promise, to be quiet. For this reason we will say if has the feature veridicality=NON-VERIDICAL. null (3) David can stay up if he promises to be quiet.</Paragraph> <Paragraph position="1"> The disjunctive discourse marker or is also NON-VERIDICAL, because it does not imply that both of its arguments are true. On the other hand, and does imply this, and so has the feature veridicality=VERIDICAL. null The VERIDICAL and NON-VERIDICAL discourse markers used in the learning experiments are shown in Table 2. Note that the polarity and veridicality are independent, for example even if is both NEG-POL and NON-VERIDICAL.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Type </SectionTitle> <Paragraph position="0"> Discourse markers like because signal a CAUSAL relation, for example in (4).</Paragraph> <Paragraph position="1"> account, discourse markers have positive polarity only if they can never be paraphrased using a discourse marker with negative polarity. Interpreted in these terms, our experiment aims to distinguish negative polarity discourse markers from all others. 2An effort was made to exclude discourse markers whose classification could be contentious, as well as ones which showed ambiguity across classes. Some level of judgement was therefore exercised by the author.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> VERIDICAL NON- VERIDICAL </SectionTitle> <Paragraph position="0"> after, although, and, as, as soon as, because, but, considering that, even though, even when, ever since, for, given that, in order that, in that, insofar as, now, now that, on the grounds that, once, only when, seeing as, since, so, so that, the instant, the moment, then, though, to the extent that, until, when, whenever, whereas, while, yet assuming that, even if, if, if ever, if only, in case, on condition that, on the assumption that, only if, or, or else, supposing that, unless experiment (4) The tension in the boardroom rose sharply because the chairman arrived. As a result, because has the feature type=CAUSAL. Other discourse markers that express a temporal relation, such as after, have the feature type=TEMPORAL. Just as a POS-POL discourse marker can occur with a negative polarity discourse relation, the context can also supply a causal relation even when a TEMPORAL discourse marker is used, as in (5).</Paragraph> <Paragraph position="1"> (5) The tension in the boardroom rose sharply after the chairman arrived.</Paragraph> <Paragraph position="2"> If the relation a discourse marker signals is neither CAUSAL or TEMPORAL it has the feature type=ADDITIVE.</Paragraph> <Paragraph position="3"> The need for a distinct class of TEMPORAL discourse relations is disputed in the literature. On the one hand, it has been suggested that TEMPORAL relations are a subclass of ADDITIVE ones on the grounds that the temporal reference inherent in the marking of tense and aspect &quot;more or less&quot; fixes the temporal ordering of events (Sanders et al., 1992). This contrasts with arguments that resolving discourse relations and temporal order occur as distinct but inter-related processes (Lascarides and Asher, 1993). On the other hand, several of the discourse markers we count as TEMPORAL, such as as soon as, might be described as CAUSAL (Oberlander and Knott, 1995). One of the results of the experiments described below is that corpus evidence suggests ADDITIVE, TEMPORAL and CAUSAL discourse markers have distinct distributions. The ADDITIVE, TEMPORAL and CAUSAL discourse markers used in the learning experiments are shown in Table 3. These features are independent of the previous ones, for example even though is</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> CAUSAL, VERIDICAL and NEG-POL. ADDITIVE TEMPORAL CAUSAL </SectionTitle> <Paragraph position="0"> and, but, whereas after, as soon as, before, ever since, now, now that, once, until, when, whenever although, because, even though, for, given that, if, if ever, in case, on condition that, on the assumption that, on the grounds that, provided that, provid- null ing that, so, so that, supposing that, though, unless</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Corpus </SectionTitle> <Paragraph position="0"> The data for the experiments comes from a database of sentences collected automatically from the British National Corpus and the world wide web (Hutchinson, 2004). The database contains example sentences for each of 140 discourse structural connectives.</Paragraph> <Paragraph position="1"> Many discourse markers have surface forms with other usages, e.g. before in the phrase before noon. The following procedure was therefore used to select sentences for inclusion in the database. First, sentences containing a string matching the surface form of a structural connective were extracted. These sentences were then parsed using a statistical parser (Charniak, 2000). Potential structural connectives were then classified on the basis of their syntactic context, in particular their proximity to S nodes. Figure 1 shows example syntactic contexts which were used to identify discourse markers.</Paragraph> <Paragraph position="3"> It is because structural connectives are easy to identify in this manner that the experiments use only this subclass of discourse markers. Due to both parser errors, and the fact that the syntactic heuristics are not foolproof, the database contains noise. Manual analysis of a sample of 500 sentences revealed about 12% of sentences do not contain the discourse marker they are supposed to.</Paragraph> <Paragraph position="4"> Of the discourse markers used in the experiments, their frequencies in the database ranged from 270 for the instant to 331,701 for and. The mean number of instances was 32,770, while the median was 4,948.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> This section presents three machine learning experiments into automatically classifying discourse markers according to their polarity, veridicality and type. We begin in Section 4.1 by describing the features we extract for each discourse marker token. Then in Section 4.2 we describe the different classifiers we use. The results are presented in Section 4.3.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Features used </SectionTitle> <Paragraph position="0"> We only used structural connectives in the experiments. This meant that the clauses linked syntactically were also related at the discourse level (Webber et al., 2003). Two types of features were extracted from the conjoined clauses. Firstly, we used lexical co-occurrences with words of various parts of speech. Secondly, we used a range of linguistically motivated syntactic, semantic, and discourse features.</Paragraph> <Paragraph position="1"> Lexical co-occurrences have previously been shown to be useful for discourse level learning tasks (Lapata and Lascarides, 2004; Marcu and Echihabi, 2002). For each discourse marker, the words occurring in their superordinate (main) and subordinate clauses were recorded,3 along with their parts of speech. We manually clustered the Penn Treebank parts of speech together to obtain coarser grained syntactic categories, as shown in Table 4.</Paragraph> <Paragraph position="2"> We then lemmatised each word and excluded all lemmas with a frequency of less than 1000 per million in the BNC. Finally, words were attached a prefix of either SUB or SUPER according to whether they occurred in the sub- or superordinate clause linked by the marker. This distinguished, for example, between occurrences of then in the antecedent (subordinate) and consequent (main) clauses linked by if.</Paragraph> <Paragraph position="3"> We also recorded the presence of other discourse markers in the two clauses, as these had previously 3For coordinating conjunctions, the left clause was taken to be superordinate/main clause, the right, the subordinate clause. New label Penn Treebank labels vb vb vbd vbg vbn vbp vbz nn nn nns nnp jj jj jjr jjs rb rb rbr rbs aux aux auxg md prp prp prp$ in in been found to be useful on a related classification task (Hutchinson, 2003). The discourse markers used for this are based on the list of 350 markers given by Knott (1996), and include multiword expressions. Due to the sparser nature of discourse markers, compared to verbs for example, no frequency cutoffs were used.</Paragraph> <Paragraph position="4"> These included a range of one and two dimensional features representing more abstract linguistic information, and were extracted through automatic analysis of the parse trees.</Paragraph> <Paragraph position="5"> One dimensional features Two one dimensional features recorded the location of discourse markers. POSITION indicated whether a discourse marker occurred between the clauses it linked, or before both of them. It thus relates to information structuring. EMBEDDING indicated the level of embedding, in number of clauses, of the discourse marker beneath the sentence's highest level clause. We were interested to see if some types of discourse relations are more often deeply embedded. null The remaining features recorded the presence of linguistic features that are localised to a particular clause. Like the lexical co-occurrence features, these were indexed by the clause they occurred in: either SUPER or SUB.</Paragraph> <Paragraph position="6"> We expected negation to correlate with negative polarity discourse markers, and approximated negation using four features. NEG-SUBJ and NEG-VERB indicated the presence of subject negation (e.g. nothing) or verbal negation (e.g. n't). We also recorded the occurrence of a set of negative polarity items (NPI), such as any and ever. The features NPI-AND-NEG and NPI-WO-NEG indicated whether an NPI occurred in a clause with or without verbal or subject negation.</Paragraph> <Paragraph position="7"> Eventualities can be placed or ordered in time using not just discourse markers but also temporal expressions. The feature TEMPEX recorded the number of temporal expressions in each clause, as returned by a temporal expression tagger (Mani and Wilson, 2000).</Paragraph> <Paragraph position="8"> If the main verb was an inflection of to be or to do we recorded this using the features BE and DO. Our motivation was to capture any correlation of these verbs with states and events respectively.</Paragraph> <Paragraph position="9"> If the final verb was a modal auxiliary, this ellipsis was evidence of strong cohesion in the text (Halliday and Hasan, 1976). We recorded this with the feature VP-ELLIPSIS. Pronouns also indicate cohesion, and have been shown to correlate with subjectivity (Bestgen et al., 2003). A class of features</Paragraph> <Paragraph position="11"> ing either 1st person, 2nd person, or 3rd person animate, inanimate or plural.</Paragraph> <Paragraph position="12"> The syntactic structure of each clause was captured using two features, one finer grained and one coarser grained. STRUCTURAL-SKELETON identified the major constituents under the S or VP nodes, e.g. a simple double object construction gives &quot;NP VB NP NP&quot;. ARGS identified whether the clause contained an (overt) object, an (overt) subject, or both, or neither.</Paragraph> <Paragraph position="13"> The overall size of a clause was represented using four features. WORDS, NPS and PPS recorded the numbers of words, NPs and PPs in a clause (not counting embedded clauses). The feature CLAUSES counted the number of clauses embedded beneath a clause.</Paragraph> <Paragraph position="14"> Two dimensional features These features all recorded combinations of linguistic features across the two clauses linked by the discourse marker. For example the MOOD feature would take the value a2 DECL,IMPa3 for the sentence John is coming, but don't tell anyone! These features were all determined automatically by analysing the auxiliary verbs and the main verbs' POS tags. The features and the possible values for each clause were as follows: MODALITY: one of FUTURE, ABILITY or NULL; MOOD: one of DECL, IMP or INTERR; PERFECT: either YES or NO; PRO-GRESSIVE: either YES or NO; TENSE: either PAST or PRESENT.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Classifier architectures </SectionTitle> <Paragraph position="0"> Two different classifiers, based on local and global methods of comparison, were used in the experiments. The first, 1 Nearest Neighbour (1NN), is an instance based classifier which assigns each marker to the same class as that of the marker nearest to it. For this, three different distance metrics were explored. The first metric was the Euclidean distance function a4a6a5 , shown in (6), applied to probability distributions.</Paragraph> <Paragraph position="2"> The second, a33a34a4a6a35 , is a smoothed variant of the information theoretic Kullback-Leibner divergence (Lee, 2001, with a36a37a18a39a38a41a40a43a42a9a44 ). Its definition</Paragraph> <Paragraph position="4"> The third metric, a61a54a62a49a63a64a63a64a65 , is a a66 -test weighted adaption of the Jaccard coefficient (Curran and Moens, 2002). In it basic form, the Jaccard coefficient is essentially a measure of how much two distributions overlap. The a66 -test variant weights co-occurrences by the strength of their collocation, using the following function: be the best metrics for other tasks involving lexical similarity. a4a6a5 is included to indicate what can be achieved using a somewhat naive metric.</Paragraph> <Paragraph position="5"> The second classifier used, Naive Bayes, takes the overall distribution of each class into account. It essentially defines a decision boundary in the form of a curved hyperplane. The Weka implementation (Witten and Frank, 2000) was used for the experiments, with 10-fold cross-validation.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Results </SectionTitle> <Paragraph position="0"> We began by comparing the performance of the 1NN classifier using the various lexical co-occurrence features against the gold standards. The results using all lexical co-occurrences are shown in Table 5. The baseline was obtained by assigning discourse markers to the largest class, i.e. with the most types. The best results obtained using just a single POS class are also shown. The results across the different metrics suggest that adverbs and verbs are the best single predictors of polarity and veridicality, respectively.</Paragraph> <Paragraph position="1"> We next applied the 1NN classifier to co-occurrences with discourse markers. The results are shown in Table 7. The results show that for each task 1NN with the weighted Jaccard coefficient performs at least as well as the other three classifiers. We also compared using the following combinations of different parts of speech: vb + aux, vb + in, vb + rb, nn + prp, vb + nn + prp, vb + aux + rb, vb + aux + in, vb + aux + nn + prp, nn + prp + in, DMs + rb, DMs + vb and DMs + rb + vb. The best results obtained using all combinations tried are shown in the last column of Table 5. For DMs + rb, DMs + vb and DMs + rb + vb we also tried weighting the co-occurrences so that the sums of the co-occurrences with each of verbs, adverbs and discourse markers were equal. However this did not lead to any better results.</Paragraph> <Paragraph position="2"> One property that distinguishes a61a54a62a49a63a77a63 a65 from the other metrics is that it weights features the strength of their collocation. We were therefore interested to see which co-occurrences were most informative. Using Weka's feature selection utility, we ranked discourse marker co-occurrences by their information gain when predicting polarity, veridicality and type. The most informative co-occurrences are listed in Table 6. For example, if also occurs in the subordinate clause then the discourse marker is more likely to be ADDITIVE.</Paragraph> <Paragraph position="3"> The 1NN and Naive Bayes classifiers were then applied to co-occurrences with just the DMs that were most informative for each task. The results, shown in Table 8, indicate that the performance of indicate that a one dimensional feature belongs to the superordinate or subordinate clause, respectively. Weka's feature selection utility was also applied to all the linguistically motivated features described in Section 4.1.2. The most informative features are shown in Table 9. Naive Bayes was then applied using both all the linguistically motivated features, and just the most informative ones. The results are shown in Table 10.</Paragraph> </Section> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> The results demonstrate that discourse markers can be classified along three different dimensions with an accuracy of over 90%. The best classifiers used a global algorithm (Naive Bayes), with co-occurrences with a subset of discourse markers as features. The success of Naive Bayes shows that with the right choice of features the classification task is highly separable. The high degree of accuracy attained on the type task suggests that there is empirical evidence for a distinct class of TEMPORAL markers.</Paragraph> <Paragraph position="1"> The results also provide empirical evidence for the correlation between certain linguistic features and types of discourse relation. Here we restrict ourselves to making just five observations. Firstly, verbs and adverbs are the most informative parts of speech when classifying discourse markers. This is presumably because of their close relation to the main predicate of the clause. Secondly, Table 6 shows that the discourse marker DM in the structure X, but/though/although Y DM Z is more likely to be signalling a positive polarity discourse relation between Y and Z than a negative polarity one. This suggests that a negative polarity discourse relation is less likely to be embedded directly beneath another negative polarity discourse relation. Thirdly, negation correlates with the main clause of NEG-POL discourse markers, and it also correlates with subordinate clause of CAUSAL ones. Fourthly, NON-VERIDICAL correlates with second person pronouns, suggesting that a writer/speaker is less likely to make assertions about the reader/listener than about other entities. Lastly, the best results with knowledge poor features, i.e.</Paragraph> <Paragraph position="2"> lexical co-occurrences, were better than those with linguistically sophisticated ones. It may be that the sophisticated features are predictive of only certain subclasses of the classes we used, e.g. hypotheticals, or signallers of contrast.</Paragraph> </Section> class="xml-element"></Paper>