XML Viewer - w99-0620

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0620_metho.xml
Size: 18,502 bytes
Last Modified: 2025-10-06 14:15:32
<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0620">
  <Title>Learning Discourse Relations with Active Data Selection</Title>
  <Section position="5" start_page="158" end_page="158" type="metho">
    <SectionTitle>
LOGICAL
SEQUENCE
ELABORATION
CONSEQUENTIAL
ANTITHESIS
ADDITIVE
CONTRAST
INITIATION
APPOSITIVE
COMPLEMENTARY
EXPANDING
</SectionTitle>
    <Paragraph position="0"> dakara therefore, shitagatte thus shikashi but, daga but soshite and, tsuigi-ni next ipp6 in contrast, soretomo or tokorode to change the subject, sonouchi in the meantime tatoeba .for example, y6suruni in other words nazenara because, chinamini incidentally</Paragraph>
    <Paragraph position="2"> casual coder, RST turned out to be a quite difficult guideline to follow.</Paragraph>
    <Paragraph position="3"> In Ichikawa (1990), discourse relations are organized into three major classes: the first class includes logical (or strongly semantic) relationships where one sentence is a logical consequence or contradiction of another; the second class consists of sequential relationships where two semantically independent sentences are juxtaposed; the third class includes elaborationtype relationships where one of the sentences is semantically subordinate to the other.</Paragraph>
    <Paragraph position="4"> In constructing a tagged corpus, we asked coders not to identify abstract discourse relations such as LOGICAL, SEQUENCE and ELAB-ORATION, but to choose from a list of pre-determined connective expressions. We expected that the coder would be able to identify a discourse relation with far less effort when working with eXplicit cues than when working with abstract Concepts of discourse relations.</Paragraph>
    <Paragraph position="5"> Moreover, since 93% of sentences considered for labeling in the corpus did not contain pre-determined relation cues, the annotation task was in effect one of guessing a possible connective cue that m'ay go with a sentence. The advantage of using explicit cues to identify discourse relations is that even if one has little or no background in linguistics, he or she may be able to assign a discourse relation to a sentence by just asking him/herself whether the associated cue fits well with the sentence. In addition, in order to make the usage of cues clear and unambiguous, the annotation instruction carried a set of examples for each of the cues. Further, we developed an emacs-based software aid which guides the coder to work through a corpus and also is capable of prohibiting the coder from making moves inconsistent with the tagging instruction.</Paragraph>
    <Paragraph position="6"> As it turned out, however, Ichikawa's scheme, using subclass relation types, did not improve agreement (~ = 0.33, three coders). So, we modified the relation taxonomy so that it contains just two major classes, SEQUENCE and ELABORATION, (LOGICAL relationships being subsumed under the SEQUENCE class) and assumed that a lexical cue marks a major class to which it belongs. The modification successfully raised the ~ score to 0.70. Collapsing LOGICAL and SEQUENCE classes may be justified by noting that both types of relationships have to do with relating two semantically independent sentences, a property not shared by relations of the elaboration type.</Paragraph>
  </Section>
  <Section position="6" start_page="158" end_page="162" type="metho">
    <SectionTitle>
3 Learning with Active Data
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="158" end_page="160" type="sub_section">
      <SectionTitle>
Selection
3.1 Committee-based Sampling
</SectionTitle>
      <Paragraph position="0"> In the committee-based sampling method (CBS, henceforth) (Dagan and Engelson, 1995; Engelson and Dagan, 1996), a training example is selected from a corpus according to its usefulness; a preferred example is one whose addition to the training corpus improves the current estimate of a model parameter which is relevant to classification and also affects a large proportion of examples. CBS tries to identify such an example by randomly generating multiple models (committee members) based on posterior dis- null tributions of model parameters and measuring how much the member models disagree in classifying the example. The rationale for this is: disagreement among models over the class of an example would suggest that the example affects some parameters sensitive to classification, and furthermore estimates of affected parameters are far from their true values. Since models are generated randomly from posterior distributions of model parameters, their disagreement on an example's class implies a large variance in estimates of parameters, which in turn indicates that the statistics of parameters involved are insufficient and hence its inclusion in the training corpus (so as to improve the statistics of relevant parameters).</Paragraph>
      <Paragraph position="1"> For each example it encounters, CBS goes through the following steps to decide whether to select the example for labeling.</Paragraph>
      <Paragraph position="2">  1. Draw k models (committee members) randomly from the probability distribution P(M \] S) of models M given the statistics S of a training corpus.</Paragraph>
      <Paragraph position="3"> 2. Classify an input example by each of the committee members and measure how much they disagree on classification.</Paragraph>
      <Paragraph position="4"> 3. Make a biased random decision as to  whether or not to select the example for labeling. This would make a highly disagreed-upon example more likely to be selected.</Paragraph>
      <Paragraph position="5"> As an illustration of how this might work, consider a problem of tagging words with parts of speech, using a Hidden Markov Model (HMM). A (bigram) HMM tagger is typically given as:</Paragraph>
      <Paragraph position="7"> where wl...wn is a sequence of input words, and tl...tn is a sequence of tags. For a sequence of input words wl...wn, a sequence of corresponding tags T(wl...wn) is one that maximizes the probability of reaching tn from tl via ti (1 &lt; i &lt; n) and generating Wl...wn along with it. Probabilities P(wi I ti) and P(ti+l I ti) are called model parameters of an HMM tagger. In Dagan and Engelson (1995), P(M I S) is given as the posterior multinomial distribution P(al = al,...,an = an J S), where ai is a model parameter and ai represents one of the possible values. P(al = al,...,an = an I S) represents the proportion of the times that each parameter oq takes a/, given the statistics S derived from a corpus. (Note that ~ P(ai = ai I S) = 1.) For instance, consider a task of randomly drawing a word with replacement from a corpus consisting of 100 different words (wl,..., Wl00). After 10 trials, you might have outcomes like wl = 3, w2 = 1,..., w55 =</Paragraph>
      <Paragraph position="9"> Wl was drawn three times, w2 was drawn once, w55 was drawn twice, etc. If you try another 10 times, you might get different results. A multinomial distribution tells you how likely you get a particular sequence of word occurrences. Dagan and Engelson (1995)'s idea is to assume the distribution P(al = al,...,an = an I S) as a set of binomial distributions, each corresponding to one of its parameters. An arbitrary HMM model is then constructed by randomly drawing a value ai from a binomial distribution for a parameter ai, which is approximated by a normal distribution. Given k such models (committee members) from the multinomial distribution, we ask each of them to classify an input example. We decide whether to select the example for labeling based on how much the committee members disagree in classifying that example. Dagan and Engelson (1995) introduces the notion of vote entropy to quantify disagreements among members. Though one could use the kappa statistic (Siegel and Castellan, 1988) or other disagreement measures such as the a statistic (Krippendorff, 1980) instead of the vote entropy, in our implementation of CBS, we decided to use the vote entropy, for the lack of reason to choose one statistic over another.</Paragraph>
      <Paragraph position="10"> A precise formulation of the vote entropy is as follows:</Paragraph>
      <Paragraph position="12"> Here e is an input example and c denotes a class. V(c, e) is the number of votes for c. k is the number of committee members. A selection function is given in probabilistic terms,  based on V(e).</Paragraph>
      <Paragraph position="14"> g here is called the entropy gain and is used to determine the number of times an example is selected; a grea~ter g would increase the number of examples selected for tagging. Engelson and Dagan (1996) investigated several plausible approaches to the selection function but were unable to find significant differences among them.</Paragraph>
      <Paragraph position="15"> At the beginning of the section, we mentioned some properties of 'useful' examples. A useful example is one which contributes to reducing variance in parameter values and also affects classification. By randomly generating multiple models and measuring a disagreement among them, one would be able to tell whether an example is useful in the sense above; if there were a large disagreement, then one would know that the example is relevant to classification and also is associated with parameters with a large variance and thus with insufficient statistics.</Paragraph>
      <Paragraph position="16"> In the following section, we investigate how we might extend CBS for use in decision tree classifiers.</Paragraph>
    </Section>
    <Section position="2" start_page="160" end_page="162" type="sub_section">
      <SectionTitle>
3.2 Decision Tree Classifiers
</SectionTitle>
      <Paragraph position="0"> Since it is difficult, if not impossible, to express the model distribution of decision tree classifiers in terms of the multinomial distribution, we turn to the bootstrap sampling method to obtain P(M \[ S). The bootstrap sampling method provides a way for artificially establishing a sampling distribution for a statistic, when the distribution is not known (Cohen, 1995).</Paragraph>
      <Paragraph position="1"> For us, a relevant statistic would be the posterior probability that a given decision tree may occur, given the training corpus.</Paragraph>
      <Paragraph position="2">  Repeat i = 1. ,. K times:  1. Draw a bootstrap pseudosample S~ of size N from S by sampling with replacement as follows: Repeat N times: select a member of S at random ai~d add it to S~.</Paragraph>
      <Paragraph position="3"> 2. Build a decision tree model M from S~.</Paragraph>
      <Paragraph position="4">  Add M to Ss.</Paragraph>
      <Paragraph position="5">  S is a small Set of samples drawn from the tagged corpus. Repeating the procedure 100 times would give 100 decision tree models, each corresponding to some S~ derived from the sample set S. Note that the bootstrap procedure allows a datum in the original sample to be selected more than once.</Paragraph>
      <Paragraph position="6"> Given a sampling distribution of decision tree models, a committee can be formed by randomly selecting k models from Ss. Of course, there are some other approaches to constructing a committee for decision tree classifiers (Dietterich, 1998). One such, known as randomization, is to use a single decision tree and randomly choose a path at each attribute test. Repeating the process k times for each input example produces k models.</Paragraph>
      <Paragraph position="7">  In the following, we describe a set of features used to characterize a sentence. As a convention, we refer to a current sentence as 'B' and the preceding sentence as 'A'.</Paragraph>
      <Paragraph position="8"> &lt;LocSen&gt; defines the location of a sentence</Paragraph>
      <Paragraph position="10"> '#S(X)' denotes an ordinal number indicating the position of a sentence X in a text, i.e., #S(kth_sentence) = k, (k &gt;_ 0).</Paragraph>
      <Paragraph position="11"> 'Last_Sentence' refers to the last sentence in a text. LocSen takes a continuous value between 0 and 1. A text-initial sentence takes 0, and a text-final sentence 1.</Paragraph>
      <Paragraph position="12"> &lt;LocPar&gt; is defined similarly to DistPar. It records information on the location of a paragraph in which a sentence X occurs.</Paragraph>
      <Paragraph position="14"> '#Par(X)' denotes an ordinal number indicating the position of a paragraph containing X.</Paragraph>
      <Paragraph position="15"> '#Last_Paragraph' is the position of the last paragraph in a text, represented by the ordinal number.</Paragraph>
      <Paragraph position="16"> &lt;LocWithinPax&gt; records information on the location of a sentence X within a paragraph in which it appears.</Paragraph>
      <Paragraph position="18"> 'Par_Init_Sen' refers to the initial sentence of a paragraph in which X occurs, 'Length(Par(X))' denotes the number of sentences that occur in that paragraph. LocW:i.thinPar takes continuous values ranging from 0 to 1. A paragraph initial sentence would have 0 and a paragraph final sentence 1.</Paragraph>
      <Paragraph position="19"> &lt;LenText&gt; the length of a text, measured in Japanese characters.</Paragraph>
      <Paragraph position="20"> the length of A in Japanese char- &lt;LenSenA&gt; acters.</Paragraph>
      <Paragraph position="21"> &lt;LenSenB&gt; acters.</Paragraph>
      <Paragraph position="22"> the length of B in Japanese char&lt;Sire&gt; encodes the lexical similarity between A and B, based on an information-retrieval measure known as tf. idf (Salton and McGill, 1983). 2 One important feature here is that we defined similarity based on (Japanese) characters rather than on words: in practice, we broke up nominals from relevant sentences into simple alphabetical characters (including graphemes) and used them to measure similarity between the sentences. (Thus in our setup xi in footnote 2 corresponds to one character, and not to one whole word.) We did this to deal with abbreviations and rewordings, which we found quite frequent in the corpus.</Paragraph>
      <Paragraph position="23"> &lt;Cue&gt; takes a discrete value 'y' or 'n'. The cue feature is intended to exploit surface cues most relevant for distinguishing between the SEQUENCE and ELABORATION relations. The fea2For a word j in a sentence Si (j E Si), its weight wij is defined by: N w# = tf~j * log ~df~ is the number of sentences in the text which have an occurrence of a word j. N is the total number of sentences in the text. The tf.idf metric has the property of favoring high frequency words with local distribution. For a pair of sentences .,~ = (xl .... ) and Y = (yx,...), where x and y are words, we define the lexical similarity</Paragraph>
      <Paragraph position="25"> where w(xi) represents a t~idf weight assigned to the term xi. The measure is known as the Dice coefficient (Salton and McGill, 1983) ture takes 'y' if a sentence contains one or more cues relevant to distinguishing between the two relation types. We considered up to 5 word n-grams found in the training corpus. Out of these, those whose INFOx values are below a particular threshold are included in the set of cues. 3 And if a sentence contains one of the cues in the set, it is marked 'y', and 'n' otherwise. The cutoff is determined in such a way as to minimize INFOcue(T), where T is a set of sentences (represented with features) in the training corpus. We had the total of 90 cue expressions. Note that using a single binary feature for cues alleviates the data sparseness problem; though some of the cues may have low frequencies, they will be aggregated to form a single cue category with a sufficient number of instances. In the training corpus, which contained 5221 sentences, 1914 sentences are marked 'y' and 3307 are marked 'n' with the cutoff at 0.85, which is found to minimize the entropy of the distribution of relation types. It is interesting to note that the entropy strategy was able to pick up cues which could be linguistically motivated (Table 2). In contrast to Samuel et al. (1998), we did not consider relation cues reported in the linguistics literature, since they would be useless unless they contribute to reducing the cue entropy. They may be linguistically 'right' cues, but their utility in the machine learning context is not known.</Paragraph>
      <Paragraph position="26"> &lt;PrevRel&gt; makes available information about a relation type of the preceding sentence. It has two values, ELA for the elaboration relation, and SEQ for the sequence relation.</Paragraph>
      <Paragraph position="27"> In the Japanese linguistics literature, there is a popular theory that sentence endings are relevant for identifying semantic relations among 3INFOx (T) measures the entropy of the distribution of classes in a set T with respect to a feature X. We define INFOx just as given in Quinlan (1993):</Paragraph>
      <Paragraph position="29"> Ti represents a partition of T corresponding to one of the values for X. INFO(T) is defined as follows:</Paragraph>
      <Paragraph position="31"> fi'eq(C, T) is the number of cases from class C in a set T of cases.</Paragraph>
      <Paragraph position="33"> mata on the other hand, dSjini at the same time, ippou in contrast, sarani in addition, mo topic marker, ni-tsuite-wa regarding, tameda the reason is that, kekka as the result ga-nerai the goal is that sentences. Some of the sentence endings are inflectional categories of verbs such as PAST/NON-PAST, INTERROGATIVE, and also morphological categories :like nouns and particles (eg.</Paragraph>
      <Paragraph position="34"> question-markers). Based on Ichikawa (1990), we defined six types of sentence-ending cues and marked a sentence according to whether it contains a part.icular type of cue. Included in the set are inflectional forms of the verb and the verbal adjec~tive, PAST/NON-PAST, morphological categories such as COPULA, and NOUN, parentheses (quotation markers), and sentence-final particles such as -ka. We use the following two attributes to encode information about sentence-ending cues.</Paragraph>
      <Paragraph position="35"> &lt;EndCueh&gt; records information about a sentence-ending form of the preceding sentence.</Paragraph>
      <Paragraph position="36"> It takes a discrete value from 0 to 6, with 0 indicating the absence in the sentence of relevant cues.</Paragraph>
      <Paragraph position="37"> &lt;EadCueB&gt; Sa~me as above except that this feature is concerned with a sentence-ending form of the current sentence, i.e. the 'B' sentence.</Paragraph>
      <Paragraph position="38"> Finally, we have two classes, ELABORATION and SEQUENCE.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML