File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1125_metho.xml

Size: 18,899 bytes

Last Modified: 2025-10-06 14:15:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1125">
  <Title>Discourse Parsing: A Decision Tree Approach</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Attempts to the automatic identification of a structure in discourse have so far met with a limited success in the computational linguistics literature.</Paragraph>
    <Paragraph position="1"> Part of the reason is that, compared to sizable data resources available to parsing research such as the Penn Treebank (Marcus et al., 1993), large corpora annotated for discourse information are hard to come by. Researchers in discourse usually work with a corpus of a few hundred sentences (Kurohashi and Nagao, 1994; Litman and Passonneau, 1995; Hearst, 1994). The lack of a large-scale corpus has made it impossible to talk about results of discourse studies with the sufficient degree of reliability.</Paragraph>
    <Paragraph position="2"> In the work described here, we created a corpus with discourse information, containing 645 articles from a Japanese economic paper, an order of magnitude larger than any previous work on discourse processing. It had a total of 12.770 sentences and 5,352 paragraphs. Each article in the corpus was manually annotated for a discourse dependency&amp;quot; relation. We then built a statistical discourse parser based on the C4.5 decision tree method (Quinlan, 1993), which  ated. The design of a parser was inspired by Haruno (1997)'s work on statistical sentence parsing.</Paragraph>
    <Paragraph position="3"> The paper is organized as follows. Section 2 presents general ideas about statistical parsing as applied to the discourse, After a brief introduction to some of the points of a decision tree model, we discuss incorporating a decision tree within a statistical parsing model. In Section 3, we explain how we have built an annotated corpus. There we also describe a procedure of experiments we have conducted, and conclude the section with their results.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="220" type="metho">
    <SectionTitle>
2 Statistical Discourse Parsing
</SectionTitle>
    <Paragraph position="0"> First, let us make ourselves clear about what we mean by parsing a discourse. The job of parsing is to find whatever dependencies there are among elements that make up a particular linguistic unit. In discourse parsing, elements one is interested in finding dependencies among correspond to sentences, and a level of unit under investigation is a discourse.</Paragraph>
    <Paragraph position="1"> We take a naive approach to the notion of a dependency here. We think of it as a re!ationship between a pair of sentences such that the interpretation of one sentence in some way depends on that of the other.</Paragraph>
    <Paragraph position="2"> Thus a dependency relationship is not a structural one, but rather a semantic or rhetorical one.</Paragraph>
    <Paragraph position="3"> The job of a discourse parser is to take as input  a discourse, or a set of sentences that make up a discourse and to produce as output a parse, or a set of dependency relations (which may give rise to a tree-like structure as in Figure 1). In statistical parsing, this could be formulated as a problem of finding a best parse with a model P(T I D), where T is a set of dependencies and D a discourse.</Paragraph>
    <Paragraph position="4"> Tbest = arg maXTP(T \[ D ) Tbe,t is a set of dependencies that maximizes the probability P(T I D). Further, we assume that a discourse D is a set of sentences marked for some pre-defined set of features F = {fl,..-,fn}. Let CF ($1) be a characterization of sentence $1 in terms of a feature set F. Then for D = {S1,...,Sm},</Paragraph>
    <Paragraph position="6"> 'A ~- B' reads like &amp;quot; sentence B is dependent on sentence A&amp;quot;, where A,B E {$i,... ,Sin}. The probability of T being an actual parse of discourse D is estimated as the product of probabilities of its element dependencies when a discourse has a representation CF(D). We make a usual assumption that element dependencies are probabilistically iiadependent. null</Paragraph>
    <Section position="1" start_page="216" end_page="217" type="sub_section">
      <SectionTitle>
2.1 Decision Tree Model
</SectionTitle>
      <Paragraph position="0"> A general framework for discourse parsing described above is thus not much different from that for statistical sentence parsing. Differences, however, lie in a makeup of the feature set F. Rather than to use information on word forms, word counts, and part-of-speech tags as in much research on statistical sentence parsing, we exploit as much information as can be gleaned from a discourse, such as lexical cohesion, distance, location, and clue words, to characterize a sentence. Therefore it is important that you do not end up with a mountain of irrelevant features.</Paragraph>
      <Paragraph position="1"> A decision tree method represents one of approaches to classification problems, where features are ranked according to how much they contribute to a classification, and models are then built with features most relevant to that classification. Suppose, for example, that you work for a travel agency and want to find out what features of a hotel are more important for tourists, based on data from your customers like Table 1. With decision tree techniques, you would be able to tell what features are more closely associated with customers' preferences.</Paragraph>
      <Paragraph position="2"> The aim of the decision tree approach is to induce rules from data that best characterize classes.</Paragraph>
      <Paragraph position="3"> A particular approach called C4.5 (Quinlan, 1993), which we adopt here, builds rules by recursively dividing the training data into subsets until all divisions contain only single class cases. In which subset  'Bath/shower' means a room has a bath, a shower or none. 'Time' means the travel time in min. from an airport. 'Class' indicates whether a particular hotel is a customer's choice.</Paragraph>
      <Paragraph position="4"> bath/shower time room rate class  a particular case is placed is determined by the outcome of a 'test' on that case. Let us explain how this works by way of the hotel example above. Suppose that the first test is &amp;quot;bath/shower', which has three outcomes, bath, shower, and none. Then the data set breaks up into three groups, {1,4,5} (bath), {2,3,7} (shower), and {6}(none). Since the last group {6} consists of only a single case, there is no further division of the group. The bath group, being a multi-class set, is further divided by a test &amp;quot;room rate&amp;quot;, which produces two subdivisions, one with { 1 } (expensive), and the other with {4,5} (moderate).</Paragraph>
      <Paragraph position="5"> Either set now consists of only single class cases.</Paragraph>
      <Paragraph position="6"> For the shower group, applying the time test(&lt;=15) would produce two subsets, one with {3}, and the other with {2,7}. l Either one now contains cases from a single class. A decision tree for divisions we made is shown in Figure 2.</Paragraph>
      <Paragraph position="7"> Now compare a hand-created decision tree in Figi Here we choose a midpoint between I0 and 20 as in C4.5.</Paragraph>
      <Paragraph position="9"> ure 2 with one in Figure 3, which is generated by C4.5 for the same data. Surprisingly, the latter tree consists of only one test node. This happens because C4.5 ranks possible tests, which we did not, and apply one that gives a most effective partitioning of data based on information-theoretic criteria known as the gain criterion and the gain ratio criterion. 2 The intuitive idea behind the criteria is to prefer a test with a least entropy, i.e., a test that partitions data in such a way that a particular class may become dominant for each subset it creates. Thus a feature that best accounts for a class distribution in data is always chosen in preference to others. For the data in Table 1, C4.5 determined that the test room rate is a best class identifier and everything 2 The gain criterion measures the effectiveness of partitioning a data set T with respect to a test X, and is defined follows.</Paragraph>
      <Paragraph position="10"> gain(X) = info(T) -in/ox(T ) Define info(T) to be an entropy of T, that is, the average amount of information generated by T. Then we have:</Paragraph>
      <Paragraph position="12"> freq(C,T) is the number of cases from a class C divided by the sum of cases in T. Now infox(T ) is the average amount of information generated by partitioning T with respect to a test X. That is,</Paragraph>
      <Paragraph position="14"> Thus a good classifier would give a small value for info X (T) and a large value for in/o X (T).</Paragraph>
      <Paragraph position="15"> The gain ratio criterion is a modification to the gain criterion. It has the effect of making a splitting of a data set less intense.</Paragraph>
      <Paragraph position="16"> gain ratio(X) = gain(X)/split info~') where: split info(X) = - ~ IT, \[ x &amp;quot; \[ Ti I I T \[ rag2 I T I *-=l The ratio decreases with an increase in the number of splits. else is irrelevant to identifying the classes. All that one needs to account for the class distribution in Table 1 turn out to be just one feature. So we might just as well conclude that the customers are just interested in the room charge when they pick up a hotel.</Paragraph>
      <Paragraph position="17"> A benefit of using the decision tree method is that it enables us to identify relevant features for classification and disregard those that are not relevant, which is particularly useful for a task such as ours, where a large number of features are potentially involved and their relevance to classification is not always known.</Paragraph>
    </Section>
    <Section position="2" start_page="217" end_page="220" type="sub_section">
      <SectionTitle>
2.2 Parsing with Decision Tree
</SectionTitle>
      <Paragraph position="0"> As we mentioned in section 2, we define discourse parsing as a task of finding a best tree T, or a set of dependencies among sentences that maximizes P(T I D).</Paragraph>
      <Paragraph position="2"> What we do now is to equip the model with a feature selection functionality. This can be done by assuming:</Paragraph>
      <Paragraph position="4"> DTF is a decision tree constructed with a feature set F by C4.5. 'X &lt; B' means that X is a sentence that precedes B: P(X ~ Y \[ CF(D),DTv) is the probability that sentence Y depends on sen' tence X under the condition that both CF(D) and DTF are used. We estimate P, using class distributions from the decision tree DTF. For example, we have numbers in parentheses after leaves in the decision tree in Figure 3. They indicate the number of cases that reach a particular leaf and also the number of misclassified cases. Thus a leaf with the label inexpensive has the total of 4 cases, one of which is misclassified. This means that we have 3 cases correctly classified as &amp;quot;NO&amp;quot; and one case wrongly classified. Thus a class distribution for &amp;quot;NO&amp;quot; is 3/4 and that for &amp;quot;YES&amp;quot; is 1/4. In practice, however, we slightly correct class frequencies, using Laplace's rule of succession, i.e., x/n ~ x + 1/n + 2.</Paragraph>
      <Paragraph position="5"> Now suppose that we have a discourse D = {...,Si,...,Sj,...,Sk,...} and want to know what Si depends on, assuming that Si depends on either Sj or Sk. To find that out involves constructing 3 Note that here we are in effect making a claim about the structure of a discourse, namely that a sentence modifies one that precedes it. Changing it to something like 'X 6 D,X # B' allows one to have forward as well as backward dependencies.</Paragraph>
      <Paragraph position="7"> CF(D) and DTF. Let us represent sentences Sj and Sk in terms of how far they are separated from Si, measured in sentences. Suppose that dist(Sj) = 2 and dist(Sj) = 4; that is, sentence S.# appears 2 sentences behind Si and Sk 4 sentences behind. Assume further that we have a decision tree constructed from data elsewhere that looks like Figure 4.</Paragraph>
      <Paragraph position="8"> With CF(D) and DTF at hand, we are now in a position to find P(A ~ B I CF(D)), for each possible dependency Sj ~ Si, and Sk +-- Si.</Paragraph>
      <Paragraph position="10"> Since Si links with either Sj or Sk, by Equation 1, we normalize the probability estimates so that they sum to 1.</Paragraph>
      <Paragraph position="12"> Recall that class frequencies are corrected by Laplace's rule. Let T 1 = {Sj ~ Si} and Tk = {,-~ Si} Then P(Tj I D) &gt; P(Tk t D). Thus Tb,,t = Tj.</Paragraph>
      <Paragraph position="13"> We conclude that Si is more likely to depend on Sj than SI,.</Paragraph>
      <Paragraph position="14">  The following list a set of features we used to encode a discourse. As a convention, we refer to a sentence for which we like to find a dependency as 'B', and a sentence preceding 'B' as 'A'.</Paragraph>
      <Paragraph position="15"> &lt;DistSen&gt; records information on how far ahead A appears from B, measured in sentences.</Paragraph>
      <Paragraph position="17"> '#S(X)' denotes an ordinal number indicating the position of a sentence X in a text, i.e., #S(kth_sentence) = k. 'Max_Sen_Distance' denotes a distance, measured in sentences, from B to A, when B ocurrs farthest from A, i.e., #S(last_sentence_in_text) - 1. DistSen thus has continuous values between 0 and 1. We discard texts which contain no more than one sentence.</Paragraph>
      <Paragraph position="18"> &lt;DistPax&gt; is defined similarly to DistSen, except that the distance is measured in paragraphs.</Paragraph>
      <Paragraph position="19">  'ParAnit_Sen' refers to the initial sentence of a paragraph in which X occurs, 'Length(Par(X))' denotes the number of sentences that occur in that paragraph. LocWithinPar takes continuous values ranging from 0 to 1. A paragraph initial sentence would have 0 and a paragraph final sentence 1.</Paragraph>
      <Paragraph position="20"> &lt;LenText&gt; the length of a text, measured in Japanese characters.</Paragraph>
      <Paragraph position="21"> &lt;LenSenA&gt; the length of A in Japanese characters. &lt;LenSenB&gt; the length of B in Japanese characters. &lt;Sire&gt; gives information on the lexical similarity between A and B, based on an information-retrieval measure known as tf * idf. 4 One important point  hereis that we did not use words per se in measuring the similarity. What we did was to break up nominals from sentences into simple characters (grapheme) and use only them to measure the similarity. We did this to deal with abbreviations and rewordings, which we found quite frequent in the corpus we used.</Paragraph>
      <Paragraph position="22"> &lt;Sire2&gt; same as Sire feature, except that the similarity is measured between A and Par(B), a paragraph in which B occurs. We define Siva2 as 'SIM(A,Concat(Par(B)))' (see footnote 4 for the definition of SIM), where 'Concat(Par(B))' is a concatenation of sentences in Par(B).</Paragraph>
      <Paragraph position="23"> &lt;IsATit\].e&gt; indicates whether A is a title. We regarded a title as a special sentence that initiates a discourse.</Paragraph>
      <Paragraph position="24"> &lt;Clues&gt; differs from features above in that it does not refer to any single feature but is a collective term for a set of clue-related features, each of which is used to indicate the presence or absence of a relevant clue in A and B. We examined N most frequent words found in a corpus and associated each with a different clue feature. We experimented with cases where N is 0, 100, 500 and 1000. A sentence can be marked for a multiple number of clue expressions at the same time. For a clue c, an associated Clues feature d takes one of the four values, depending on the way c appears in A and B. c' = 0 if c appears in neither A or B; d = 1 if c appears in both A and B; d = 2 if c appears in A and not in B; and d = 3 if c appears not in A but in B. We consider clue expressions from the following grammatical classes: nominals, adjectives, demonstratives, adverbs, sentence connectives, verbs, sentence-final particles, topic-marking particles, and punctuation marks. 5 While we did not consider a complex clue expression, which can be made up of multiple elements from various grammatical classes 6 , it is posd/j is the number of sentences in the text which have an occurrence of a word j. N is the total number of sentences in the text. The tf.idf metric has the property of favoring high frequency words with local distribution. For a pair of senfences X = {zl .... } and \]&amp;quot; = {YI,--.}, where z and y are words, we define the lexical similarity between X and Y by:</Paragraph>
      <Paragraph position="26"> They- are extracted from a corpus by a Japanese tokenizer program (Sakurai and Hisamitsu, 1997).</Paragraph>
      <Paragraph position="27"> 6 English examples would be for example, as a result, etc., which are thought of as an indicator of a discourse relationship.</Paragraph>
      <Paragraph position="28">  grammar term of a class of numerals. Since there are infinitely many of them, we decided not to treat them individually, but to represent them collectively with a single feature .~uushi.</Paragraph>
      <Paragraph position="29">  sible to think of a complex clue in terms of its component clues for which a sentence is marked. Classes For a sentence pair A and B, the class is either yes or no, corresponding to the presence or absence of a dependency link from B to A.</Paragraph>
      <Paragraph position="30"> The features above are more or less plucked from the air. Some are motivated, and some are less so. Our strategy here, however, is to rely on the decision tree mechanism to select 'good' features and filter out features that are not relevant to the class identification.</Paragraph>
      <Paragraph position="31">  Let us make further notes on how to encode a discourse with the set of features we have described. We characterize a sentence in relation to its potential &amp;quot;modifyee&amp;quot; sentence, a sentence in a discourse which it is likely to depend on. Thus encoding is based on a pair of sentences, rather than on a single sentence. For example, a discourse</Paragraph>
      <Paragraph position="33"> may want to constrain P by restricting the attention to pairs of a particular type. If we are interested only in backward dependencies, then we will have</Paragraph>
      <Paragraph position="35"> In experiments, we assumed a discourse as consisting of backward dependency pairs and encoded each pair with the set of features above. Assumptions we made about the structure of a discourse are  the following: 1. Every sentence in a discourse has exactly one preceding &amp;quot;modifyee&amp;quot; to link to.</Paragraph>
      <Paragraph position="36"> 2. A discourse may have crossing dependencies.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML