File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0402_metho.xml

Size: 15,035 bytes

Last Modified: 2025-10-06 14:07:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0402">
  <Title>Mining Discourse Markers for Chinese Textual Summarization</Title>
  <Section position="4" start_page="12" end_page="13" type="metho">
    <SectionTitle>
3 SIFAS System Architecture
</SectionTitle>
    <Paragraph position="0"> From the perspective of discourse analysis, the study of discourse markers basically involves four distinct but fundamental issues: 1) the occurrence and the frequency of occurrence of discourse markers (Moser and Moore 1995), 2) determining whether a candidate linguistic item is a discourse marker (identification / disambiguation) (Hirschberg and Litman 1993; Siegel and McKeown 1994), 3) determination or selection of the discourse function of an identified discourse marker (Moser and Moore 1995), and 4) the coverage capabilities (in terms of levels of embedding) among rhetorical relations, as well as among individual discourse markers. Discussion of these problems for Chinese compound sentences can be found in Wang et al. (1994).</Paragraph>
    <Paragraph position="1"> Previous attempts to address the above problems in Chinese text have usually been based on the investigators' intuition and knowledge, or on a small number of constructed examples. In our current research, we adopt heuristics-based  corpus-based learning to discover the correlation between various linguistic features and different aspects of approaches, and use machine discourse marker usage. Our research framework  Data in the segmented corpus are divided into two sets of texts, namely, the training set and :the test set, each of which includes 40 editorials in :our present research. Texts in the training set are . manually and semi-automatically tagged to reflect where, the properties of every Candidate Discourse DMi: Marker (CDM). Texts in the test set are automatically tagged and proofread. Different algorithms, depending on the features being RRi: investigated, are derived to automatically extract the interesting features to form a feature database. RPi: Machine learning algorithms are then applied to the feature database to generate linguistic rules (decision trees) reflecting the characteristics of various discourse markers and the relevant CT~: rhetorical relations. For every induced rule (or a combination of them), its performance is evaluated by tagging the discourse markers appearing in thtest set of the corpus.</Paragraph>
  </Section>
  <Section position="5" start_page="13" end_page="15" type="metho">
    <SectionTitle>
4 A Framework for Tagging MN~:
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="13" end_page="15" type="sub_section">
      <SectionTitle>
Discourse Markers
</SectionTitle>
      <Paragraph position="0"> The following coding scheme is designed to encode all and only Real Discourse Markers RN~: (RDM) appearing in the SIFAS corpus. We describe the i th discourse marker with a 7-tuple RDMi, RDMi=&lt; DMI, RR/, RPI, CTi, MNi, RNI, &gt; the lexical item of the Discourse Marker, or the value 'NULL'.</Paragraph>
      <Paragraph position="1"> the Rhetorical Relation in which DMi is one of the constituting markers.</Paragraph>
      <Paragraph position="2"> the Relative Position of DM;. The value of RPi can be either 'Front' or 'Back' denoting the relative position of the marker in the rhetorical relation RRi.</Paragraph>
      <Paragraph position="3"> the Connection Type of RRi. The value of CT~ can be either 'Inter&amp;quot; or 'Intra', which indicates that the DM~ functions as a discourse marker in an inter-sentence relation or an Intra-sentence relation.</Paragraph>
      <Paragraph position="4"> the Discourse Marker Sequence Number.</Paragraph>
      <Paragraph position="5"> The value of MNi is assigned sequentially from the beginning of the processed text to the end.</Paragraph>
      <Paragraph position="6"> the Rhetorical Relation Sequence Number. The value of RNi is assigned</Paragraph>
      <Paragraph position="8"> sequentially to the corresponding rhetorical relation RR; in the text.</Paragraph>
      <Paragraph position="9"> OTi: the Order Type of RR;. The value of OTi can be 1, -1 or 0, denoting respectively the normal order, reverse order or irrelevance of the premise-consequence ordering of RRI.</Paragraph>
      <Paragraph position="10"> For Apparent Discourse Markers (ADM) that do not function as real discourse markers in a text, a different 3-tuple coding scheme is used to encode them: ADM~ = &lt; LIi, *, SNi &gt; where, LIi: the Lexical Item of the ADM.</Paragraph>
      <Paragraph position="11"> SNi: the Sequence Number of the ADM.</Paragraph>
      <Paragraph position="12"> To illustrate the above coding scheme consider the following examples of encoded sentences where every CDM has been tagged to be either a 7-tuple or a 3-tuple.</Paragraph>
      <Paragraph position="13"> Example 1  &lt;vouvu ('because').Causalitv. Front. lntra. 2. 2. /&gt; Zhu Pei ('Jospin') zhengfu ('government') taidu ('attitude') qiangying ('adamant'), chaoye ('government-public') duikang ('confrontation') yue-yan-yue ('more-develop-more') -lie ('strong'), &lt;NULL. Causality. Back. Intra, O. 2. /&gt; gongchao ('labour unrest') &lt;vi ('with'). * :1&gt; liaoyuan ('bum-plain') zhi ('gen') shi 'tendency' xunshu 'quick' poji 'spread to' ge ('every') hang ('profession') ge ('every') ye , ('trade').</Paragraph>
      <Paragraph position="14">  'As a result of the adamant attitude of the Jospin administration, confrontation between the government and the public is becoming w.orse and worse. Labour unrest has spread quickly to all industrial sectors.' From the above tagging, we can immediately obtain the discourse structure that the two clauses encapsulated by the two discourse markers youyu (with sequence number 2) and NULL (with sequence number 0). They have formed a causality relation (with sequence number 2). We denote this as a binary relation Causality(FrontClause(2), BaekClause(2)) where FrontClause(n) denotes the discourse segment that is encapsulated by the Front discourse marker of the corresponding rhetorical relation whose sequence number is n.</Paragraph>
      <Paragraph position="15">  BackClause(n) can be defined similarly. Note that although yi is a CDM, it does not function as a discourse indicator in this sentence. Therefore, it is &amp;quot; encoded as an apparent discourse marker.</Paragraph>
      <Paragraph position="16"> Example 2 &lt;dan ('however'). Adversativitv. Back. Inter. 17. 14. 1&gt; &lt;ruguo 'if'. Su_~ciencv. Front. Inter, 18. 15. 1&gt; Zhu Pei ('Jospin') zhengfu ('government') cici ('this time') zai ('at') gongchao ('labour unrest') mianqian ('in the face of') tuique ('back down'), &lt;NULL.</Paragraph>
      <Paragraph position="17"> Su.~ciencv. Back. Inter. O. 15. 1&gt; houguo ('result') &lt;geng.('more'). *. 3&gt; shi bukan ('is unbearable') shexian ('imagine').</Paragraph>
      <Paragraph position="18"> 'However, if the Jospin administration backs down in the face of the labour unrest, the result will be terrible.' From the above tagging, we can obtain the following discourse structure with embedding relations:</Paragraph>
      <Paragraph position="20"> where &amp;F(n) denotes the Front discourse segment of an inter-sentence rhetorical relation whose sequence number is n. We can define &amp;B(n) similarly.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="15" end_page="16" type="metho">
    <SectionTitle>
5 Heuristic-based Tagging of
</SectionTitle>
    <Paragraph position="0"> In the previous section, we have introduced a coding, scheme for CDMs, and have explained how to automatically derive the discourse structure from sentences with tagged discourse markers. Now, the problem we have to resolve is: Is there an algorithm that will tag the markers according to the above encoding scheme? To derive such an algorithm,-even an imperfect one, it is necessary that we have knowledge of the usage patterns and statistics of discourse markers in unrestricted texts. This is exactly what project SIFAS intends to achieve as explained in Section 3. Instead of completely relying on a human encoder to encode all the training texts in the SIFAS corpus, we have experimented with a simple algorithm using a small number of heuristic rules to automatically encode the CDMs. The algorithm is a straightforward matching algorithm for rhetorical relations based recognition of their constituent discourse markers as specified in the Rhetorical Relation Dictionary (T'sou et al. 1999). The following principles are adopted by the heuristic-based algorithm to resolve ambiguous situations encountered in the process of matching discourse markers:  (1) Principle of Greediness: When matching a pair of CDMs for a rhetorical relation, priority is given to the first matched relation from the left.</Paragraph>
    <Paragraph position="1"> (2) Principle of Locality: When matching a pair of CDMs for a rhetorical relation, priority is given to the relation where the distance between its constituent CDMs is shortest.</Paragraph>
    <Paragraph position="2"> (3) Principle of Explicitness: When matching a pair of CDMs for a rhetorical relation, priority is given to the relation that has both CDMs explicitly present.</Paragraph>
    <Paragraph position="3"> (4) Principle of Superiority: When matching a pair of CDMs for a rhetorical relation, priority is given to the inter-sentence relation whose back discourse marker matches the first CDM of a sentence.</Paragraph>
    <Paragraph position="4"> (5) Principle of Back-Marker Preference: this principle is applicable only to rhetorical  relations where either the front or the back marker is absent. In such cases, priority is given to the relation with the back marker present.</Paragraph>
    <Paragraph position="5"> ' Application of the above principles to process a text is in the order shown, with the * exception that the principle of greediness is applied whenever none of the other principles can be, used to resolve an ambiguous situation. The following pseudo code realizes principles 1, 2 and  if ((not CDMs\[J\].Tagged) and (not</Paragraph>
    <Paragraph position="7"> The following code realizes principles 1,4 and 5:  for I:=l to NumberOfCDMslnTheSentence do begin if (not CDMs\[I\].Tagged) then</Paragraph>
    <Paragraph position="9"> In the above pseudo codes, CDMs\[\] denotes the array holding the candidate discourse markers, and the Boolean variable Tagged is used to indicate whether a CDM has been tagged.</Paragraph>
    <Paragraph position="10"> Furthermore, the procedure Matching0 is to examine whether the first word or phrase appearing in a sentence is an inter-sentence CDMs\[I\].</Paragraph>
  </Section>
  <Section position="7" start_page="16" end_page="16" type="metho">
    <SectionTitle>
6 Mining Discourse Marker Using
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="16" end_page="16" type="sub_section">
      <SectionTitle>
Machine Learning
</SectionTitle>
      <Paragraph position="0"> Data mining techniques constitute a field dedicated to the development of computational methods underlying learning processes and they have been applied in various disciplines in text processing, such as finding associations in a collection of texts (Feldman and Hirsh 1997) and mining online text (Knight 1999). In this section, we focus on the problem of discourse marker disambiguation using decision trees obtained by machine learning techniques. Our novel approach in mining Chinese discourse markers attempts to apply the C4.5 learning algorithm, as introduced by Quinlan (1993), in the context of non-tabular, unstructured data. A decision tree consists of nodes and branches connecting the nodes. The nodes located at the bottom of the tree are called leaves, and indicate classes. The top node in the tree is called the root, and contains all the training examples that are to be divided into classes. In order to minimize the branches in the tree, the best attribute is selected and used in the test at the root node of the tree. A descendant of the root node is then created for each possible value of this attribute, and the training examples are sorted to the appropriate descendant node. The entire process is then repeated using the training examples associated with each descendant node to select the best attribute for testing at that point in the tree. A statistical property, called information gain, is used to measure how well a given attribute differentiates the training examples according to their target classificatory scheme and to select the</Paragraph>
      <Paragraph position="2"> most suitable candidate attribute at each step while expanding the tree.</Paragraph>
      <Paragraph position="3"> The attributes we use in this research include the candidate discourse marker itself, two words immediately to the left of the CDM, and two words immediately to the right of the CDM. The attribute names are F2, F1, CDM, B1, B2, respectively. All these five attributes are discrete. The following are two examples: * &amp;quot;,&amp;quot;, dan 'but', youyu 'since', Xianggang 'Hong Kong', de 'of', T.</Paragraph>
      <Paragraph position="4"> * zhe 'this', yi 'also', zhishi 'is only', Xianggang 'Hong Kong', de 'of', F.</Paragraph>
      <Paragraph position="5"> where &amp;quot;T&amp;quot; denotes the CDM youyu as a discourse marker in the given context, and &amp;quot;F&amp;quot; denotes that zhishi is not a discourse marker.</Paragraph>
      <Paragraph position="6"> In building up a decision-tree in our application of C4.5 to the mining of discourse markers, entropy, first of all, is used to measure the homogeneity of the examples. For any possible candidate A chosen as an attribute in classifying the training data S, Gain(S, A) information gain, relative to a data set S is defined. This information gain measures the expected reduction in entropy and defines one branch for the possible subset Si of the training examples. For each subset Si, a new test is then chosen for any further split. If Si satisfies a stopping criterion, such as all the element in S~ belong to one class, the decision tree is formed with all the leaf nodes associated with the most frequent class in S. C4.5 uses arg max(Gain(S, A)) or arg max(Gain Ratio(S, A)) as defined in the following to construct the minimal decision tree.</Paragraph>
      <Paragraph position="8"> where Splitlnformation=-2./--,Jog 2 ~, Si is !S! iS! subset of S for which A has value vt In our text mining, according to the number of times a CDM occurs in the 80 tagged editorials, we select 75 CDMs with more than 10 occurrences.</Paragraph>
      <Paragraph position="9"> To avoid decision trees being over-fitted or trivial, for F2, F1, B1 and B2, only values of attributes with frequency more than 15 in the corpus are used in building the decision trees. We denote all values of attributes with frequency less than 15 as 'Other'. If a CDM is the first, the second or the last word of a sentence, values of F2, F1, or B2 will be null, we denote a null-value as &amp;quot;*&amp;quot;. The following are two other examples:</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML