File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-0402_evalu.xml

Size: 4,684 bytes

Last Modified: 2025-10-06 13:58:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0402">
  <Title>Mining Discourse Markers for Chinese Textual Summarization</Title>
  <Section position="8" start_page="16" end_page="17" type="evalu">
    <SectionTitle>
7 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="16" end_page="17" type="sub_section">
      <SectionTitle>
7.1 Evaluation of Heuristic-based
Algorithm
</SectionTitle>
      <Paragraph position="0"> In order to evaluate the effectiveness of the heuristic-based algorithm, we randomly selected 40 editorials from Ming Pao, a Chinese newspaper of Hong Kong, to form our test data. Only editorials are chosen because they are mainly argumentative texts and their lengths are relatively uniform.</Paragraph>
      <Paragraph position="1"> The steps of evaluation consist of: 1) tagging all of the test data using the heuristic-based algorithm, and 2) proofreading, correcting and recording all the tagging errors by a human encoder. The resulting statistics include, for each editorial in the test data, the number of lexical items (#Lltms), the number of sentences (#Sens), the number of discourse markers (#Mrkrs), and the number of sentences containing at least one discourse marker (#CSens). Table 2 shows the minimum, maximum and average values of these characteristics. The ratio of the average number of discourse markers to the average number of lexical items is 4.37%, and the ratio of the average number of sentences discourse marker to sentences is 62.66%.</Paragraph>
      <Paragraph position="2">  Our evaluation is based on counting the number of discourse markers that are correctly  tagged. For incorrectly tagged discourse markers, we classify them according to the types of errors that we have introduced in T'sou et al. (1999). We define two evaluation metrics as follows: Gross Accuracy (GA) is defined to be the percentage of correctly tagged discourse markers to the total number of discourse markers while Relation-Matching Accuracy (RMA) is defined to be the percentage of correctly tagged discourse markers to the total number of discourse markers minus those errors caused by non-markers and unrecorded markers. The results for our testing. data have GA = 68.89% and RMA = 95.07%.</Paragraph>
      <Paragraph position="3"> Since the heuristic-based algorithm does not assume any knowledge of the statistics and behavioral patterns of discourse markers, our GA demonstrates the usefulness of the algorithm in alleviating the burden of human encoders in developing a sufficiently large-corpus for the purpose of studying the usage of discourse markers.</Paragraph>
      <Paragraph position="4"> In our experiment, most errors come from tagging non-discourse markers as discourse markers (T'sou et al. 1999). This is due to the fact that, similar to the question of cue phrase polysemy (Hirschberg and Litman 1993), many Chinese discourse markers have both discourse senses and alternate sentential senses in different ;utterances. For example:</Paragraph>
    </Section>
    <Section position="2" start_page="17" end_page="17" type="sub_section">
      <SectionTitle>
7.2 Evaluation of Decision Tree
</SectionTitle>
      <Paragraph position="0"> Algorithm (with C4.5) In Section 6, we discuss how machine learning techniques have been applied to the problem of discourse marker disambiguation in Chinese. In our experiment, there are a total of 2627 cases. In our decision tree construction, we use 75 percent of the total cases as a training set, and the remaining 25 percent of cases as a test set. Many decision trees can be generated by adjusting the parameters in the learning algorithm. Many decision trees generated in our experiment have an accuracy around 80% for both the training set and the test set. Figure 2 shows one of the possible decision trees in our experiment. The last branch of the decision tree</Paragraph>
      <Paragraph position="2"> can be explained as: if (F1 = danshi 'but') then if (CDM in {ru 'if', reng 'still', geng &amp;quot;even more', que 'however' }) then dassify as F else if (CDM in {chule 'except',youyu 'since', ru0 'if' }) then classify as T Decision Tree: (Size = 38, Items = 1971, Errors = 282) F1 in {di, ye, yi} : F (25/5)</Paragraph>
      <Paragraph position="4"> The two numbers in the brackets denote the number of cases covered by the branch and the number of cases being misclassified respectively: The results of our experiment will be elaborated on in future, when we shall also explore the application of machine learning techniques to recognizing rhetorical relations on the basis of discourse markers, and extracting important sentences from Chinese text.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML