File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0620_intro.xml
Size: 3,386 bytes
Last Modified: 2025-10-06 14:07:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0620"> <Title>Learning Discourse Relations with Active Data Selection</Title> <Section position="4" start_page="0" end_page="158" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The success of corpus-based approaches to discourse ultimately depends on whether one is able to acquire a large volume of data annotated for discourse-level information. However, to acquire merely a few hundred texts annotated for discourse information is often impossible due to the enormity of the haman labor required.</Paragraph> <Paragraph position="1"> This paper presents a novel method for reducing the amount of data for training a decision tree classifier, while not compromising the accuracy. While there has been some work exploring the use of machine leaning techniques for discourse and dialogue (Marcu, 1997; Samuel et al., 1998), to our knowledge, no computational research on discourse or dialogue so far has addressed the problem of reducing or minimizing the amount of data for training a learning algorithm. null * The work reported here was conducted while the first author was with Advanced Research Lab., Hitachi Ltd, 2520 Hatoyama Saitama 350-0395 Japan.</Paragraph> <Paragraph position="2"> A particular method proposed here is built on the committee-based sampling, initially proposed for probabilistic classifiers by Dagan and Engelson (1995), where an example is selected from the corpus according to its utility in improving statistics. We extend the method for decision tree classifiers using a statistical technique called bootstrapping (Cohen, 1995). With an additional extension, which we call error .feedback, it is found that the method achieves an increased accuracy as well as a significant reduction of training data. The method proposed here should be of use in domains other than discourse, where a decision tree strategy is found applicable.</Paragraph> <Paragraph position="3"> 2 Tagging a corpus with discourse relations In tagging a corpus, we adopted Ichikawa (1990)'s scheme for organizing discourse relations (Table 1). The advantage of Ichikawa (1990)'s scheme is that it directly associates discourse relations with explicit surface cues (eg. sentential connectives), so it is possible for the coder to determine a discourse relation by figuring a most natural cue that goes with a sentence he/she is working on. Another feature is that, unlike Rhetorical Structure Theory (Mann and Thompson, 1987), the scheme assumes a discourse relation to be a local one, which is defined strictly over two consecutive sentences. 1 We expected that these features would make a tagging task less laborious for a human coder than it would be with RST. Further, our earlier study indicated a very low agreement rate with 1This does not mean to say that all of the discourse relations are local. There could be some relations that involve sentences separated far apart. However we did not consider non-local relations, as our preliminary study found that they are rarely agreed upon by coders.</Paragraph> <Paragraph position="4"> and the second subclasses. The third column lists some examples associated with each subclass. Note that the EXPANDING subclass has no examples in it. This is because no explicit cue is used to mark the relationship.</Paragraph> </Section> class="xml-element"></Paper>