File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1116_intro.xml
Size: 10,429 bytes
Last Modified: 2025-10-06 14:03:34
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1116"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Bootstrapping Approach to Unsupervised Detection of Cue Phrase Variants</Title> <Section position="3" start_page="0" end_page="922" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Cue phrases such as This paper proposes a novel approach to. . . , no method for . . . exists or even you will hear from my lawyer are semi- xed in that they constitute a formulaic pattern with a clear semantics, but with syntactic and lexical variations which are hard to predict and thus hard to detect in unseen text (e.g. a new algorithm for . . . is suggested in the current paper or I envisage legal action ). In scienti c discourse, such meta-discourse (Myers, 1992; Hyland, 1998) abounds and plays an important role in marking the discourse structure of the texts.</Paragraph> <Paragraph position="1"> Finding these variants can be useful for many text understanding tasks because semi- xed cue phrases act as linguistic markers indicating the importance and/or the rhetorical role of some adjacent text. For the summarisation of scienti c 1In contrast to standard work in discourse linguistics, which mostly considers sentence connectives and adverbials as cue phrases, our de nition includes longer phrases, sometimes even entire sentences.</Paragraph> <Paragraph position="2"> papers, cue phrases such as Our paper deals with. . . are commonly used as indicators of extraction-worthiness of sentences (Kupiec et al., 1995). Re-generative (rather than extractive) summarisation methods may want to go further than that and directly use the knowledge that a certain sentence contains the particular research aim of a paper, or a claimed gap in the literature. Similarly, in the task of automatic routing of customer emails and automatic answering of some of these, the detection of threats of legal action could be useful.</Paragraph> <Paragraph position="3"> However, systems that use cue phrases usually rely on manually compiled lists, the acquisition of which is time-consuming and error-prone and results in cue phrases which are genre-speci c.</Paragraph> <Paragraph position="4"> Methods for nding cue phrases automatically include Hovy and Lin (1998) (using the ratio of word frequency counts in summaries and their corresponding texts), Teufel (1998) (using the most frequent n-grams), and Paice (1981) (using a pattern matching grammar and a lexicon of manually collected equivalence classes). The main issue with string-based pattern matching techniques is that they cannot capture syntactic generalisations such as active/passive constructions, different tenses and modi cation by adverbial, adjectival or prepositional phrases, appositions and other parenthetical material.</Paragraph> <Paragraph position="5"> For instance, we may be looking for sentences expressing the goal or main contribution of a paper; Fig. 1 shows candidates of such sentences.</Paragraph> <Paragraph position="6"> Cases a) e), which do indeed describe the authors' goal, display a wide range of syntactic variation.</Paragraph> <Paragraph position="7"> a) In this paper, we introduce a method for similarity-based estimation of . . .</Paragraph> <Paragraph position="8"> b) We introduce and justify a method. . .</Paragraph> <Paragraph position="9"> c) A method (described in section 1) is introduced d) The method introduced here is a variation. . .</Paragraph> <Paragraph position="10"> e) We wanted to introduce a method. . .</Paragraph> <Paragraph position="11"> f) We do not introduce a method. . .</Paragraph> <Paragraph position="12"> g) We introduce and adopt the method given in [1]. . . h) Previously we introduced a similar method. . .</Paragraph> <Paragraph position="13"> i) They introduce a similar method. . .</Paragraph> <Paragraph position="14"> Cases f) i) in contrast are false matches: they do not express the authors' goals, although they are super cially similar to the correct contexts. While string-based approaches (Paice, 1981; Teufel, 1998) are too restrictive to cover the wide variation within the correct contexts, bag-of-words approaches such as Agichtein and Gravano's (2000) are too permissive and would miss many of the distinctions between correct and incorrect contexts. null Lisacek et al. (2005) address the task of identifying paradigm shift sentences in the biomedical literature, i.e. statements of thwarted expectation. This task is somewhat similar to ours in its de nition by rhetorical context. Their method goes beyond string-based matching: In order for a sentence to qualify, the right set of concepts must be present in a sentence, with any syntactic relationship holding between them. Each concept set is encoded as a xed, manually compiled lists of strings. Their method covers only one particular context (the paradigm shift one), whereas we are looking for a method where many types of cue phrases can be acquired. Whereas it relies on manually assembled lists, we advocate data-driven acquisition of new contexts. This is generally preferrable to manual de nition, as language use is changing, inventive and hard to predict and as many of the relevant concepts in a domain may be infrequent (cf. the formulation be cursed , which was used in our corpus as a way of describing a method's problems). It also allows the acquisition of cue phrases in new domains, where the exact prevalent meta-discourse might not be known.</Paragraph> <Paragraph position="15"> Riloff's (1993) method for learning information extraction (IE) patterns uses a syntactic parse and correspondences between the text and lled MUC-style templates to learn context in terms of lexico-semantic patterns. However, it too requires substantial hand-crafted knowledge: 1500 lled templates as training material, and a lexicon of semantic features for roughly 3000 nouns for constraint checking. Unsupervised methods for similar tasks include Agichtein and Gravano's (2000) work, which shows that clusters of vector-space-based patterns can be successfully employed to detect speci c IE relationships (companies and their headquarters), and Ravichandran and Hovy's (2002) algorithm for nding patterns for a Question Answering (QA) task. Based on training material in the shape of pairs of question and answer terms e.g., (e.g. {Mozart, 1756}), they learn the a) In this paper, we introduce a method for similarity-based estimation of . . .</Paragraph> <Paragraph position="16"> b) Here, we present a similarity-based approach for estimation of. . .</Paragraph> <Paragraph position="17"> c) In this paper, we propose an algorithm which is . . . d) We will here de ne a technique for similarity-based. . . Figure 2: Context around cue phrases (lexical variants) semantics holding between these terms ( birth year ) via frequent string patterns occurring in the context, such as A was born in B , by considering n-grams of all repeated substrings. What is common to these three works is that bootstrapping relies on constraints between the context external to the extracted material and the extracted material itself, and that the target extraction material is de ned by real-world relations.</Paragraph> <Paragraph position="18"> Our task differs in that the cue phrases we extract are based on general rhetorical relations holding in all scienti c discourse. Our approach for nding semantically similar variants in an unsupervised fashion relies on bootstrapping of seeds from within the cue phrase. The assumption is that every semi- xed cue phrase contains at least two main concepts whose syntax and semantics mutually constrain each other (e.g. verb and direct object in phrases such as (we) present an approach for ). The expanded cue phrases are recognised in various syntactic contexts using a parser2. General semantic constraints valid for groups of semantically similar cue phrases are then applied to model, e.g., the fact that it must be the authors who present the method, not somebody else.</Paragraph> <Paragraph position="19"> We demonstrate that such an approach is more appropriate for our task than IE/QA bootstrapping mechanisms based on cue phrase-external context. Part of the reason for why normal bootstrapping does not work for our phrases is the difculty of nding negatives contexts, essential in bootstrapping to evaluate the quality of the patterns automatically. IE and QA approaches, due to uniqueness assumptions of the real-world relations that these methods search for, have an automatic de nition of negative contexts by hard constraints (i.e., all contexts involving Mozart and any other year are by de nition of the wrong semantics; so are all contexts involving Microsoft and a city other than Redmond). As our task is not grounded in real-world relations but in rhetorical ones, constraints found in the context tend to be 2Thus, our task shows some parallels to work in paraphrasing (Barzilay and Lee, 2002) and syntactic variant generation (Jacquemin et al., 1997), but the methods are very different.</Paragraph> <Paragraph position="20"> soft rather than hard (cf. Fig 2): while it it possible that strings such as we and in this paper occur more often in the context of a given cue phrase, they also occur in many other places in the paper where the cue phrase is not present. Thus, it is hard to de ne clear negative contexts for our task.</Paragraph> <Paragraph position="21"> The novelty of our work is thus the new pattern extraction task ( nding variants of semi- xed cue phrases), a task for which it is hard to directly use the context the patterns appear in, and an iterative unsupervised bootstrapping algorithm for lexical variants, using phrase-internal seeds and ranking similar candidates based on relation strength between the seeds.</Paragraph> <Paragraph position="22"> While our method is applicable to general cue phrases, we demonstrate it here with transitive verb direct object pairs, namely a) cue phrases introducing a new methodology (and thus the main research goal of the scienti c article; e.g. In this paper, we propose a novel algorithm. . . ) we call those goal-type cue phrases; and b) cue phrases indicating continuation of previous other research (e.g. Therefore, we adopt the approach presented in [1]. . . ) continuation-type cue phrases.</Paragraph> </Section> class="xml-element"></Paper>