File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1116_metho.xml
Size: 23,132 bytes
Last Modified: 2025-10-06 14:10:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1116"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Bootstrapping Approach to Unsupervised Detection of Cue Phrase Variants</Title> <Section position="4" start_page="922" end_page="924" type="metho"> <SectionTitle> 2 Lexical Bootstrapping Algorithm </SectionTitle> <Paragraph position="0"> The task of this module is to nd lexical variants of the components of the seed cue phrases.</Paragraph> <Paragraph position="1"> Given the seed phrases we introduce a method and we propose a model , the algorithm starts by nding all direct objects of introduce in a given corpus and, using an appropriate similarity measure, ranks them according to their distributional similarity to the nouns method and model . Subsequently, the noun method is used to nd transitive verbs and rank them according to their similarity to introduce and propose . In both cases, the ranking step retains variants that preserve the semantics of the cue phrase (e.g. develop and approach ) and lters irrelevant terms that change the phrase semantics (e.g. need and example ).</Paragraph> <Paragraph position="2"> Stopping at this point would limit us to those terms that co-occur with the seed words in the training corpus. Therefore additional iterations using automatically generated verbs and nouns are applied in order to recover more and more variants. The full algorithm is given in Fig. 3.</Paragraph> <Paragraph position="3"> The algorithm requires corpus data for the steps Hypothesize (producing a list of potential candidates) and Rank (testing them for similarity). We Input: Tuples {A1, A2, . . . , Am} and {B1, B2, . . . , Bn}. Initialisation: Set the concept-A reference set to {A1, A2, . . . , Am} and the concept-B reference set to {B1, B2, . . . , Bn}. Set the concept-A active element to A1 and the concept-B active element to B1.</Paragraph> <Paragraph position="4"> Recursion: 1. Concept B retrieval: (i) Hypothesize: Find terms in the corpus which are in the desired relationship with the concept-A active element (e.g. direct objects of a verb active element). This results in the concept-B candidate set.</Paragraph> <Paragraph position="5"> (ii) Rank: Rank the concept-B candidate set using a suitable ranking methodology that may make use of the concept-B reference set. In this process, each member of the candidate set is assigned a score.</Paragraph> <Paragraph position="6"> (iii) Accumulate: Add the top s items of the concept-B candidate set to the concept-B accumulator list (based on empirical results, s is the rank of the candidate set during the initial iteration and 50 for the remaining iterations). If an item is already on the accumulator list, add its ranking score to the existing item's score.</Paragraph> <Paragraph position="7"> 2. Concept A retrieval: as above, with concepts A and B swapped.</Paragraph> <Paragraph position="8"> 3. Updating active elements: (i) Set the concept-B active element to the highest ranked instance in the concept-B accumulator list which has not been used as an active element before. null (ii) Set the concept-A active element to the highest ranked instance in the concept-A accumulator list which has not been used as an active element before. null Repeat steps 1-3 for k iterations Output: top M words of concept-A (verb) accumulator list and top N words of concept-B (noun) accumulator list Reference set: a set of seed words which de ne the collective semantics of the concept we are looking for in this iteration Active element: the instance of the concept used in the current iteration for retrieving instances of the other concept. If we are nding lexical variants of Concept A by exploiting relationships between Concepts A and B, then the active element is from Concept B.</Paragraph> <Paragraph position="9"> Candidate set: the set of candidate terms for one concept (eg. Concept A) obtained using an active element from the other concept (eg. Concept B). The more semantically similar a term in the candidate set is to the members of the reference set, the higher its ranking should be. This set contains verbs if the active element is a noun and vice versa. Accumulator list: a sorted list that accumulates the ranked members of the candidate set.</Paragraph> <Paragraph position="10"> estimate frequencies for the Rank step from the written portion of the British National Corpus (BNC, Burnard (1995)), 90 Million words. For the Hypothesize step, we experiment with two data sets: First, the scienti c subsection of the BNC (24 Million words), which we parse using RASP (Briscoe and Carroll, 2002); we then examine the grammatical relations (GRs) for transitive verb constructions, both in active and passive voice. This method guarantees that we nd almost all transitive verb constructions cleanly; Carroll et al. (1999) report an accuracy of .85 for DOs, Active: &quot;AGENT STRING AUX active-verb-element DETERMINER * POSTMOD&quot; DOs, Passive: &quot;DETERMINER * AUX active-verb-element element&quot; TVs, Active: &quot;AGENT STRING AUX * DETERMINER active-noun- element POSTMOD&quot; TVs, Passive:&quot;DET active-noun-element AUX * POSTMOD&quot; newspaper articles for this relation. Second, in order to obtain larger coverage and more current data we also experiment with Google Scholar3, an automatic web-based indexer of scienti c literature (mainly peer-reviewed papers, technical reports, books, pre-prints and abstracts). Google Scholar snippets are often incomplete fragments which cannot be parsed. For practical reasons, we decided against processing the entire documents, and obtain an approximation to direct objects and transitive verbs with regular expressions over the result snippets in both active and passive voice (cf. Fig. 4), designed to be high-precision4. The amount of data available from BNC and Google Scholar is not directly comparable: harvesting Google Scholar snippets for both active and passive constructions gives around 2000 sentences per seed (Google Scholar returns up to 1000 results per query), while the number of BNC sentences containing seed words in active and passive form varies from 11 ( formalism ) to 5467 ( develop ) with an average of 1361 sentences for the experimental seed pairs.</Paragraph> <Section position="1" start_page="923" end_page="924" type="sub_section"> <SectionTitle> Ranking </SectionTitle> <Paragraph position="0"> Having obtained our candidate sets (either from the scienti c subsection of the BNC or from Google Scholar), the members are ranked using BNC frequencies. We investigate two ranking methodologies: frequency-based and contextbased. Frequency-based ranking simply ranks each member of the candidate set by how many times it is retrieved together with the current active element. Context-based ranking uses a similarity measure for computing the scores, giving a higher score to those words that share suf ciently similar contexts with the members of the reference set.</Paragraph> <Paragraph position="1"> We consider similarity measures in a vector space de ned either by a xed window, by the sentence window, or by syntactic relationships. The score assigned to each word in the candidate set is the sum of its semantic similarity values computed with respect to each member in the reference set.</Paragraph> <Paragraph position="2"> actual words (e.g. AGENT STRING: We/I, DETERMINER: a/ an/our), and the extracted words (indicated by * ) are lemmatised. null Syntactic contexts, as opposed to window-based contexts, constrain the context of a word to only those words that are grammatically related to it. We use verb-object relations in both active and passive voice constructions as did Pereira et al. (1993) and Lee (1999), among others. We use the cosine similarity measure for window-based contexts and the following commonly used similarity measures for the syntactic vector space: Hindle's (1990) measure, the weighted Lin measure (Wu and Zhou, 2003), the a-Skew divergence measure (Lee, 1999), the Jensen-Shannon (JS) divergence measure (Lin, 1991), Jaccard's coef cient (van Rijsbergen, 1979) and the Confusion probability (Essen and Steinbiss, 1992). The Jensen-Shannon measure JS (x1, x2) =</Paragraph> <Paragraph position="4"> subsequently performed best for our task. We compare the different ranking methodologies and data sets with respect to a manually-de ned gold standard list of 20 goal-type verbs and 20 nouns. This list was manually assembled from Teufel (1999); WordNet synonyms and other plausible verbs and nouns found via Web searches on scienti c articles were added. We ensured by searches on the ACL anthology that there is good evidence that the gold-standard words indeed occur in the right contexts, i.e. in goal statement sentences. As we want to nd similarity metrics and data sources which result in accumulator lists with many of these gold members at high ranks, we need a measure that rewards exactly those lists. We use non-interpolated Mean Average Precision (MAP), a standard measure for evaluating ranked information retrieval runs, which combines precision and recall and ranges from 0 to 15.</Paragraph> <Paragraph position="5"> We use 8 pairs of 2-tuples as input (e.g. [introduce, study] & [approach, method]), randomly selected from the gold standard list. MAP was cal-</Paragraph> <Paragraph position="7"> where P (gi) = nijrij if gi is retrieved and 0 otherwise, N is the number of seed combinations, M is the size of the golden list, gi is the ith member of the golden list and rij is its rank in the retrieved list of combination j while nij is the number of golden members found up to and including rank rij.</Paragraph> <Paragraph position="8"> culated over the verbs and nouns retrieved using our algorithm and averaged. Fig. 5 summarises the MAP scores for the rst iteration, where Google Scholar signi cantly outperformed the BNC. The best result for this iteration (MAP=.550) was achieved by combining Google Scholar and the Jensen-Shannon measure. The algorithm stops to iterate when no more improvement can be obtained, in this case after 4 iterations, resulting in a nal MAP of .619.</Paragraph> <Paragraph position="9"> Although a-Skew outperforms the simpler measures in ranking nouns, its performance on verbs is worse than the performance of Weighted Lin.</Paragraph> <Paragraph position="10"> While Lee (1999) argues that a-Skew's asymmetry can be advantageous for nouns, this probably does not hold for verbs: verb hierarchies have much shallower structure than noun hierarchies with most verbs concentrated on one level (Miller et al., 1990). This would explain why JS, which is symmetric compared to the a-Skew metric, performed better in our experiments.</Paragraph> <Paragraph position="11"> In the evaluation presented here we therefore use Google Scholar data and the JS measure. An additional improvement (MAP=.630) is achieved when we incorporate a lter based on the following hypothesis: goal-type verbs should be more likely to have their direct objects preceded by inde nite articles rather than de nite articles or possessive determiners (because a new method is introduced) whereas continuation-type verbs should prefer de nite articles with their direct objects (as an existing method is involved).</Paragraph> </Section> </Section> <Section position="5" start_page="924" end_page="925" type="metho"> <SectionTitle> 3 Syntactic variants and semantic lters </SectionTitle> <Paragraph position="0"> The syntactic variant extractor takes as its input the raw text and the lists of verbs and nouns generated by the lexical bootstrapper. After RASPparsing the input text, all instances of the input verbs are located and, based on the grammatical relations output by RASP6, a set of relevant en- null aux, argmod, detmod, ncmod and mod.</Paragraph> <Paragraph position="1"> The agent of the verb (e.g., We adopt. . .</Paragraph> <Paragraph position="2"> . . . adopted by the author ), the agent's determiner and related adjectives.</Paragraph> <Paragraph position="3"> The direct object of the verb, the object's determiner and adjectives, in addition to any post-modi ers (e.g., . . . apply a method proposed by [1] . . . , . . . follow an approach of [1] . . .</Paragraph> <Paragraph position="4"> Auxiliaries of the verb (e.g., In a similar manner, we may propose a . . . ) Adverbial modi cation of the verb (e.g., We have previously presented a . . . . ) Prepositional phrases related to the verb (e.g., In this paper we present. . . , . . . adopted from their work ) Figure 6: Grammatical relations considered tities and modi ers for each verb is constructed, grouped into ve categories (cf. Fig. 6).</Paragraph> <Paragraph position="5"> Next, semantic lters are applied to each of the potential candidates (represented by the extracted entities and modi ers), and a tness score is calculated. These constraints encode semantic principles that will apply to all cue phrases of that rhetorical category. Examples for constraints are: if work is referred to as being done in previous own work, it is probably not a goal statement; the work in a goal statement must be presented here or in the current paper (the concept of 'here-ness ); and the agents of a goal statement have to be the authors, not other people. While these lters are manually de ned, they are modular, encode general principles, and can be combined to express a wide range of rhetorical contexts. We veri ed that around 20 semantic constraints are enough to cover a large sets of different cue phrases (the 1700 cue phrases from Teufel (1999)), though not all of these are implemented yet.</Paragraph> <Paragraph position="6"> A nice side-effect of our approach is the simple characterisation of a cue phrase (by a syntactic relationship, some seed words for each concept, and some general, reusable semantic constraints). This characterisation is more informative and speci c than string-based approaches, yet it has the potential for generalisation (useful if the cue phrases are ever manually assessed and put into a lexicon).</Paragraph> <Paragraph position="7"> Fig. 7 shows successful extraction examples from our corpus7, illustrating the dif culty of the task: the system correctly identi ed sentences with syntactically complex goal-type and continuation-type cue phrases, and correctly rejected deceptive variants8.</Paragraph> <Paragraph position="8"> chitecture, method] (for goal) and [improve, adopt] & [model, method] (for continuation).</Paragraph> <Paragraph position="9"> Correctly found: Goal-type: What we aim in this paper is to propose a paradigm that enables partial/local generation through decompositions and reorganizations of tentative local structures. (9411021, S-5) Continuation-type: In this paper we have discussed how the lexicographical concept of lexical functions, introduced by Melcuk to describe collocations, can be used as an interlingual device in the machine translation of such structures. (9410009, S-126) Correctly rejected: Goal-type: Perhaps the method proposed by Pereira et al.</Paragraph> <Paragraph position="10"> (1993) is the most relevant in our context.</Paragraph> <Paragraph position="11"> (9605014, S-76) Continuation-type: Neither Kamp nor Kehler extend their copying/ substitution mechanism to anything besides pronouns, as we have done. (9502014, S-174)</Paragraph> </Section> <Section position="6" start_page="925" end_page="926" type="metho"> <SectionTitle> 4 Gold standard evaluation </SectionTitle> <Paragraph position="0"> We evaluated the quality of the extracted phrases in two ways: by comparing our system output to gold standard annotation, and by human judgement of the quality of the returned sentences. In both cases bootstrapping was done using the seed tuples [analyse, present] & [architecture, method].</Paragraph> <Paragraph position="1"> For the gold standard-evaluation, we ran our system on a test set of 121 scienti c articles drawn from the CmpLg corpus (Teufel, 1999) entirely different texts from the ones the system was trained on. Documents were manually annotated by the second author for (possibly more than one) goal-type sentence; annotation of that type has been previously shown to be reliable at K=.71 (Teufel, 1999). Our evaluation recorded how often the system's highest-ranked candidate was indeed a goal-type sentence; as this is a precision-critical task, we do not measure recall here.</Paragraph> <Paragraph position="2"> We compared our system against our reimplementation of Ravichandran and Hovy's (2002) paraphrase learning. The seed words were of the form {goal-verb, goal-noun}, and we submitted each of the 4 combinations of the seed pair to Google Scholar. From the top 1000 documents for each query, we harvested 3965 sentences containing both the goal-verb and the goal-noun. By considering all possible substrings, an extensive list of candidate patterns was assembled. Patterns with single occurrences were discarded, leaving a list of 5580 patterns (examples in Fig. 8). In order to rank the patterns by precision, the goal-verbs were submitted as queries and the top 1000 documents were downloaded for each. From these, we <verb> a <noun> for of a new <noun> to <verb> the In this section , we <verb> the <noun> of</Paragraph> <Section position="1" start_page="925" end_page="926" type="sub_section"> <SectionTitle> Ravichandran and Hovy's (2002) method Method Correct sentences </SectionTitle> <Paragraph position="0"> Our system with bootstrapping 88 (73%) Ravichandran and Hovy (2002) 58 (48%) Our system, no bootstrapping, WordNet 50 (41%) Our system, no bootstrapping, seeds only 37 (30%) the precision of each pattern was calculated by dividing the number of strings matching the pattern instantiated with both the goal-verb and all Word-Net synonyms of the goal-noun, by the number of strings matching the patterns instantiated with the goal-verb only. An important point here is that while the tight semantic coupling between the question and answer terms in the original method accurately identi es all the positive and negative examples, we can only approximate this by using a sensible synonym set for the seed goal-nouns. For each document in the test set, the sentence containing the pattern with the highest precision (if any) was extracted as the goal sentence.</Paragraph> <Paragraph position="1"> We also compared our system to two baselines.</Paragraph> <Paragraph position="2"> We replaced the lists obtained from the lexical bootstrapping module with a) just the seed pair and b) the seed pair and all the WordNet synonyms of the components of the seed pair9.</Paragraph> <Paragraph position="3"> The results of these experiments are given in Fig. 9. All differences are statistically signi cant with the kh2 test at p=.01 (except those between Ravichandran/Hovy and our nonbootstrapping/WordNet system). Our bootstrapping system outperforms the Ravichandran and Hovy algorithm by 34%. This is not surprising, because this algorithm was not designed to perform well in tasks where there is no clear negative context. The results also show that bootstrapping outperforms a general thesaurus such as WordNet.</Paragraph> <Paragraph position="4"> Out of the 33 articles where our system's favourite was not an annotated goal-type sentence, only 15 are due to bootstrapping errors (i.e., to an incorrect ranking of the lexical variants), corre9Bootstrapping should in principle do better than a thesaurus, as some of our correctly identi ed variants are not true synonyms (e.g., theory vs. method), and as noise through overgeneration of unrelated senses might occur unless automatic word sense diambiguation is performed.</Paragraph> <Paragraph position="5"> System chose: but should have chosen: derive set compare model illustrate algorithm present formalisation discuss measures present variations describe modi cations propose measures accommodate material describe approach examine material present study sponding to a 88% accuracy of the bootstrapping module. Examples from those 15 error cases are given in Fig. 10. The other errors were due to the cue phrase not being a transitive verb direct object pattern (e.g. we show that, our goal is and we focus on), so the system could not have found anything (11 cases, or an 80% accuracy), ungrammatical English or syntactic construction too complex, resulting in a lack of RASP detection of the crucial grammatical relation (2) and failure of the semantic lter to catch non-goal contexts (5).</Paragraph> </Section> </Section> <Section position="7" start_page="926" end_page="926" type="metho"> <SectionTitle> 5 Human evaluation </SectionTitle> <Paragraph position="0"> We next perform two human experiments to indirectly evaluate the quality of the automatically generated cue phrase variants. Given an abstract of an article and a sentence extracted from the article, judges are asked to assign a score ranging from 1 (low) to 5 (high) depending on how well the sentence expresses the goal of that article (Exp. A), or the continuation of previous work (Exp. B).</Paragraph> <Paragraph position="1"> Each experiment involves 24 articles drawn randomly from a subset of 80 articles in the CmpLg corpus that contain manual annotation for goal-type and continuation-type sentences. The experiments use three external judges (graduate students in computational linguistics), and a Latin Square experimental design with three conditions: Base-line (see below), System-generated and Ceiling (extracted from the gold standard annotation used in Teufel (1999)). Judges were not told how the sentences were generated, and no judge saw an item in more than one condition.</Paragraph> <Paragraph position="2"> The baseline for Experiment A was a random selection of sentences with the highest TF*IDF scores, because goal-type sentences typically contain many content-words. The baseline for experiment B (continuation-type) were randomly selected sentences containing citations, because they often co-occur with statements of continuation. In both cases, the length of the baseline sentence was controlled for by the average lengths of the gold standard and the system-extracted sentences in the document.</Paragraph> <Paragraph position="3"> Fig. 11 shows that judges gave an average score of 3.08 to system-extracted sentences in Exp. A, compared with a baseline of 1.58 and a ceiling of 3.9110; in Exp. B, the system scored 3.67, with a higher baseline of 2.50 and a ceiling of 4.33.</Paragraph> <Paragraph position="4"> According to the Wilcoxon signed-ranks test at a = .01, the system is indistinguishable from the gold standard, but signi cantly different from the baseline, in both experiments. Although this study is on a small scale, it indicates that humans judged sentences obtained with our method as almost equally characteristic of their rhetorical function as human-chosen sentences, and much better than non-trivial baselines.</Paragraph> </Section> class="xml-element"></Paper>