File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1608_metho.xml
Size: 16,037 bytes
Last Modified: 2025-10-06 14:08:35
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1608"> <Title>Extracting Structural Paraphrases from Aligned Monolingual Corpora</Title> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 Approach </SectionTitle> <Paragraph position="0"> Our approach, like Barzilay and McKeown's, is built on the application of sentence-alignment techniques used in machine translation to generate paraphrases.</Paragraph> <Paragraph position="1"> The insight is simple: if we have pairs of sentences with the same semantic content, then the difference in lexical content can be attributed to variations in the surface form. By generalizing these differences we can automatically derive paraphrases. Barzilay and McKeown perform this learning process by only For example, &quot;dog&quot; and &quot;cat&quot; are recognized to be similar, but they are obviously not paraphrases of one another.</Paragraph> <Paragraph position="2"> considering the local context of words and their frequencies; as a result, paraphrases must be contiguous, and in the majority of cases, are only one word long. We believe that disregarding the rich syntactic structure of language is an oversimplification, and that structural paraphrases offer several distinct advantages over lexical paraphrases. Long distance relations can be captured by syntactic trees, so that words in the paraphrases do not need to be contiguous. Use of syntactic trees also buffers against morphological variants (e.g., different inflections) and some syntactic variants (e.g., active vs. passive).</Paragraph> <Paragraph position="3"> Finally, because paraphrases are context-dependent, we believe that syntactic structures can encapsulate a richer context than lexical phrases.</Paragraph> <Paragraph position="4"> Based on aligned monolingual corpora, our technique for extracting paraphrases builds on Lin and Pantel's insight of using dependency paths (derived from parsing) as the fundamental unit of learning and using parts of those paths as features. Based on the hypothesis that paths between identical words in aligned sentences are semantically equivalent, we can extract paraphrases by scoring the path frequency and context. Our approach addresses the limitations of both Barzilay and McKeown's and Lin and Pantel's work: using syntactic structures allows us to generate structural paraphrases, and using aligned corpora renders the process more computationally tractable. The following sections describe our approach in greater detail.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 Corpus Alignment </SectionTitle> <Paragraph position="0"> Multiple English translations of foreign novels, e.g., Twenty Thousand Leagues Under the Sea by Jules Verne, were used for extraction of paraphrases.</Paragraph> <Paragraph position="1"> Although translations by different authors differ slightly in their literary interpretation of the original text, it was usually possible to find corresponding sentences that have the same semantic content. Sentence alignment was performed using the Gale and Church algorithm (1991) with the following cost function: cost of substitution =1[?] ncw anw ncw: number of common words anw: average number of words in two strings Here is a sample from two different translations of Twenty Thousand Leagues Under the Sea: Ned Land tried the soil with his feet, as if to take possession of it.</Paragraph> <Paragraph position="2"> Ned Land tested the soil with his foot, as if he were laying claim to it.</Paragraph> <Paragraph position="3"> To test the accuracy of our alignment, we manually aligned 454 sentences from two different versions of Chapter 21 from Twenty Thousand Leagues Under the Sea and compared the results of our automatic alignment algorithm against the manually generated &quot;gold standard.&quot; We obtained a precision of 0.93 and recall of 0.88, which is comparable to the numbers (P.94/R.85) reported by Barzilay and McKeown, who used a different cost function for the alignment process.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.2 Parsing and Postprocessing </SectionTitle> <Paragraph position="0"> The sentence pairs produced by the alignment algorithm are then parsed by the Link Parser (Sleator and Temperly, 1993), a dependency-based parser developed at CMU. The resulting parse structures are post-processed to render the links more consistent: Because the Link Parser does not directly identify the subject of a passive sentence, our postprocessor takes the object of the by-phrase as the subject by default. For our purposes, auxiliary verbs are ignored; the postprocessor connects verbs directly to their subjects, discarding links through any auxiliary verbs. In addition, subjects and objects within relative clauses are appropriately modified so that the linkages remained consistent with subject and object linkages in the matrix clause. For sentences involving verbs that have particles, the Link Parser connects the object of the verb directly to the verb itself, attaching the particle separately. Our postprocessor modifies the link structure so that the object is connected to the particle in order to form a continuous path. Predicate adjectives are converted into an adjective-noun modification link instead of a complete verb-argument structure. Also, common nouns denoting places and people are marked by consulting WordNet.</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.3 Paraphrase Extraction </SectionTitle> <Paragraph position="0"> The paraphrase extraction process starts by finding anchors within the aligned sentence pairs. In our approach, only nouns and pronouns serve as possible anchors. The anchor words from the sentence pairs are brought into alignment and scored by a simple set of ordered heuristics: * Exact string matches denote correspondence. * Noun and matching pronoun (same gender and number) denote correspondence. Such a match penalizes the score by 50%.</Paragraph> <Paragraph position="1"> * Unique semantic class (e.g., places and people) denotes correspondence. Such a match penalizes the score by 50%.</Paragraph> <Paragraph position="2"> * Unique part of speech (i.e., the only noun pair in the sentences) denotes correspondence. Such a match penalizes the score by 50%.</Paragraph> <Paragraph position="3"> * Otherwise, attempt to find correspondence by finding longest common substrings. Such a match penalizes the score by 50%.</Paragraph> <Paragraph position="4"> * If a word occurs more than once in the aligned sentence pairs, all possible combinations are considered, but the score for such a corresponding anchor pair is further penalized by 50%. For each pair of anchors, a breadth-first search is used to find the shortest path between the anchor words. The search algorithm explicitly rejects paths that contain conjunctions and punctuation. If valid paths are found between anchor pairs in both of the aligned sentences, the resulting paths are considered candidate paraphrases, with a default score of one (subjected to penalties imposed by imperfect anchor matching).</Paragraph> <Paragraph position="5"> Scores of candidate paraphrases take into account two factors: the frequency of anchors with respect to a particular candidate paraphrase and the variety of different anchors from which the paraphrase was produced. The initial default score of any paraphrase is one (assuming perfect anchor matches), but for each additional occurrence the score is incremented</Paragraph> <Paragraph position="7"> , where n is the number of times the current set of anchors has been seen. Therefore, the effect of seeing new sets of anchors has a big initial impact on the score, but the additional increase in score is subjected to diminishing returns as more occurrences of the same anchor are encountered.</Paragraph> <Paragraph position="8"> count aligned sentences 27479 parsed aligned sentences 25292 anchor pairs 43974 paraphrases 5925 unique paraphrases 5502 gathered paraphrases (score [?] 1.0) 2886</Paragraph> </Section> </Section> <Section position="5" start_page="1" end_page="2" type="metho"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> Using the approach described in previous sections, we were able to extract nearly six thousand different paraphrases (see Table 1) from our corpus, which consisted of two translations of 20,000 Leagues Under the Sea, two translations of The Kreutzer Sonata, and three translations of Madame Bouvary.</Paragraph> <Paragraph position="1"> Our corpus was essentially the same as the one used by Barzilay and McKeown, with the exception of some short fairy tale translations that we found to be unsuitable. Due to the length of sentences (some translations were noted for their paragraph-length sentences), the Link Parser was unable to produce a parse for approximately eight percent of the sentences. Although the Link Parser is capable of producing partial linkages, accuracy deteriorated significantly as the length of the input string increased. The distribution of paraphrase length is shown in unique paraphrases were randomly chosen to be assessed by human judges. The human assessors were specifically asked whether they thought the paraphrases were roughly interchangeable with each other, given the context of the genre. We believe that the genre constraint was important because some paraphrases captured literary or archaic uses of particular words that were not generally useful. This should not be viewed as a shortcoming of our approach, but rather an artifact of our corpus. In addition, sample sentences containing the structural paraphrases were presented as context to the judges; structural paraphrases are difficult to comprehend without this information.</Paragraph> <Paragraph position="2"> A summary of the judgments provided by human evaluators is shown in Table 2. The average precision of our approach stands at just over forty percent; the average length of the paraphrases learned was 3.26 words long. Our results also show that judging structural paraphrases is a difficult task and inter-assessor agreement is rather low. All of the evaluators agreed on the judgments (either positive or negative) only 75.4% of the time. The average correlation constant of the judgments is only 0.66.</Paragraph> <Paragraph position="3"> The highest scoring paraphrase was the equivalence of the possessive morpheme 's with the preposition of. We found it encouraging that our algorithm was able to induce this structural paraphrase, complete with co-indexed anchors on the ends of the paths, i.e., A's B == B of A. Some other interesting examples include: Brief description of link labels: S: subject to verb; O: object to verb; OF: certain verbs to of; K: verbs to particles; MV: verbs to certain modifying phrases. See Link Parser documentation for full descriptions.</Paragraph> <Paragraph position="4"> Example: He thought fit, after the first few mouthfuls, to give some details as to the catastrophe. == After the first few mouthfuls he considered it appropriate to supply a few details concerning the catastrophe. null A more detailed breakdown of the evaluation results can be seen in Table 3. Increasing the threshold for generating paraphrases tends to increase their precision, up to a certain point. In general, the highest ranking structural paraphrases consisted of single word paraphrases of prepositions, e.g., at == in. Our algorithm noticed that different prepositions were often interchangeable, which is something that our human assessors disagreed widely on. Beyond a certain threshold, the accuracy of our approach actually decreases.</Paragraph> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> An obvious first observation about our algorithm is the dependence on parse quality; bad parses lead to many bogus paraphrases. Although the parse results from the Link Parser are far from perfect, it is unclear whether other purely statistical parsers would fare any better, since they are generally trained on corpora containing a totally different genre of text.</Paragraph> <Paragraph position="1"> However, future work will most likely include a comparison of different parsers.</Paragraph> <Paragraph position="2"> Examination of our results show that a better notion of constituency would increase the accuracy of our results. Our algorithm occasionally generates non-sensical paraphrases that cross constituent boundaries, for example, including the verb of a subordinate clause with elements from the matrix clause. Other problems arise because our current algorithm has no notion of verb phrases; it often generates near misses such as fail == succeed, neglecting to include not as part of the paraphrase.</Paragraph> <Paragraph position="3"> However, there are problems inherent in paraphrase generation that simple knowledge of constituency alone cannot solve. Consider the following two sentences: John made out gold at the bottom of the well.</Paragraph> <Paragraph position="4"> John discovered gold near the bottom of the well.</Paragraph> <Paragraph position="5"> Which structural paraphrases should we be able to extract? made out X at Y== discovered X near Y made out X== discovered X at X== near X Arguably, all three paraphrases are valid, although opinions vary more regarding the last paraphrase.</Paragraph> <Paragraph position="6"> What is the optimal level of structure for paraphrases? Obviously, this represents a tradeoff between specificity and accuracy, but the ability of structural paraphrases to capture long-distance relationships across large numbers of lexical items complicates the problem. Due to the sparseness of our data, our algorithm cannot make a good decision on what constituents to generalize as variables; naturally, greater amounts of data would alleviate this problem. This current inability to decide on a good &quot;scope&quot; for paraphrasing was a primary reason why we were unable to perform a strict evaluation of recall. Our initial attempts at generating a gold standard for estimating recall failed because human judges could not agree on the boundaries of paraphrases.</Paragraph> <Paragraph position="7"> The accuracy of our structural paraphrases is highly dependent on the corpus size. As can be seen from the numbers in Table 1, paraphrases are rather sparse--nearly 93% of them are unique. Without adequate statistical evidence, validating candidate paraphrases can be very difficult. Although our data spareness problem can be alleviated simply by gathering a larger corpus, the type of parallel text our algorithm requires is rather hard to obtain, i.e., there are only so many translations of so many foreign novels. Furthermore, since our paraphrases are arguably genre-specific, different applications may require different training corpora. Similar to the work of Barzilay and Lee (2003), who have applied paraphrase generation techniques to comparable corpora consisting of different newspaper articles about the same event, we are currently attempting to solve the data sparseness problem by extending our approach to non-parallel corpora.</Paragraph> <Paragraph position="8"> We believe that generating paraphrases at the structural level holds several key advantages over lexical paraphrases, from the capturing of long-distance relationships to the more accurate modeling of context. The paraphrases generated by our approach could prove to be useful in any natural language application where understanding of linguistic variations is important. In particular, we are attempting to apply our results to improve the performance of question answering system, which we will describe in the following section.</Paragraph> </Section> class="xml-element"></Paper>