File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1603_metho.xml
Size: 19,173 bytes
Last Modified: 2025-10-06 14:10:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1603"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Paraphrase Recognition via Dissimilarity Signi cance Classi cation</Title> <Section position="5" start_page="19" end_page="19" type="metho"> <SectionTitle> 3 Methodology </SectionTitle> <Paragraph position="0"> As noted earlier, for a pair of sentences to be a paraphrase, they must possess two attributes: 1. similarity: they share a substantial amount of information nuggets; 2. dissimilarities are extraneous: if extra infor- null mation in the sentences exists, the effect of its removal is not signi cant.</Paragraph> <Paragraph position="1"> A key decision for our two-phase PR framework is to choose the representation of an information nugget. A simple approach is to use representative words as information nuggets, as is done in the SimFinder system (Hatzivassiloglou et al., 2001). Instead of using words, we choose to equate information nuggets with predicate argument tuples. A predicate argument tuple is a structured representation of a verb predicate together with its arguments. Given a sentence from the example in Figure 1, its predicate argument tuple form in Prop-Bank (Kingsbury et al., 2002) format is: target(predicate): hurt arg0: a young man arg1: Richard Miller We feel that this is a better choice for the representation of a nugget as it accounts for the action, concepts and their relationships as a single unit. In comparison, using ne-grained units such as words, including nouns and verbs may result in inaccuracy (sentences that share vocabulary may not be paraphrases), while using coarser-grained units may cause key differences to be missed. In the rest of this paper, we use the term tuple for conciseness when no ambiguity is introduced.</Paragraph> <Paragraph position="2"> An overview of our paraphrase recognition system is shown in Figure 2. A pair of sentences is rst fed to a syntactic parser (Charniak, 2000) and then passed to a semantic role labeler (ASSERT; (Pradhan et al., 2004)), to label predicate argument tuples. We then calculate normalized tuple similarity scores over the tuple pairs using a metric that accounts for similarities in both syntactic structure and content of each tuple. A thesaurus constructed from corpus statistics (Lin, 1998) is utilized for the content similarity.</Paragraph> <Paragraph position="3"> We utilize this metric to greedily pair together the most similar predicate argument tuples across sentences. Any remaining unpaired tuples represent extra information and are passed to a dissimilarity classi er to decide whether such information is signi cant. The dissimilarity classi er uses supervised machine learning to make such a decision. null</Paragraph> </Section> <Section position="6" start_page="19" end_page="20" type="metho"> <SectionTitle> 4 Similarity Detection and Pairing </SectionTitle> <Paragraph position="0"> We illustrate this advantage of using predicate argument tuples from our running example. In Table 1, one of the model sentences is shown in the middle column. Two edited versions are shown on the left and right columns. While it is clear that the left modi cation is an example of a paraphrase and the right is not, the version on the left involves more changes in its syntactic structure and vocabulary. Standard word or syntactic similarity measures would assign the right modi cation a higher similarity score, likely mislabeling one or both modi cations.</Paragraph> <Paragraph position="1"> In contrast, semantic role labeling identi es the dependencies between predicates and their arguments, allowing a more precise measurement of sentence similarity. Assuming that the arguments in predicate argument tuples are assigned the same role when their roles are comparable1, we de ne the similarity score of two tuples Ta and Tb as the weighted sum of the pairwise similarities of all their shared constituents C=f(ca,cb)g (c being either the target or one of the arguments that both 1ASSERT, which is trained on the Propbank, only guarantees consistency of arg0 and arg1 slots, but we have found in practice that aligning arg2 and above arguments do not cause problems.</Paragraph> <Paragraph position="2"> Modi cation 1: paraphrase Model Sentence Modi cation 2: non-paraphrase Sentence Richard Miller was hurt by a young man.</Paragraph> <Paragraph position="3"> Authorities said a young man injured Richard Miller.</Paragraph> <Paragraph position="4"> Authorities said Richard Miller injured a young man.</Paragraph> <Paragraph position="6"> where normalization factor a is the sum of the weights of constituents in C, i.e.: a = bardbl{argshared}bardbl + wtarget (2) In our current implementation we reduce targets and their arguments to their syntactic headwords. These headwords are then directly compared using a corpus-based similarity thesaurus. As we hypothesized that targets are more important for predicate argument tuple similarity, we multiply the target's similarity by a weighting factor wtarget, whose value we have empirically determined as 1.7, based on a 300-pair development set from the MSR training set.</Paragraph> <Paragraph position="7"> We then proceed to pair tuples in the two sentences using a greedy iterative algorithm. The algorithm locates the two most similar tuples from each sentence, pairs them together and removes them from futher consideration. The process stops when subsequent best pairings are below the similarity threshold or when all possible tuples are exhausted. If unpaired tuples still exist in a given sentence pair, we further examine the copular constructions and noun phrases in the opposing sentence for possible pairings2. This results in a one2Copular constructions are not handled by ASSERT. Such constructions account for a large proportion of the semantic meaning in sentences. Consider the pair Microsoft rose 50 cents and Microsoft was up 50 cents , in which the second is in copular form. Similarly, NPs can often be equivalent to predicate argument tuples when actions are nominalized. Consider an NP that reads (be blamed for) frequent attacks on soldiers and a predicate argument tuple: (be blamed for) attacking soldiers . Again, identical information is conveyed but not captured by semantic role labeling. In such cases, they can be paired if we allow a candidate tuple to pair with the predicative argument (e.g., 50 cents) of a copula, or (the head of) an NP in the opposing sentence. As these heuristic matches may introduce errors, we resort to these methods of matching tuple only in the contingency when there are unpaired tuples.</Paragraph> <Paragraph position="8"> to-one mapping with possibly some tuples left unpaired. The curved arrows in Table 1 denote the correct results of similarity pairing: two tuples are paired up if their target and shared arguments are identical or similar respectively, otherwise they remain unpaired even if the bag of words they contain are the same.</Paragraph> </Section> <Section position="7" start_page="20" end_page="21" type="metho"> <SectionTitle> 5 Dissimilarity Signi cance </SectionTitle> <Paragraph position="0"> Classi cation If some tuples remain unpaired, they are dissimilar parts of the sentence that need to be labeled by the dissimilarity classi er. Such unpaired information could be extraneous or they could be semantically important, creating a barrier for paraphrase. We frame this as a supervised machine learning problem in which a set of features are used to inform the classi er. A support vector machine, SVMLight, was chosen as the learning model as it has shown to yield good performance over a wide application range. We experimented with a wide set of features of unpaired tuples, including internal counts of numeric expressions, named entities, words, semantic roles, whether they are similar to other tuples in the same sentence, and contextual features like source/target sentence length and paired tuple count. Currently, only two features are correlated in improved classi cation, which we detail now.</Paragraph> <Paragraph position="1"> Syntactic Parse Tree Path: This is a series of features that re ect how the unpaired tuple connects with the context: the rest of the sentence. It models the syntactic connection between the constituents on both ends of the path (Gildea and Palmer, 2002; Pradhan et al., 2004). Here, we model the ends of the path as the unpaired tuple and the paired tuple with the closest shared ancestor, and model the path itself as a sequence of constituent category tags and directions to reach the destination (the paired target) from the source (the unpaired target) via the the shared ancestor. When no tuples have been paired in the sentence pair, the destination defaults to the root of the syntactic parse tree. For example, the tuples with target injured are unpaired when the model sentence and the non-paraphrasing modi cation in Table 1 are being compared. A path &quot;V BD,&quot;V P ,&quot;S,&quot;SBAR The syntactic path can act as partial evidence in signi cance classi cation. In the above example, the category tag V BD assigned to injured indicates that the verb is in its past tense. Such a predicate argument tuple bears the main content of the sentence and generally can not be ignored if its meaning is missing in the opposing sentence. Another example is shown in Figure 4. The second sentence has one unpaired target proposed while the rest all nd their counterpart. The path we get from the syntactic parse tree reads &quot;V BN,&quot;NP ,&quot;S,... , showing that the unpaired tuple (consisting of a single predicate) is a modi er contained in an NP. It can be ignored if there is no contradiction in the opposing sentence.</Paragraph> <Paragraph position="2"> We represent a syntactic path by a set of n-gram (n 4) features of subsequences of category tags found in the path, along with the respective direction. We require these n-gram features to be no more than four category tags away from the unpaired target, as our primary concern is to model what role the target plays in its sentence.</Paragraph> <Paragraph position="3"> Sheena Young of Child, the national infertility support network, hoped the guidelines would lead to a more fair and equitable service for infertility sufferers. Sheena Young, a spokesman for Child, the national infertility support network, said the proposed guidelines should lead to a more fair and equitable service for infertility sufferers.</Paragraph> <Paragraph position="4"> modi er in a paraphrase Predicate: This is the lexical token of predicate argument tuple's target, as a text feature. As this feature is liable to run into sparse data problems, the semantic category of the target would be a more suitable feature. However, verb similarity is generally regarded as dif cult to measure, both in terms of semantic relatedness as well as in nding a consistent granularity for verb categories. We investigated using WordNet as well as Levin's classi cation (Levin, 1993) as additional features on our validation data, but currently nd that using the lexical form of the target works best.</Paragraph> <Section position="1" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 5.1 Classi er Training Set Acquisition </SectionTitle> <Paragraph position="0"> Currently, no training corpus for predicate argument tuple signi cance exists. Such a corpus is indispensable for training the classi er. Rather than manually annotating training instances, we use an automatic method to construct instances from paraphrase corpora. This is possible as the paraphrase judgments in the corpora can imply which portion of the sentence(s) are signi cant barriers to paraphrasing or not. Here, we exploit the similarity detector implemented for the rst phase for this purpose. If unpaired tuples exist after greedy pairing, we classify them along two dimensions: whether the sentence pair is a (non-)paraphrase, and the source of the unpaired tuples: 1. [PS] paraphrasing pairs and unpaired predicate argument tuples are only from a single sentence; 2. [NS] non-paraphrasing pairs and only one single unpaired predicate argument tuple exists; 3. [PM] paraphrasing pairs and unpaired predicate argument tuples are from multiple (both) sentences; 4. [NM] non-paraphrasing pairs and multiple unpaired predicate argument tuples (from either one or both sentences) exist.</Paragraph> <Paragraph position="1"> Assuming that similarity detector pairs tuples correctly, for the rst two categories, the paraphrasing judgment is directly linked to the unpaired tuples. PS tuple instances are therefore used as insignificant class instances, and NS as significant ones. The last two categories cannot be used for training data, as it is unclear which of the unpaired tuples is responsible for the (non-) paraphrasing as the similarity measure may mistakenly leave some similar predicate argument tuples unpaired.</Paragraph> </Section> </Section> <Section position="8" start_page="21" end_page="23" type="metho"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"> The goal of our evaluation is to show that our system can reliably determine the cause(s) of non- null paraphrase examples, while maintaining the performance level of the state-of-the-art PR systems. For evaluation, we conduct both component evaluations as well as a holistic one, resulting in three separate experiments. In evaluating the rst tuple pairing component, we aim for high precision, so that sentences that have all tuples paired can be safely assumed to be paraphrases. In evaluating the dissimilarity classi er, we simply aim for high accuracy. In our overall system evaluation, we compare our system versus other PR systems on standard corpora.</Paragraph> <Paragraph position="1"> Experimental Data Set. For these experiments, we utilized two widely-used corpora for paraphrasing evaluation: the MSR and PASCAL RTE corpora. The Microsoft Research Paraphrase coupus (Dolan et al., 2004) consists of 5801 newswire sentence pairs, 3900 of which are annotated as semantically equivalent by human annotators. It re ects ordinary paraphrases that people often encounter in news articles, and may be viewed as a typical domain-general paraphrase recognition task that downstream NLP systems will need to deal with. The corpus comes divided into standard training (70%) and testing (30%) divisions, a partition we follow in our experiments. ASSERT (the semantic role labeler) shows for this corpus a sentence contains 2.24 predicate argument tuples on average. The second corpus is the paraphrase acquisition subset of the PASCAL Recognizing Textual Entailment (RTE) Challenge corpus (Dagan et al., 2005). This is much smaller, consisting of 50 pairs, which we employ for testing only to show portability.</Paragraph> <Paragraph position="2"> To assess the component performance, we need additional ground truth beyond the f+pp, ppg labels provided by the corpora. For the rst evaluation, we need to ascertain whether a sentence pair's tuples are correctly paired, misidenti ed or mispaired. For the second, which tuple(s) (if any) are responsible for a pp instance. However, creating ground truth by manual annotation is expensive, and thus we only sampled the data set to get an indicative assessment of performance. We sampled 200 random instances from the total MSR testing set, and rst processed them through our framework. Then, ve human annotators (two authors and three volunteers) annotated the ground truth for tuple pairing and the semantic signi cance of the unpaired tuples, while checking system output. They also independently came up with their own f+pp,-ppg judgment so we could assess the reliability of the provided annotations.</Paragraph> <Paragraph position="3"> The results of this annotation is shown in Table 2. Examining this data, we can see that the similarity detector performs well, despite its simplicity and assumption of a one-to-one mapping.</Paragraph> <Paragraph position="4"> Out of the 157 predicate argument tuple pairs identi ed through similarity detection, annotators agreed that 144 (92%) are semantically similar or equivalent. However, 31 similar pairs were missed by the system, resulting in 82% recall. We defer discussion on the other details of this table to Section 7.</Paragraph> <Paragraph position="5"> To assess the dissimilarity classi er, we focus on the pp subset of 55 instances recognized by the system. For 43 unpaired tuples from 40 sentence pairs (73% of 55), the annotators' judgments agree with the classi er's claim that they are signi cant. For these cases, the system is able to both recognize that the sentence pair is not a paraphrase and further correctly establish a cause of the nonparaphrase. null In addition to this ground truth sampled evaluation, we also show the effectiveness of the classi er by examining its performance on PS and NS tuples in the MSR corpus as described in Section 5. The test set consists of 413 randomly selected PS and NS instances among which 145 are significant (leading to non-paraphrases). The classi er predicts predicate argument tuple signi cance at an accuracy of 71%, outperforms a majority classi er (65%), a gain which is marginally statistically signi cant (p < .09).</Paragraph> <Paragraph position="6"> signi cant insigni cant 112 263 insigni cant by classi er 33 5 signi cant by classi er We can see the classi er classi es the majority of insigni cant tuples correctly (by outputting a score greater than zero), which is effectively a 98% recall of insigni cant tuples. However, the precision is less satisfatory. We suspect this is partially due the tuples that fail to be paired up with their counterpart. Such noise is found among the automatically collected PS instances used in training. null For the nal system-wide evaluation, we implemented two baseline systems: a majority classi er and SimFinder (Hatzivassiloglou et al., 2001), a bag-of-words sentence similarity module incorporating lexical, syntactic and semantic features. In Table 3, precision and recall are measured with respect to the paraphrasing class. The table shows sentence pairs falling under the column pairs without unpaired tuples are more likely to be paraphrasing than an arbitrary pair (79.5% versus 66.5%), providing further validation for using predicate argument tuples as information nuggets.</Paragraph> <Paragraph position="7"> The results for the experiment benchmarking the overall system performance are shown under the Overall column: our approach performs comparably to the baselines at both accuracy and paraphrase recall. The system performance reported in (CM05; (Corley and Mihalcea, 2005)), which is among the best we are aware of, is also included for comparison.</Paragraph> <Paragraph position="8"> We also ran our system (trained on the MSR corpus) on the 50 instances in the PASCAL paraphrase acquisition subset. Again, the system performance (as shown in Table 4) is comparable to the baseline systems.</Paragraph> </Section> class="xml-element"></Paper>