File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-5001_metho.xml

Size: 14,590 bytes

Last Modified: 2025-10-06 14:09:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-5001">
  <Title>Support Vector Machines for Paraphrase Identification and Corpus Construction</Title>
  <Section position="4" start_page="2" end_page="6" type="metho">
    <SectionTitle>
3.4 Features
</SectionTitle>
    <Paragraph position="0"> Some 264,543 features, including overt lexical pairings, were in theory available to the classifier. In practice, however, the number of dimensions used typically fell to less than 1000 after the lowest frequency features are eliminated (see Table 4.) The main feature classes were: String Similarity Features: All sentence pairs were assigned string-based features, including absolute and relative length in words, number of shared words, word-based edit distance, and lexical distance, as measured by converting the sentences into alphabetized strings of unique words and applying word based edit distance.</Paragraph>
    <Paragraph position="1"> Morphological Variants: Another class of features was co-ocurrence of morphological variants in sentence pairs. Approximately 490,000 sentences in our primary datasets were stemmed using a rule-based stemmer, to yield a lexicon of 95,422 morphologically variant word pairs. Each word pair was treated as a feature. Examples are: orbit|orbital orbiter|orbiting WordNet Lexical Mappings: Synonyms and hypernyms were extracted from WordNet,  (http://www.cogsci.princeton.edu/~wn/; Fellbaum, 1998), using the morphological variant lexicon from the 490,000 sentences as keywords. The theory here is that as additional paraphrase pairs are identified by the classifier, new information will &amp;quot;come along for the ride,&amp;quot; thereby augmenting the range of paraphrases available to be learned. A lexicon of 314,924 word pairs of the following form created. Only those pairs identified as occurring in either training data or the corpus to be classified were included in the final classifier.</Paragraph>
    <Paragraph position="2"> operation|procedure operation|work Word Association Pairs: To augment the above resources, we dynamically extracted from the L12 corpus a lexicon of 13001 possibly-synonymous word pairs using a log-likelihood algorithm described in Moore (2001) for machine translation. To minimize the damping effect of the overwhelming number of identical words, these were deleted from each sentence pair prior to processing; the algorithm was then run on the non-identical residue as if it were a bilingual parallel corpus.</Paragraph>
    <Paragraph position="3"> To deploy this data in the SVM feature set, a cutoff was arbitrarily selected that yielded 13001 word pairs. Some exemplars (not found in WordNet) include: straight|consecutive vendors|suppliers Fig. 1 shows the distribution of word pairings obtained by this method on the L12 corpus in comparison with WordNet. Examination of the top-ranked 1500 word pairs reveals that 46.53% are found in WordNet and of the remaining 53.47%, human judges rated 56% as good, yielding an overall &amp;quot;goodness score&amp;quot; of 76.47%. Judgments were by two independent raters.</Paragraph>
    <Paragraph position="4"> For the purposes of comparison, we automatically eliminated pairs containing trivial substring differences, e.g., spelling errors, British vs. American spellings, singular/plural alternations, and miscellaneous short abbreviations. All pairs on which the raters disagreed were discarded. Also discarded were a large number of partial phrasal matches of the &amp;quot;reported|according&amp;quot; and &amp;quot;where|which&amp;quot; type, where part of a phrase (&amp;quot;according to&amp;quot;, &amp;quot;in which&amp;quot;) was missing. Although viewed in isolation these do not constitute valid synonym or hyperrnym pairs, the ability to identify these partial matchings is of central importance within an SMT-framework of paraphrase alignment and generation. These results suggest, among other things, that dynamically-generated lexical data of this kind might be useful in increasing the coverage of hand-built synonymy resources.</Paragraph>
    <Paragraph position="5"> Composite Features: From each of the lexical feature classes, we derived a set of more abstract features that summarized the frequency with which each feature or class of features occurred in the training data, both independently, and in correlation with others. These had the effect of performing normalization for sentence length and other factors. Some examples are:</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.1 Methodology
</SectionTitle>
      <Paragraph position="0"> Evaluation of paraphrase recognition within an SMT framework is highly problematic, since no technique or data set is standardly recognized.</Paragraph>
      <Paragraph position="1"> Barzilay &amp; Lee (2003) and Quirk et al. (2004) use human evaluations of end-to-end generation, but these are not very useful here, since they add an additional layer of uncertainty into the evaluation, and depend to a significant extent on the quality and functionality of the decoder.</Paragraph>
      <Paragraph position="2"> Dolan &amp; Brockett (2005) report extraction precision of 67% using a similar classifier, but with the explicit intention of creating a corpus that contained a significant number of naturallyoccuring paraphrase-like negative examples.</Paragraph>
      <Paragraph position="3"> Since our purpose in the present work is nonapplication specific corpus construction, we apply an automated technique that is widely used for reporting intermediate results in the SMT community, and is being extended in other fields such as summarization (Daume and Marcu, forthcoming), namely word-level alignment using an off-the-shelf implementation of the SMT system GIZA++ (Och &amp; Ney, 2003). Below, we use Alignment Error Rate (AER), which is indicative of how far the corpus is from providing a solution under a standard SMT tool. This allows the effective coverage of an extracted corpus to be evaluated efficiently, repeatedly against a single standard, and at little cost after the initial tagging. Further, if used as an objective function, the AER technique offers the prospect of using hillclimbing or other optimization techniques for non-application-specific corpus extraction.</Paragraph>
      <Paragraph position="4"> To create the test set, two human annotators created a gold standard word alignment on held out data consisting of 1007 sentences pairs. Following the practice of Och &amp; Ney (2000, 2003), the annotators each created an initial annotation, categorizing alignments as either SURE (necessary) or POSSIBLE (allowed, but not required). In the event of differences, annotators were asked to review their choices. First pass inter-rater agreement was 90.28%, climbing to 94.43% on the second pass. Finally we combined the annotations into a single gold standard as follows: if both annotators agreed that an alignment was SURE, it was tagged as SURE in the goldstandard; otherwise it was tagged as POSSIBLE.</Paragraph>
      <Paragraph position="5"> To compute Precision, Recall, and Alignment Error Rate (AER), we adhere to the formulae listed in Och &amp; Ney (2003). Let A be the set of alignments in the comparison, S be the set of SURE alignments in the gold standard, and P be the union of the SURE and POSSIBLE alignments in the gold standard:</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="5" type="sub_section">
      <SectionTitle>
4.2 Baselines
</SectionTitle>
      <Paragraph position="0"> Evaluations were performed on the heuristically-derived L12, F2, and F3 datasets using the above formulation. Results are shown in Table 3.</Paragraph>
      <Paragraph position="1"> L12 represents the best case, followed respectively by F3 and F2. AERs were also computed separately for identical (Id) and non-identical (Non-Id) word mappings in order to be able to  drill down on the extent to which new non-identical mappings are being learned from the data. A high Id error rate can be considered indicative of noise in the data. The score that we are most interested in, however, is the Non-Id alignment error rate, which can be considered indicative of coverage as represented by the Giza++ alignment algorithm's ability to learn new mappings from the training data. It will be observed that the F3 dataset non-Id AER is smaller than that of the F2 dataset: it appears that more data is having the desired effect.</Paragraph>
      <Paragraph position="2"> Following accepted SMT practice, we added a lexicon of identical word mappings to the training data, since Giza++ does not directly model word identity, and cannot easily capture the fact that many words in paraphrase sentence may translate as themselves. We did not add in word pairs derived from word association data or other supplementary resources that might help resolve matches between unlike but semantically similar words.</Paragraph>
    </Section>
    <Section position="3" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
4.3 Training on the 10K Data
</SectionTitle>
      <Paragraph position="0"> We trained an SVM on the 10 K training set employing 3-fold cross-validation on the training set itself. Validation errors were typically in the region of 16-17%. Linear kernels with default parameters (tolerance=1e-3; margin size computed automatically; error probability=0.5) were employed throughout. Applying the SVM to the F3 data, using 946 features encountered in the training data with frequency &gt; 4, this classifier yielded a set of 24588 sentence pairs, which were then aligned using Giza++.</Paragraph>
      <Paragraph position="1"> The alignment result is shown in Table 3. The &amp;quot;10K Trained&amp;quot; row represents the results of applying Giza++ to the data extracted by the SVM.</Paragraph>
      <Paragraph position="2"> Non-identical word AER, at 24.70%, shows a 36.9% reduction in the non-identical word AER over the F2 dataset (which is approximately double the size), and approximately 28% over the original F3 dataset. This represents a huge improvement in the quality of the data collected by using the SVM and is within striking distance of the score associated with the L12 best case.</Paragraph>
      <Paragraph position="3"> The difference is especially significant when it is considered that the newly constructed corpus is less than one-tenth the size of the best-case corpus. Table 5 shows sample extracted sentences. null To develop insights into the relative contributions of the different feature classes, we omitted some feature classes from several runs. The results were generally indistinguishable, except for non-Id AER, shown in Table 4, a fact that may be taken to indicate that string-based features such as edit distance still play a major role. Eliminating information about morphological alternations has the largest overall impact, producing a degradation of a 0.94 in on Non-Id AER. Of the three feature classes, removal of WordNet appears to have the least impact, showing the smallest change in Non-Id AER.</Paragraph>
      <Paragraph position="4"> When the word association algorithm is applied to the extracted ~24K-sentence-pair set, degradation in word pair quality occurs significantly earlier than observed for the L12 data; after removing &amp;quot;trivial&amp;quot; matches, 22.63% of word pairs in the top ranked 800 were found in Wordnet, while 25.3% of the remainder were judged to be &amp;quot;good&amp;quot; matches. This is equivalent to an overall &amp;quot;goodness score&amp;quot; of 38.25%. The rapid degradation of goodness may be in part attributable to the smaller corpus size yielded by the classifier. Nevertheless, the model learns many valid new word pairs. Given enough data with which to bootstrap, it may be possible to do away with static resources such as Wordnet, and rely entirely on dynamically derived data.</Paragraph>
    </Section>
    <Section position="4" start_page="5" end_page="6" type="sub_section">
      <SectionTitle>
4.4 Training on the MSR Training Set
</SectionTitle>
      <Paragraph position="0"> By way of comparison, we also explored application of the SVM to the training data in the MSR Paraphrase corpus. For this purpose we used the 4076-sentence-pair &amp;quot;training&amp;quot; section of the MSR corpus, comprising 2753 positive and 1323 negative examples. The results at default parameter settings are given in Table 3, with respect to all features that were observed to occur with frequency greater than 4. Although the 49914 sentence pairs yielded by using the  MSR Paraphrase Corpus is nearly twice that of the 10K training set, AER performance is measurably degraded. Nevertheless, the MSR-trained corpus outperforms the similar-sized F12, yielding a reduction in Non-Id AER of a not insignificant 16%.</Paragraph>
      <Paragraph position="1"> The fact that the MSR training data does not perform as well as the 10 K training set probably reflects its derivative nature, since it was originally constructed with data collected using the 10K training set, as described in Dolan &amp; Brockett (2005). The performance of the MSR corpus is therefore skewed to reflect the biases inherent in its original training, and therefore exhibits the performance degradation commonly associated with bootstrapping. It is also a significantly smaller training set, with a higher proportion of negative examples than in typical in real world data. It will probably be necessary to augment the MSR training corpus with further negative examples before it can be utilized effectively for training classifiers.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="6" end_page="6" type="metho">
    <SectionTitle>
5 Discussion and Future Work
</SectionTitle>
    <Paragraph position="0"> These results show that it is possible to use machine learning techniques to induce a corpus of likely sentential paraphrase pairs whose alignment properties measured in terms of AER approach those of a much larger, more homogeneous dataset collected using a string-edit distance heuristic. This result supports the idea that an abstract notion of paraphrase can be captured in a high dimensional model.</Paragraph>
    <Paragraph position="1"> Future work will revolve around optimizing classifiers for different domains, corpus types and training sets. It seems probable that the effect of the 10K training corpus can be greatly augmented by adding sentence pairs that have been aligned from multiple translations using the techniques described in, e.g., Barzilay &amp; McKeown (2001) and Pang et al. (2003).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML