File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/c04-1051_concl.xml

Size: 3,339 bytes

Last Modified: 2025-10-06 13:53:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1051">
  <Title>Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources</Title>
  <Section position="7" start_page="5" end_page="5" type="concl">
    <SectionTitle>
6 Conclusions and Future Work
</SectionTitle>
    <Paragraph position="0"> Edit distance identifies sentence pairs that exhibit lexical and short phrasal alternations that can be aligned with considerable success. Given a large dataset and a well-motivated clustering of documents, useful datasets can be gleaned even without resorting to more sophisticated techniques  paraphrase phenomena per sentence (such as Multiple Sequence Alignment, as employed by Barzilay &amp; Lee 2003).</Paragraph>
    <Paragraph position="1"> However, there is a disparity between the kinds of paraphrase alternations that we need to be able to align and those that we can already align well using current SMT techniques. Based solely on the criterion of word AER, the L12 data would seem to be superior to the F2 data as a source of paraphrase knowledge. Hand evaluation, though, indicates that many of the phenomena that we are interested in learning may be absent from this L12 data.</Paragraph>
    <Paragraph position="2"> String edit distance extraction techniques involve assumptions about the data that are inadequate, but achieve high precision. Techniques like our F2 extraction strategies appear to extract a more diverse variety of data, but yield more noise. We believe that an approach with the strengths of both methods would lead to significant improvement in paraphrase identification and generation.</Paragraph>
    <Paragraph position="3"> In the near term, however, the relatively similar performances of F2 and L12-trained models on the F2 test data suggest that with further refinements, this more complex type of data can achieve good results. More data will surely help.</Paragraph>
    <Paragraph position="4"> One focus of future work is to build a classifier to predict whether two sentences are related through paraphrase. Features might include edit distance, temporal/topical clustering information, information about cross-document discourse structure, relative sentence length, and synonymy information. We believe that this work has potential impact on the fields of summarization, information retrieval, and question answering.</Paragraph>
    <Paragraph position="5"> Our ultimate goal is to apply current SMT techniques to the problems of paraphrase recognition and generation. We feel that this is a natural extension of the body of recent developments in SMT; perhaps explorations in monolingual data may have a reciprocal impact.</Paragraph>
    <Paragraph position="6"> The field of SMT, long focused on closely aligned data, is only now beginning to address the kinds of problems immediately encountered in monolingual paraphrase (including phrasal translations and large scale reorderings).</Paragraph>
    <Paragraph position="7"> Algorithms to address these phenomena will be equally applicable to both fields. Of course a broad-domain SMT-influenced paraphrase solution will require very large corpora of sentential paraphrases. In this paper we have described just one example of a class of data extraction techniques that we hope will scale to this task.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML