File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-1021_intro.xml

Size: 4,010 bytes

Last Modified: 2025-10-06 14:03:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1021">
  <Title>Towards Robust Context-Sensitive Sentence Alignment for Monolingual Corpora</Title>
  <Section position="3" start_page="161" end_page="162" type="intro">
    <SectionTitle>
2 Related work
</SectionTitle>
    <Paragraph position="0"> Several authors have tackled the monolingual sentence correspondence problem. SimFinder (Hatzivassiloglou et al., 1999; Hatzivassiloglou et al., 2001) examined 43 different features that could potentially help determine the similarity of two short text units (sentences or paragraphs). Of these, they automatically selected 11 features, including word overlap, synonymy as determined by WordNet (Fellbaum, 1998), matching proper nouns and noun phrases, and sharing semantic classes of verbs (Levin, 1993).</Paragraph>
    <Paragraph position="1"> The Decomposition method (Jing, 2002) relies on the observation that document summaries are often constructed by extracting sentence fragments from the document. It attempts to identify such extracts, using a Hidden Markov Model of the process of extracting words. The HMM uses features of word identity and document position, in which transition probabilities are based on locality assumptions. For instance, after a word is extracted, an adjacent word or one that belongs to a nearby sentence is more likely to be extracted than one that is further away.</Paragraph>
    <Paragraph position="2"> Barzilay and Elhadad (2003) apply a 4-step algorithm: null  1. Cluster the paragraphs of the training documents into topic-specific clusters, based on word overlap. For instance, paragraphs in the Britannica city entries describing climate might cluster together.</Paragraph>
    <Paragraph position="3"> 2. Learn mapping rules between paragraphs of the full and elementary versions, taking the word-overlap and the clusters as features.</Paragraph>
    <Paragraph position="4"> 3. Given a new pair of texts, identify sentence  pairs with high overlap, and take these to be aligned. Then, classify paragraphs according to the clusters learned in Step 1, and use the mapping rules of Step 2 to match pairs of paragraphs between the documents.</Paragraph>
    <Paragraph position="5"> 4. Finally, take advantage of the paragraph clustering and mapping, by locally aligning only sentences belonging to mapped paragraph pairs.</Paragraph>
    <Paragraph position="6"> Dolan et al. (2004) used Web-aggregated news stories to learn both sentence-level and word-level alignments. Having collected a large corpus of clusters of related news stories from Google and MSN news aggregator services, they first seek related sentences, using two methods. First, using a high Levenshtein distance score they identify 139K sentence pairs of which about 16.7% are estimatedtobeunrelated (using human evaluation of asample). Second, assuming that the firsttwosentences of related news stories should be matched, provided they have a high enough word-overlap, yields 214K sentence pairs of which about 40% are estimated to be unrelated. No recall estimates  are provided; however, with the release of the annotated Microsoft Research Paraphrase Corpus,1 it is apparent that Dolan et al. are seeking much more tightly related pairs of sentences than Barzilay and Elhadad, ones that are virtually semantically equivalent. In subsequent work, the same authors (Quirk et al., 2004) used such matched sentence pairs to train Giza++ (Och and Ney, 2003) on word-level alignment.</Paragraph>
    <Paragraph position="7"> The recent PASCAL &amp;quot;Recognizing Textual Entailment&amp;quot; (RTE)challenge (Dagan et al., 2005) focused on the problem of determining whether one sentence entails another. Beyond the difference in the definition of the required relation between sentences, the RTE challenge focuses on isolated sentence pairs, as opposed to sentences within a document context. The task was judged to be quite difficult, with many of the systems achieving relatively low accuracy.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML