File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1004_intro.xml
Size: 6,864 bytes
Last Modified: 2025-10-06 14:01:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1004"> <Title>Sentence Alignment for Monolingual Comparable Corpora</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Text-to-text generation is an emerging area of research in NLP (Chandrasekar and Bangalore, 1997; Caroll et al., 1999; Knight and Marcu, 2000; Jing and McKeown, 2000). Unlike in traditional concept-to-text generation, text-to-text generation applications take a text as input and transform it into a new text satisfying specific constraints, such as length in summarization or style in text simplification. One exciting new research direction is the automatic induction of such transformation rules. This is a particularly promising direction given that there are naturally occurring examples of comparable texts that convey the same information yet are written in different styles. Presented with two such texts, one can pair sentences that convey the same information, thereby building a training set of rewriting examples for the domain to which the texts belong. We believe that automating this process will provide researchers in text-to-text generation with valuable training and testing resources, just as techniques to align multi-lingual parallel corpora boosted research in Machine Translation (MT).</Paragraph> <Paragraph position="1"> In this paper, we address the task of aligning sentences in text pairs. We focus on monolingual comparable corpora, that is, texts in the same language (e.g., English) that overlap in the information they convey. Stories about the same events from different press agencies and texts presenting the same information to experts and lay people are two examples.</Paragraph> <Paragraph position="2"> In MT, the task of sentence alignment was extensively studied for parallel corpora.1 A typical sentence alignment algorithm can be roughly described as a two-step process: (1) for each sentence pair compute a local similarity value, independently of the other sentences; (2) find an overall sequence of mapped sentences, using both the local similarity values and additional features.</Paragraph> <Paragraph position="3"> In the case of monolingual corpora, step (2) might seem unnecessary. Since the texts share the same language, it would be enough to choose for local similarity a function based on lexical cues only and select sentence pairs with high lexical similarity.</Paragraph> <Paragraph position="4"> Even a simple lexical function (e.g., one that counts word overlap) could produce an accurate alignment.</Paragraph> <Paragraph position="5"> 1Sentence alignment for comparable multilingual corpora was not addressed in previous research. Comparable corpora have primarily been used to build bilingual lexical resources (Fung and Yee, 1998).</Paragraph> <Paragraph position="6"> After all, two sentences which share most of their words are likely to paraphrase each other. The problem is that there are many sentences that convey the same information but have little surface resemblance. As a result, simple word counts cannot distinguish the matching pair (A) in Figure 1 from the unrelated pair (B). An accurate local similarity measure would have to account for many complex paraphrasing phenomena. A simple, weak lexical similarity function alone is not sufficient.</Paragraph> <Paragraph position="7"> (A) Petersburg served as the capital of Russia for 200 years.</Paragraph> <Paragraph position="8"> For two centuries Petersburg was the capital of the Russian Empire.</Paragraph> <Paragraph position="9"> (B) The city is also the country's leading port and center of commerce. And yet, as with so much of the city, the port facilities are old and inefficient.</Paragraph> <Paragraph position="10"> two content words. (A) is a matching pair, (B) is not. In MT, a weak similarity function is compensated for by searching for a globally optimal alignment, using dynamic programming or taking advantage of the geometric/positional or contextual properties of the text pair (Gale and Church, 1991; Shemtov, 1993; Melamed, 1999). But these techniques operate on the assumptions that there are limited insertions and deletions between the texts and that the order of the information is roughly preserved from one text to another.</Paragraph> <Paragraph position="11"> Texts from comparable corpora, as opposed to parallel corpora, contain a great deal of &quot;noise.&quot; In Figure 2 which plots the manually identified alignment for a text pair in our corpus, only a small fraction of the sentences got aligned (35 out of 31 270 sentence pairs), which illustrates that there is no complete information overlap. Consider two texts written by different press agencies: while both report on the same events, one may contain additional interviews and the other, background information.</Paragraph> <Paragraph position="12"> Another distinguishing characteristic of comparable corpora is that the order in which the information is presented can differ greatly from one text to another. Analysis of comparable texts in different domains (Paris, 1993; Barzilay et al., 2002) showed that there is wide variability in the order in which the same information can be presented. This is also illustrated in Figure 2.</Paragraph> <Paragraph position="13"> corpus. A point in (x,y) indicates that the sentences x and y match.</Paragraph> <Paragraph position="14"> We investigate a novel approach informed by text structure for sentence alignment. Our method emphasizes the search for an overall alignment, while relying on a simple local similarity function. We incorporate context into the search process in two complementary ways: (1) we map large text fragments using hypotheses learned in a supervised fashion and (2) we further refine the match through local alignment within mapping fragments to find sentence pairs. When the documents in the collection belong to the same domain and genre, the fragment mapping takes advantage of the topical structure of the texts. This structure is derived in an unsupervised fashion by analyzing commonalities among texts on each side of the comparable corpora separately. Our experiments show that our overall approach identifies even pairs with low lexical similarity. We also found that a fully unsupervised method using a minimalist representation of contextual information, viz., paragraph-level lexical similarity, outperforms existing methods based on complex local similarity functions.</Paragraph> <Paragraph position="15"> In the next section, we provide an overview of existing work on monolingual sentence alignment.</Paragraph> <Paragraph position="16"> Section 3 describes our algorithm. In sections 4 and 5, we report on data collection and evaluation.</Paragraph> </Section> class="xml-element"></Paper>