File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/e06-1021_concl.xml

Size: 3,552 bytes

Last Modified: 2025-10-06 13:55:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1021">
  <Title>Towards Robust Context-Sensitive Sentence Alignment for Monolingual Corpora</Title>
  <Section position="7" start_page="166" end_page="167" type="concl">
    <SectionTitle>
6 Conclusions and future work
</SectionTitle>
    <Paragraph position="0"> For monolingual alignment to achieve its full potential for text rewriting, huge amounts of text would need to be accurately aligned. Since mono-lingual corpora are so noisy, simple but effective methods as described in this paper willbe required to ensure scalability.</Paragraph>
    <Paragraph position="1"> We have presented a novel algorithm for aligning the sentences of monolingual corpora of comparable documents. Our algorithm not only yields substantially improved accuracy, but is also simpler and more robust than previous approaches.</Paragraph>
    <Paragraph position="2"> The efficacy of TF*IDF ranking is remarkable in the face of previous results. In particular, TF*IDF was not chosen by the feature selection algorithm of Hatzivassiloglou et al. (2001), who directly experimented and rejected TF*IDF measures as being less effective in determining similarity. Webelieve this striking difference can be attributed to the source of the weights. Recall that our TF*IDF weights treat each sentence as a separate document for the purpose of weighting. TF*IDFscores used in previous work are likely to have been obtained either by aggregation over the full document corpus, or by comparison with an external general collection, which is bound to yield lower discriminative power. To illustrate this, consider two words, such as the name of a city, and the name of a building in that city. Viewed globally, both words are likely to belong to the long tail of the Zipf distribution, having almost indistinguishable logarithmic IDF. However, in the encyclopedia entry describing the city, the city's name is likely to appear in many sentences, while the building name may appear only in the single sentence that refers to it, and thus the latter should be scored higher. Conversely, a word that is relativelyfrequent ingeneral usage, e.g., &amp;quot;river&amp;quot; might be highly discriminative between sentences.</Paragraph>
    <Paragraph position="3"> We further improve on the TF*IDF results by using a global alignment algorithm. We expect that more sophisticated sequence alignment techniques, as studied for biological sequence analysis, might yield improved results, in particular for comparing loosely matched document pairs involving non-linear text transformations such as inversions and translocations. Such methods could still modularly rely on the TF*IDF scoring.</Paragraph>
    <Paragraph position="4"> We reiterate Barzilay and Elhadad's conclusion about the effectiveness of using the document context for the alignment of text. In fact, we are able totake better advantage ofthe intra-document context, while not relying on any assumptions about inter-document context that might be specific to one particular corpus. Identifying scalable principles for the use of inter-document context poses a challenging topic for future research.</Paragraph>
    <Paragraph position="5"> We have restricted our attention here to pre-annotated corpora, allowing better comparison with previous work, and sidestepping the labor-intensive task of human annotation. Having es- null tablished a simple and robust document alignment method, we leave its application to much larger-scale document sets for future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML