File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-2134_intro.xml
Size: 2,642 bytes
Last Modified: 2025-10-06 14:06:34
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2134"> <Title>Bitext Correspondences through Rich Mark-up</Title> <Section position="2" start_page="0" end_page="812" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Adequate encoding schemes applied to large bodies of text in electronic form have been a main achievement in the field of humanities computing. Research in computational linguistics, which since the late 1980s has resorted to methodologies involving statistics and probabilities in large corpora, has however largely neglected the existence and provision of extra information from such encoding schemes. In this paper we present an approach to sentence alignment that crucially relies on previously introduced annotations in a parallel corpus. Following (Harris 88), corpora containing bilingual texts have been called &quot;bitexts&quot; (Melamed 97), (Martlnez et al. 97).</Paragraph> <Paragraph position="1"> The utility of annotated bitexts will be demonstrated by the proposition of a methodology that crucially takes advantage of rich mark-up to resolve bitext correspondences, that is, the task of providing correct identification and alignment methods for text segments that are translation equivalencies of each other (Chang & Chen 97). Bitext correspondences provide a great source of information for applications such as example and memory based approaches to machine translation (Sumita & Iida 91), (Brown et al. 93), (Collins et al. 96); bilingual terminology extraction (Kupiec 93), (Eijk 93), (Dagan et al. 94), (Smajda et al. 96); bilingual lexicography (Catizione et al. 93), (Daille et al. 94), (Gale & Church, 91b); multilingual information retrieval (SIGIR 96), and word-sense disambiguation (Gale et al. 92), (Chan & Chen 97). Moreover, the increasing availability of running parallel text in annotated form (e.g.</Paragraph> <Paragraph position="2"> WWW pages), together with evidence that poor mark-up (as HTML) will progressively be replaced by richer mark-up (e.g. SGML/XML), are good enough reasons to investigate methods that benefit from such encoding schemes.</Paragraph> <Paragraph position="3"> We first provide details of how a bitext sample has been marked-up, with particular emphasis on the recognition and annotation of proper nouns. Then we show how sentence alignment relies on mark-up by the application of a methodology that resorts to annotations to determine the similarity between sentence pairs.</Paragraph> <Paragraph position="4"> This is the 'tags as cognates' algorithm, TasC.</Paragraph> </Section> class="xml-element"></Paper>