File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-1042_intro.xml
Size: 3,417 bytes
Last Modified: 2025-10-06 14:06:33
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1042"> <Title>An Experiment in Hybrid Dictionary and Statistical Sentence Alignment</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> There have been many approaches proposed to solve the problem of aligning corresponding sentences in parallel corpora. With a few notable exceptions however, much of this work has focussed on either corpora containing European language pairs or clean-parallel corpora where there is little reformatting. In our work we have focussed on developing a method for robust matching of English-Japanese sentences, based primarily on lexical matching. The method combines statistical information from byte length ratios. We show in this paper that this hybrid model is more effective than its constituent parts used separately.</Paragraph> <Paragraph position="1"> The task of sentence alignment is a critical first step in many automatic applications involving the analysis of bilingual texts such as extraction of bilingum vocabulary, extraction of translation templates, word sense disambiguation, word and phrase alignment, and extraction of parameters for statistical translation models. Many software products which aid human translators now contain sentence alignment tools as an aid to speeding up editing and terminology searching.</Paragraph> <Paragraph position="2"> Various methods have been developed for sentence alignment which we can categorise as either lexical such as (Chen, 1993), based on a large-scale bilingual lexicon; statistical such as (Brown et al., 1991) (Church, 1993)(Gale and Church, 1903)(Kay and RSsheheisen, 1993), based on distributional regularities of words or byte-length ratios and possibly inducing a bilingual lexicon as a by-product, or hybrid such as (Utsuro et al., 1994) (Wu, 1994), based on some combination of the other two. Neither of the pure approaches is entirely satisfactory for the following reasons: * Text volume limits the usefulness of statistical approaches. We would often like to be able to align small amounts of text, or texts from various domains which do not share the same statistical properties.</Paragraph> <Paragraph position="3"> * Bilingual dictionary coverage limitations mean that we will often encounter problems establishing a correspondence in non-general domains.</Paragraph> <Paragraph position="4"> * Dictionary-based approaches are founded on an assumption of lexicul correspondence between language pairs. We cannot always rely on this for non-cognate language pairs, such as English and Japanese.</Paragraph> <Paragraph position="5"> * Texts are often heavily reformatted in translation, so we cannot assume that the corpus will be clean, i.e. contain many one-to-one sentence mappings. In this case statistical methods which rely on structure correspondence such as byte-length ratios may not perform well.</Paragraph> <Paragraph position="6"> These factors suggest that some hybrid method may give us the best combination of coverage and accuracy when we have a variety of text domains, text sizes and language pairs. In this paper we seek to fill a gap in our understanding and to show how the various components of the hybrid method influence the quality of sentence alignment for Japanese and English newspaper articles.</Paragraph> </Section> class="xml-element"></Paper>