File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/95/p95-1032_abstr.xml
Size: 3,613 bytes
Last Modified: 2025-10-06 13:48:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P95-1032"> <Title>A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> We present a pattern matching method for compiling a bilingual lexicon of nouns and proper nouns from unaligned, noisy parallel texts of Asian/Indo-European language pairs. Tagging information of one language is used. Word frequency and position information for high and low frequency words are represented in two different vector forms for pattern matching. New anchor point finding and noise elimination techniques are introduced. We obtained a 73.1% precision. We also show how the results can be used in the compilation of domain-specific noun phrases.</Paragraph> <Paragraph position="1"> 1 Bilingual lexicon compilation without sentence alignment Automatically compiling a bilingual lexicon of nouns and proper nouns can contribute significantly to breaking the bottleneck in machine translation and machine-aided translation systems. Domain-specific terms are hard to translate because they often do not appear in dictionaries. Since most of these terms are nouns, proper nouns or noun phrases, compiling a bilingual lexicon of these word groups is an important first step.</Paragraph> <Paragraph position="2"> We have been studying robust lexicon compilation methods which do not rely on sentence alignment. Existing lexicon compilation methods (Kupiec 1993; Smadja & McKeown 1994; Kumano & Hirakawa 1994; Dagan et al. 1993; Wu & Xia 1994) all attempt to extract pairs of words or compounds that are translations of each other from previously sentencealigned, parallel texts. However, sentence alignment (Brown et al. 1991; Kay & RSscheisen 1993; Gale & Church 1993; Church 1993; Chen 1993; Wu 1994) is not always practical when corpora have unclear sentence boundaries or with noisy text segments present in only one language.</Paragraph> <Paragraph position="3"> Our proposed algorithm for bilingual lexicon acquisition bootstraps off of corpus alignment procedures we developed earlier (Fung & Church 1994; Fung & McKeown 1994). Those procedures attempted to align texts by finding matching word pairs and have demonstrated their effectiveness for Chinese/English and Japanese/English. The main focus then was accurate alignment, but the procedure produced a small number of word translations as a by-product. In contrast, our new algorithm performs a minimal alignment, to facilitate compiling a much larger bilingual lexicon.</Paragraph> <Paragraph position="4"> The paradigm for Fung ~: Church (1994); Fung & McKeown (1994) is based on two main steps find a small bilingual primary lexicon, use the text segments which contain some of the word pairs in the lexicon as anchor points for alignment, align the text, and compute a better secondary lexicon from these partially aligned texts. This paradigm can be seen as analogous to the Estimation-Maximization step in Brown el al. (1991); Dagan el al. (1993); Wu & Xia (1994).</Paragraph> <Paragraph position="5"> For a noisy corpus without sentence boundaries, the primary lexicon accuracy depends on the robustness of the algorithm for finding word translations given no a priori information. The reliability of the anchor points will determine the accuracy of the secondary lexicon. We also want an algorithm that bypasses a long, tedious sentence or text alignment step.</Paragraph> </Section> class="xml-element"></Paper>