File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0902_metho.xml
Size: 4,057 bytes
Last Modified: 2025-10-06 14:08:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0902"> <Title>Learning a Translation Lexicon from Monolingual Corpora</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> This section provides more detail on the experiments we have carried out to test the methods just outlined.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Evaluation measurements </SectionTitle> <Paragraph position="0"> We are trying to build a one-to-one German-English translation lexicon for the use in a machine translation system.</Paragraph> <Paragraph position="1"> To evaluate this performance we use two different measurements: Firstly, we record how many correct word-pairs we have constructed.</Paragraph> <Paragraph position="2"> This is done by checking the generated word-pairs against an existing bilingual lexicon.4 In essence, we try to recreate this lexicon, which contains 9,206 distinct German and 10,645 distinct English nouns and 19,782 lexicon entries. For a machine translation system, it is often more important to get more frequently used words right than obscure ones. Thus, our second evaluation measurement tests the word translations proposed by the acquired lexicon against the actual word-level translations in a 5,000 sentence aligned parallel corpus.5 The starting point to extending the lexicon is the seed lexicon of identically spelled words, as described in Section 2.1. It consists of 1339 entries, of which are (88.9%) correct according to the existing bilingual lexicon. Due to computational constraints,6 we focus on the additional mapping of only 1,000 German and English words.</Paragraph> <Paragraph position="3"> These 1,000 words are chosen from the 1,000 most frequent lexicon entries in the dictionary, without duplications of words. This frequency is de ned by the sum of two word frequencies of the words in the entry, as found in the mono-lingual corpora. We did not collect statistics of the actual use of lexical entries in, say, a parallel corpus.</Paragraph> <Paragraph position="4"> In a di erent experimental set-up we also simply tried to match the 1,000 most frequent German words with the 1,000 most frequent English words. The results do not di er signi cantly.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Greedy extension </SectionTitle> <Paragraph position="0"> Each of the four clues described in the Sections 2.2 to 2.5 provide a matching score between a German and an English word. The likelihood of these two words being actual translations of each other should correlate to these scores.</Paragraph> <Paragraph position="1"> There are many ways one could search for the best set of lexicon entries based on these scores. We could perform an exhaustive search: construct all possible mappings and nd the highest combined score of all entries. Since there are O(n!) possible mappings, a brute force approach to this is practically impossible.</Paragraph> <Paragraph position="2"> We therefore employed a greedy search: First we search for the highest score for any word pair. We add this word pair to the lexicon, and drop word pairs that include either the German and English word from further search. Again, we search for the highest score and add the corresponding word pair, drop these words from further search, and so on. This is done iteratively, until all words are used up.</Paragraph> <Paragraph position="3"> Tables 2 and 3 illustrate this process for the spelling and context similarity clues, when applied separately.</Paragraph> <Paragraph position="4"> 6For matching 1,000 words, the described algorithms run up to 3 days. Since the complexity of these algorithms is O(n2) in regard to the number of words, a full run on 10,000 would take almost a year. Of course, this may be alleviated by more e cient implementation and parallelization.</Paragraph> </Section> </Section> class="xml-element"></Paper>