File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/p02-1052_evalu.xml
Size: 6,475 bytes
Last Modified: 2025-10-06 13:58:51
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1052"> <Title>Using Similarity Scoring To Improve the Bilingual Dictionary for Word Alignment</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"> We trained three basic dictionaries using part of the Hansard data, around five megabytes of data (around 20k sentence pairs and 850k words). The basic dictionaries were built using the algorithm described in section 3, with three different thresholds: 0.005, 0.01, and 0.02. In the following, we will refer to these dictionaries as as Dict0.005, Dict0.01, and Dict0.02.</Paragraph> <Paragraph position="1"> 50 sentences were held back for testing. These sentences were hand-aligned by a fluent speaker of French. No one-to-one assumption was enforced. A word could thus align to zero or more words, where no upper limit was enforced (although there is a natural upper limit).</Paragraph> <Paragraph position="2"> The Competitive Linking algorithm was then run with multiple parameter settings. In one setting, we varied the maximum number of links allowed per</Paragraph> <Paragraph position="4"> number is 2, then a word can align to 0, 1, or 2 words in the parallel sentence. In other settings, we enforced a minimum score in the bilingual dictionary for a link to be accepted, a6a8a7a10a9 a0 a42a39a43a18a28 a16 . This means that two words cannot be aligned if their score is below</Paragraph> <Paragraph position="6"> applied in the same way.</Paragraph> <Paragraph position="7"> The dictionary was also rebuilt using a number of different parameter settings. The two parameters that can be varied when rebuilding the dictionary are the similarity threshold a6a8a7a10a9 a0 a7a10a6 and the cooc-</Paragraph> <Paragraph position="9"> that all words within one cluster must have an average similarity score of at least a6a8a7a10a9 a0 a7a10a6 . The sec-</Paragraph> <Paragraph position="11"> words are considered for clustering. Those words that are considered for clustering should account for more than a1 a0a79a0a3a2 a42a39a43a44a43a44a42</Paragraph> <Paragraph position="13"> rences of the source language word with any target language word. If a word falls below threshold</Paragraph> <Paragraph position="15"> a43 , its entry in the dictionary remains unchanged, and it is not clustered with any other word.</Paragraph> <Paragraph position="16"> Below we summarize the values each parameter was set to.</Paragraph> <Paragraph position="17"> a0 maxlinks Used in Competitive Linking algorithm: Maximum number of words any word can be aligned with. Set to: 1, 2, 3.</Paragraph> <Paragraph position="18"> a0 minscore Used in Competitive Linking algorithm: Minimum score of a word pair in the dictionary to be considered as a possible link. Set to: 1, 2, 4, 6, 8, 10, 20, 30, 40, 50.</Paragraph> <Paragraph position="19"> a0 minsim Used in rebuilding dictionary: Minimum average similarity score of the words in a cluster. Set to: 0.6, 0.7, 0.8.</Paragraph> <Paragraph position="20"> a0 coocsratio Used in rebuilding dictionary:</Paragraph> <Paragraph position="22"> a43 is the minimum percentage of all cooccurrences of a source language word with any target language word that are accounted for by one target language word. Set to: 0.003.</Paragraph> <Paragraph position="23"> Thus varying the parameters, we have constructed various dictionaries by rebuilding the three baseline dictionaries. Here, we report on results on three dictionaries where minsim was set to 0.7 and coocsratio was set to 0.003. For these parameter settings, we observed robust results, although other parameter settings also yielded positive results.</Paragraph> <Paragraph position="24"> Precision and recall was measured using the hand-aligned 50 sentences. Precision was defined as the percentage of links that were correctly proposed by our algorithm out of all links that were proposed. Recall is defined as the percentage of links that were found by our algorithm out of all links that should have been found. In both cases, the hand-aligned data was used as a gold standard.</Paragraph> <Paragraph position="25"> The F-measure combines precision and recall: a6 - null The following figures and tables illustrate that the Competitive Linking algorithm performs favorably when a rebuilt dictionary is used. Table 1 lists the improvement in precision and recall for each of the dictionaries. The table shows the values when the minscore score is set to 50, and up to 1 link was allowed per word. Furthermore, the p-values of a 1tailed t-test are listed, indicating these performance boosts are in mostly highly statistically significant and precision, comparing baseline and rebuilt dictionaries at minscore 50 and maxlinks 1.</Paragraph> <Paragraph position="26"> for these parameter settings, where some of the best results were observed.</Paragraph> <Paragraph position="27"> The following figures (figures 1-9) serve to illustrate the impact of the algorithm in greater detail. All figures plot the precision, recall, and f-measure performance against different minscore settings, comparing rebuilt dictionaries to their baselines. For each dictionary, three plots are given, one for each maxlinks setting, i.e. the maximum number of links allowed per word. The curve names indicate the type of the curve (Precision, Recall, or F-measure), the maximum number of links allowed per word (1, 2, or 3), the dictionary used (Dict0.005, Dict0.01, or Dict0.02), and whether the run used the base-line dictionary or the rebuilt dictionary (Baseline or Cog7.3).</Paragraph> <Paragraph position="28"> It can be seen that our algorithm leads to stable improvement across parameter settings. In few cases, it drops below the baseline when minscore is low. Overall, however, our algorithm is robust - it improves alignment regardless of how many links are allowed per word, what baseline dictionary is used, and boosts both precision and recall, and thus also the f-measure.</Paragraph> <Paragraph position="29"> To return briefly to the example cited in section a0 , we can now show how the dictionary rebuild has affected these entries. In dictionary a11 The fact that p'etrole, p'etroli`ere, and p'etroli`eres now receive higher scores than et and dans is what causes the alignment performance to increase.</Paragraph> <Paragraph position="30"> up to two links per word</Paragraph> </Section> class="xml-element"></Paper>