File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2122_evalu.xml

Size: 5,098 bytes

Last Modified: 2025-10-06 13:59:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2122">
  <Title>Inducing Word Alignments with Bilexical Synchronous Trees</Title>
  <Section position="6" start_page="957" end_page="958" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> First of all, we are interested in nding out how much speedup can be achieved by doing the hook trick for EM. We implemented both versions in C++ and turned off pruning for both. We ran the two inside-outside parsing algorithms on a small test set of 46 sentence pairs that are no longer than 25 words in both languages. Then we put the results into buckets of (1 [?] 4), (5 [?] 9), (10 [?] 14), (15[?]19), and (20[?]24) according to the maximum length of two sentences in each pair and took averages of these timing results. Figure 3 (a) shows clearly that as the sentences get longer the hook trick is helping more and more. We also tried to turn on pruning for both, which is the normal condition for the parsers. Both are much faster due to the effectiveness of pruning. The speedup ratio is lower because the hooks will less often be used again since many cells are pruned away. Figure 3 (b) shows the speedup curve in this situation.</Paragraph>
    <Paragraph position="1"> We trained both the unlexicalized and the lexicalized ITGs on a parallel corpus of Chinese-English newswire text. The Chinese data were automatically segmented into tokens, and English capitalization was retained. We replaced words occurring only once with an unknown word token, resulting in a Chinese vocabulary of 23,783 words and an English vocabulary of 27,075 words.</Paragraph>
    <Paragraph position="2"> We did two types of comparisons. In the rst comparison, we measured the performance of ve word aligners, including IBM models, ITG, the lexical ITG (LITG) of Zhang and Gildea (2005), and our bilexical ITG (BLITG), on a hand-aligned bilingual corpus. All the models were trained using the same amount of data. We ran the experiments on sentences up to 25 words long in both languages. The resulting training corpus had 18,773 sentence pairs with a total of 276,113 Chinese words and 315,415 English words.</Paragraph>
    <Paragraph position="3"> For scoring the Viterbi alignments of each system against gold-standard annotated alignments, we use the alignment error rate (AER) of Och and Ney (2000), which measures agreement at the level of pairs of words:</Paragraph>
    <Paragraph position="5"> where A is the set of word pairs aligned by the automatic system, GS is the set marked in the gold standard as sure , and GP is the set marked as possible (including the sure pairs). In our Chinese-English data, only one type of alignment was marked, meaning that GP = GS.</Paragraph>
    <Paragraph position="6"> In our hand-aligned data, 47 sentence pairs are no longer than 25 words in either language and were used to evaluate the aligners.</Paragraph>
    <Paragraph position="7"> A separate development set of hand-aligned sentence pairs was used to control over tting. The subset of up to 25 words in both languages was used. We chose the number of iterations for EM  both sides). LITG stands for the cross-language Lexicalized ITG. BLITG is the within-English Bilexical ITG. ITG-lh is ITG with left-head assumption on English. ITG-rh is with right-head assumption.  training as the turning point of AER on the development data set. The unlexicalized ITG was trained for 3 iterations. LITG was trained for only 1 iteration, partly because it was initialized with fully trained ITG parameters. BLITG was trained for 3 iterations.</Paragraph>
    <Paragraph position="8"> For comparison, we also included the results from IBM Model 1 and Model 4. The numbers of iterations for the training of the IBM models were also chosen to be the turning points of AER changing on the development data set.</Paragraph>
    <Paragraph position="9"> We also want to know whether or not BLITG can model dependencies better than LITG. For this purpose, we also used the AER measurement, since the goal is still getting higher precision/recall for a set of recovered word links, although the dependency word links are within one language. For this reason, we rename AER to Dependency Error Rate. Table 1(right) is the dependency results on English side of the test data set. The dependency results on Chinese are similar.</Paragraph>
    <Paragraph position="10"> The gold standard dependencies were extracted from Collins' parser output on the sentences. The LITG and BLITG dependencies were extracted from the Viterbi synchronous trees by following the head words.</Paragraph>
    <Paragraph position="11"> For comparison, we also included two base-line results. ITG-lh is unlexicalized ITG with left-head assumption, meaning the head words always come from the left branches. ITG-rh is ITG with right-head assumption.</Paragraph>
    <Paragraph position="12"> To make more con dent conclusions, we also did tests on a larger hand-aligned data set used in Liu et al. (2005). We used 165 sentence pairs that are up to 25 words in length on both sides.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML