File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1412_metho.xml

Size: 16,423 bytes

Last Modified: 2025-10-06 14:07:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1412">
  <Title>A Comparative Study on Translation Units for Bilingual Lexicon Extraction</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Models of Translation Units
</SectionTitle>
    <Paragraph position="0"> The main objective of this paper is to determine suitable translation units for the automatic acquisition of translation pairs. A word-to-word correspondence is often assumed in the pioneering works, and recently Melamed argues that one-to-one assumption is not restrictive as it may appear in (Melamed, 2000). However, we question his claim, since the tokenization of words for non-segmented languages such as Japanese is, by nature, ambiguous, and thus his one-to-one assumption is difficult to hold. We address this ambiguity problem by allowing 'overlaps' in generation of translation units and obtain single- and multi-word correspondences simultaneously.</Paragraph>
    <Paragraph position="1"> Previous works that focus on multi-word Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 28 .  NP recognizers are used to extract translation units and (Smadja et al., 1996) which uses the XTRACT system to extract collocations. Moreover, (Kitamura and Matsumoto, 1996) extracts an arbitrary length of word correspondences and (Haruno et al., 1996) identifies collocations through word-level sorting.</Paragraph>
    <Paragraph position="2"> In this paper, we compare three N-gram models of translation units, namely Bound-length Ngram, Chunk-bound N-gram, and Dependency-linked N-gram. Our approach of extracting bilingual lexicon is two-staged. We first prepare N-grams independently for each language in the parallel corpora and then find corresponding translation pairs from both sets of translation units in a greedy manner. The essence of our algorithm is that we allow some overlapping translation units to accommodate ambiguity in the first stage. Once translation pairs are detected during the process, they are decisively selected, and the translation units that overlaps with the found translation pairs are gradually ruled out.</Paragraph>
    <Paragraph position="3"> In all three models, translation units of N-gram are built using only content (open-class) words.</Paragraph>
    <Paragraph position="4"> This is because functional (closed-class) words such as prepositions alone will usually act as noise and so they are filtered out in advance.</Paragraph>
    <Paragraph position="5"> A word is classified as a functional word if it matches one of the following conditions. (The Penn Treebank part-of-speech tag set (Santorini, 1991) is used for English, whereas the ChaSen part-of-speech tag set (Matsumoto and Asahara, 2001) is used for Japanese.)</Paragraph>
    <Paragraph position="7"> part-of-speech(E) &amp;quot;CC&amp;quot;, &amp;quot;CD&amp;quot;, &amp;quot;DT&amp;quot;, &amp;quot;EX&amp;quot;, &amp;quot;FW&amp;quot;, &amp;quot;IN&amp;quot;, &amp;quot;LS&amp;quot;, &amp;quot;MD&amp;quot;, &amp;quot;PDT&amp;quot;, &amp;quot;PR&amp;quot;, &amp;quot;PRS&amp;quot;, &amp;quot;TO&amp;quot;, &amp;quot;WDT&amp;quot;, &amp;quot;WD&amp;quot;, &amp;quot;WP&amp;quot; stemmed-form(E) &amp;quot;be&amp;quot; symbols punctuations and brackets We now illustrate the three models of translation units by referring to the sentence in Figure  Bound-length N-gram Bound-length N-gram is first proposed in (Kitamura and Matsumoto, 1996). The translation units generated in this model are word sequences from uni-gram to a given length N. The upper bound for N is fixed to 5 in our experiment. Figure 2 lists a set of N-grams generated by Bound-length N-gram for the sentence in Figure 1.</Paragraph>
    <Paragraph position="8"> Chunk-bound N-gram Chunk-bound N-gram assumes prior knowledge of chunk boundaries. The definition of &amp;quot;chunk&amp;quot; varies from person to person. In our experiment, the definition for English chunk task complies with the CoNLL-2000 text chunking tasks and the definition for Japanese chunk is based on &amp;quot;bunsetsu&amp;quot; in the Kyoto University Corpus.</Paragraph>
    <Paragraph position="9"> Unlike Bound-length N-gram, Chunk-bound N-gram will not extend beyond the chunk boundaries. N varies depending on the size of the chunks1. Figure 3 lists a set of N-grams gener-</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Dependency-linked N-gram
</SectionTitle>
      <Paragraph position="0"> Dependency-linked N-gram assumes prior knowledge of dependency links among chunks.</Paragraph>
      <Paragraph position="1"> In fact, Dependency-linked N-gram is an enhanced model of the Chunk-bound model in that, Dependency-linked N-gram extends chunk boundaries via dependency links. Although dependency links could be extended recursively in a sentence, we limit the use to direct dependency links (i.e. links of immediate mother-daughter relations) only. Two chunks of dependency linked are concatenated and treated as an extended chunks. Dependency-linked  extended boundaries. Therefore, translation units generated by Dependency-linked N-gram (Figure 4) become the superset of the units generated by  Chunk-bound N-gram (Figure 3).</Paragraph>
      <Paragraph position="2"> The distinct characteristics of Dependency-linked N-gram from previous works are two-fold. First, (Yamamoto and Matsumoto, 2000) also uses dependency relations in the generation of translation units. However, it suffers from data sparseness (and thus low coverage), since the entire chunk is treated as a translation unit, which is too coarse. Dependency-linked N-gram, on the other hand, uses more fine-grained N-grams as translation units in order to avoid sparseness. Second, Dependency-linked N-gram includes &amp;quot;flexible&amp;quot; or non-contiguous collocations if dependency links are distant in a sentence. These collocations cannot be obtained by Bound-length N-gram with any N.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Translation Pair Extraction
</SectionTitle>
    <Paragraph position="0"> We use the same algorithm as (Yamamoto and Matsumoto, 2000) for acquiring translation pairs.</Paragraph>
    <Paragraph position="1"> The algorithm proceeds in a greedy manner. This means that the translation pairs found earlier (i.e.</Paragraph>
    <Paragraph position="2"> at a higher threshold) in the algorithm are regarded as decisive entries. The threshold acts as the level of confidence. Moreover, translation units that partially overlap with the already found translation pairs are filtered out during the algorithm. null The correlation score between translation units a41a43a42 and a41a45a44 is calculated by the weighted Dice Co-efficient defined as:  where a68a55a44 and a68a70a42 are the numbers of occurrences of a41 a42 anda41 a44 in Japanese and English corpora respectively, and a68a70a42a71a44 is the number of co-occurrences of a41a43a42 and a41a45a44 .</Paragraph>
    <Paragraph position="3"> We repeat the following until the current threshold a68a70a78a71a79a55a80a81a80 reaches the predefined minimum threshold a68a83a82a85a84a53a86 .</Paragraph>
    <Paragraph position="4"> 1. For each pair of English unit a87a89a88 and Japanese unit a87a48a90 appearing at least a91a69a92a94a93a96a95a62a95 times, identify the most likely correspondences according to the correlation scores.</Paragraph>
    <Paragraph position="5">  2. Filter out the co-occurrence positions for a87 a88 , a87 a90 , and their overlapped translation units.</Paragraph>
    <Paragraph position="6"> 3. Lower a91a69a92a71a93a96a95a62a95 if no more pairs are found. 4 Experiment and Result</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Experimental Setting
</SectionTitle>
      <Paragraph position="0"> Data for our experiment is 10000 sentence-aligned corpus from English-Japanese business expressions (Takubo and Hashimoto, 1995). 8000 sentences pairs are used for training and the remaining 2000 sentences are used for evaluation.</Paragraph>
      <Paragraph position="1"> Since the data are unannotated, we use NLP tools (part-of-speech taggers, chunkers, and dependency parsers) to estimate linguistic information such as word segmentation, chunk boundaries, and dependency links. Most tools employ a statistical model (Hidden Markov Model) or ma- null pair extraction algorithm described in the previous section. This implies that translation pairs that co-occur only once will never be found in our algorithm. We believe this is a reasonable sacrifice to bear considering the statistical nature of our algorithm. Table 1 shows the number of translation units found in each model. Note that translation units are counted not by token but by type.</Paragraph>
      <Paragraph position="2"> We adjust the threshold of the translation pair extraction algorithm according to the following equation. The threshold a68a2a78a71a79a55a80a81a80 is initially set to 100 and is gradually lowered down until it reaches the minimum threshold a68a83a82a85a84a53a86 2, described in Section 3. Furthermore, we experimentally decrement the threshold a68 a78a71a79a55a80a81a80 from 2 to 1 with the remaining uncorrelated sets of translation units, all of which appear at least twice in the corpus.</Paragraph>
      <Paragraph position="3"> This means that translation pairs whose correlation score is 1 a137 sim(a41a138a42 ,a41a139a44 ) a137 0 are attempted to find correspondences2.</Paragraph>
      <Paragraph position="4"> 2Note that a91 a92a50a140a2a141a111a141 plays two roles: (1) threshold for the co-occurrence frequency, and (2) threshold for the correlation score. During the decrement of a91a69a92 a140a70a141a96a141 form 2 to 1, the effect is solely on the latter threshold (for the correlation score), and the former threshold (for the co-occurrence frequency) does not alter and remains 2.</Paragraph>
      <Paragraph position="6"> The result is evaluated in terms of accuracy and coverage. Accuracy is the number of correct translation pairs over the extracted translation pairs in the algorithm. This is calculated by type.</Paragraph>
      <Paragraph position="7"> Coverage measures &amp;quot;applicability&amp;quot; of the correct translation pairs for unseen test data. It is the number of tokens matched by the correct translation pairs over the number of tokens in the unseen test data. Acuracy and coverage roughly correspond to Melamed's precision and percent correct respectively (Melamed, 1995). Accuracy is calculated on the training data (8000 sentences) manually, whereas coverage is calculated on the test data (2000 sentences) automatically.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Accuracy
</SectionTitle>
      <Paragraph position="0"> Stepwise accuracy for each model is listed in Table 2, Table 3, and Table 4. &amp;quot;a68a70a78a71a79a55a80a158a80 &amp;quot; indicates the threshold, i.e. stages in the algorithm. &amp;quot;e&amp;quot; is the number of translation pairs found at stage &amp;quot;a68a2a78a71a79a55a80a81a80 &amp;quot;, and &amp;quot;c&amp;quot; is the number of correct ones found at stage &amp;quot;a68a2a78a71a79a55a80a81a80 &amp;quot;. The correctness is judged by an English-Japanese bilingual speaker. &amp;quot;acc&amp;quot;  lists accuracy, the fraction of correct ones over extracted ones by type. The accumulated results for &amp;quot;e&amp;quot;, &amp;quot;c&amp;quot; and &amp;quot;acc&amp;quot; are indicated by '.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Coverage
</SectionTitle>
      <Paragraph position="0"> Stepwise coverage for each model is listed in Table 5, Table 6, and Table 7. As before, &amp;quot;a68a2a78a71a79a55a80a81a80 &amp;quot; indicates the threshold. The brackets indicate language: &amp;quot;E&amp;quot; for English and &amp;quot;J&amp;quot; for Japanese. &amp;quot;found&amp;quot; is the number of content tokens matched with correct translation pairs. &amp;quot;ideal&amp;quot; is the upper bound of content tokens that should be found by the algorithm; it is the total number of content tokens in the translation units whose co-occurrence frequency is at least &amp;quot;a68 a78a71a79a55a80a81a80 &amp;quot; times in the original parallel corpora. &amp;quot;cover&amp;quot; lists coverage. The prefix &amp;quot;i &amp;quot; is the fraction of found tokens over ideal tokens and the prefix &amp;quot;t &amp;quot; is the fraction of found tokens over the total number of both content and functional tokens in the data. For 2000 test parallel sentences, there are 30255 tokens in the English half and 38827 tokens in the Japanese half.</Paragraph>
      <Paragraph position="1"> The gap between the number of &amp;quot;ideal&amp;quot; tokens and that of total tokens is due to filtering of functional words in the generation of translation units.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> Of the three models, Chunk-bound N-gram yields the best performance both in accuracy (83%) and in coverage (60%)3. Compared with the Bound-length N-gram, it achieves approximately 13% improvement in accuracy and 5-9% improvement in coverage at threshold 1.1.</Paragraph>
    <Paragraph position="1"> Although Bound-length N-gram generates more translation units than Chunk-bound Ngram, it extracts fewer correct translation pairs (and results in low coverage). A possible explanation for this phenomenon is that Bound-length N-gram tends to generate too many unnecessary translation units which increase the noise for the  extraction algorithm.</Paragraph>
    <Paragraph position="2"> Dependency-linked N-gram follows a similar transition of accuracy and coverage as Chunk-bound N-gram. Figure 5 illustrates the Venn diagram of the number of correct translation pairs extracted in each model. As many as 3439 translation pairs from Dependency-linked N-gram and Chunk-bound N-gram are found in common.</Paragraph>
    <Paragraph position="3"> Based on these observation, we could say that dependency links do not contribute significantly. However, as dependency parsers are still prone to some errors, we will need further investigation with improved dependency parsers.</Paragraph>
    <Paragraph position="4"> Table 8 lists the sample correct translation pairs that are unique to each model. Most translation pairs unique to Chunk-bound N-gram are named entities (NP compounds) and one-to-one correspondence. This matches our expectation, as translation units in Chunk-bound N-gram are limited within chunk boundaries. The reason why the other two failed to obtain these translation pairs is probably due to a large number of overlapped translation units generated. Our extraction algorithm filters out the overlapped entries once the correct pairs are identified, and thus a large number of overlapped translation units sometimes become noise.</Paragraph>
    <Paragraph position="5"> Bound-length N-gram and Dependency-linked N-gram include longer pairs, some of which are idiomatic expressions. Theoretically speaking, translation pairs like &amp;quot;look forward&amp;quot; should be extracted by Dependency-linked N-gram. A close examination of the data reveals that in some sentences, &amp;quot;look&amp;quot; and &amp;quot;forward&amp;quot; are not recognized as dependency-linked. These preprocessing failures can be overcome by further improvement of the tools used.</Paragraph>
    <Paragraph position="6"> Based on the above analysis, we conclude that chunking boundaries are useful clues in building bilingual seed dictionary as Chunk-bound N-gram has demonstrated high precision and wide coverage. However, for parallel corpora that include a great deal of domain-specific or idiomatic expressions, partial use of dependency links is desirable. null There is still a remaining problem with our method. That is how to determine translation pairs which co-occur only once. One simple approach is to use a machine-readable bilingual dictionary. However, a more fundamental solution may lie in the partial structural matching of parallel sentences (Watanabe et al., 2000). We intend to incorporate these techniques to improve the overall coverage.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML