File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1089_metho.xml

Size: 20,087 bytes

Last Modified: 2025-10-06 14:08:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1089">
  <Title>Mining New Word Translations from Comparable Corpora</Title>
  <Section position="3" start_page="0" end_page="21" type="metho">
    <SectionTitle>
2. Our approach
</SectionTitle>
    <Paragraph position="0"> The work of (Fung and Yee, 1998; Rapp, 1995; Rapp, 1999) noted that if an English word e is the translation of a Chinese word c , then the contexts of the two words are similar. We could view this as a document retrieval problem. The context (i.e., the surrounding words) of c is viewed as a query. The context of each candidate translation 'e is viewed as a document. Since the context of the correct translation e is similar to the context of c , we are likely to retrieve the context of e when we use the context of c as the query and try to retrieve the most similar document. We employ the language modeling approach (Ng, 2000; Ponte and Croft, 1998) for this retrieval problem. More details are given in Section 3.</Paragraph>
    <Paragraph position="1"> On the other hand, when we only look at the word w itself, we can rely on the pronunciation of w to locate its translation. We use a variant of the machine transliteration method proposed by (Knight and Graehl, 1998). More details are given in Section 4.</Paragraph>
    <Paragraph position="2"> Each of the two individual methods provides a ranked list of candidate words, associating with each candidate a score estimated by the particular method. If a word e in English is indeed the translation of a word c in Chinese, then we would expect e to be ranked very high in both lists in general. Specifically, our combination method is as follows: we examine the top M words in both lists and find</Paragraph>
    <Paragraph position="4"> that appear in top M positions in both lists. We then rank these words</Paragraph>
    <Paragraph position="6"> according to the average of their rank positions in the two lists. The candidate e i that is ranked the highest according to the average rank is taken to be the correct translation and is output. If no words appear within the top M positions in both lists, then no translation is output.</Paragraph>
    <Paragraph position="7"> Since we are using comparable corpora, it is possible that the translation of a new word does not exist in the target corpus. In particular, our experiment was conducted on comparable corpora that are not very closely related and as such, most of the Chinese words have no translations in the English target corpus.</Paragraph>
    <Paragraph position="8"> 3. Translation by context In a typical information retrieval (IR) problem, a query is given and a ranked list of documents most relevant to the query is returned from a document collection.</Paragraph>
    <Paragraph position="9"> For our task, the query is )(cC , the context (i.e., the surrounding words) of a Chinese word c . Each )(eC , the context of an English word e , is considered as a document in IR. If an English word e is the translation of a Chinese word c , they will have similar contexts. So we use the query )(cC to retrieve a document )(</Paragraph>
    <Paragraph position="11"> best matches the query. The English word</Paragraph>
    <Paragraph position="13"> Within IR, there is a new approach to document retrieval called the language modeling approach (Ponte &amp; Croft, 98). In this approach, a language model is derived from each document D . Then the probability of generating the query Q according to that language model, )|( DQP , is estimated. The document with the highest )|( DQP is the one that best matches the query.</Paragraph>
    <Paragraph position="14"> The language modeling approach to IR has been shown to give superior retrieval performance (Ponte &amp; Croft, 98; Ng, 2000), compared with traditional vector space model, and we adopt this approach in our current work.</Paragraph>
    <Paragraph position="15"> To estimate )|( DQP , we use the approach of (Ng, 2000). We view the document D as a multinomial distribution of terms and assume that query Q is generated by this model:</Paragraph>
    <Paragraph position="17"> c is the number of times term t occurs in the query Q ,</Paragraph>
    <Paragraph position="19"> cn is the total number of terms in query Q .</Paragraph>
    <Paragraph position="20"> For ranking purpose, the first fraction</Paragraph>
    <Paragraph position="22"> can be omitted as this part depends on the query only and thus is the same for all the documents.</Paragraph>
    <Paragraph position="23"> In our translation problem, )(cC is viewed as the query and )(eC is viewed as a document. So our task is to compute ))(|)(( eCcCP for each English word e and find the e that gives the highest ))(|)(( eCcCP , estimated as:</Paragraph>
    <Paragraph position="25"> bag of Chinese words obtained by translating the English words in )(eC , as determined by a bi-lingual dictionary. If an English word is ambiguous and has K translated Chinese words listed in the bilingual dictionary, then each of the K translated Chinese words is counted as occurring 1/K times in ))(( eCT c for the purpose of probability estimation.</Paragraph>
    <Paragraph position="26"> We use backoff and linear interpolation for</Paragraph>
    <Paragraph position="28"> is the number of occurrences of the term</Paragraph>
    <Paragraph position="30"> mated similarly by counting the occurrences of  c t in the Chinese translation of the whole English corpus. a is set to 0.6 in our experiments. 4. Translation by transliteration  For the transliteration model, we use a modified model of (Knight and Graehl, 1998) and (Al-Onaizan and Knight, 2002b).</Paragraph>
    <Paragraph position="31"> Knight and Graehl (1998) proposed a probabilistic model for machine transliteration. In this model, a word in the target language (i.e., English in our task) is written and pronounced. This pronunciation is converted to source language pronunciation and then to source language word (i.e., Chinese in our task). Al-Onaizan and Knight (2002b) suggested that pronunciation can be skipped and the target language letters can be mapped directly to source language letters. Pinyin is the standard Romanization system of Chinese characters. It is phonetic-based. For transliteration, we estimate )|( ceP as follows:</Paragraph>
    <Paragraph position="33"> First, each Chinese character in a Chinese word c is converted to pinyin form. Then we sum over all the alignments that this pinyin form of c can map to an English word e. For each possible alignment, we calculate the probability by taking the product of each mapping.</Paragraph>
    <Paragraph position="35"> p is the ith syllable of pinyin,</Paragraph>
    <Paragraph position="37"> l is the English letter sequence that the ith pinyin syllable maps to in the particular alignment a.</Paragraph>
    <Paragraph position="38"> Since most Chinese characters have only one pronunciation and hence one pinyin form, we assume that Chinese character-to-pinyin mapping is one-to-one to simplify the problem. We use the expectation maximization (EM) algorithm to generate mapping probabilities from pinyin syllables to English letter sequences. To reduce the search space, we limit the number of English letters that each pinyin syllable can map to as 0, 1, or 2. Also we do not allow cross mappings.</Paragraph>
    <Paragraph position="39"> That is, if an English letter sequence  Our method differs from (Knight and Graehl, 1998) and (Al-Onaizan and Knight, 2002b) in that our method does not generate candidates but only estimates )|( ceP for candidates e appearing in the English corpus. Another difference is that our method estimates )|( ceP directly, instead of )|( ecP and )(eP .</Paragraph>
  </Section>
  <Section position="4" start_page="21" end_page="21" type="metho">
    <SectionTitle>
5. Experiment
5.1 Resources
</SectionTitle>
    <Paragraph position="0"> For the Chinese corpus, we used the Linguistic Data Consortium (LDC) Chinese Gigaword Corpus from Jan 1995 to Dec 1995. The corpus of the period Jul to Dec 1995 was used to come up with new Chinese words c for translation into English. The corpus of the period Jan to Jun 1995 was just used to determine if a Chinese word c from Jul to Dec 1995 was new, i.e., not occurring from Jan to Jun 1995. Chinese Gigaword corpus consists of news from two agencies: Xinhua News Agency and Central News Agency.</Paragraph>
    <Paragraph position="1"> As for English corpus, we used the LDC English Gigaword Corpus from Jul to Dec 1995. The English Gigaword corpus consists of news from four newswire services: Agence France Press English Service, Associated Press Worldstream English Service, New York Times Newswire Service, and Xinhua News Agency English Service. To avoid accidentally using parallel texts, we did not use the texts of Xinhua News Agency English Service.</Paragraph>
    <Paragraph position="2"> The size of the English corpus from Jul to Dec 1995 was about 730M bytes, and the size of the Chinese corpus from Jul to Dec 1995 was about 120M bytes.</Paragraph>
    <Paragraph position="3"> We used a Chinese-English dictionary which contained about 10,000 entries for translating the words in the context. For the training of transliteration probability, we required a Chinese-English name list. We used a list of 1,580 Chinese-English name pairs as training data for the EM algorithm.</Paragraph>
    <Section position="1" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
5.2 Preprocessing
</SectionTitle>
      <Paragraph position="0"> Unlike English, Chinese text is composed of Chinese characters with no demarcation for words. So we first segmented Chinese text with a Chinese word segmenter that was based on maximum entropy modeling (Ng and Low, 2004).</Paragraph>
      <Paragraph position="1"> We then divided the Chinese corpus from Jul to Dec 1995 into 12 periods, each containing text from a half-month period. Then we determined the new Chinese words in each half-month period p. By new Chinese words, we refer to those words that appeared in this period p but not from Jan to Jun 1995 or any other periods that preceded p. Among all these new words, we selected those occurring at least 5 times. These words made up our test set. We call these words Chinese source words. They were the words that we were supposed to find translations from the English corpus.</Paragraph>
      <Paragraph position="2"> For the English corpus, we performed sentence segmentation and converted each word to its morphological root form and to lower case. We also divided the English corpus into 12 periods, each containing text from a half-month period. For each period, we selected those English words occurring at least 10 times and were not present in the 10,000-word Chinese-English dictionary we used and were not stop words. We considered these English words as potential translations of the Chinese source words. We call them English translation candidate words. For a Chinese source word occurring within a half-month period p, we looked for its English translation candidate words occurring in news documents in the same period p.</Paragraph>
    </Section>
    <Section position="2" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
5.3 Translation candidates
</SectionTitle>
      <Paragraph position="0"> The context )(cC of a Chinese word c was collected as follows: For each occurrence of c, we set a window of size 50 characters centered at c.</Paragraph>
      <Paragraph position="1"> We discarded all the Chinese words in the context that were not in the dictionary we used. The contexts of all occurrences of a word c were then concatenated together to form )(cC . The context of an English translation candidate word e, )(eC , was similarly collected. The window size of English context was 100 words.</Paragraph>
      <Paragraph position="2"> After all the counts were collected, we estimated ))(|)(( eCcCP as described in Section 3, for each pair of Chinese source word and English translation candidate word. For each Chinese source word, we ranked all its English translation candidate words according to the estimated ))(|)(( eCcCP .</Paragraph>
      <Paragraph position="3"> For each Chinese source word c and an English translation candidate word e , we also calculated the probability )|( ceP (as described in Section 4), which was used to rank the English candidate words based on transliteration.</Paragraph>
      <Paragraph position="4"> Finally, the English candidate word with the smallest average rank position and that appears within the top M positions of both ranked lists is the chosen English translation (as described in Section 2). If no words appear within the top M positions in both ranked lists, then no translation is output.</Paragraph>
      <Paragraph position="5"> Note that for many Chinese words, only one English word e appeared within the top M positions for both lists. And among those cases where more than one English words appeared within the top M positions for both lists, many were multiple translations of a Chinese word. This happened for example when a Chinese word was a non-English person name. The name could have multiple translations in English. For example, Mi Luo Xi Nuo was a Russian name. Mirochina and Miroshina both appeared in top 10 positions of both lists. Both were correct.</Paragraph>
    </Section>
    <Section position="3" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
5.4 Evaluation
</SectionTitle>
      <Paragraph position="0"> We evaluated our method on each of the 12 half-month periods. The results when we set M = 10 are shown in Table 1.</Paragraph>
      <Paragraph position="1">  In Table 1, period 1 is Jul 01 - Jul 15, period 2 is Jul 16 - Jul 31, ..., period 12 is Dec 16 - Dec 31. #c is the total number of new Chinese source words in the period. #e is the total number of English translation candidates in the period. #o is the total number of output English translations. #Cor is the number of correct English translations output. Prec. is the precision. The correctness of the English translations was manually checked.</Paragraph>
      <Paragraph position="2"> Recall is somewhat difficult to estimate because we do not know whether the English translation of a Chinese word appears in the English part of the corpus. We attempted to estimate recall by manually finding the English translations for all the Chinese source words for the two periods Dec 01 - Dec 15 and Dec 16 - Dec 31 in the English part of the corpus. During the whole December period, we only managed to find English translations which were present in the English side of the comparable corpora for 43 Chinese words. So we estimate that English translations are present in the English part of the corpus for 3624499)205329(43 =x+ words in all 12 periods. And our program finds correct translations for 115 words. So we estimate that recall (for M = 10) is approximately %8.31362/115 = .</Paragraph>
      <Paragraph position="3"> We also investigated the effect of varying M .</Paragraph>
      <Paragraph position="4"> The results are shown in Table 2.</Paragraph>
      <Paragraph position="6"> The past research of (Fung and Yee, 1998; Rapp, 1995; Rapp, 1999) utilized context information alone and was evaluated on different corpora from ours, so it is difficult to directly compare our current results with theirs. Similarly, Al-Onaizan and Knight (2002a; 2002b) only made use of transliteration information alone and so was not directly comparable.</Paragraph>
      <Paragraph position="7"> To investigate the effect of the two individual sources of information (context and transliteration), we checked how many translations could be found using only one source of information (i.e., context alone or transliteration alone), on those Chinese words that have translations in the English part of the comparable corpus. As mentioned earlier, for the month of Dec 1995, there are altogether 43 Chinese words that have their translations in the English part of the corpus.</Paragraph>
      <Paragraph position="8"> This list of 43 words is shown in Table 3. 8 of the 43 words are translated to English multi-word phrases (denoted as &amp;quot;phrase&amp;quot; in Table 3). Since our method currently only considers unigram English words, we are not able to find translations for these words. But it is not difficult to extend our method to handle this problem. We can first use a named entity recognizer and noun phrase chunker to extract English names and noun phrases.</Paragraph>
      <Paragraph position="9"> The translations of 6 of the 43 words are words in the dictionary (denoted as &amp;quot;comm.&amp;quot; in Table 3) and 4 of the 43 words appear less than 10 times in the English part of the corpus (denoted as &amp;quot;insuff&amp;quot;). Our method is not able to find these translations. But this is due to search space pruning. If we are willing to spend more time on searching, then in principle we can find these translations.</Paragraph>
      <Paragraph position="10">  - Dec 15 and Dec 16 - Dec 31. 'Cont. rank' is the context rank, 'Trans. Rank' is the transliteration rank. 'NA' means the word cannot be transliterated. 'insuff' means the correct translation appears less than 10 times in the English part of the comparable corpus. 'comm' means the correct translation is a word appearing in the dictionary we used or is a stop word. 'phrase' means the correct translation contains multiple English words.</Paragraph>
      <Paragraph position="11"> As shown in Table 3, using just context information alone, 10 Chinese words (the first 10) have their correct English translations at rank one position. And using just transliteration information alone, 9 Chinese words have their correct English translations at rank one position.</Paragraph>
      <Paragraph position="12"> On the other hand, using our method of combining both sources of information and setting M = [?], 19 Chinese words (i.e., the first 22 Chinese words in Table 3 except Ba Zuo Ya ,Gan Guo ,Pu Li Fa ) have their correct English translations at rank one position. If M = 10, 15 Chinese words (i.e., the first 19 Chinese words in Table 3 except Xie Ma Si ,Ba Zuo Ya ,Gan Guo ,Pu Li Fa ) have their correct English translations at rank one position. Hence, our method of using both sources of information outperforms using either information source alone.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="21" end_page="21" type="metho">
    <SectionTitle>
6. Related work
</SectionTitle>
    <Paragraph position="0"> As pointed out earlier, most previous research only considers either transliteration or context information in determining the translation of a source language word w, but not both sources of information. For example, the work of (Al-Onaizan and Knight, 2002a; Al-Onaizan and Knight, 2002b; Knight and Graehl, 1998) used only the pronunciation or spelling of w in translation. On the other hand, the work of (Cao and Li, 2002; Fung and Yee, 1998; Rapp, 1995; Rapp, 1999) used only the context of w to locate its translation in a second language. In contrast, our current work attempts to combine both complementary sources of information, yielding higher accuracy than using either source of information alone.</Paragraph>
    <Paragraph position="1"> Koehn and Knight (2002) attempted to combine multiple clues, including similar context and spelling. But their similar spelling clue uses the longest common subsequence ratio and works only for cognates (words with a very similar spelling).</Paragraph>
    <Paragraph position="2"> The work that is most similar to ours is the recent research of (Huang et al., 2004). They attempted to improve named entity translation by combining phonetic and semantic information.</Paragraph>
    <Paragraph position="3"> Their contextual semantic similarity model is different from our language modeling approach to measuring context similarity. It also made use of part-of-speech tag information, whereas our method is simpler and does not require part-of-speech tagging. They combined the two sources of information by weighting the two individual scores, whereas we made use of the average rank for combination.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML