File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1312_metho.xml

Size: 19,045 bytes

Last Modified: 2025-10-06 14:07:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1312">
  <Title>Cross-lingual Information Retrieval using Hidden Markov Models</Title>
  <Section position="4" start_page="95" end_page="95" type="metho">
    <SectionTitle>
3 HMM for Cross-lingual IR
</SectionTitle>
    <Paragraph position="0"> For CLIR we extend the query generation process so that a document Dy written in language y can generate a query Qx in language x. We use Wx to denote a word in x and Wy to denote a word in y. As before, to model general query words from language x, we estimate P(Wx \]Gx) by using a large corpus Cx in language x.</Paragraph>
    <Paragraph position="1"> Also as before, we estimate P(WyIDy) to be the sample distribution of Wy in Dy.</Paragraph>
    <Paragraph position="2"> We use P(Wx\[Wy) to denote the probability that Wy is translated as Wx. Though terms often should not be translated independent of their context, we make that simplifying assumption here. We assume that the possible translations are specified by a bilingual lexicon BL. Since the event spaces for Wy's in P(WyIDy) are mutually exclusive, we can compute the output</Paragraph>
    <Paragraph position="4"> We compute P(Q~IDy is R) as below: P(Qx IDr /sR) = I~I(aetwx IG,)+O-a)P(W~ IDy)) w.~,o. The above model generates queries from documents, that is, it attempts to determine how likely a particular query is given a relevant document. The retrieval system, however, can use either query translation or document translation. We chose query translation over document translation for its flexibility, since it allowed us to experiment with a new method of estimating the translation probabilities without changing the index structure.</Paragraph>
  </Section>
  <Section position="5" start_page="95" end_page="97" type="metho">
    <SectionTitle>
4 Experimental Set-up
</SectionTitle>
    <Paragraph position="0"> For retrieval using English queries to search Chinese documents, we used the TREC5 and TREC6 Chinese data which consists of 164,789 documents from the Xinhua News Agency and People's Daily, averaging 450 Chinese characters/document. Each of the TREC topics has three Chinese fields: title, description and  narrative, plus manually translated, English versions of each. We corrected some of the English queries that contained errors, such as &amp;quot;Dali Lama&amp;quot; instead of the correct &amp;quot;Dalai Lama&amp;quot; and &amp;quot;Medina&amp;quot; instead of &amp;quot;Medellin.&amp;quot; Stop words and stop phrases were removed. We created three versions of Chinese queries and three versions of English queries: short (title only), medium (title and description), and long (all three fields).</Paragraph>
    <Paragraph position="1"> For retrieval using English queries to search Spanish documents, we used the TREC4 Spanish data, which has 57,868 documents. It has 25 queries in Spanish with manual translations to English. We will denote the Chinese data sets as Trec5C and Trec6C and the Spanish data set as Trec4S.</Paragraph>
    <Paragraph position="2"> We used a Chinese-English lexicon from the  Linguistic Data Consortium (LDC). We pre-processed the dictionary as follows: 1. Stem Chinese words via a simple algorithm to remove common suffixes and prefixes.</Paragraph>
    <Paragraph position="3"> 2. Use the Porter stemmer on English words.</Paragraph>
    <Paragraph position="4"> 3. Split English phrases into words. If an  English phrase is a translation for a Chinese word, each word in the phrase is taken as a separate translation for the Chinese word. ~ 4. Estimate the translation probabilities. (We first report results assuming a uniform distribution on a word's translations. If a Chinese word c has n translations el, e2, ...en. each of them will be assigned equal probability, i.e., P(eilc)=l/n. Section 10 supplements this with a corpus-based distribution.) 5. Invert the lexicon to make it an English-Chinese lexicon. That is, for each English word e, we associate it with a list of Chinese words cl, c2, ... Cm together with non-zero translation probabilities P( elc~).</Paragraph>
    <Paragraph position="5"> The resulting English-Chinese lexicon has 80,000 English words. On average, each English word has 2.3 Chinese translations.</Paragraph>
    <Paragraph position="6"> For Spanish, we downloaded a bilingual English-Spanish lexicon from the Internet (http://www.activa.arrakis.es) containing around 22,000 English words (16,000 English stems) and processed it similarly. Each English word has around 1.5 translations on average. A co-occurrence based stemmer (Xu and Croft, 1998) was used to stem Spanish words. One difference from the treatment of Chinese is to include the English word as one of its own translations in addition to its Spanish translations in the lexicon. This is useful for translating proper nouns, which often have identical spellings in English and Spanish but are routinely excluded from a lexicon.</Paragraph>
    <Paragraph position="7"> One problem is the segmentation of Chinese text, since Chinese has no spaces between words. In these initial experiments, we relied on a simple sub-string matching algorithm to extract words from Chinese text. To extract words from a string of Chinese characters, the algorithm examines any sub-string of length 2 or greater and recognizes it as a Chinese word if it is in a predefined dictionary (the LDC lexicon in our case). In addition, any single character which is not part of any recognized Chinese words in the first step is taken as a Chinese word. Note that this algorithm can extract a compound Chinese word as well as its components. For example, the Chinese word for &amp;quot;particle physics&amp;quot; as well as the Chinese words for &amp;quot;particle&amp;quot; and &amp;quot;physics&amp;quot; will be extracted. This seems desirable because it ensures the retrieval algorithm will match both the compound words as well as their components.</Paragraph>
    <Paragraph position="8"> The above algorithm was used in processing Chinese documents and Chinese queries.</Paragraph>
    <Paragraph position="9"> English data from the 2 GB of TREC disks l&amp;2 was used to estimate P(WlG,..ngti~h), the general language probabilities for English words. The evaluation metric used in this study is the average precision using the trec_eval program (Voorhees and Harman, 1997). Mono-lingual retrieval results (using the Chinese and Spanish queries) provided our baseline, with the HMM retrieval system (Miller et al, 1999).</Paragraph>
  </Section>
  <Section position="6" start_page="97" end_page="97" type="metho">
    <SectionTitle>
5 Retrieval Results
</SectionTitle>
    <Paragraph position="0"> Table 2 reports average precision for mono-lingual retrieval, average precision for crosslingual, and the relative performance ratio of cross-lingual retrieval to mono-lingual.</Paragraph>
    <Paragraph position="1"> Relative performance of cross-lingual IR varies between 67% and 84% of mono-lingual IR.</Paragraph>
    <Paragraph position="2"> Trec6 Chinese queries have a somewhat higher relative performance than Trec5 Chinese queries. Longer queries have higher relative performance than short queries in general.</Paragraph>
    <Paragraph position="3"> Overall, cross-lingual performance using our HMM retrieval model is around 76% of mono-lingual retrieval. A comparison of our mono-lingual results with Trec5 Chinese and Trec6 Chinese results published in the TREC proceedings (Voorhees and Harman, 1997, 1998) shows that our mono-lingual results are close to the top performers in the TREC conferences. Our Spanish mono-lingual performance is also comparable to the top automatic runs of the TREC4 Spanish task (Harrnan, 1996). Since these mono-lingual results were obtained without using sophisticated query processing techniques such as query expansion, we believe the mono-lingual results form a valid baseline.</Paragraph>
    <Paragraph position="4">  lingual retrieval performance. The scores on the monolingual and cross-lingual columns are average precision.</Paragraph>
  </Section>
  <Section position="7" start_page="97" end_page="98" type="metho">
    <SectionTitle>
6 Comparison with other Methods
</SectionTitle>
    <Paragraph position="0"> In this section we compare our approach with two other approaches. One approach is &amp;quot;simple substitution&amp;quot;, i.e., replacing a query term with all its translations and treating the translated query as a bag of words in mono-lingual retrieval. Suppose we have a simple query Q=(a, b), the translations for a are al, a2, a3, and the translations for b are bl, b2. The translated query would be (at, a2, a3, b~, b2). Since all terms are treated as equal in the translated query, this gives terms with more translations (potentially the more common terms) more credit in retrieval, even though such terms should potentially be given less credit if they are more common. Also, a document matching different translations of one term in the original query may be ranked higher than a document that matches translations of different terms in the original query. That is, a document that contains terms at, a2 and a3 may be ranked higher than a document which contains terms at and bl. However, the second document is more likely to be relevant since correct translations of the query terms are more likely to co-occur (Ballesteros and Croft, 1998).</Paragraph>
    <Paragraph position="1"> A second method is to structure the translated query, separating the translations for one term from translations for other terms. This approach limits how much credit the retrieval algorithm can give to a single term in the original query and prevents the translations of one or a few terms from swamping the whole query. There are several variations of such a method (Ballesteros and Croft, 1998; Pirkola, 1998; Hull 1997). One such method is to treat different translations of the same term as synonyms.</Paragraph>
    <Paragraph position="2"> Ballesteros, for example, used the INQUERY (Callan et al, 1995) synonym operator to group translations of different query terms. However, if a term has two translations in the target language, it will treat them as equal even though one of them is more likely to be the correct translation than the other. By contrast, our HMM approach supports translation probabilities. The synonym approach is equivalent to changing all non-zero translation probabilities P(W~\[ Wy)'s to 1 in our retrieyal function. Even estimating uniform translation probabilities gives higher weights to unambiguous translations and lower weights to highly ambiguous translations.</Paragraph>
    <Paragraph position="3">  These intuitions are supported empirically by the results in Table 3. We can see that the HMM performs best for every query set. Simple substitution performs worst. The synonym approach is significantly better than substitution, but is consistently worse than the HMM translations were kept in disambiguation, the improvement would be 4% for Trec6C-medium.</Paragraph>
    <Paragraph position="4"> The results of this manual disambiguation suggest that there are limits to automatic disambiguation.</Paragraph>
  </Section>
  <Section position="8" start_page="98" end_page="98" type="metho">
    <SectionTitle>
7 Impact of Translation Ambiguity
</SectionTitle>
    <Paragraph position="0"> To get an upper bound on performance of any disambiguation technique, we manually disambiguated the Trec5C-medium, Trec6C-medium and Trec4S queries. That is, for each English query term, a native Chinese or Spanish speaker scanned the list of translations in the bilingual lexicon and kept one translation deemed to be the best for the English term and discarded the rest. If none of the translations was correct, the first one was chosen.</Paragraph>
    <Paragraph position="1"> The results in Table 4 show that manual disambiguation improves performance by 17% on Trec5C, 4% on Trec4S, but not at all on Trec6C. Furthermore, the improvement on Trec5C appears to be caused by big improvements for a small number of queries.</Paragraph>
    <Paragraph position="2"> The one-sided t-test (Hull, 1993) at significance level 0.05 indicated that the improvement on Trec5C is not statistically significant.</Paragraph>
    <Paragraph position="3"> It seems surprising that disambiguation does not help at all for Trec6C. We found that many terms have more than one valid translation. For example, the word &amp;quot;flood&amp;quot; (as in &amp;quot;flood control&amp;quot;) has 4 valid Chinese translations. Using all of them achieves the desirable effect of query expansion. It appears that for Trec6C, the benefit of disambiguation is cancelled by choosing only one of several alternatives, discarding those other good translations. If multiple correct  are average precision.</Paragraph>
  </Section>
  <Section position="9" start_page="98" end_page="99" type="metho">
    <SectionTitle>
8 Impact of Missing Translations
</SectionTitle>
    <Paragraph position="0"> Results in the previous section showed that manual disambiguation can bring performance of cross-lingual IR to around 82% of mono-lingual IR. The remaining performance gap between mono-lingual and cross-lingual IR is likely to be caused by the incompleteness of the bilingual lexicon used for query translation, i.e., missing translations for some query terms. This may be a more serious problem for cross-lingual IR than ambiguity. To test the conjecture, for each English query term, a native speaker in Chinese or Spanish manually checked whether the bilingual lexicon contains a correct translation for the term in the context of the query. If it does not, a correct translation for the term was added to the lexicon. For the query sets Trec5C-medium and Trec6C-medium, there are 100 query terms for which the lexicon does not have a correct translation. This represents 19% of the 520 query terms (a term is counted only once in one query). For the query set Trec4S, the percentage is 12%.</Paragraph>
    <Paragraph position="1"> The results in Table 5 show that with augmented lexicons, performance of cross-lingual IR is 91%, 99% and 95% of mono-lingual IR on Trec5C-mediurn, Trec6C-medium and Trec4S.</Paragraph>
    <Paragraph position="2">  The improvement over using the original lexicon is 28%, 18% and 23% respectively. The results demonstrate the importance cff a complete lexicon. Compared with the results in section 7, the results here suggest that missing translations have a much larger impact on cross-lingual IR</Paragraph>
  </Section>
  <Section position="10" start_page="99" end_page="100" type="metho">
    <SectionTitle>
9 Impact of Lexicon Size
</SectionTitle>
    <Paragraph position="0"> In this section we measure CLIR performance as a function of lexicon size. We sorted the English words from TREC disks l&amp;2 in order of decreasing frequency. For a lexicon of size n, we keep only the n most frequent English words.</Paragraph>
    <Paragraph position="1"> The upper graph in Figure 1 shows the curve of cross-lingual IR performance as a function of the size of the lexicon based on the Chinese short and medium-length queries. Retrieval performance was averaged over Trec5C and Trec6C. Initially retrieval performance increases sharply with lexicon size. After the dictionary exceeds 20,000, performance levels off. An examination of the translated queries shows that words not appearing in the 20,000-word lexicon usually do not appear in the larger lexicons either. Thus, increases in the general lexicon beyond 20,000 words did not result in a substantial increase in the coverage of the query terms.</Paragraph>
    <Paragraph position="2"> The lower graph in Figure 1 plots the retrieval performance as a function of the percent of the full lexicon. The figure shows that short queries are more susceptible to incompleteness of the lexicon than longer queries. Using a 7,000-word lexicon, the short queries only achieve 75% of their performance with the full lexicon. In comparison, the medium-length queries achieve  87% of their performance.</Paragraph>
    <Paragraph position="3"> \[--*- Short Query 4-- Medium Query J  We categorized the missing terms and found that most of them are proper nouns (especially locations and person names), highly technical terms, or numbers. Such words understandably do not normally appear in traditional lexicons. Translation of numbers can be solved using simple rules. Transliteration, a technique that guesses the likely translations of a word based on pronunciation, can be readily used in translating proper nouns.</Paragraph>
    <Paragraph position="4"> Another technique is automatic discovery of translations from parallel or non-parallel corpora (Fung and Mckeown, 1997). Since traditional lexicons are more or less static repositories of knowledge, techniques that discover translation from newly published materials can supplement them with corpus-specific vocabularies.</Paragraph>
    <Section position="1" start_page="100" end_page="100" type="sub_section">
      <SectionTitle>
10 Using a Parallel Corpus
</SectionTitle>
      <Paragraph position="0"> In this section we estimate translation probabilities from a parallel corpus rather than assuming uniform likelihood as in section 4. A Hong Kong News corpus obtained from the Linguistic Data Consortium has 9,769 news stories in Chinese with English translations. It has 3.4 million English words. Since the documents are not exact translations of each other, occasionally having extra or missing sentences, we used document-level co-occurrence to estimate translation probabilities. The Chinese documents were &amp;quot;segmented&amp;quot; using the technique discussed in section 4. Let co(e,c) be the number of parallel documents where an English word e and a Chinese word c co-occur, and df(c) be the document frequency of c. If a Chinese word c has n possible translations el to en in the bilingual lexicon, we estimate the corpus translation probability as:</Paragraph>
      <Paragraph position="2"> Since several translations for c may co-occur in a document, ~co(e~ c) can be greater than df(c).</Paragraph>
      <Paragraph position="3"> Using the maximum of the two ensures that E P_corpus(eilc)_&lt;l.</Paragraph>
      <Paragraph position="4"> Instead of relying solely on corpus-based estimates from a small parallel corpus, we employ a mixture model as follows: P( e I c) = ~ P _ corpus( e I c) + (1- #)P_ lexicon( e \[ c) The retrieval results in Table 6 show that combining the probability estimates from the lexicon and the parallel corpus does improve retrieval performance. The best results are obtained when 13=0.7; this is better than using uniform probabilities by 9% on Trec5C-medium and 4% on Trec6C-medium. Using the corpus probability estimates alone results in a significant drop in performance, the parallel corpus is not large enough nor diverse enough for reliable estimation of the translation probabilities. In fact, many words do not appear in the corpus at all. With a larger and better parallel corpus, more weight should be given to the probability estimates from the corpus.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML