File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/h05-1061_intro.xml

Size: 3,872 bytes

Last Modified: 2025-10-06 14:02:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1061">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 483-490, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Mining Key Phrase Translations from Web Corpora</Title>
  <Section position="3" start_page="483" end_page="484" type="intro">
    <SectionTitle>
2 Retrieving Web Page Snippets through
Cross-lingual Query Expansion
</SectionTitle>
    <Paragraph position="0"> For a Chinese key phrase f, we want to find its translation e from the web, more specifically, from the mixed-language web pages or web page snippets containing both f and e. As we do not know e, we are unable to directly retrieve such mixed-language web page using (f,e) as the query.</Paragraph>
    <Paragraph position="1"> Figure 2. Returned mixed-language web page snippets using cross-lingual query expansion However, we observed that when the author of a web page lists both f and e in a page, it is very likely that f' and e' are listed in the same page, where f' is a Chinese hint word topically relevant to f, and e' is f's translation. Therefore if we know a Chinese hint word f', and we know its reliable translation, e', we can send (f, e') as a query to retrieve mixed language web pages containing (f, e).</Paragraph>
    <Paragraph position="2"> For example, to find web pages which contain translations of &amp;quot;Fu Shi De &amp;quot;(Faust), we expand the query to &amp;quot;Fu Shi De +goethe&amp;quot; since &amp;quot; Ge De &amp;quot; (Goethe) is the author of &amp;quot;Fu Shi De &amp;quot;(Faust). Figure 2 illustrates retrieved web page snippets with expanded queries. We find that newly returned snippets contain more correct translations with higher ranks.</Paragraph>
    <Paragraph position="3"> To propose a &amp;quot;good&amp;quot; English hint e' for f, first we need to find a Chinese hint word f' that is relevant to f. Because f is often an OOV word, it is unlikely that such information can be obtained from existing Chinese monolingual corpora. Instead, we  query Google for web pages containing f. From the returned snippets we select Chinese words f' based on the following criteria: 1. f' should be relevant to f based on the co-occurrence frequency. On average, 300 Chinese words are returned for each query f. We only consider those words that occur at least twice to be relevant.</Paragraph>
    <Paragraph position="4"> 2. f' can be reliably translated given the current bilingual resources (e.g. the LDC Chinese-English lexicon  with 81,945 translation entries).</Paragraph>
    <Paragraph position="5"> 3. The meaning of f' should not be too am null biguous. Words with many translations are not used.</Paragraph>
    <Paragraph position="6"> 4. f' should be translated into noun or noun phrases. Given the fact that most OOV words are noun or noun phrases, we ignore those source words which are translated into other part-of-speech words. The  is used to generate the English noun lists.</Paragraph>
    <Paragraph position="7"> For each f, the top Chinese words f' with the highest frequency are selected. Their corresponding translations are then used as the cross-lingual hint words for f. For example, for OOV word f = Fu Shi De (Faust), the top candidate f's are &amp;quot;Ge De (Goethe)&amp;quot;, &amp;quot;Jie Jian (introduction)&amp;quot;, &amp;quot;Wen Xue (literature)&amp;quot; and &amp;quot;Bei Ju (tragedy)&amp;quot;. We expand the original query &amp;quot;Fu Shi De &amp;quot; to &amp;quot;Fu Shi De + goethe&amp;quot;, &amp;quot;Fu Shi De + introduction&amp;quot;, &amp;quot;Fu Shi De + literature&amp;quot;, &amp;quot;Fu Shi De + tragic&amp;quot;, and then query Google again for web page snippets containing the correct translation &amp;quot;Faust&amp;quot;.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML