XML Viewer - c02-1008

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1008_metho.xml
Size: 14,559 bytes
Last Modified: 2025-10-06 14:07:45
<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1008">
  <Title>A Transitive Model for Extracting Translation Equivalents of Web Queries through Anchor Text Mining</Title>
  <Section position="3" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2 The Previous Approach
</SectionTitle>
    <Paragraph position="0"> For query translation, the anchor-text-based approach is a new technique compared with the bilingual-dictionary- and parallel-corpus-based approaches. In this section we will introduce the basic concept of the anchor-text-based approach.</Paragraph>
    <Paragraph position="1"> For more details please refer to our initial work (Lu, et al., 2001).</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.1 Anchor-Text Set
</SectionTitle>
      <Paragraph position="0"> An anchor text is the descriptive part of an out-link of a Web page. It represents a brief description of the linked Web page. For a Web page (or URL) u</Paragraph>
      <Paragraph position="2"> as all of the anchor texts of the links, i.e., u</Paragraph>
      <Paragraph position="4"> 's alternative concepts and textual expressions such as titles and headings, which are cited by other Web pages. With different preferences, conventions and language competence, the anchor-text set could be composed of multilingual phrases, short texts, acronyms, or even u i 's URL. For a query term appearing in the anchor-text set, it is likely that its corresponding translations also appear together. The anchor-text sets can be considered as a comparable corpus of translated texts, from the viewpoint of translation extraction.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.2 The Probabilistic Inference Model
</SectionTitle>
      <Paragraph position="0"> To determine the most probable target translation t for source query term s, we developed a probabilistic inference model (Wong et al., 1995). This model is adopted for estimating probability value between source query and each translation candidate that co-occur in the same anchor-text sets. The estimation assumes that the anchor texts linking to the same pages may contain similar terms with analogous concepts. Therefore, a candidate translation has a higher chance to be an effective translation if it is written in the target language and frequently co-occurs with the source query term in the same anchor-text sets. In the field of Web research, it has been proven that the use of link structures is effective for estimating the authority of Web pages (Kleinberg, 1998; Chakrabarti et al., 1998). The model further assumes that the translation candidates in the anchor-text sets of the pages with higher authority may have more reliability in confidence. The similarity estimation function based on the probabilistic inference model is defined below:</Paragraph>
      <Paragraph position="2"> The above measure is adopted to estimate the degree of similarity between source term s and target translation t. The measure is estimated based on their co-occurrence in the anchor text sets of the concerned Web pages U = {u  is a page of concern and P(u i ) is the probability value used to measure the authority of page u i . By considering the link structures and concept space of Web pages, P(u</Paragraph>
      <Paragraph position="4"> simplified from HITS algorithm (Kleinberg, 1998).</Paragraph>
      <Paragraph position="5"> In addition, we assume that s and t are</Paragraph>
      <Paragraph position="7"> ) is equal to the product of P(s|u</Paragraph>
      <Paragraph position="9"> ), and the similarity measure becomes:</Paragraph>
      <Paragraph position="11"> ) are defined to be estimated by calculating the fractions of the</Paragraph>
      <Paragraph position="13"> ), respectively.</Paragraph>
      <Paragraph position="14"> Therefore, a candidate translation has a higher confidence value to be an effective translation if it frequently co-occurs with the source term in the anchor-text sets of the pages with higher authority.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.3 The Estimation Process
</SectionTitle>
      <Paragraph position="0"> For each source term, the probabilistic inference model extracts the most probable translation that maximizes the estimation. The estimation process based on the model was developed to extract term translations through mining of real-world anchor-text sets. The process contains three major computational modules: anchor-text extraction, term extraction and term translation extraction. The anchor-text extraction module was constructed to collect pages from the Web and build up a corpus of anchor-text sets. On the other hand, for each given source term s, the term extraction module extracts key terms as the translation candidate set from the anchor-text sets of the pages containing s. At last, the term translation module extracts the translation that maximizes the similarity estimation. For more details about the estimation process, please refer to our previous work (Lu et al., 2001).</Paragraph>
      <Paragraph position="1"> To make a difference from the translation process via an intermediate language, the above process is called direct translation, and the adopted model called direct translation model hereafter. Meanwhile, we will use function  Pdirect in Equation (3) for the estimation of the direct translation.</Paragraph>
      <Paragraph position="2"> (3) ).(log),( tsPtsPdirect -=</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="1" end_page="2" type="metho">
    <SectionTitle>
3 The Improved Approach
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 The Indirect Translation Model
</SectionTitle>
      <Paragraph position="0"> As mentioned above, for those query terms whose corresponding translations either appear infrequently in the same anchor text sets or do not appear together, the estimation with equation (2) is basically unreliable. To increase the possibility of translation extraction especially for the source terms whose corresponding translations do not co-occur, we add a phase of indirect translation through an intermediate language. For example, as shown in Fig. 1, our idea is to obtain the corresponding target translation &amp;quot;g13046g4624&amp;quot; in simplified Chinese by translating the source term &amp;quot;g4357g1338&amp;quot; in traditional Chinese into an intermediate term &amp;quot;Sony&amp;quot; in English, and then seek for translating &amp;quot;Sony&amp;quot; into a target term &amp;quot;g13046g4624&amp;quot; in simplified Chinese. For both the source query and the target translation, we assume that their translations in the intermediate language are the same and can be found.</Paragraph>
      <Paragraph position="1"> The above assumption is not unrealistic.</Paragraph>
      <Paragraph position="2"> For example, it is possible to find the Chinese translation of a Japanese movie star through submitting his/her English name to a search engine and browsing the retrieved Chinese pages containing the English name. The Web contains large amounts of multilingual pages, and English is the most likely intermediate language between other languages. Based on this assumption, we extend the probabilistic inference model and propose an indirect translation model as the following formula:  , where m is the transitive translation of s and t in the intermediate language, P(s- m) and P(m- t) are the probability values obtained with the direct translation model which can be calculated by Equation (2).</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.2 The Transitive Translation Model
</SectionTitle>
      <Paragraph position="0"> The transitive model is developed to combine both the direct and indirect translation models and improve the translation accuracy. By combining Equation (3) and (4), the transitive translation model is defined as follows:</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
4.1 Analysis of Anchor-Text Sets and
Query Logs
</SectionTitle>
      <Paragraph position="0"> In the initial experiments, we took traditional Chinese and simplified Chinese as the source and target language respectively, and used English as the intermediate language. We have collected 1,980,816 traditional Chinese Web pages in Taiwan. Among these pages, 109,416 pages whose anchor-text sets contained both traditional Chinese and English terms were taken as the anchor-text set corpus. We also collected 2,179,171 simplified Chinese Web pages in China and extracted 157,786 pages whose anchor-text sets contained both simplified Chinese and English terms. In addition, through merging the two Web page collections into a larger one, we extracted 4,516 Web pages containing both traditional and simplified Chinese terms. The three comparable corpora provide a potential resource of translation pairs for some Web queries. In order to realize the feasibility in translating query terms via transitive translation, we aim at finding out the corresponding simplified Chinese translations of traditional Chinese query terms via English as the intermediate language.</Paragraph>
      <Paragraph position="1">  We also collected popular query terms with the logs from two real-world Chinese search engines in Taiwan, i.e., Dreamer and GAIS  .</Paragraph>
      <Paragraph position="2"> The Dreamer log contained 228,566 unique query terms from a period of over 3 months in 1998, and the GAIS log contained 114,182 unique query terms from a period of two weeks in 1999. There were 9,709 most popular query terms whose frequencies were above 10 in both of the logs and 1,230 of them were English terms. After filtering out the terms which were used locally, we obtained 258 terms. These query terms were taken as the major test set in the term translation extraction analysis. The traditional Chinese translations of the test query terms were determined manually and taken as the source query set in the following experiments.</Paragraph>
      <Paragraph position="3"> According to our previous work (Lu et al., 2001), there were three methods for term extraction, which is a necessary process step in extracting translations from anchor-text corpus. Since we have not yet collected a query log in simplified Chinese, in the following experiments we adopted the PAT-tree-based keyword extraction method, which is an efficient statistics-based approach that can extract longer terms without using a dictionary (Chien, 1997). To evaluate the achieved performance of query translation, we used the average top-n inclusion rate as a metric. For a set of test query terms, its top-n inclusion rate is defined as the percentage of the query terms whose effective translation(s) can be found in the top n extracted translations.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.2 Performance with the Direct
Translation Model
</SectionTitle>
      <Paragraph position="0"> In order to realize the feasibility of the transitive translation model, we carried out some experiments based on the direct translation models and the three different anchor-text set corpora in the first step. Table 1 shows the results of the obtained top-5 inclusion rates,  These two search engines are second-tier portals in Taiwan, whose logs have certain representatives in the Chinese communities, and whose URLs are as follows: http://www.dreamer.com.tw/ and http://gais.cs.cu.edu.tw/.</Paragraph>
      <Paragraph position="1"> where terms &amp;quot;TC&amp;quot;, &amp;quot;SC&amp;quot; and &amp;quot;ENG&amp;quot; represent traditional Chinese, simplified Chinese and English terms respectively. The performance of translating TC into SC is worse than that of the other two since the size of the anchor-text set corpus containing both TC and SC is relatively small in comparison with the others. This is why we are pursuing in this paper to integrate the direct translation with the indirect translation via a third language. However, the performance of the direct translation from TC to SC is used as a reference in comparison with our proposed models in the following experiments.</Paragraph>
    </Section>
    <Section position="5" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.3 Performance with the Indirect and
Transitive Translation Models
</SectionTitle>
      <Paragraph position="0"> To realize the improvement using the transitive translation model, some further experiments were conducted. As shown in Table 2, the indirect and transitive translation models outperform than the direct translation model. As mentioned above, the size of the anchor-text corpus that contains both TC and SC is small.</Paragraph>
      <Paragraph position="1"> The indirect translation model is, therefore, helpful to find out the corresponding translations for some terms with low-frequency values in the corpora. For example, the traditional Chinese term &amp;quot;g1771g2416g1367&amp;quot; was found can obtain its corresponding translation equivalent &amp;quot;g1771g19388g1367&amp;quot; in simplified Chinese via the intermediate translation &amp;quot;Siemens&amp;quot;, which cannot be found only using the direct translation.</Paragraph>
      <Paragraph position="2"> By examining the top-1 translations obtained with the three different models, it was found that the inclusion rates can be from 44.2% using the indirect translation to 49.2% using the transitive translation model. Table 3 illustrates some of the translations extracted using the transitive translation model.</Paragraph>
      <Paragraph position="3">  An additional experiment was also made to compare with the use of a translation lexicon for query translation. The lexicon contained more than 23,948 word/phrase entries in both traditional Chinese and simplified Chinese. It was found the top-1 inclusion rate that using the lexicon lookup was 12.4% which is obviously lower than the 49.2% that using the proposed transitive translation model. In addition, the top-1 inclusion rate can reach to 55.8% (see the last row of Table 2) if both of the approaches are combined. With the combined approach, the translation(s) of a query term is picked up from the lexicon if such a translation is already in the lexicon, otherwise it is obtained based on the transitive translation model.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML