File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1120_metho.xml
Size: 12,979 bytes
Last Modified: 2025-10-06 14:08:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1120"> <Title>Cross-Language Information Retrieval Based on Category Matching Between Language Versions of a Web Directory</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Proposed System </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Outline of the System </SectionTitle> <Paragraph position="0"> Our system uses two or more language versions of a Web directory. One version is the query language (language A in Figure 1), others are the target languages to be retrieved (language B in Figure 1).</Paragraph> <Paragraph position="1"> From these language versions, category correspondences between languages are estimated in advance. The preprocessing consists of the following four steps: 1) term extraction from Web documents in each category, 2) feature term extraction, 3) translation of feature terms, and 4) estimation of category correspondences between different languages. Figure 1 illustrates the flow of the preprocessing. This example shows a case that category a in language A corresponds to a category in language B. First, the system extracts terms from Web documents which belong to category a (1). Secondly, the system calculates the weights of the extracted terms. Then higher-weighted terms are extracted as the feature term set fa of category a (2). Thirdly, the system translates the feature term set fa into language B (3). Lastly, the system compares the translated feature term set of category a with feature term sets of all categories in language B, and estimates the corresponding category of category a from language B (4).</Paragraph> <Paragraph position="2"> These category pairs are used on retrieval. First, the system estimates appropriate category for the query in the query language. Next, the system selects the corresponding category in the target language using the pre-estimated category pairs. Finally, the system retrieves Web documents in the selected corresponding category.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Preprocessing </SectionTitle> <Paragraph position="0"> The feature of each category is represented by its feature term set. Feature term set is a set of terms that seem to distinguish the category. The feature term set of each category is extracted in the following steps: First, the system extracts terms from Web documents that belong to a given category. In this time, system also collect term frequency of each word in each category and normalize these frequency for each category. Second, the system calculates the weights of the extracted terms using TFC/ICF (term frequency C/ inverse category frequency). Lastly, top n ranked terms are extracted as the feature term set of the category.</Paragraph> <Paragraph position="1"> Weights of feature terms are calculated by TFC/ICF. TFC/ICF is a variation of TFC/IDF (term frequency C/ inverse document frequency). Instead of using a document as the unit, TFC/ICF calculates weights by category. TFC/ICF is calculated as follows: null</Paragraph> <Paragraph position="3"> where ti is the term appearing in the category c, f(ti) is the term frequency of term ti, Nc is the total number of terms in the category c, ni is number of the categories that contain the term ti|and N is the number of all categories in the directory.</Paragraph> <Paragraph position="4"> For estimating category correspondences between languages, we compare each feature term set of a category which is extracted in section 3.2.1, and calculates similarities between categories across languages. null In order to compare two categories between languages, feature term set must be translated into the target language. First, for each feature term, the system looks up the term in a bilingual dictionary and extracts all translation candidates for the feature term. Next, the system checks whether each translation candidate exists in the feature term set of the target category. If the translation candidate exists, the system checks the candidate's weight in the target category. Lastly, the highest-weighted translation candidate in the feature term set of the target category is selected as the translation of the feature term. Thus, translation candidates are determined for each category, and translation ambiguity is resolved. null If no translation candidate for a feature term exists in the feature term set of the target category, that term is ignored in the comparison. However, there are some cases that the source language term itself is useful as a feature term in the target language. For example, some English terms (mostly abbreviations) are commonly used in documents written in other languages (e.g. &quot;WWW&quot;, &quot;HTM&quot;, etc.). Therefore, in case that no translation candidate for a feature term exists in the feature term set of the target category, the feature term itself is checked whether it exists in the feature term set of the target category. If it exists, the feature term itself is treated as the translation of the feature term in the target category. As an example, we consider that an English term &quot;system&quot; is translated into Japanese for the category &quot;q> >(Computers and Internet >Software >Security)&quot; (hereafter called &quot;&quot; for short). The English term &quot;system&quot; has the following translation candidates in a dictionary; &quot;</Paragraph> <Paragraph position="6"> etc. We check each of these translation candidates in the feature term set of the category &quot;.&quot; Then the highest-weighted term of these translation candidates in the category &quot;&quot; is determined as the translation of the English term &quot;system&quot; in this category. If no translation candidate exists in the feature term set of the category &quot; ,&quot; the English term &quot;system&quot; itself is treated as the translation.</Paragraph> <Paragraph position="7"> Once all the feature terms are translated, the system calculates the similarities between categories across languages. The similarity between the source category a and the target category b is calculated as the total of multiplying the weights of each feature term in the category a by the weight of its translation in the category b. The similarity of the category a for the category b is calculated as follows:</Paragraph> <Paragraph position="9"> where f is a feature term, fa is the feature term set of category a, t is the translation of f in the category feature term set of category a feature term set of category bfeature term f b, and w(f;a) is the weight of f in a. The system calculates the similarities of category a for each category in the target language using the above-mentioned method. Then, a category with the highest similarity in the target language is selected as the correspondent of category a.</Paragraph> <Paragraph position="10"> As an example, we consider an example of calculating the similarity of an English category &quot;Computers and Internet >Security and Encryption&quot; (hereafter called &quot;Encryption&quot; for short) for the category &quot;&quot; which is mentioned above. Suppose that the feature term set of the category &quot;Encryption&quot; has the following feature terms; &quot;privacy&quot;, &quot;system&quot;, etc., and the weights of these terms are 0.007110, 0.006327, C/C/C/. Also suppose that the Japanese translations of these terms are &quot; (privacy)&quot;, &quot;(system)&quot;, etc., and the weights of these terms are 0.023999, 0.047117, C/C/C/. In this case, the similarity of the category &quot;Encryption&quot; (s1) for the category &quot;&quot; (s2) is calculated as follows:</Paragraph> <Paragraph position="12"> Figure 3 illustrates the processing flow of a retrieval. When the user submits a query, the following steps are processed.</Paragraph> <Paragraph position="13"> In our system, a query consists of some keywords, not of a sentence. We define the query vector ~q as follows:</Paragraph> <Paragraph position="15"> where qk is the weight of the k-th keyword in the query. We define the values of all qk are 1.</Paragraph> <Paragraph position="16"> First, the system calculates the relevance between the query and each category in the source language, and determines the relevant category of the query in the source language (1). The relevance between the query and each category is calculated by multiplying the inner product between query terms and the feature term set of the target category by the angle of these two vectors. The relevance between query q and category c is calculated as follows:</Paragraph> <Paragraph position="18"> where wk is the weight of the k-th keyword in the feature term set of c.</Paragraph> <Paragraph position="19"> If there is more than one category whose relevance for the query exceeds a certain threshold, all of them are selected as the relevant categories of the query. It is because there might be some cases that, for example, documents in the same domain belong to different categories, or a query concept belongs to multiple domains.</Paragraph> <Paragraph position="20"> Second, the corresponding category in the target language is selected by using category correspondences between languages mentioned in section 3.2.2 (2). Third, the query is translated into the target language by using a dictionary and the feature term set of the corresponding category (3). Finally, the system retrieves documents in the corresponding category (4).</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Category Merging </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Previous Experiments </SectionTitle> <Paragraph position="0"> In our previous paper(Kimura et al., 2003), we conducted experiments of category matching using the subsets of English and Japanese versions of Yahoo!.</Paragraph> <Paragraph position="1"> The English subset is 559 categories under the category &quot;Computers and Internet&quot; and the Japanese subset is 654 categories under the corresponding category &quot;q(Computers and Internet).&quot; Total size of English web pages in each category after eliminating HTML tags are 45,905 bytes on average, ranging from 476 to 1,084,676 bytes. Total size of Japanese web pages are 22,770 bytes on average, ranging from 467 to 409,576 bytes.</Paragraph> <Paragraph position="2"> In our previous experiments, we could not match categories across languages with adequate accuracy. It may have been caused by the following reasons; one possible reason is that the size of Web documents was not enough for statistics in some categories, and another is that some categories are excessively divided as a distinct domain.</Paragraph> <Paragraph position="3"> For the former observation, we eliminated the categories whose total bytes of Web documents are less than 30KB, but the results were not improved.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Method of Category Merging </SectionTitle> <Paragraph position="0"> Considering the result of the above experiments, we need to solve the problem of excessive division of categories in order to accurately match categories between languages.</Paragraph> <Paragraph position="1"> The problem might be caused by the following reasons; one possible reason is that there are some categories which are too close in topic, and it might cause poor accuracy. Another possible reason is that some categories have insufficient amount of text in order to obtain statistically significant values for feature term extraction. Considering the above observations, we might expect that the accuracy will be improved by merging child categories at some level in the category hierarchy in order to merge some categories similar in topic and to increase the amount of text in a category.</Paragraph> <Paragraph position="2"> Accordingly, we solve the problem by merging child categories into the parent category at some level using the directory hierarchy. As child categories are specialized ones of the parent category, we can assume that these categories have similar topic. Besides, even if two categories have no direct link from each other, we can assume that categories that have same parent category might also have similar topic.</Paragraph> <Paragraph position="3"> However, we still need further investigation on at which level categories should be merged.</Paragraph> </Section> </Section> class="xml-element"></Paper>