Cross-Language Information Retrieval Based on
Category Matching Between Language Versions of a Web Directory
Fuminori Kimura
Graduate School of
Information Science,
Nara Institute of
Science and Technology
8916-5 Takayama,
Ikoma, Nara, Japan
Akira Maeda
Department of
Computer Science,
Ritsumeikan University
1-1-1 Noji-Higashi,
Kusatsu, Shiga, Japan
Masatoshi Yoshikawa
Information Technology
Center, Nagoya University
Furo-cho, Chigusa-ku,
Nagoya, Aichi, Japan
Shunsuke Uemura
Graduate School of
Information Science,
Nara Institute of
Science and Technology
8916-5 Takayama,
Ikoma, Nara, Japan
Abstract
Since the Web consists of documents in
various domains or genres, the method
for Cross-Language Information Retrieval
(CLIR) of Web documents should be in-
dependent of a particular domain. In this
paper, we propose a CLIR method which
employs a Web directory provided in mul-
tiple language versions (such as Yahoo!).
In the proposed method, feature terms are
first extracted from Web documents for
each category in the source and the tar-
get languages. Then, one or more corre-
sponding categories in another language
are determined beforehand by comparing
similarities between categories across lan-
guages. Using these category pairs, we in-
tend to resolve ambiguities of simple dic-
tionary translation by narrowing the cat-
egories to be retrieved in the target lan-
guage.
1 Introduction
With the popularity of the Internet, more and more
languages are becoming to be used for Web docu-
ments, and it is now much easier to access docu-
ments written in foreign languages. However, exist-
ing Web search engines only support the retrieval of
documents which are written in the same language
as the query, so the monolingual users are not able to
retrieve documents written in non-native languages
efficiently. Also, there might be cases, depending
on the user’s demand, where information written in
a language other than the user’s native language is
rich. Needs for retrieving such information must
be large. In order to satisfy such needs on a usual
monolingual retrieval system, the user him-/herself
has to manually translate the query by using a dic-
tionary, etc. This process not only imposes a burden
to the user but also might choose incorrect transla-
tions for the query, especially for languages that are
unfamiliar to the user.
To fulfill such needs, researches on Cross-
Language Information Retrieval (CLIR), a tech-
nique to retrieve documents written in a certain lan-
guage using a query written in another language,
have been active in recent years. A variety of meth-
ods, including employing corpus statistics for the
translation of terms and the disambiguation of trans-
lated terms, are studied and a certain results has
been obtained. However, corpus-based disambigua-
tion methods are heavily affected by the domain of
the training corpus, so the retrieval effectiveness for
other domains might drop significantly. Besides,
since the Web consists of documents in various do-
mains or genres, the method for CLIR of Web docu-
ments should be independent of a particular domain.
In this paper, we propose a CLIR method which
employs Web directories provided in multiple lan-
guage versions (such as Yahoo!). Our system uses
two or more language versions of a Web directory.
One version is the query language, and others are
the target languages. From these language versions,
category correspondences between languages are es-
timated in advance. First, feature terms are extracted
from Web documents for each category in the source
and the target languages. Then, one or more cor-
responding categories in another language are de-
termined beforehand by comparing similarities be-
tween categories across languages. Using these cat-
egory pairs, we intend to resolve ambiguities of
simple dictionary translation by narrowing the cat-
egories to be retrieved in the target language.
2 Related Work
Approaches to CLIR can be classified into three cat-
egories; document translation, query translation, and
the use of inter-lingual representation. The approach
based on translation of target documents has the ad-
vantage of utilizing existing machine translation sys-
tems, in which more content information can be used
for disambiguation. Thus, in general, it achieves
a better retrieval effectiveness than those based on
query translation(Sakai, 2000). However, since it is
impractical to translate a huge document collection
beforehand and it is difficult to extend this method to
new languages, this approach is not suitable for mul-
tilingual, large-scale, and frequently-updated collec-
tion of the Web. The second approach transfers both
documents and queries into an inter-lingual repre-
sentation, such as bilingual thesaurus classes or a
language-independent vector space. The latter ap-
proach requires a training phase using a bilingual
(parallel or comparable) corpus as a training data.
The major problem in the approach based on the
translation and disambiguation of queries is that the
queries submitted from ordinary users of Web search
engines tend to be very short (approximately two
words on average (Jansen et al., 2000)) and usually
consist of just an enumeration of keywords (i.e. no
context). However, this approach has an advantage
that the translated queries can simply be fed into ex-
isting monolingual search engines. In this approach,
a source language query is first translated into target
language using a bilingual dictionary, and translated
query is disambiguated. Our method falls into this
category.
It is pointed out that corpus-based disambiguation
methods are heavily affected by the difference in do-
main between query and corpus. Hull suggests that
the difference between query and corpus may cause
bad influence on retrieval effectiveness in the meth-
ods that use parallel or comparable corpora (Hull,
1997). Lin et al. conducted comparative experi-
ments among three monolingual corpora that have
different domains and sizes, and has concluded that
large-scale and domain-consistent corpus is needed
for obtaining useful co-occurrence data (Lin et al.,
1999).
On the Web retrieval, which is the target of our re-
search, the system has to cope with queries in many
different kinds of topics. However, it is impracti-
cal to prepare corpora that cover any possible do-
mains. In our previous paper(Kimura et al., 2003),
we proposed a CLIR method which uses documents
in a Web directory that has several language versions
(such as Yahoo!), instead of using existing corpora,
in order to improve the retrieval effectiveness. In
this paper, we propose an extension of our method
which takes account of the hierarchical structure of
Web directories. Dumais et al.(Dumais and Chen,
2000) suggests that the precision of Web document
classification could be improved to a certain extent
by limiting the target categories to compare by us-
ing the hierarchical structure of a Web directory. In
this paper, we try to improve our proposed method
by incorporating the hierarchical structure of a Web
directory for merging categories.
3 Proposed System
3.1 Outline of the System
Our system uses two or more language versions of
a Web directory. One version is the query language
(language A in Figure 1), others are the target lan-
guages to be retrieved (language B in Figure 1).
From these language versions, category correspon-
dences between languages are estimated in advance.
The preprocessing consists of the following four
steps: 1) term extraction from Web documents in
each category, 2) feature term extraction, 3) transla-
tion of feature terms, and 4) estimation of category
correspondences between different languages. Fig-
ure 1 illustrates the flow of the preprocessing. This
example shows a case that category a in language A
corresponds to a category in language B. First, the
system extracts terms from Web documents which
belong to category a (1). Secondly, the system cal-
culates the weights of the extracted terms. Then
higher-weighted terms are extracted as the feature
term set fa of category a (2). Thirdly, the system
translates the feature term set fa into language B
(3). Lastly, the system compares the translated fea-
ture term set of category a with feature term sets of
all categories in language B, and estimates the cor-
responding category of category a from language B
(4).
These category pairs are used on retrieval. First,
the system estimates appropriate category for the
query in the query language. Next, the system se-
lects the corresponding category in the target lan-
guage using the pre-estimated category pairs. Fi-
nally, the system retrieves Web documents in the se-
lected corresponding category.
language A
category a
feature term
    set f
language B
category a
feature term
    set f ...
compare among languages
(1)
(2)
(3)
(4)
feature
term DB
a b
Figure 1: Preprocessing.
3.2 Preprocessing
3.2.1 Feature Term Extraction
The feature of each category is represented by
its feature term set. Feature term set is a set of
terms that seem to distinguish the category. The
feature term set of each category is extracted in the
following steps: First, the system extracts terms
from Web documents that belong to a given cate-
gory. In this time, system also collect term fre-
quency of each word in each category and normal-
ize these frequency for each category. Second, the
system calculates the weights of the extracted terms
using TF¢ICF (term frequency ¢ inverse category fre-
quency). Lastly, top n ranked terms are extracted as
the feature term set of the category.
Weights of feature terms are calculated by
TF¢ICF. TF¢ICF is a variation of TF¢IDF (term fre-
quency ¢ inverse document frequency). Instead of
using a document as the unit, TF¢ICF calculates
weights by category. TF¢ICF is calculated as fol-
lows:
tf ¢icf(ti;c) = f(ti)N
c
¢log Nn
i
+1
where ti is the term appearing in the category c,
f(ti) is the term frequency of term ti, Nc is the total
number of terms in the category c, ni is number of
the categories that contain the term ti|and N is the
number of all categories in the directory.
3.2.2 Category Matching Between Languages
For estimating category correspondences between
languages, we compare each feature term set of a
category which is extracted in section 3.2.1, and cal-
culates similarities between categories across lan-
guages.
In order to compare two categories between lan-
guages, feature term set must be translated into the
target language. First, for each feature term, the
system looks up the term in a bilingual dictionary
and extracts all translation candidates for the feature
term. Next, the system checks whether each trans-
lation candidate exists in the feature term set of the
target category. If the translation candidate exists,
the system checks the candidate’s weight in the tar-
get category. Lastly, the highest-weighted transla-
tion candidate in the feature term set of the target
category is selected as the translation of the feature
term. Thus, translation candidates are determined
for each category, and translation ambiguity is re-
solved.
If no translation candidate for a feature term ex-
ists in the feature term set of the target category, that
term is ignored in the comparison. However, there
are some cases that the source language term itself
is useful as a feature term in the target language. For
example, some English terms (mostly abbreviations)
are commonly used in documents written in other
languages (e.g. “WWW”, “HTM”, etc.). Therefore,
in case that no translation candidate for a feature
term exists in the feature term set of the target cat-
egory, the feature term itself is checked whether it
exists in the feature term set of the target category.
If it exists, the feature term itself is treated as the
translation of the feature term in the target category.
As an example, we consider that an English term
“system” is translated into Japanese for the cate-
gory “������q�������>�����
�>������(Computers and Internet >Soft-
ware >Security)” (hereafter called “������”
for short). The English term “system” has the fol-
lowing translation candidates in a dictionary; “��
(universe/space)”|“MO(method)”|“
�	�(orga-
nization)”|“+�(organ)”|“����(system)”|
etc. We check each of these translation candidates in
the feature term set of the category “������.”
Then the highest-weighted term of these translation
candidates in the category “������” is deter-
mined as the translation of the English term “sys-
tem” in this category. If no translation candidate ex-
ists in the feature term set of the category “����
��,” the English term “system” itself is treated as
the translation.
Once all the feature terms are translated, the sys-
tem calculates the similarities between categories
across languages. The similarity between the source
category a and the target category b is calculated as
the total of multiplying the weights of each feature
term in the category a by the weight of its transla-
tion in the category b. The similarity of the category
a for the category b is calculated as follows:
sim(a;b) = X
f2fa
w(f;a)¢w(t;b)
where f is a feature term, fa is the feature term set
of category a, t is the translation of f in the category
feature term set
 of category a
feature term set
 of category bfeature term f
translation
candidates
t1
t2
t3
..
.
dictionary
compare
translation term
in category b
Figure 2: Feature term translation.
b, and w(f;a) is the weight of f in a.
The system calculates the similarities of category
a for each category in the target language using the
above-mentioned method. Then, a category with the
highest similarity in the target language is selected
as the correspondent of category a.
As an example, we consider an example of
calculating the similarity of an English category
“Computers and Internet >Security and Encryption”
(hereafter called “Encryption” for short) for the cat-
egory “������” which is mentioned above.
Suppose that the feature term set of the category
“Encryption” has the following feature terms; “pri-
vacy”, “system”, etc., and the weights of these terms
are 0.007110, 0.006327, ¢¢¢. Also suppose that the
Japanese translations of these terms are “����
��(privacy)”, “����(system)”, etc., and the
weights of these terms are 0.023999, 0.047117, ¢¢¢.
In this case, the similarity of the category “Encryp-
tion” (s1) for the category “������” (s2) is
calculated as follows:
sim(s1;s2) = 0:007110£0:023999
+0:006327£0:047117
+¢¢¢
3.2.3 Retrieval
Figure 3 illustrates the processing flow of a re-
trieval. When the user submits a query, the follow-
ing steps are processed.
In our system, a query consists of some keywords,
not of a sentence. We define the query vector ~q as
follows:
~q = (q1;q2;:::;qn)
where qk is the weight of the k-th keyword in the
query. We define the values of all qk are 1.
First, the system calculates the relevance between
the query and each category in the source language,
and determines the relevant category of the query in
the source language (1). The relevance between the
query and each category is calculated by multiply-
ing the inner product between query terms and the
feature term set of the target category by the angle
of these two vectors. The relevance between query
q and category c is calculated as follows:
rel(q;c) = ~q ¢~c¢ ~q ¢~cj~qj¢j~cj
where ~c is a vector of category c defined as follows:
~c = (w1;w2;:::;wn)
where wk is the weight of the k-th keyword in the
feature term set of c.
If there is more than one category whose rele-
vance for the query exceeds a certain threshold, all
of them are selected as the relevant categories of the
query. It is because there might be some cases that,
for example, documents in the same domain belong
to different categories, or a query concept belongs to
multiple domains.
Second, the corresponding category in the tar-
get language is selected by using category corre-
spondences between languages mentioned in section
3.2.2 (2). Third, the query is translated into the tar-
get language by using a dictionary and the feature
term set of the corresponding category (3). Finally,
the system retrieves documents in the corresponding
category (4).
4 Category Merging
4.1 Previous Experiments
In our previous paper(Kimura et al., 2003), we con-
ducted experiments of category matching using the
subsets of English and Japanese versions of Yahoo!.
language A language B
(4)
feature
term DB
query
(language A)
query
(language B)
(1)
(2)
(3)
(4)
Figure 3: Processing in retrieval.
The English subset is 559 categories under the cat-
egory “Computers and Internet” and the Japanese
subset is 654 categories under the corresponding
category “������q�������(Com-
puters and Internet).” Total size of English web
pages in each category after eliminating HTML tags
are 45,905 bytes on average, ranging from 476 to
1,084,676 bytes. Total size of Japanese web pages
are 22,770 bytes on average, ranging from 467 to
409,576 bytes.
In our previous experiments, we could not match
categories across languages with adequate accuracy.
It may have been caused by the following reasons;
one possible reason is that the size of Web docu-
ments was not enough for statistics in some cate-
gories, and another is that some categories are ex-
cessively divided as a distinct domain.
For the former observation, we eliminated the cat-
egories whose total bytes of Web documents are less
than 30KB, but the results were not improved.
4.2 Method of Category Merging
Considering the result of the above experiments, we
need to solve the problem of excessive division of
categories in order to accurately match categories
between languages.
The problem might be caused by the following
reasons; one possible reason is that there are some
categories which are too close in topic, and it might
cause poor accuracy. Another possible reason is that
some categories have insufficient amount of text in
order to obtain statistically significant values for fea-
ture term extraction. Considering the above observa-
tions, we might expect that the accuracy will be im-
proved by merging child categories at some level in
the category hierarchy in order to merge some cate-
gories similar in topic and to increase the amount of
text in a category.
Accordingly, we solve the problem by merging
child categories into the parent category at some
level using the directory hierarchy. As child cate-
gories are specialized ones of the parent category,
we can assume that these categories have similar
topic. Besides, even if two categories have no direct
link from each other, we can assume that categories
that have same parent category might also have sim-
ilar topic.
However, we still need further investigation on at
which level categories should be merged.
Figure 4: Category merging.
5 Experiments
We are conducting experiments of the proposed
method to detect relevance category of a query. In
this experiment, we used the same subsets men-
tioned in section 4.1. We merged the categories three
levels below the category “Computers and Internet”
into the parent. The number of categories after cate-
gory merging is 342 in English and 265 in Japanese.
At first, we have done the experiment using the
following formula that uses only inner product,
before using the calculation mentioned in section
3.2.3.
relinner(q;c) = ~q ¢~c
In this experiment, the query has three
terms: “encryption”(=q1), “security”(=q2), and
“system”(=q3).
Table 1 is the list of top 10 relevant categories in
first experiment. Almost all the categories in the Ta-
ble 1 are relevant to the query. Thus, the relevance
calculation method by only inner product is regarded
as an effective method. However, this method has
the following problem. The category that has few
query terms might be given high relevance when the
category has the only one query term whose weight
in the category is extremely high.
In order to reduce this effect, we propose the im-
proved method mentioned in section 3.2.3. The
method is revised to take account of the angle be-
tween ~q and ~c. Ultimately, the most relevant cate-
gory has the vector whose length is long and whose
factors are flat. The length is considered by inner
product, on the other hand, flatness is considered by
the angle between ~q and ~c.
Table 2 is the list of top 10 relevant categories in
the second experiment using revised method. Al-
though noticeable improvement does not appear, the
relevance of the categories which matches few query
terms are ranked lower than the first experiment.
6 Conclusions
In this paper, we proposed a method using a Web
directory for CLIR. The proposed method is inde-
pendent of a particular domain because it uses docu-
ments in a Web directory as the corpus. Our method
is particularly effective for the case that the docu-
ment collection covers wide range of domains such
as the Web. Besides, our method does not require
expensive linguistic resources except for a dictio-
nary. Therefore, our method can easily be extended
to other languages as long as the language versions
of a Web directory exist and the dictionary can be
obtained.
Future work includes improving the category
matching method and the evaluation of retrieval ef-
fectiveness.

References
Susan Dumais and Hao Chen. 2000. Hierarchical clas-
sification of Web content. Proceedings of the 23rd
ACM International Conference on Research and De-
velopment in Information Retrieval(SIGIR2000).
David A. Hull. 1997. Using structured queries for dis-
ambiguation in cross-language information retrieval.
Electronic Working Notes of the AAAI Symposium on
Cross-Language Text and Speech Retrieval.
Bernard J. Jansen, Amanda Spink, and Tefko Saracevic.
2000. Real life, real user queries on the Web. Infor-
mation Processing & Management, 36(2).
Fuminori Kimura, Akira Maeda, Masatoshi Yoshikawa,
and Shunsuke Uemura. 2003. Cross-Language Infor-
mation Retrieval using Web Directory Structure. The
14th Data Engineering Workshop, (in Japanese).
Chuan-Jie Lin, Wen-Cheng Lin, Guo-Wei Bian, and
Hsin-Hsi Chen. 1999. Description of the NTU
Japanese-English cross-lingual information retrieval
system used for NTCIR workshop. First NTCIR Work-
shop on Research in Japanese Text Retrieval and Term
Recognition.
Tetsuya Sakai. 2000. MT-based Japanese-English cross-
language IR experiments using the TREC test collec-
tions. Proceedings of The Fifth International Work-
shop on Information Retrieval with Asian Languages
(IRAL2000).
