File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/p98-1036_abstr.xml
Size: 4,073 bytes
Last Modified: 2025-10-06 13:49:16
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1036"> <Title>Proper Name Translation in Cross-Language Information Retrieval</Title> <Section position="1" start_page="0" end_page="232" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Recently, language barrier becomes the major problem for people to search, retrieve, and understand WWW documents in different languages. This paper deals with query translation issue in cross-language information retrieval, proper names in particular. Models for name identification, name translation and name searching are presented. The recall rates and the precision rates for the identification of Chinese organization names, person names and location names under MET data are (76.67%, 79.33%), (87.33%, 82.33%) and (77.00%, 82.00%), respectively. In name translation, only 0.79% and 1.11% of candidates for English person names and location names, respectively, have to be proposed. The name searching facility is implemented on an MT sever for information retrieval on the WWW. Under this system, user can issue queries and read documents with his familiar language.</Paragraph> <Paragraph position="1"> Introduction World Wide Web (WWW) is the most useful and powerful information dissemination system on the Internet. For the multilingual feature, the language barrier becomes the major problem for people to search, retrieve, and understand WWW documents in different languages. That decreases the dissemination power of WWW to some extent. The researches of cross-language information retrieval abbreviated as CLIR (Oard and Dorr, 1996; Oard 1997) aim to tackle the language barriers. There are several important issues in CLIR: (1) Queries and documents are in different languages, so that translation is required. (2) Words in a query may be ambiguous, thus disambiguation is required.</Paragraph> <Paragraph position="2"> (3) Queries are usually short, thus expansion is required.</Paragraph> <Paragraph position="3"> (4) Word boundary in queries of some languages (Chen and Lee, 1996) is not clear, thus segmentation is required.</Paragraph> <Paragraph position="4"> (5) A document may be in more than one language, thus language identification is required.</Paragraph> <Paragraph position="5"> This paper focuses on query translation issue, proper name in particular.</Paragraph> <Paragraph position="6"> The percentage of user queries containing proper names is very high. The paper (Thompson and Dozier, 1997) reported an experiment over periods of several days in 1995. It showed 67.8%, 83.4%, and 38.8% of queries to Wall Street Journal, Los Angeles Times, and Washington Post, respectively, involve name searching. In CLIR, three tasks are needed: name identification, name translation, and name searching. Because proper names are usually unknown words, it is hard to find in monolingual dictionary not to mention bilingual dictionary. Coverage is one of the major problems in dictionary-based approaches (Ballesteros and Croft, 1996; Davis, 1997; Hull and Grefenstette, 1996). Corpus-based approaches (Brown, 1996; Oard 1996; Sheridan and Ballerini, 1996) set up thesaurus from large-scale corpora. They provide narrow but specific coverage of the language, and are complementary to broad and shallow coverage in dictionaries. However, domain shifts and term align accuracy are major limitations of corpus-based approaches. Besides, proper names are infrequent words relative to other content words in corpora. In information retrieval, most frequent and less frequent words are regarded as unimportant words and may be neglected.</Paragraph> <Paragraph position="7"> This paper will propose methods to extract and classify proper names from Chinese queries (Section 1). Then, Chinese proper names are translated into English proper names (Section 2). Finally, the translated queries are sent to an MT sever for information retrieval on the WWW (Bian and Chen, 1997). The retrieved English home pages are presented in Chinese and/or English.</Paragraph> </Section> class="xml-element"></Paper>