File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/ackno/02/w02-1206_ackno.xml
Size: 3,692 bytes
Last Modified: 2025-10-06 13:50:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1206"> <Title>Lexicon-based Orthographic Disambiguation in CJK Intelligent Information Retrieval</Title> <Section position="2" start_page="0" end_page="0" type="ackno"> <SectionTitle> 5TheRoleofLexical Databases </SectionTitle> <Paragraph position="0"> Because of the irregular orthography of CJK languages, lexeme-based procedures such as orthographic disambiguation cannot be based on probabilistic methods (e.g. bigramming) alone.</Paragraph> <Paragraph position="1"> Many attempts have been made along these lines, as for example Brill (2001) and Goto et al. (2001), with some claiming performance equivalent to lexicon-based methods, while Kwok (1997) reports good results with only a small lexicon and simple segmentor.</Paragraph> <Paragraph position="2"> Thesemethods may be satisfactory for pure IR (relevant document retrieval), but for orthographic disambiguation and C2C conversion, Emerson (2000) and others have shown that a robust morphological analyzer capable of processing lexemes, rather than bigrams or n-grams, must be supported by a large-scale computational lexicon (even 100,000 entries is much too small).</Paragraph> <Paragraph position="3"> The CJK Dictionary Institute (CJKI), which specializes in CJK computational lexicography, is engaged in an ongoing research and development effort to compile comprehensive CJK lexical databases (currently about 5.5 million entries), with special emphasis on orthographic disambiguation and proper nouns. Listed below are the principal components useful for intelligent IR tools and orthographic disambiguation.</Paragraph> <Paragraph position="4"> 1. Chinese to Chinese conversion. In 1996, CJKI launched a project to investigate C2C conversion issues in-depth, and to build comprehensive mapping tables (now at 1.3 million SC and 1.2 million TC items) whose goal is to achieve near 100% conversion accuracy. These include: a. SC-to/from-TC code-level mapping tables b. SC-to/from-TC orthographic and lexemic mapping tables for general vocabulary c. SC-to/from-TC orthographic mapping tables for proper nouns d. Comprehensive SC-to/from-TC orthographic/lexemic mapping tables for technical terminology, especially IT terms 2. TC orthographc normalization tables a. TC normalization mapping tables b. STC-to/from-TTC character mapping tables 3. Japanese orthographic variant databases a. A comprehensive database of Japanese orthographic variants b. A database of semantically classified homophone groups c. Semantically classified synonym groups for synonym expansion (Japanese thesaurus) d. An English-Japanese lexicon for CLIR e. Rules for identifying unlisted variants CJK IR tools have become increasingly important to information retrieval in particular and to information technology in general. As we have seen, because of the irregular orthography of the CJK writing systems, intelligent information retrieval requires not only sophisticated tools such as morphological analyzers, but also lexical databases fine-tuned to the needs of orthographic disambiguation.</Paragraph> <Paragraph position="5"> Few if any CJK IR tools perform orthographic disambiguation. For truly &quot;intelligent&quot; IR to become a reality, not only must lexicon-based disambiguation be supported, but such emerging technologies as CLIR, synonym expansion and cross-homophone searching should also be implemented.</Paragraph> <Paragraph position="6"> We are currently engaged in further developing the lexical resources required for building intelligent CJK information retrieval tools and for supporting accurate segmentation technology.</Paragraph> </Section> class="xml-element"></Paper>