File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/w02-1206_abstr.xml
Size: 17,917 bytes
Last Modified: 2025-10-06 13:42:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1206"> <Title>Lexicon-based Orthographic Disambiguation in CJK Intelligent Information Retrieval</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> The orthographical complexity of Chinese, Japanese and Korean (CJK) poses a special challenge to the developers of computational linguistic tools, especially in the area of intelligent information retrieval.These difficulties are exacerbated by the lack of a standardized orthography in these languages, especially the highly irregular Japanese orthography. This paper focuses on the typology of CJK orthographic variation, provides a brief analysis of the linguistic issues, and discusses why lexical databases should play a central role in the CJK information retrieval. To achieve truly &quot;intelligent&quot; retrieval many challenges must be overcome. Some of the major issues include: 1. The lack of a standard orthography. To process the extremely large number of orthographic variants (especially in Japanese) and character forms requires support for advanced IR technologies such as crossorthographic searching (Halpern 2000).</Paragraph> <Paragraph position="1"> 2. The accurate conversion between Simplified Chinese (SC) and Traditional Chinese (TC), a deceptively simple but in fact extremely difficult computational task (Halpern and Kerman 1999).</Paragraph> <Paragraph position="2"> 3. The morphological complexity of Japanese and Korean poses a formidable challenge to the development of an accurate morphological analyzer. This performs such operations as canonicalization, stemming (removing inflectional endings) and conflation (reducing morphological variants to a single form) on the morphemic level.</Paragraph> <Paragraph position="3"> 4. The difficulty of performing accurate word segmentation, especially in Chinese and Japanese which are written without interword spacing. This involves identifying word boundaries by breaking a text stream into meaningful semantic units for dictionary lookup and indexing purposes. Good progress in this area is reported in Emerson (2000) and Yu et al. (2000).</Paragraph> <Paragraph position="4"> 5. Miscellaneous retrieval technologies such as lexeme-based retrieval (e.g. 'take off' + 'jacket' from 'took off his jacket'), identifying syntactic phrases (such asZbfromZ `h), synonym expansion, and cross-language information retrieval (CLIR) (Goto et al. 2001).</Paragraph> <Paragraph position="5"> 6. Miscellaneous technical requirements such as transcoding between multiple character sets and encodings, support for Unicode, and input method editors (IME). Most of these issues have been satisfactorily resolved, as reported in Lunde (1999).</Paragraph> <Paragraph position="6"> 7. Proper nouns pose special difficulties for IR tools, as they are extremely numerous, difficult to detect without a lexicon, and have an unstable orthography.</Paragraph> <Paragraph position="7"> 8. Automatic recognition of terms and their variants, a complex topic beyond the scope of this paper. It is described in detail for European languages in Jacquemin (2001), and we are currently investigating it for Chinese and Japanese.</Paragraph> <Paragraph position="8"> Each of the above is a major issue that deserves a paper in its own right. Here, the focus is on orthographic disambiguation, which refers to the detection, normalization and conversion of CJK orthographic variants.Thispaper summarizes the typology of CJK orthographic variation, briefly analyzes the linguistic issues, and discusses why lexical databases should play a central role in the disambiguation process.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2Orthographic Variation in Chinese 2.1 One Language, Two Scripts </SectionTitle> <Paragraph position="0"> As a result of the postwar language reforms in the PRC, thousands of character forms underwent drastic simplifications (Zongbiao 1986). Chinese written in these simplified forms is called Simplified Chinese (SC). Taiwan, Hong Kong, and most overseas Chinese continue to use the old, complex forms, referred toasTraditional Chinese (TC).</Paragraph> <Paragraph position="1"> The complexity of the Chinesewritingsystem is well known. Some factors contributing to this are the large number of characters in common use, their complex forms, the major differences between TC and SC along various dimensions, the presence of numerous orthographic variants in TC, and others. The numerous variants and the difficulty of converting between SC and TC are of special importance to Chinese IR applications.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Chinese-to-Chinese Conversion </SectionTitle> <Paragraph position="0"> The process of automatically converting SC to/from TC, referred to as C2C conversion,isfull of complexities and pitfalls. A detailed description of the linguistic issues can be found in Halpern and Kerman (1999), while technical issues related to encoding and character sets are described in Lunde (1999). The conversion can be implemented on three levels in increasing order of sophistication, briefly described below.</Paragraph> <Paragraph position="1"> unreliable, way to perform C2C conversion is on a codepoint-to-codepoint basis by looking the source up in a mapping table, such as the one shown below. This isreferred toascode conversion or transcoding. Because of the numerous one-to-many ambiguities (which occur in both the SC-to-TC and the TC-to-SC directions), the rate of conversion failure is unacceptably high.</Paragraph> <Paragraph position="2"> level of sophistication in C2C conversion is referred to as orthographic conversion, because the items being converted are orthographic units, rather than codepoints in a character set. That is, they are meaningful linguistic units, especially multi-character lexemes. While code conversion is ambiguous, orthographic conversion gives better results because the orthographic mapping tables enable conversion on the word level.</Paragraph> <Paragraph position="3"> As can be seen, the ambiguities inherent in code conversion are resolved by using an orthographic mapping table, which avoids false conversions such as shown in the Incorrect column. Because of segmentation ambiguities, such conversion must be done with the aid of a morphological analyzer that can break the text stream into meaningful units (Emerson 2000).</Paragraph> <Paragraph position="4"> sophisticated, and far more challenging, approach to C2C conversion is called lexemic conversion, which maps SC and TC lexemes that are semantically, not orthographically, equivalent. For example, SC G1667G26f5 (xinxi )'information' is converted to the semantically equivalent TC G534dG5090 (zi xun). This is similar to the difference between lorry in British English and truck in American English.</Paragraph> <Paragraph position="5"> There are numerous lexemic differences between SC and TC, especially in technical terms and proper nouns, as demonstrated by Tsou (2000). For example, there are more than 10 variants for 'Osama bin Laden.' To complicate matters, the correct TC is sometimes locale-dependent.</Paragraph> <Paragraph position="6"> Lexemic conversion is the most difficult aspect of C2C conversion and can only be done with the help of mapping tables. Table 3 illustrates various patterns of cross-locale lexemic variation.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Traditional Chinese Variants </SectionTitle> <Paragraph position="0"> Traditional Chinese does not have a stable orthography. There are numerous TC variant forms, and much confusion prevails. To process TC (and to some extent SC) it is necessary to disambiguate these variants using mapping tables (Halpern 2001).</Paragraph> <Paragraph position="1"> 2.3.1 TC Variants in Taiwan and Hong Kong Traditional Chinese dictionaries often disagree on the choice of the standard TC form. TC variants can be classified into various types, as illustrated in Table 4.</Paragraph> <Paragraph position="2"> Var. 1 Var. 2 English Comment There are variousreasonsfortheexistence of TC variants, such as some TCformsarenot being available in the Big Five character set, the occasional use of SC forms, and others. 2.3.2 Mainland vs. Taiwanese Variants To a limited extent, the TC forms are used in the PRC for some classical literature, newspapers for overseas Chinese, etc., based on a standard that maps the SC forms (GB 2312-80) to their corresponding TC forms (GB/T 12345-90).</Paragraph> <Paragraph position="3"> However, these mappings do not necessarily agree with those widely used inTaiwan.We will refer to the former as &quot;Simplified Traditional Chinese&quot; (STC), and to the latter as &quot;Traditional</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 One Language, Four Scripts </SectionTitle> <Paragraph position="0"> The Japanese orthography is highly irregular.</Paragraph> <Paragraph position="1"> Because of the large number of orthographic variants and easily confused homophones, the Japanese writing system is significantly more complex than any other major language, including Chinese. A major factor is the complex interaction of the four scripts used to write Japanese, resulting in countless words that can be written in a variety of often unpredictable ways (Halpern 1990, 2000).</Paragraph> <Paragraph position="2"> Table 6 shows the orthographic variants of{ Mtoriatsukai 'handling', illustrating a variety of variation patterns.</Paragraph> <Paragraph position="3"> An example of how difficult Japanese IR can be is the proverbial &quot;A hen that lays golden eggs.&quot; The &quot;standard&quot; orthography would bew[2 (Kin no tamago wo umu niwatori). In reality, tamago 'egg' has four variants ([,,h],), niwatori 'chicken' three (2,tq,) and umu 'to lay' two (, \), which expands to 24 permutations likew[ \, w2etc. As can be easily verified by searching the web, these variants frequently occur in webpages. Clearly, the user has no hope of finding them unless the application supports orthographic disambiguation.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Okurigana Variants </SectionTitle> <Paragraph position="0"> One of the most common types of orthographic variation in Japanese occurs in kana endings, called >okurigana, that are attached to a kanji base or stem. Although it is possible to generate some okurigana variants algorithmically, such as nouns (Z`)derivedfromverbs(Z b), on the whole hard-coded tables are required. Because usage is often unpredictable and the variants are numerous, okurigana must play a major role in Japanese orthographic disambiguation.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Cross-Script Orthographic Variants </SectionTitle> <Paragraph position="0"> Japanese is written in a mixture of four scripts (Halpern 1990): kanji (Chinese characters), two syllabic scripts called hiragana and katakana, and romaji (the Latin alphabet). Orthographic variation across scripts, which should play a major role in Japanese IR, is extremely common and mostly unpredictable, so that the same word can be written in hiragana, katakana or kanji, or even in a mixture of two scripts. Table 8 shows the major cross-script variation patterns in Japanese.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Kana Variants Recent years have seen a sharp increase in the use </SectionTitle> <Paragraph position="0"> of katakana, a syllabary used mostly to write loanwords. A major annoyance in Japanese IR is that katakana orthography is often irregular; it is quite common for the same word to be written in multiple, unpredictable ways which cannot be generated algorithmically. Hiragana is used mostly to write grammatical elements and some native Japanese words. Although hiragana orthography is generally regular, a small number of irregularities persist. Some of the major types of kana variation are shown in Table 9.</Paragraph> <Paragraph position="1"> The above is only a brief introduction to the most important types of kana variation. There are various others, including an optional middle dot (nakaguro) and small katakana variants (vs.</Paragraph> <Paragraph position="2"> ), and the use of traditional (avs.k)and historical (Mvs.)kana.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.5 Miscellaneous Variants </SectionTitle> <Paragraph position="0"> There are various other types of orthographic variants in Japanese, which are beyond the scope of this paper. Only a couple of the important ones are mentioned below. A detailed treatment can be found in Halpern (2000).</Paragraph> <Paragraph position="1"> writing system underwent major reforms in the postwar period and the character forms have by now been standardized, there is still a significant number of variants in common use, such as abbreviated forms in contemporary Japanese (= for@andfor)andtraditional forms in proper nouns and classical works (such asbfor aandforC).</Paragraph> <Paragraph position="2"> that contributes to the complexity of the Japanese writing system is the existence of a large number of homophones (words pronounced the same but written differently) and their variable orthography (Halpern 2000). Not only can each kanji have many kun readings, but many kun words can be written in a bewildering variety of ways. The majority of kun homophones are often close or even identical in meaning and thus easily confused, i.e., noboru means 'go up' when written but 'climb' when writtenJ,while yawarakai 'soft' is writtenJTMorTM with identical meanings.</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4Orthographic Variation in Korean 4.1 Irregular Orthography </SectionTitle> <Paragraph position="0"> The Korean orthography is not as regular as most people tend to believe. Though hangul is often described as &quot;logical,&quot; the fact is that in modern Korean there is a significant amount of orthographic variation. This, combined with the morphological complexity of the language, poses achallenge to developers of IR tools. The major types of orthographic variation in Korean are described below.</Paragraph> </Section> <Section position="10" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Hangul Variants </SectionTitle> <Paragraph position="0"> The most important type of orthographic variation in Korean is the use of variant hangul spellings in the writing of loanwords. Another significant kind of variation is in the writing of non-Korean personal names, as shown in Table 10.</Paragraph> </Section> <Section position="11" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Cross-Script Orthographic Variants </SectionTitle> <Paragraph position="0"> Afactor that contributes to the complexity of the Korean writing system istheuse of multiple scripts. Korean is written in a mixture of three scripts: an alphabetic syllabary called hangul, Chinese characters called hanja (their use is declining) and the Latin alphabet called romaja.</Paragraph> <Paragraph position="1"> Orthographic variation across scripts is not uncommon. The major patterns of cross-script variation are shown Table 11.</Paragraph> </Section> <Section position="12" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Miscellaneous Variants </SectionTitle> <Paragraph position="0"> 4.4.1 North vs. South Korea Another factor contributing to the irregularity of hangul orthography is the differences in spelling between South Korea (S.K.) and North Korea (N.K.). The major differences are in the writing of loanwords, astrong preference for native Korean words, and in the writing of non-Korean proper nouns. The major types are shown below.</Paragraph> <Paragraph position="1"> 1. Place names: N.K. GafdfGaa67G9807 (osakka)vs.S.K. GafdfGaa67Gb82f(osaka)for'Osaka' 2. Personal names: N.K. Ga73bGac43 (busyu)vs.S.K. Ga73bGac97(busi)for'Bush' 3. Loanwords: N.K.Ga5b3Gacb3Gb137(missail)vs.S.K.Ga5b3 Gaa67Gb137(misail)for'missile' 4. Russian vs. English: N.K. G97b3Ga2a3Ga81b (guruppa) vs. S.K.G97b3Ga2b4(geurup) 5. Morphophonemic: N.K. Ga147Gb064 (ramyong)vs. S.K.G9a63Gb064(namyong) 4.4.2 New vs. Old Orthography The hangul script went through several reforms during its history, the latest one taking place as recently as 1988. Though the new orthography is now well established, the old orthography is still important because the affected words are of high frequency and their number is not insignificant. For example, the modern Gb137G972b 'worker' (ilgun)waswritten Gb137G9977 (ilkkun)before 1988, whileGa816G95c3'color' (bitgal)was writtenGa816G980f(bitkkal).</Paragraph> <Paragraph position="2"> reforms in Korea did not include the simplification of the character forms, the Japanese occupation of Korea resulted in many simplified Japanese character forms coming into use, such as the Japanese formCto replace(bal).</Paragraph> <Paragraph position="3"> various other types of orthographic variation, which are beyond the scope of this paper. This includes the use of abbreviations and acronyms and variation in interword spacing in multiword compounds. For example, 'Caribbean Sea' (karibeuhae)maybewritten solid (Gb82fGa367Ga7c7Gbf2f)or open (Gb82fGa367Ga7c7G3Gbf2f).</Paragraph> </Section> </Section> class="xml-element"></Paper>