File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/w02-1209_abstr.xml
Size: 5,921 bytes
Last Modified: 2025-10-06 13:42:35
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1209"> <Title>Decomposition for ISO/IEC 10646 Ideographic Characters</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Ideograph characters are often formed by some smaller functional units, which we call character components. These character components can be ideograph radicals, ideographs proper, or some pure components which must be used with others to form characters. Decomposition of ideographs can be used in many applications. It is particularly important in the study of Chinese character formation, phonetics and semantics.</Paragraph> <Paragraph position="1"> However, the way a character is decomposed depends on the definition of components as well as the decomposition rules. The 12 Ideographic Description Characters (IDCs) introduced in ISO 10646 are designed to describe characters using components. The Hong Kong SAR Government recently published two sets of glyph standards for ISO10646 characters. The standards, being the first of its kind, make use of character decomposition to specify a character glyph using its components. In this paper, we will first introduce the IDCs and how they can be used with components to describe two dimensional ideograph characters in a linear fashion. Next we will briefly discuss the basic references and character decomposition rules. We will then describe the data structure and algorithms to decompose Chinese characters into components and, vice versa. We have also implemented our database and algorithms as an internet application, called the Chinese Character Search System, available at website http://www.iso10646hk.net/. With this tool, people can easily search characters and components in ISO 10646.</Paragraph> <Paragraph position="2"> Introduction ISO/IEC 10646 (ISO 10646) in its current version, contains more than 27,000 Han characters, or ideograph characters as it is called, in its basic multilingual plane and another 40,000 in the second plane[1-2]. The complete set of ideograph repertoire includes Han characters in all national/regional standards as well as all characters from the Kang Xi Dictionary( ) and other major references. In almost all the current encoding systems including ISO 10646 and Unicode, each Han character is treated as a separate unique symbol and given a separate code point. This single character encoding method has some serious drawbacks. Consider most of the alphabet-based languages, such as English, even though new words are created quite frequently, the alphabet itself is quite stable. Thus the newly adopted words do not have any impact on coding standards. When new Han characters are created, they must be assigned a new code point, thus all codesets supporting Han characters must leave space for extension. As there is no formal rule to limit the formation of new Han characters, the standardization process for code point assignment can be potentially endless. On the other hand, new Han characters are almost always be created using some existing character components which can be existing radicals, characters proper, or pure components which are not used alone as characters. If we can use coded components to describe a new character, we can potentially eliminate the standardization process. Han characters can be considered as a two dimensional encoding of components. The same set of components when used in different relative positions can form different characters. For example the two components and can form two different characters: depending on the relative positions of the two components. However, the current internal code point assignments in no way can reveal the relationship of the these characters with respect to their component characters. Because of the limitation of the encoding system, people have to put a lot of efforts to develop different input methods. Searching for characters with similar shapes are also quite difficult. The 12 Ideographic Description Characters (IDCs) were introduced in ISO 10646 in the code range of 2FF0 - 2FFB to describe the relative positions of components as shown in Table 1. Each IDC symbol shows a typical ideograph character composition structure. For example, (U+2FF0) indicates that a character is formed by two components, one on the left-hand side and one on the right-hand side. All IDCs except U+2FF2 and U+2FF3 have cardinality of two because the decomposition requires two components only. Details of these symbols can be found in Annex F of ISO 10646 2nd Edition</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Characters </SectionTitle> <Paragraph position="0"> The IDCs can be used to describe not only unencoded characters, but also coded characters to reveal their internal structures and relationships among components. Thus applications for using these structural symbols can be quite useful. In fact the most common applications are in electronic dictionaries and on-line education [4].</Paragraph> <Paragraph position="1"> In this paper, however, we introduce a new application where the IDCs and components are used in the standardization of Han character glyphs. As we all know that ISO 10646 is a character standard, which allows different glyph styles for the same character and different regions can develop different glyph styles to suit their own needs. The ideographic repertoire in ISO 10646 has a so called Horizontal Extension, where each coded ideograph character is listed under the respective CJKV columns. The glyph of each character can be different under different columns because ISO 10646 is a character standard, not a glyph standard. We normally call these different glyphs as variants. For example, the character bone can take three different forms(variants):</Paragraph> </Section> </Section> class="xml-element"></Paper>