File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1209_metho.xml

Size: 14,365 bytes

Last Modified: 2025-10-06 14:08:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1209">
  <Title>Decomposition for ISO/IEC 10646 Ideographic Characters</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
HK Mainland Taiwan
</SectionTitle>
    <Paragraph position="0"> Even with the ISO 10646 horizontal extensions, people in Hong Kong still get confused as to which styles to use, as only some characters in the Hong Kong style deviate from both G column(mainland China) and T column(Taiwan).</Paragraph>
    <Paragraph position="1"> Consequently, the Hong Kong SAR Government has decided to develop the Hong Kong glyph standards for ISO 10646 which can serve as a reference guide for font vendors when developing products for Hong Kong. The standards, being the first of its kind, makes uses of character decomposition to specify a character glyph using its components.</Paragraph>
    <Paragraph position="2"> The rest of the paper is organized as follows.</Paragraph>
    <Paragraph position="3"> Section 1 gives the rationale for the use of character components, the references and decomposition rules. Section 2 describes the data structure and algorithms to decompose Chinese characters into components and, vice versa. Section 3 discusses performance considerations and Section 4 is the conclusion.  1. Character Decomposition Rules  At the beginning of the glyph standardization, one important requirement was agreed by the working group, namely, extensibility. That is, the specifications should be easily extended by adding more characters into later versions of the ISO/IEC 10646, which we refer to as the new characters. The specifications should also not contain any internal inconsistency, or inconsistency in relation to the ISO/IEC 10646's source standards. In order to satisfy both consistency requirements, we have concluded that listing every character in ISO/IEC 10646 is not desirable. Instead, we decided to produce the specifications by giving the correct glyphs of character components based on a common assumption that if a component or a character is written in a certain way, all other characters using it as a component should also write it in the same way. For example if the character &amp;quot;bone&amp;quot; (U+9AA8) is written in a certain way, all characters using &amp;quot;bone&amp;quot; as a component, such as &amp;quot; &amp;quot; (U+6ED1) and &amp;quot; &amp;quot; (U+9ABC), should have the bone &amp;quot; &amp;quot; component follow the same style. In this way, the specification can be extended very easily for all new characters using bone &amp;quot; &amp;quot; as a component. In other words, we can assume that component glyphs are standardized for general usage. By using components to describe a character, we can also avoid inconsistency. That is, by avoid listing all characters with bone, &amp;quot; &amp;quot; as a component, we do not need to be concerned about producing inconsistent glyphs in the specifications. This is important because the working group does not have any font vendor as a member, because of an implicit rule that was specified by the Government of the HKSAR to avoid any potential conflict of interest. The glyph style is mostly based on the book published by the Hong Kong Institute of Education in 2000[5] In principle, for producing glyph specifications, we have to produce a concrete, minimal, and unique list of basic components. In order to achieve this, we need to have a set of rules to decompose the characters systematically. In our work, we have used the GF 3001-1997 [6] as our major component reference. The following is a brief description of the rules. (For a detailed description, please refer to the paper &amp;quot;The Hong  basis to construct a set of primary components. Components for simplified Chinese are removed. The shapes are modified to match the glyph style for Hong Kong.</Paragraph>
    <Paragraph position="4"> * Characters are decomposed into components according to their structure and etymological origin.</Paragraph>
    <Paragraph position="5"> * In some cases, an &amp;quot;ad-hoc&amp;quot; decomposition occurs if the etymological origin and its glyph shape are not consistent, or the etymological origin is not clear, or to avoid defining additional components.</Paragraph>
    <Paragraph position="6">  some components to prevent the components from getting too small.</Paragraph>
    <Paragraph position="7"> * In some cases, a single component will be distinguished as two different components. This is the concept of variant or related component.</Paragraph>
    <Paragraph position="8"> This set of rules, together with 644 basic components and the set of intermediate components defined, enables us to decompose Chinese characters that appear in the first version ISO 10646 with 20,902 characters, Ext. A in the second version of ISO 10646[1] and Hong Kong Suplementary Character Set [8-9].</Paragraph>
    <Paragraph position="9"> The 644 basic components play a very important role because they form all the Chinese characters in our scope.</Paragraph>
    <Paragraph position="10"> In order to describe the position relationship amongst components in a character, we have used the 12 Ideographic Description Characters (IDC) in ISO/IEC 10646 Part1:2000 in the range from 2FF0 to 2FFB, and defined an extra IDC &amp;quot;M&amp;quot; (which indicates that a particular component is a basic component and will not be further decomposed), as shown in Table 1. Every character can be decomposed into up to three components depending on the cardinality of the IDC used.</Paragraph>
    <Paragraph position="11"> Each Character is decomposed according to the following definition:</Paragraph>
    <Paragraph position="13"> CC(i) is a set of character components and i indicates its position in the sequence M is a special symbol indicating Character will not be further decomposed By our definition, a CC can be formed by three subsets: (1) coded radicals, (2) coded components and ideographs proper, and (3) intermediate components that are not coded in ISO 10646. The intermediate components are maintained by our system only. The decomposition result is stored in the database. Conceptually, every entry in the database can be treated as a Chinese component, having a data structure described above.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Decomposition/Formation Algorithms
</SectionTitle>
    <Paragraph position="0"> As mentioned above, the decomposition database only gives information on how a character is decomposed in a minimal way.</Paragraph>
    <Paragraph position="1"> However, some characters have nested components. For instance, the character &amp;quot;Zhun &amp;quot; can be decomposed into two components: &amp;quot;Huai &amp;quot; and &amp;quot;Shi &amp;quot;, but &amp;quot;Huai &amp;quot; being a character can be further decomposed into two components. In order to handle nesting and finding components to the most elementary form(no further decomposition), we have defined the decomposition and formation algorithms.</Paragraph>
    <Paragraph position="2"> There are mainly two algorithms, one for the decomposition of a character into a set of components(the algorithm is called Char-to-Compnt) , another one for the formation of a set of characters from a component ( the algorithm is called Compn-to-Charr).</Paragraph>
    <Paragraph position="3"> Let x be the seed (x = starting character for  Both algorithms are very similar. They recursively retrieve all characters/components appearing in the decomposition database by using the characters/components themselves as a seed, but their directions of retrieval are opposite to each other. In the &amp;quot;Char-to-Compnt&amp;quot;, the decomposition goes from its current level down, one level at a time, until no more decomposition can be done. Figure 1 the pseudo code of the algorithm for one level only and they can be done recursively to find all components of a character. Table 2 shows the entries related to the character &amp;quot;Meng &amp;quot;. Notice that the number of components for &amp;quot;Meng &amp;quot; is not two, but 4 because one of the components &amp;quot;Ming &amp;quot; can be further decomposed into two more components.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
&amp;quot;Meng &amp;quot;
</SectionTitle>
      <Paragraph position="0"> On the other hand, the &amp;quot;Compnt-to-Char&amp;quot; algorithm searches from its current level up until no more character can be found using the current component. Figure 2 shows the pseudo code of the upward search algorithm where x is considered the seed to start the search and the variable contains all characters formed using the current component x.</Paragraph>
      <Paragraph position="1"> Let x be the seed (x = starting component for  involving the component &amp;quot;Kou &amp;quot;. Note that the result not only find the character &amp;quot;Wu &amp;quot;, but also the characters using &amp;quot;Wu &amp;quot; as components as well. Further more, due to the fact that there are two IDCs with cardinality of three, the decomposition is not unique. Based Han characters formation rules, some characters should be decomposed into two components first before considering further decomposition. For instance, &amp;quot;Zha &amp;quot; should be decomposed into &amp;quot;Jin &amp;quot; and &amp;quot;Ze &amp;quot; whereas &amp;quot;Jie &amp;quot; should be decomposed into &amp;quot;Xing &amp;quot; and &amp;quot;Gui &amp;quot;. However, for upward search we certainly want the character &amp;quot;Zha &amp;quot; to be found if the search component is &amp;quot;Bei &amp;quot;. Therefore, in addition to using the most reason decomposition at the first level, we also maintain different decompositions for applications where character formation rule are less important. In other words, we also provided composition and decompositions independent of certain particular character formal rules. Again taking the character &amp;quot;Zha &amp;quot; as an examples, its components should not only be &amp;quot;Jin &amp;quot; and &amp;quot;Ze &amp;quot;, but also &amp;quot;Bei &amp;quot;, &amp;quot;Ze &amp;quot;, &amp;quot;Zhao &amp;quot; as well as &amp;quot;Bei &amp;quot; and &amp;quot; &amp;quot;. In fact, in our system, &amp;quot;Zha &amp;quot; is decomposed into &amp;quot;Jin &amp;quot;, &amp;quot;Bei &amp;quot; and &amp;quot; &amp;quot; as shown in Table 4. The &amp;quot;Char-to-Compnt&amp;quot; algorithm will take the relative positions of the components into consideration based on the IDC defined in each entry to find other three possible components &amp;quot;Bei &amp;quot;, &amp;quot;Ze &amp;quot; and &amp;quot;Zhao &amp;quot;. This can be done because the combination of &amp;quot;Jin &amp;quot; and &amp;quot;Bei &amp;quot; will form &amp;quot;Bei &amp;quot;; similarly &amp;quot;Bei &amp;quot; and &amp;quot; &amp;quot; will form &amp;quot;Ze &amp;quot;;, and &amp;quot;Jin &amp;quot; and &amp;quot; &amp;quot;will form &amp;quot;Zhao &amp;quot;. Note that in the first two cases of the OR clause, &amp;quot;Bei &amp;quot; and &amp;quot;Ze &amp;quot; will be identified. In the third case of the OR clause, the character &amp;quot;Zhao &amp;quot; will be identified. You may argue the validity of the third case of the OR clause, but for the character &amp;quot;&amp;quot;Chong &amp;quot;, finding the component &amp;quot;Xing &amp;quot; would be very important.</Paragraph>
      <Paragraph position="2">  with three components The basic principle of the algorithm, as shown in Figure 3, is that if we see a character with an IDC {K} or {L}, or an IDC of a character that can be transformed to IDC {K} or {L}, we will try to use its components to form characters.</Paragraph>
      <Paragraph position="3"> Let x be a Chinese component (x = cc); Let LCsub be the list of sub-components c;</Paragraph>
      <Paragraph position="5"> **the same algorithm works when x[structure] = IDC{L}, then the result c[structure] will become IDC{B} Figure 4 Pseudo-code for handling a character with three components Let s be the seed (s = cc); Let r be the result component; if s[structure] = IDC{A} if s[component(1)][ structure] = IDC{A} then</Paragraph>
      <Paragraph position="7"> **this algorithm also works when s[structure] = IDC{B}, then the result structure will become IDC{L} Figure 4 Pseudo-code of For the Split Step In many cases, we still want to maintain the characters in the right decomposition, e.g, to decompose them into two components first and then further decompose if needed. Take another character &amp;quot;Shu &amp;quot; as an example. Suppose it is only decomposed into two components (&amp;quot;Mu &amp;quot; and &amp;quot;Shu &amp;quot;). This makes the search more complex. In order to simplify the search, we need to go through an internal step which we call the Split Step to decompose the character into three components before we allow for component to character search. The pseudo code for the Split Step is shown in Figure 4. The generated result  For some characters like &amp;quot;Chong &amp;quot;, the Split Step must consider the component &amp;quot;Zhong &amp;quot; in the middle as an insertion into the character &amp;quot;Xing &amp;quot;. We use similar handling to decompose &amp;quot;Chong &amp;quot; into &amp;quot;Chi &amp;quot;, &amp;quot;Zhong &amp;quot; and &amp;quot;Chu &amp;quot;, with an IDC {K}. In order to find a character with the component &amp;quot;Xing &amp;quot; such as &amp;quot;Chong &amp;quot; , we need additional algorithm to locate components that are potentially being split to the two sides with an inserted component.</Paragraph>
      <Paragraph position="8"> We try to decompose a component into two sub-components if their IDC is &amp;quot;A&amp;quot; or &amp;quot;B&amp;quot;. Once we get the two sub-components, we try to make different combinations to see if there are any characters with an IDC {K} or {L} that contain the two sub-components as shown in Figure 5.</Paragraph>
      <Paragraph position="9"> Let x be a Chinese character (x = cc); Let Clst be the list of results c; if x[structure] = IDC{A} then</Paragraph>
      <Paragraph position="11"> **this algorithm also works when x[structure] = IDC{B}, then the result structure will become</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML