A CHINESE CHARACTERS CODING SCHEME FOR COMPUTER INPUT AND INTERNAL STORAGE 
Chorkin Chan, Computer Centre, University of Hong Kong, Hong Kong 
Abstract 
A coding scheme for inputting Chinese 
characters by means of a conventional 
keyboard has been developed. The code 
for each Chinese character is composed 
of two strings of keys, one corresponds 
to the spelling and the other the ideo- 
graphic property of the character. Each 
code requires no more than seven keys 
(average five and a half keys) and 99.5% 
of the ten thousand characters in a 
dictionary 'XianDai HanYu CiDian' have 
unique codes. Each input code can be 
packed into 32 bits for internal 
representation. 
Introduction 
of eighteen pairs can be easily removed 
by deleting one member of each pair from 
the vocabulary because they are either 
'dead' characters appearing in ancient 
classics only or they can be replaced by 
other characters of equivalent meaning. 
The non-uniqueness of the remaining 
eight pairs can be removed also by 
either changing the ideographic pattern 
or the spelling of one of the members in 
each pair. Thus, by means of these 
remedial measures, this coding scheme 
offers a unique code to each of the ten 
thousand characters found in this dic- 
tionary. The list of characters sharing 
the same codes is in Table 1 together 
with the suggested remedies to overcome 
the problem of non-uniqueness. 
Over the last few years, encoding 
Chinese characters has become a very 
active subject of research. Numerous 
papers have appeared, mainly written in 
Chinese (hence difficult to be refer- 
enced in English), proposing various 
kinds of inputting schemes. Unfortun- 
ately, most of these papers offered only 
the ideas without accompanying implemen- 
tation and experimentation. This paper 
presents a coding scheme of Chinese 
characters based on their ideographic 
properties as well as their spellings so 
that a conventional typewriter keyboard 
can be used for inputting purposes. 
This scheme has been implemented at the 
University of Hong Kong using an IBM 
3031 under VM/CMS. Without a proper 
output device to display the Chinese 
characters, when the code of a Chinese 
character is entered, the address of 
that character (where it can be found) 
in a dictionary 'XianDai HanYu CiDian' 
is displayed. This is awkward but still 
sufficient to prove the correctness of 
the code recognition procedure. 
The Coding Scheme for Inputting 
In this scheme, a code for a Chinese 
character consists of two strings of 
symbols concatenated together. One 
string of three symbols corresponds to 
the ideographic radicals the character 
is composed of. The other of no more than 
four symbols is the spelling of the 
character. Corresponding to each of the 
ten thousand characters in the diction- 
ary 'XianDai HanYu CiDian', with the 
exception of twenty six pairs, there 
exists a unique code in this scheme. In 
other words, this coding scheme is 99.5% 
unique. Furthermore, among these pairs 
of characters sharing the same codes in 
a pair-wise manner, the non-uniqueness 
The Spelling of Chinese Characters 
There are two standard systems to spell 
Chinese characters, one in terms of the 
Latin alphabets and the other in terms 
of Mandarin Pin Yin symbols. By means 
of the former; a maximum of five 
alphabets are normally required to spell 
a Chinese character. However, since the 
alphabet 'G' (except when it is the 
leading alphabet) always appear with 'N' 
as 'NG', one can replace 'NG' with 'G' 
and reduce the maximum number of alpha- 
bets required from five to four. By 
means of the latter, no more than three 
symbols are required to spell a Chinese 
character. This can be an important 
saving but in this paper, spellings are 
in terms of Latin alphabets just because 
a conventional terminal keyboard does 
not have Mandarin Pin Yin keys. 
It is not always obvious whether one 
should read certain Chinese characters 
with or without a curling tongue, i.e., 
whether one should spell with 'C' or 
'CH' 'S' or 'SH' and 'Z' or 'ZH'. This 8 
is particularly difficult to those whose 
mother tongue is not Mandarin. In order 
to be more forgiving, this coding scheme 
allows one not to differentiate 'C' from 
'CH' 'S' from 'SH' and 'Z' from 'ZH' so 
that, for example, 'SHAO' can be spelled 
as 'SAO'. As a consequence, there Will 
be three additional pairs of characters 
sharing the same codes in a pair-wise 
manner as listed in Table 2. Fortunate- 
ly, the non-uniqueness so engendered can 
be easily eliminated by deleting one 
member of each pair because of its rare 
occurrence. For the same reason, this 
coding scheme also allows one to confuse 
a leading 'N' with a leading 'L' For 
example, 'LUAN' can be spelled as 'NUAN' 
and vice versa. No non-uniqueness is 
introduced as a result of this 
274 
Table i: Pairs of Chinese Characters Sharing the Same Codes 
Spelling 
AN 
BI 
BO 
DIAO 
DUN 
E 
E 
FU 
GU 
JIA 
JIAN 
JING 
JUAN 
LIAN 
LING 
MAO 
PANG 
Qz 
SHAO 
SI 
XIAO 
YI 
YI 
YU 
YUN 
ZHANG 
ZHEN 
ZI 
ZI 
Radical 
Composition 
5 
9TE 
M 
V 
KB 
KGX 
T2- 
VM 
K2 
FDK 
JY 
D6= 
PKL 
-y 
KYR 
HOP 
87 
I 
2 DK 
I 
8EL 
F? 
? 
0 
2K; 
27 
JPX 
X.X 
;.X 
Char. 
1 
I 
~t 
~f 
A 
@ 
Char. 
2 
P 
T 
T 
% 
Suggested Remedy 
delete char. 2 
write char. 2 as 
delete char. 1 
delete char. 2 
delete char. 1 
delete char. 1 
delete char. 2 
spell char. 2 as FO 
delete char. 1 
write char. 2 as ~ 
delete char. 1 
write char. 1 as ~"\[ 
delete char. 2 
delete char. 1 
delete char. 1 
Jus ti fi cation 
same meaning 
it means a defect 
delete char. 2 
write char. 1 as 
write char. 2 as 
delete char. 2 
write char. 1 as ~A 
write char. 2 as ~ 
delete char. 1 
delete char. 2 
spell char. 2 as OU 
delete char. 2 
write char. 1 as ~ 
delete char. 1 
delete char. 1 
delete char. 1 
same meaning 
replaced by + 
uncommon 
uncommon 
un common 
so is 
uncommon 
metalic shackle 
un common 
human activity 
uncommon 
un common 
un common 
uncommon 
that's original 
that's original 
replaced by ~'~ 
that's original 
being ~celestial 
Hn common 
un common 
SO is I~ 
un common 
made of fabric 
un common 
replaced by @ 
replaced by y)~ 
--275--. 
relaxation because the complete code 
consists of the radical string as well 
as the spelling string. Over the ten 
thousand characters in 'XianDai HanYu 
CiDian', this coding scheme requires an 
average of 2.5 alphabets to spell a 
Chinese character. 
The Radical Comppsition of 
Chinese Characters 
One traditional method of looking up a 
Chinese character in a dictionary is 
first to identify a radical in the 
graphic representation of the character. 
There are hundreds of different standard 
radicals used in a dictionary and there 
are rigid rules to apply in order to 
identify one. The number of Chinese 
characters identified to a single 
radical is numerous. Even a combination 
of the spelling and the identifying 
radical together is not sufficient to 
yield a unique code for a Chinese 
character. 
An experiment was conducted in which 
each of the ten thousand characters 
mentioned above was decomposed into a 
string of as many as eight radicals. In 
order to do so, a total of four hundred 
and fifty six radicals were employed. 
These radicals were grouped into fifty 
sets according to their common graphical 
properties. Each set is then associated 
with a key of a conventional keyboard. 
Table 3 lists all these radicals, their 
groupings and their associations with 
the keys of a keyboard. Human engineer- 
ing aspects were considered when the 
set-key association was determined. The 
radical string for a Chinese character 
consists of the keys corresponding to 
the first three radicals composing the 
character. In case the character is 
decomposed into less than three 
radicals, blanks are used as fillers to 
make up a string of three keys. For 
instance, the character ~ is decom- 
posed into 91T and the radical string 
for ~ is I . In this coding scheme, 
the grouping of radicals into sets is of 
paramount importance. On the one hand, 
they are grouped according to their 
common graphic properties into as few 
sets as possible. On the other hand, 
care is exercised to assure the unique- 
ness (or almost uniqueness) of the code- 
character correspondence. 
The Codin@ Scheme for 
Internal Representation 
For data processing purposes, it is 
necessary to arrange the Chinese charac- 
ters into a collating sequence which is 
a direct result of their internal repre- 
sentation in computer memory. Hence, 
when one is designing the internal 
codes, besides minimizing the length of 
the codes, one should also observe that 
the collating sequence that follows is 
logical and practical. This paper 
attempts to derive the internal codes 
logically from the input codes which, in 
turn, are logically related to the spel- 
lings and graphical properties of the 
Chinese characters. When a new charac- 
ter is created in the future with a 
unique input code, this scheme guaran- 
tees that the internal code will also be 
unique and a logical place in the 
collating sequence for it is assured. 
The maximum number of keys used for an 
input code is seven. Storing seven 
symbols, in general, requires seven 
bytes. We recall that three symbols out 
of the seven serve to indicate which 
sets of radicals the Chinese character 
is composed of. Since there are fifty 
sets of radicals altogether, there are 
a total of 125,000 possible combina- 
tions. Seventeen bits will be suffi- 
cient to represent these combinations. 
The remaining four alphabetic symbols 
used to represent the spelling have the 
following properties:- The first symbol 
can be any alphabet from A to Z (except 
V). Five bits would suffice to repre- 
sent it. The second symbol can be a 
blank, A, E, H, I, M, N, O, R, U or V, a 
total of eleven possibilities. Four 
bits would suffice. The third symbol 
can be a blank, A, E, G, I, N, O, or U, 
a total of eight possibilities. Three 
bits would suffice. The fourth symbol 
can be a blank, A, G, I, N, O, or U, a 
total of seven possibilities. Three 
bits would suffice. 
Thus the spelling can be packed into 
fifteen bits. Combining with the seven- 
teen bits required for the radicals, a 
code in these scheme requires only 
thirty two bits of memory space. 
As a consequence of this internal repre- 
sentation, the collating sequence would 
be such that where a character should 
appear in the sequence first depends on 
the spelling of the character. The 
order of two characters of the same 
spelling depends on the keys used in the 
radical strings for the two characters. 
276 
Table 2: Conflicts Introduced by not Differentiating 'C' from 'CH', 'S' from 'SH' 
and 'Z' from 'ZH' as Leading Symbols in Spelling Chinese Characters 
Radical Char. Char. Suggested Remedy Justification Spelling Composition 1 2 
CU/CHU 
S~/SHA 
SI/SHI 
72Y 
7M 
G2 
} delete char. 1 
delete char. 2 
delete char. 1 
uncommon 
uncommon 
replaced by 4~ 
Table 3: Grouping of Radicals 
Key Radicals in Sets 
B ~ ~-~ -~~ 
c g 
G t Wl~t tt%~ 
M 
0 
P 
R k > " ~ ~l< ,l- 
U = / 
Key 
V 
W 
X 
6 
7 
! 
Radicals in Sets 
t 
3 ~ o .~g_. ~ ..¢. 
s r~t/tli~l,~?tf,4\]ii 
8 
--277 
Key 
! 
i! 
Radicals in Sets 
,, 
The Next Step 
In order to evaluate the effectiveness 
of this coding scheme, the author plans 
to experiment with different users and 
measure their coding efficiencies as a 
function of training and experience as 
well as their reaction towards using 
this scheme. The acceptance of the 
users is the ultimate measure of success 
Of any invention. The design of the 
set-key association in Table 3 is some- 
what arbitrary. Since it has a subtle 
impact on the collating sequence, more 
research in this area is necessary. 
Acknowledgement 
The author is indebted to Professor 
T.C. Chen for his constructive 
suggestions and criticisms. The author 
is also grateful to Mr. T.H. Tse for his 
assistance and discussions. 
278- 
