Multilingual Text Processing in a Two-Byte Code 
Lloyd B. Anderson 
Ecological Linguistics 
316 "A" st. s. E. 
Washington, D. C., 20003 
ABS~ACT 
National and international standards commit- 
tees are now discussing a two-byte code for multi- 
lingual information processing. This provides for 
65,536 separate character and control codes, enough 
to make permanent code assiguments for all the cha- 
ranters of ell national alphabets of the world, and 
also to include Chinese/Japanese characters. 
This paper discusses the kinds of flexibility 
required to handle both Roman and non-Roman alp.ha- 
bets. It is crucial to separate information units 
(codes) from graphic forms, to maximize processing 
p ower, 
Comparing alphabets around the world, we find 
t.hat the graphic devices (letters, digraphs, accent 
marks, punctuation, spacing, etc.) represent a very 
limited number of information units. It is possi- 
ble to arr_ange alphabet codes to provide transliter- 
ation equivalence, the best of three solutions 
compared as a _eramework for code assignments. 
Information vs. Form. In developing proposals 
for codes in information processing, the most impor- 
tant decisions are the choices of what to code. In 
a proposal for a multilingual two-byte code, Xerox 
Corporation has n'~%de explicit a principle which we 
can state precisely as follows: 
Basic codes stand for independent1.y_, function- 
in~ information units (not for visual forms) 
The choice of type font, presence or absence of se- 
rifs, and variations like boldface, italics or 
underlining, are matters of form. Such choices are 
norrmlly made once for spans at least as long as 
one word. ~'\[e do not use ComPLeX miXturEs, but con- 
sistent strings llke this, THIS, this, or THIS. 
By assigning the same basic code to variations of a 
single letter (as a, _~, A, A~, all variants will 
automatically be alphabetized the ~ame way, which 
is as it should be. The choice of variant forms is 
specified by supplementary "looks" information. 
(The capitalization of first letters of sentences, 
proper names, or nouns, is a kind of punctuation,) 
Identical graphic forms may also be assigned 
more than one code because they are distinct units 
in information processing. Thus the letter form 
"C"' is used in the Russian alphabet to represent 
the sound /s/, but it is not the same information 
unit as English "C", so it has a distinct code. So 
far this seems relatively obvious. 
The sane principle is now being applied in 
much more subtle cases. Thus the minus sign and 
the hyphen are assigned distinct codes in recent 
proposals because they are completely distinct in- 
formation units. There are even two kinds of hy- 
phens distinguished, a "hard" hyphen as in the 
word father-in-law, which remains always present, 
and a "soft" hyphen which is used only to di- 
vide a word at the end of a line, and which should 
automatically vanish when, in word-processing, the 
sane word comes to stand undivided within the line. 
We can now frame the question "what to code?" 
as a matter of empirical discovery, what are the 
independently functioning information units in 
text? Relevant facts emerge from comparing a 
range of different alphabets. 
What is a "letter of the alphabet"? -- the 
problem of diacritics and digraphs. The most 
obvious question turns out to be the most difficult 
of all. Western European alphabets are in many 
ways not typical of alphabets of the world. They 
have an unusually small number of basic letters, 
and to represent a larger number of sounds they use 
digraphs like English sh, ch, th, or diacritics as 
in Czech ~, ~. It seems at first entirely obvious 
that digraphs like sh should be coded simply as a 
sequence of two codes, one for s plus one for h. 
Indeed English, French, German and Scandinavian 
alphabets do alphabetize their digraphs just like 
a sequence, s__ plus h etc. But these national 
alphabets are not typical. Spanish, Hungarian, 
Polish, Croatian and Albanian treat their native 
digraphs as single letters for purposes of alpha- 
betical order. Spanish II is not & sequence of 
two l's, but a new letter which follows all io, l~u 
sequences! similarly ch follows all c sequences, & 
follows all ~ sequences as a separate letter. 
There is just as much variation in handling 
letters" with diacritics. The umlauted letter ~ is 
alphabetized as a separate letter following _o in 
Hungarian, and at the end of the alphabet in 
Swedish, but in German it is mixed in with o. In 
Spanish, ~ is treated as a separate letter, but the 
Slovak ~_ ~epresenting the same sound is mixed in 
with ordinary n. 
In Table I., the digraphs and letters with 
diacritics which are not in parentheses or brackets 
are alphabetized separately as distinct single 
units. Those in parentheses are alphabetized am a 
sequence of two or more letters or (Slovak and 
Czech I', n, ~ ~t', d_~ are treated as equivalent to 
the simpler letter, completely disregarding the 
diacritic. Combinations in brackets are used to 
represent sounds in words burrowed from other 
languages. Double dashes mark sounds fur which an 
particular alphabet has no distinctive written sym- 
bol. (In Russian, palatal consonants are marked 
by choice of special vowel letters, while Turkish 
has a different kind of contrast, hence the blanks~ 
Even when a digraph or trigraph is treated as 
a sequence of letters for alphabetization, there 
may be other evidence that it functions as a single 
information unit. In syllable division (hyphena- 
tion), English never divides the digraphs sh, oh, 
or th when they function as single units (~t~-er, 
~er) but does when they represent two ~its 
t-house). The same is true of other letter com- 
binations in all national standard alphabets where 
a single sound is represented by a combination of 
letters. 
Within certain mechanical constraints, type- 
writer keyboards also put each distinct information 
unit on a separate key. Thus Spanish E mr Czech 
~_, _~, ~_ are Produced by single keys, n~t by ~g 
a diacritic to a base letter. Mechanical limits 
have forced a sequence of two letters (like the 
Spanish oh, ~ to be typed with two separate key- 
s~rokes whether or not they represent a single 
functional unit, but occasionally we see excep- 
tions, an in Dutch where the ~ digraph appears an 
a ligature on a single key and is printed in one 
Sound " 
space not two. 
Unit tmanalyzable letters exist in Serbian 
and Macedonian for most of the sound types (the 
columns) of Table I. Icelandic has single letters 
"thorn" and "edh" for the two rightmost columns. 
Even where the o~her languages use digraphs cr 
letters with diacritics, there is evidence from 
syllabification and usually also from alphabetical 
order that these are functionally independent in- 
formation units. For transliteration from one 
national alphabet into another, these symbol equi- 
valences are needed. The im~inciple stated on the 
preceding page thus implies that unique codes be 
available for English s h, c h, t_~h and unitary 
digraphs in other languages so these can be used 
when needed in information processing. (Informa- 
tion processing is not the shuffling of bits of 
scribal ink:) The principle does not compel use 
of those cedes -- English t h can be recorded first 
as a sequence of two cedes, then converted into a 
single cede only when needed, by a Program which 
has a dictions~y listing all wu~Is containing 
matary t_h. 
Spatial arrangement of printe~ characters. 
In al~habets of Europe, letters (and information 
units) almost always follow each other in a line, 
from left to right. This is not true of many 
Table I. Some Consonant Characters in Europe 
r~l~ f ~ ~ ~ ~ ~ ~ ~ s ~ ts d, o "% 
Russian 
Macedonian 
Serbian 
LU y~: q \[,a~3 c x ~ \[,,3\] 
LU ~ q ~ c .x q, S 
Hungarian -- ly 
Croatian -- lj 
s'J.ovak -- (I') 
Czech 
Latvian r I 
Polish -- 1 
C~man 
ny 
nj 
(~) 
n 
(~i) 
ty gy 
(t') (d') 
(~) (d') 
6 (dg) 
(ci) (d~) 
s ,s cs \[dzs\] sz -- c \[dz\] -- -- 
~ ~ d~ s h c \[dz\] 
~ ~ (d~) s oh o \[d,\] .... 
~ ~ (d~) S ch c \[dz\] .... 
~ ~ (d~) s -- c (dz) .... 
(s,) ~ (cz) (d~) s (oh) c (d,) .... 
(sch) -- (tsch) \[dsch\] s (ch) z Edz\] .... 
Albanian -- lj nj .q gj 
Turkish 
Rom~i~ -- (...) (...) .... 
French -" (''')S(''') .... 
Spanish -- II ~ .... 
sh zh 9 xh s h c x th dh 
j ~ o s h \[ \] \[ \] .... 
j ~(cl) ~(gi) ~ -- ~ \[ \] .... L(oe) l~gs~ 
(eh) j Itch\] mdJ3 ~s -- Its\] \[dz\] .... 
Iw 
(sh) (...) (oh) J s -- Its\] \[dz\] th th 
x \[ \] ch \[ \] s j Ets\] Edz\] .... 
important alphabets elsewhere in the world. Arabic 
and Hebrew, .hen they ~rite sh~rt vowels, place 
them above or below the consonant letters. What 
we transcribe as kit~bu appears 
(in a left-to-right transform of a u 
the Arabic s~Tangement) as shown k t b 
on the right. These vowel symbols i 
are independent information units, 
not "diacritics" in the sense of the European 
alphabets. They keep a constant f~rm, combining 
freely with any consonant letter. Alphabets of 
India and Southeast Asia place vowels above, below, 
to right or to left of a consonant letter or clus- 
ter, or in two or three of these positions simul- 
taneously. There can be further combinations with 
marks for tones or consonant-douBling. 
The Korean alphabet alTanges its letters in 
syllabic groups, so that mascot 
would be a shown to the right m a c o 
if ~ritten in the K~rean manner, s t 
The independently functioning 
Infcm~ation units are still consonants and vowels, 
for which we need codes, and we need one additional 
code to m~k the division between syllables. This 
is just as much an alphabet as o~ f~l~r English 
and is not a syll~hary. (Since there are only 
about ~00 syllables, a printin~ device Night store 
all of them, but these would not normally be useful 
in information processing.) 
A flexible multi-lingual code for Infatuation 
processing must be able to handle the different 
spatial arrangements described here, but it need 
not (except in input and output for human use) be 
concerned with what that spatial arrangement is, 
only with what si~nificent inf~tion units it 
contains. Even in Europe, Spanish accented vowels ~, ~, ~_, _6, ~ 
show a v~l sup~mpomiti~ of 
the basic vowels with a functionally independent 
symbol of accentnation. These are not new letters 
in the sense that ~tian _~, i, ~_ ~ =_" are, but 
are alphabetized just like simple a, e, i, o, u. 
C~it~ria far a two-byte cod e standard. We ca,, 
now consider alternative methods of coding fc~ 
multillngual information processing. Three basic 
criteria are given first, followed by discussion 
of alternative solutions and further criteria. 
A) Each independent character or information 
unit sb=11 have available a re~esentation in a 
two-byte code (whether it is graphically manifest 
as a base letter, di6raph, independent diacritic, 
letter-plus-dlacritic unit, syll~ble separation, 
punct~tion tomsk, or other unit of normal text, 
and in~ep~naent of position in printing). 
B) It s~=11 be possible to identify the source 
alphabet from the codes themselves. ~Since "C" in 
Czech represents the sound /ts/, it is not the same 
unit as ~llsh "c"! in li~ary processing it is 
impcm~cant to know that German den and di__~e are 
articles like ~lish the, to be disregarded in 
filing, but English den and die are headwords. 3 
C) The assignment of information units to 
codes shall maximize the possibilities for use of 
one-byte code reductions through long monolingual 
texts, minimizing shifts between different blocks 
of 256 codes. ~This is especially important in reducing transmission coets.~ 
Each of the following three solutions has cer- 
tain a~vantages. The third is far superior in the 
long run. 
Solution I. Incorporate exlsti~ ?-bit or 
8-bit n~tiona I code standards, one in each block 
of 256 codes. Use the extra space as codes for 
information units which are not single spacing 
characters, This satisfies all of the basic cri- 
teria (A,B,C) and uses existing codes, -~d~ng only 
a first byte as an alphabet name to make a two- 
byte code. There is no transllteration-equivalence 
and elaborate transliteration programs would be 
necessary f~ each conversion, N x N programs for 
~_ alp~ets. 
Solution 2. Systematically code all b@sic 
letter forms and all their diacritic modifications 
thus allowing for expansion, use of new letter- 
dis~itic comblru~tlons. Despite their difTeremces, 
Latin-based alphabets share a common core of alpha- 
betical c~der, which can be reflected in a coding 
to minimize shuffling. This is attempted in Table 
2., which includes all characters f~om IS0/T~9?/SC2 
N 1255 1982-11-01 pp.60-61 plus additions from 
African and Vietnamese alphabets. Code ordering 
Is downwards within columns, starting from the left. 
Table 2. Alphabetical order of letters and diacritics as a basis for coding 
e Sf\[g h~ i i lJJk ~ IEm~ ~ o cec/3pqr s @t~u ~ Cv~wxy~z ~ ~m~ 
a e 
i 
u y 
rnis solution satisfies none of the criteria 
(A,B,C), and does not provide codes for many kinds 
of infurmation units. It appears to be economical 
in Europe, where 20 national alphabets can fit in 
48 x 13 = 624 code cells if only letter forms are 
considered. But for non-L&tin alphabets there can 
be no similar savings. Here there are (considering 
only living alphabets) about 5~ alphabets based on 
38 distinct sets of letters. 
Solution ~. Transliteration-euuivalemt units 
assigned identical second bytes in their two-byte 
code. Transliteration between any two alphabets 
simply changes the first byte of the cede naming 
the alphabet, requi:in~ minor pro~rammin~ only ~hen 
an alphabet has non-recoverable spellings cr cannot 
represent certain sounds. This solution depends on 
the fact that there is a small number of types of 
information units which have ever been represented 
in a national standard alphabet. In the tentative 
arrangement of Table 3., most of the sound types 
noted ere represented by single unanalyz~ble cha- 
racters in some national alphabet (as Georgian, 
Armenian, Hindi, ...), and most of the rest by 
clearly unitary digraphs. Despite the strange 
symbols, this is not a list of fine phonetic dis- 
tinctions, it is a list of distinct categories 
of ~ritten symbols. 
The idea fc~ this solution came from the one- 
byte code adopted in India, struct~ed identically 
with transliteration-equivalence for each of the 
alphabets of India. A printer with only Tamil 
letters can simply ~int a Tamil transliteration 
of an incoming Hindl message. 
In the two-byte version presented here, there 
is provision far any alphabet to add characters 
representing sounds of some other alphabet, and a 
s~l~ amount of space to add unique information 
units which are not m~tched in other alphabets. 
This is the right amount of space for expansion. 
Applications to transliteration and llh~ar~ 
processing. Wlth newer capabilities of printers 
and screens, a speaker of any language can soon 
request a data base in its m~iginsl alphabet cr 
Table 3. Transliteration-equivalent information 
0 I 2 3 a 
in any t~ansliteration of his choice, either one 
using many diacritic characters like C~oatlan and 
special symbols to avoid ambiguity, ~ one m~e 
adapted to his native alphabet, f~ example F~ench 
cr Hungarian. Rec~ds can be kept in the codes of 
the original alphabet, always ensuring complete 
recoverability. There would be a gentle encourage- 
ment f~ each national alphabet to use a consistent 
transliteration f~ each sound independent of the 
source alphabet, because this would be aatom~tlc. 
Summary. The third solution described above 
is designed to handle all the structures and fUnc- 
tions found in national standard alphabets and to 
fit them like a well-made glove, allowing the maxi- 
mum capabilities of infcrmstion processing, but 
never compelling their use. This type of solution 
could be a primar~ international standard, with 
code translations to reach existing 7-blt and 8-bit 
and an E~APE sequence to allow Proces- 
sing directly in the alds~ standards (solution I. 
above Imc~crated as an alternate). Since mAthe- 
matical and scientific symbol~ are international, 
they would :equire only single blocks of 256 codes. 
The first column of 16 blocks of 256 each could 
provide 4096 two-byte control codes, and the second 
column could eventually be added to the 96 alpha- 
bet blocks allowing t~nsliteration of numerals. 
The right 128 blocks of 256 codes each remain far 
Chinese/Japanese ch~acters cr other p~rposes, but 
even these can be coded alphabetically in terms of 
character components and arrangements (partly 
achieved in a keyboard now installed at Stanford 
and the Ll~:ary of Confess). 
AEKNONLE~TS 
I would llke to thank Mr. Thomas N. Hastings, 
chairman of the ANSI X3L~ committee, and ~. James 
Agen~omd, APO, Litany of Congress, f~ indispen- 
sable Information and discussions. They of course 
beer no resp~sibility for claims cr analyses 
presented here. 
units found in national standard alphabets 
6 7 8 9 A B C D E F 
0 SPace k 
l ~ • I k ? 
2 ~ , i k h 
~ ~ - / x 
a ® ~ ~ I g 
6 o ~ ~ ~ T ~h 
( C\] h ) 
A o ~ INitial-CAPS SUPerscript 
B ~ o ~ ALT~n.-CHA~ n~ACritic a~ 
C ~ ~ o ° SYIL~ble-SEPAR. INSULator 
D = ~ REPeat r~KER (~, e~ 
0 DIGraph-LINE SILent LETter 
F ~ ~ DOb~le CONSort. NO V~,~EL 
~ ts~/c h 6h 
X s 6 
d~ ~/~ 
5 z ~ 
i (y) 
'~ ld~ .an.Win 
.1 a ~y@) i 
(ya~ T 
t~/cz t t p k w 
~i -- ~ " t ~ 
~ht~h _ ~h th i~ h w 
( ) . • £ ~ (~) 
~h ~ dh bh (r-) 
r .r 
~l .I i 1 1 ~ (~) 
n ~ . m (~) 
m~ ~ )- - ~ 
(~) ~/m (~) #/~ ~/# 
(ye) ~ (yo) ~ ~ ~ an 
