Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 117–120,
New York, June 2006. c©2006 Association for Computational Linguistics
Quantitative Methods for Classifying Writing Systems
Gerald Penn
University of Toronto
10 King’s College Rd.
Toronto M5S 3G4, Canada
gpenn@cs.toronto.edu
Travis Choma
Cognitive Science Center Amsterdam
Sarphatistraat 104
1018 GV Amsterdam, Netherlands
travischoma@gmail.com
Abstract
We describe work in progress on using
quantitative methods to classify writing
systems according to Sproat’s (2000) clas-
sification grid using unannotated data. We
specifically propose two quantitative tests
for determining the type of phonography
in a writing system, and its degree of lo-
gography, respectively.
1 Background
If you understood all of the world’s languages, you
would still not be able to read many of the texts
that you find on the world wide web, because they
are written in non-Roman scripts that have been ar-
bitrarily encoded for electronic transmission in the
absence of an accepted standard. This very mod-
ern nuisance reflects a dilemma as ancient as writ-
ing itself: the association between a language as
it is spoken and the language as it is written has a
sort of internal logic to it that we can comprehend,
but the conventions are different in every individ-
ual case — even among languages that use the same
script, or between scripts used by the same language.
This conventional association between language and
script, called a writing system, is indeed reminis-
cent of the Saussurean conception of language itself,
a conventional association of meaning and sound,
upon which modern linguistic theory is based.
Despite linguists’ necessary reliance upon writ-
ing to present and preserve linguistic data, how-
ever, writing systems were a largely neglected cor-
ner of linguistics until the 1960s, when Gelb (1963)
presented the first classification of writing systems.
Now known as the Gelb teleology, this classification
viewed the variation we see among writing systems,
particularly in the size of linguistic “chunks” rep-
resented by an individual character or unit of writ-
ing (for simplicity, referred to here as a grapheme),
along a linear, evolutionary progression, beginning
with the pictographic forerunners of writing, pro-
ceeding through “primitive” writing systems such as
Chinese and Egyptian hieroglyphics, and culminat-
ing in alphabetic Greek and Latin.
While the linear and evolutionary aspects of
Gelb’s teleology have been rejected by more recent
work on the classification of writing systems, the ad-
mission that more than one dimension may be nec-
essary to characterize the world’s writing systems
has not come easily. The ongoing polemic between
Sampson (1985) and DeFrancis (1989), for exam-
ple, while addressing some very important issues in
the study of writing systems,1 has been confined ex-
clusively to a debate over which of several arboreal
classifications of writing is more adequate.
Sproat (2000)’s classification was the first multi-
dimensional one. While acknowledging that other
dimensions may exist, Sproat (2000) arranges writ-
ing systems along the two principal dimensions of
Type of Phonography and Amount of Logography,
both of which will be elaborated upon below. This
is the departure point for our present study.
Our goal is to identify quantitative methods that
1These include what, if anything, separates true writing sys-
tems from other more limited written forms of communication,
and the psychological reality of our classifications in the minds
of native readers.
117
Type of Phonography
Consonantal Polyconsonantal Alphabetic Core Syllabic Syllabic
W. Semitic English, PahawhHmong Linear B Modern Yi
Greek,
Korean,
Devanagari
←−
Am
ou
nt
of
Lo
go
gra
ph
y
Perso-Aramaic
Chinese
Egyptian Sumerian,
Mayan,
Japanese
Figure 1: Sproat’s writing system classification grid (Sproat, 2000, p. 142).
can assist in the classification of writing systems. On
the one hand, these methods would serve to verify
or refute proposals such as Sproat’s (2000, p. 142)
placement of several specific writing systems within
his grid (Figure 1) and to properly place additional
writing systems, but they could also be used, at least
corroboratively, to argue for the existence of more
appropriate or additional dimensions in such grids,
through the demonstration of a pattern being con-
sistently observed or violated by observed writing
systems. The holy grail in this area would be a tool
that could classify entirely unknown writing systems
to assist in attempts at archaeological decipherment,
but more realistic applications do exist, particularly
in the realm of managing on-line document collec-
tions in heterogeneous scripts or writing systems.
No previous work exactly addresses this topic.
None of the numerous descriptive accounts that cat-
alogue the world’s writing systems, culminating in
Daniels and Bright’s (1996) outstanding reference
on the subject, count as quantitative. The one com-
putational approach that at least claims to consider
archaeological decipherment (Knight and Yamada,
1999), curiously enough, assumes an alphabetic and
purely phonographic mapping of graphemes at the
outset, and applies an EM-style algorithm to what
is probably better described as an interesting varia-
tion on learning the “letter-to-sound” mappings that
one normally finds in text analysis for text-to-speech
synthesizers. The cryptographic work in the great
wars of the early 20th century applied statistical rea-
soning to military communications, although this
too is very different in character from deciphering
a naturally developed writing system.
2 Type of Phonography
Type of phonography, as it is expressed in Sproat’s
grid, is not a continuous dimension but a dis-
crete choice by graphemes among several differ-
ent phonographic encodings. These characterize
not only the size of the phonological “chunks” en-
coded by a single grapheme (progressing left-to-
right in Figure 1 roughly from small to large),
but also whether vowels are explicitly encoded
(poly/consonantal vs. the rest), and, in the case of
vocalic syllabaries, whether codas as well as onsets
are encoded (core syllabic vs. syllabic). While we
cannot yet discriminate between all of these phono-
graphic aspects (arguably, they are different dimen-
sions in that a writing system may select a value
from each one independently), size itself can be reli-
ably estimated from the number of graphemes in the
underlying script, or from this number in combina-
tion with the tails of grapheme distributions in repre-
sentative documents. Figure 2, for example, graphs
the frequencies of the grapheme types witnessed
among the first 500 grapheme tokens of one docu-
ment sampled from an on-line newspaper website in
each of 8 different writing systems plus an Egyp-
tian hieroglyphic document from an on-line reposi-
tory. From left to right, we see the alphabetic and
consonantal (small chunks) scripts, followed by the
polyconsonantal Egyptian hieroglyphics, followed
by core syllabic Japanese, and then syllabic Chinese.
Korean was classified near Japanese because its Uni-
code representation atomically encodes the multi-
segment syllabic complexes that characterize most
Hangul writing. A segmental encoding would ap-
pear closer to English.
3 Amount of Logography
Amount of logography is rather more difficult.
Roughly, logography is the capacity of a writing
system to associate the symbols of a script directly
118
with the meanings of specific words rather than in-
directly through their pronunciations. No one to
our knowledge has proposed any justification for
whether logography should be viewed continuously
or discretely. Sproat (2000) believes that it is contin-
uous, but acknowledges that this belief is more im-
pressionistic than factual. In addition, it appears, ac-
cording to Sproat’s (2000) discussion that amount or
degree of logography, whatever it is, says something
about the relative frequency with which graphemic
tokens are used semantically, rather than about the
properties of individual graphemes in isolation. En-
glish, for example, has a very low degree of lo-
gography, but it does have logographic graphemes
and graphemes that can be used in a logographic
aspect. These include numerals (with or without
phonographic complements as in “3rd,” which dis-
tinguishes “3” as “three” from “3” as “third”), dol-
lar signs, and arguably some common abbreviations
as “etc.” By contrast, type of phonography predicts
a property that holds of every individual grapheme
— with few exceptions (such as symbols for word-
initial vowels in CV syllabaries), graphemes in the
same writing system are marching to the same drum
in their phonographic dimension.
Another reason that amount of logography is dif-
ficult to measure is that it is not entirely indepen-
dent of the type of phonography. As the size of the
phonological units encoded by graphemes increases,
at some point a threshold is crossed wherein the
unit is about the size of a word or another meaning-
bearing unit, such as a bound morpheme. When
this happens, the distinction between phonographic
and logographic uses of such graphemes becomes
a far more intensional one than in alphabetic writ-
ing systems such as English, where the boundary is
quite clear. Egyptian hieroglyphics are well known
for their use of rebus signs, for example, in which
highly pictographic graphemes are used not for the
concepts denoted by the pictures, but for concepts
with words pronounced like the word for the de-
picted concept. There are very few writing systems
indeed where the size of the phonological unit is
word-sized and yet the writing system is still mostly
phonographic;2 it could be argued that the distinc-
2Modern Yi (Figure 1) is one such example, although the
history of Modern Yi is more akin to that of a planned language
than a naturally evolved semiotic system.
tion simply does not exist (see Section 4).
0
10
20
30
40
50
60
0 50 100 150 200 250
frequency
symbol
"Egyptian"
"English"
"Greek"
"Hebrew"
"Japanese"
"Korean"
"Mandarin"
"Spanish"
"Russian"
Figure 2: Grapheme distributions in 9 writing sys-
tems. The symbols are ordered by inverse frequency
to separate the heads of the distributions better. The
left-to-right order of the heads is as shown in the key.
Nevertheless, one can distinguish pervasive se-
mantical use from pervasive phonographic use. We
do not have access to electronically encoded Mod-
ern Yi text, so to demonstrate the principle, we will
use English text re-encoded so that each “grapheme”
in the new encoding represents three consecutive
graphemes (breaking at word boundaries) in the un-
derlying natural text. We call this trigraph English,
and it has no (intensional) logography. The princi-
ple is that, if graphemes are pervasively used in their
semantical respect, then they will “clump” seman-
tically just like words do. To measure this clump-
ing, we use sample correlation coefficients. Given
two random variables, X and Y , their correlation is
given by their covariance, normalized by their sam-
ple standard deviations:
corr(X,Y ) = cov(X,Y )s(X)·s(Y )
cov(X,Y ) = 1n−1Σ0≤i,j≤n(xi −µi)(yj −µj)
s(X) =
radicalBig
1
n−1Σ0≤i≤n(xi −µ)2
For our purposes, each grapheme type is treated as
a variable, and each document represents an obser-
vation. Each cell of the matrix of correlation co-
efficients then tells us the strength of the correla-
tion between two grapheme types. For trigraph En-
glish, part of the correlation matrix is shown in Fig-
ure 3. Part of the correlation matrix for Mandarin
119
Figure 3: Part of the trigraph-English correlation
matrix.
Chinese, which has a very high degree of logogra-
phy, is shown in Figure 4. For both of the plots in
Figure 4: Part of the Mandarin Chinese correlation
matrix.
our example, counts for 2500 grapheme types were
obtained from 1.63 million tokens of text (for En-
glish, trigraphed Brown corpus text, for Chinese,
GB5-encoded text from an on-line newspaper).
By adding the absolute values of the correla-
tions over these matrices (normalized for number of
graphemes), we obtain a measure of the extent of
the correlation. Pervasive semantic clumping, which
would be indicative of a high degree of logography,
corresponds to a small extent of correlation — in
other words the correlation is pinpointed at semanti-
cally related logograms, rather than smeared over se-
mantically orthogonal phonograms. In our example,
these sums were repeated for several 2500-type sam-
ples from among the approximately 35,000 types
in the trigraph English data, and the approximately
4,500 types in the Mandarin data. The average sum
for trigraph English was 302,750 whereas for Man-
darin Chinese it was 98,700. Visually, this differ-
ence is apparent in that the trigraph English matrix
is “brighter” than the Mandarin one. From this we
should conclude that Mandarin Chinese has a higher
degree of logography than trigraph English.
4 Conclusion
We have proposed methods for independently mea-
suring the type of phonography and degree of logog-
raphy from unannotated data as a means of classify-
ing writing systems. There is more to understand-
ing how a writing system works than these two di-
mensions. Crucially, the direction in which texts
should be read, the so-called macroscopic organi-
zation of typical documents, is just as important as
determining the functional characteristics of individ-
ual graphemes.
Our experiments with quantitative methods for
classification, furthermore, have led us to a new un-
derstanding of the differences between Sproat’s clas-
sification grid and earlier linear attempts. While we
do not accept Gelb’s teleological interpretation, we
conjecture that there is a linear variation in how in-
dividual writing systems behave, even if they can be
classified according to multiple dimensions. Mod-
ern Yi stands as a single, but questionable, coun-
terexample to this observation, and for it to be vis-
ible in Sproat’s grid (with writing systems arranged
along only the diagonal), one would need an objec-
tive and verifiable means of discriminating between
consonantal and vocalic scripts. This remains a topic
for future consideration.
References
P. Daniels and W. Bright. 1996. The World’s Writing
Systems. Oxford.
J. DeFrancis. 1989. Visible Speech: The Diverse One-
ness of Writing Systems. University of Hawaii.
I. Gelb. 1963. A Study of Writing. Chicago, 2nd ed.
K. Knight and K. Yamada. 1999. A computational ap-
proach to deciphering unknown scripts. In Proc. of
ACL Workshop on Unsupervised Learning in NLP.
G. Sampson. 1985. Writing Systems. Stanford.
R. Sproat. 2000. A Computational Theory of Writing
Systems. Cambridge University Press.
120
