ProOect DOC: Its Methodological Basis 1 
William S-Y. Wang 
I. 
Dictionary on Computer~ hereafter DOC, is part of an 
overall effort to harness an on-line computer for phono- 
logical research. For certain problems the linguist finds 
it necessary to organize large amounts of data, or to per- 
form rather involved logical tasks -- such as checking out 
a body of rules with intricate ordering relations. In 
these situations a computer can be invaluable in that it 
forces the linguist to think through his problems with 
great precision and in that it can do certain jobs with a 
speed and accuracy not otherwise possible. 
The overt aim of DOC is to reconstruct the phono- 
logical histories of the major Chinese dialects. At a 
deeper level our interest is to find out more about how 
phonological structures changein general and the relation 
between these changes and the synchronic systems they 
lead to. To achieve these objectives we must attempt to 
account for oceans of data (the regular and irregular 
developments of thousands of morphemes in dozens of dia- 
lects). The hypotheses we posit, i.e., the reconstructed 
forms and the associated rules~ are likewise numerous and 
complex. The project is further complicated by the tens 
-2- 
of thousands of logographs which must be contended with, 
especially when we involve the rime dictionaries and the 
rime tables. These considerations lead naturally to the 
2 use of a computer in our work. 
To construct significant ~4vpotheses and to generalize 
them beyond the confines of our data pool, to grasp the 
theoretical import of each discovery, clearly these are 
creative tasks that cannot be mechanized. Nonetheless a 
~!ell-hewn tool, such as we hope to develop D0C into, can 
contribute substantially to facilitate these creative 
tasks. 
Of the many language families in the world, Chinese 
offers an ideal laboratory within which to study phonolo- 
gical change for many reasons. Chief among these are two: 
(1) its unrivaled wealth of materials, and (2) its dis- 
tinctive phonology and orthography. 
(1) The earliest extant materials date back to ca. 
1500 B.C. in the form of oracle inscriptions. We have a 
virtual time depth of some three and one-half millenia 
of literature. This literature includes not only such 
works as rime tables and rime dictionaries, but also 
extensive contributions from a tradition of philological 
scholarship that arose in the early Son~ and reached con- 
siderable sophistication in the Qing period. Indeed the 
view has been expressed that in China the mebhods of 
scientific reasoning were primarily developed in the hands 
-3- 
of the Qing philologists (as opposed to Europe where they 
originate in the physical sciences). Few language groups 
compare with Chinese with respect to this immense treasure 
of literature to work with. 
(2) By far the greatest bulk of the present knowledge 
of phonological change comes from investigations of Indo- 
European languages. It is not unlikely, then, that the 
present theories and methods are skewed in the direction 
of characteristics found in these languages. Studying a 
language family with a very different structure will help 
us balance this skewed perspective. Indeed Meillet must 
have had something like this in mind back in 1913 when, in 
discussing the comparative method, he wrote: 
"Les rapprochements regoivent des confirmations utiles 
quand on peut constater que des concordances grammaticales 
s'ajoutent ~ la concordance du son and du sens .... Les 
langues qui, comme les langues indo-europ~enes...ont des 
particularit~s grammaticales attach~es~ certaih mots se 
preterit donc mieux ~ la d~monstration de l'~tymologie que 
les langue o~ tousles mots se conforment aux m~me r~gles 
grammaticales. La difficult~ qu'on 6prouve ~ poser la 
grammaire compar~e de certaines langues, notamment en 
Extreme-Orient, vient en partie de lg." <p. 32). 
A language of the Chinese type is distinct in that 
(a) it has no inflectional paradigms and no morphophonemic 
alternations to speak of, (b) it has a very simple syllabic 
-4- 
structure, (c) it has tones, and (d) its orthography is 
logographic. These characteristics all have implications 
for research on phonological change. 
(a) Current views of diachronic phonology invariably 
emphasize the importance of paradigmatic analogy as one of 
the two major fonces of phonological change (the other 
major force being phonetic). The formalization of analogy 
may be in terms of proportionality in a structuralist 
framework, or in terms of rule simplification within the 
context of generative phonology. It would be of consi- 
derable theoretical interest to examine these views with 
respect to Chinese, which has virtually no paradigms. In 
particular we would want to investigate what are the 
mechanisms whereby a change diffuses lexically 3 in Chinese, 
where word classes are not related by morphophonemic 
alternations. An understanding of these mechanisms is 
crucial toward answering the question of whether phonolo- 
gical change is or is not phonetically actuated. 
(b) The simple syllabic structure of the morphemes 
and the even accentual structure of the sentences are also 
of special interest. ~ereas many recurrent types of change 
outside of Chinese involve the reduction of consonant 
clusters into geminates, or the breaking up of clusters by 
vowel epenthesis, or the reduction of syllabic elements due 
to stress shifts, these changes hardly occur at all in 
Chinese. The pervasive themes found in phonological 
-5- 
structures like Chinese are palatalizatibn, dento-labiali- 
zatien, reduction of post-vocalic obstruents, and various 
complex interplays between the segmental syllable and tones. 
Research on phonological change now suffers from a 
severe lack of a systematic catalog of carefully documented 
changes. Given that X has the reflex Y, we need to know if 
this change was induced within the system or if it was 
actuated by another linguistic system; did X go directly 
into Y or were there intermediate phonemic stages; if each 
direct change was abrupt or gradual, phonetically and lexi- 
cally. Only when a sufficient fund of such information is 
available can one successfully meet the challenge of pho- 
netic and other types of explanations, and only then can 
phonology make the exciting transition from a descriptive 
effort into an explanatory science. DOC isdesigned to ~ 
facilitate the gathering of this fund of information. 
The tones of Chinese have intrigued students of lan- 
guage for many years. They are of interest to phonological 
theory because they form a relatively self-c6ntained sub- 
system in the sound structure that can serve as a relatively 
independent testing ground for the theory. The Chinese 
have had a categorical (though not physical) understanding 
of the tones of their language for well over 1500 years. 
During this period although the morpheme membership of the 
tonal categories has been relatively stable, the physical 
manifestations of the tones have undergone considerable 
-6- 
changes° Some of these changes, it appears, are intricately 
connected to segmental features. The investigation of 
these changes can contribute much toward our understanding 
of the inter-relationships between phonation and articu- 
lation. 
Lastly, the logographic syztem of writing has certain 
unique implications. Since the logographs are much more 
distantly related to the sounds of the language than are 
the alphabets of the European languages, one can assume 
that they have exerted very little influence on the develop- 
ments of the various sound systems. In other words, we 
have fewer cases of historical confusion due to spelling 
pronunciation to contend with. By the same token the logo- 
graphs themselves have an amazing longevity, so that we can 
make many inferences about their phonetics for as far back 
as three thousand years ago. 
In sum, then, DOC is being developed as a powerful 
too 1 that will give phonological research a speed and pre- 
cision not otherwise attainable. Its creative use can 
lead us to a deeper understanding of phonological structure 
and change on a quantitative basis, at present this tool 
is being developed within the context of Chinese, for the 
reasons outlined in the foregoing paragraphs~ We expect 
that the methods we will have worked out will be largely 
applicable to the study of the phonology of any language 
group. Indeed, it is to be hoped that a field like 
-7- 
Indo-European may one day be subjected to the rigors of 
this tool, and its results validated on a quantitative 
and objective basis. 
II. 
At present the primary source of data is the Hanyu 
Fangyin Zihui. a The 17 dialects reported in the Zihui are 
now available on Linc tape. Outside of the Zihui, we have 
the complete Kan-on, Go-on, Sino-Korean and portions of 
the Zhongyuan Yinyun. 
For a variety of reasons, such as ease of tape- 
punching, ease of proof-reading and error-correction, and 
ease of writing of utility programs, the data are stored 
in several formats. These formats are related to each 
other by a set of supporting programs, as shown in Figure 1. 
The rectangles indicate data formats and the circles indi- 
cate supporting programs. 
The first stage in the data collection is the punching 
of paper tape on the teletype. A standard entry requires 
24 punches: 
-8- 
i. space 
2-5. telegraphic code (G) 
7. dialect identification (D) 
8-9. tone (T) 
10-13. initial (I) 
14-15. medial (M) 
16-20. nucleus (N) 
21. ending (E) 
22. literary (L) 
23. carriage return (~) 
24. line feed 
i~unches 23 and 24 are discarded by the supporting program 
RDPT (Read l~aper Tape). After using RDPT the resulting 
Dialect tape should have the structure illustrated in 
Figure 2. 
For proof-reading and error correcting, the Dialect 
tapes may be converted into File tapes in the format of 
LAI ~' 6 ~ as shown in Figure 3. In this format each entry 
has 20 characters (or half-words) which is the maximum 
number that can be displayed per line on the scopel each 
entry is further followed by a CASE (Linc code 23) that is 
disregarded by FILEDOC and a ~ (Linc code 12) that shifts 
the display to the next line. So each entry~on the Linc 
tape is still 22 characters or iI words long, even though 
only 20characters are displayed. The spaces (Liuc code 14) 
are converted into periods (Linc code 20) for ease of reading. 
-9- 
LA~ ~ 6 D is modified from LAP 6 in two ways. The 
left margin is moved 4 positions to the left so that each 
entry will exactly fit one line. Note space is allotted 
for files on both the systems tape and the file tape. 
According to my present understanding, each dialect tape 
is just about the size that a single file can accommodate. 
The uses of L~ ~ 6 D files are obvious. We can use 
the full set of meta commands for such files as well as 
the editing conveniences. 
The AC tape contains the ~i~-Y~n information for the 
logographs as these are recorded in the Zihui , as shown in 
Figure 4. The use of this tape makes it possible to add 
this information to any dialect tape by matching the tele- 
graphic codes via the ACCODJ~ program. The resultant AC- 
Dialect tape has the structure also shown in Figure 4. 
Notice that positions 17 through 32 correspond to 7 through 
22 in entry structure of Dialect tape illustrated in 
Figure 2. As shown in (D) in Figure 4, the number of @ia- 
lect forms for each entry can be easily increased. 
Finally, it will be useful for certain problems to 
have the result in the form of a se$ of logographs. At 
present our computer can only give us the telegraphic 
code of the characters. With LOGOTAB we hope to be able 
to display the logographs on the scope and/or print them 
cut by means of a special purpose computer. The 16 x 16 
matrix representations of several thousand logographs have 
-lO- 
already been designed by Susumu Kuno's group at the 
Harvard Computation Laboratory, cf. Hayashi, et al., 
1968. LOGOTAB will be essentially a table look-up pro- 
gram that will translate telegraphic codes into those 
matrix representations. We are also giving thought to 
a similar logograph input device as that used by the 
Harvard group. 
III. 
Although the ideas for DOC were first conceived in 
1966, it is only in late spring 1969 that the project began 
to be operational. Several linguistic programs have been 
written for it, especially with respect to the seven 
Mandarin dialects. In Figure 5 we see a correlation pro- 
gram that quantifies the development of the Ancient Ghinese 
tones into each of the Maudarin dialects. The points of 
greatest interest are of course with the cells which show 
the small number of exceptional developments. Are the~ due 
to borrowing from other dialects, residue from changes 
that have not yet completed their course, or a~e ~hey due 
to the inception of new changes yet to be systematimed~ 
As the data pool becomes richer and richer with the 
addition of each new dialect or rime dictionary, it became 
-ll- 
increasingly obvious that our little laboratory Linc 
would not beable to cope with all the problems effi- 
ciently. Since the beginning of the summer, Tom McGuire 
of the Phonology Laboratory has helped us establish a 
remote terminal that connects to the CDC 6400 in the 
University Computer Center. Some of Our materials have 
already been converted into magnetic tape that is com- 
patible with that~computer. An example of the new format 
of DOC is shown in Figure 6. 
J J / 
-12- 
~D 
0 
l 
0 
l 
0 
0 
,1~ ID 
0 
0 .H 
.p 
r q3 
4~ 
O 
T 
B 
ID 
..p 
0 E-4 0 
cO 
0 
0 o3 
4o 
ul 
O -H 
O 
C4 
4~ 
O O 
A 
O 
O "H 
4o 
.,-4 
N? ,e.4 
O 
(D 
~0 
13 
(A) Entry Structure 
Each entry has 22 half-words, as follows: 
i 6 16-20 22 
(B) Tape Structure 
Each block on the Linc tape contains 23 entries, 
with the last three words filled by 5555. 
BN Address Content 
O00 000-012 Entry i 
013-025 ~try 
026-040 Entry 3 
362-374 Entry 23 
375-377 5555 
Figure 2: Dialect Tape. 
J 
-14- 
(A) CANTON: structure of a single file (23 centuries per 
block) 
BN Address Content 
000 
001 
oo0 2065 
ooi 5712 
o02-014 Entry 1 
015-027 Entry 2 
o 
364-376 Entry 23 
377 5555 
n 
(B) Entry Structure 
000 
001 
002-014 
O15-027 
364-376 
377 
5555 
5555 
Entry i 
Entry 2 
J 
laat entry 
7777 
i G\[G\[G\[GIGIT\[T I IIIIIIIMIMINININININIEILICaseR 
Figure 3: DOC File• 
-15- 
(A) AC Tape Entry Structure: 
G IG IG IG she h/k DIT IRime Initial 
1 - 5 6-7 8 9 I01 11-12 13-16 
AC Tape: 
000 
each entry takes 208 words 
0OO-O17 Entry 1 
020-037 Entry 2 
• • 
(0) At-Dialect Tape: entry structure 
I ~Iolololo IAo,~o I OITITI~I~I~!~I~IMI~I~INI~I~I~I~ I 
(D) AO-Dialect Tape: 
OOO 000-007 
01o-017 
020-027 
o 
o 
o 
TelecodeD_l and AC I 
D-I i Entry o j 
D-n 
Figure 4: AC Tape and AC-Dialect Tape 
-16- 
\["-O O ~" ~" O~O,I \['.- 
0000 0 OOOO O OOOO O 
,-H 
I 
In 
rD 
-,-I 
OOOO O OOOO 
~O~ ~ ~ ~h ~O~ ~ ~O~ 
0~.0 0 cO ~'~! ('4 \['-- ,--~ 04 ~ ~i 0 %00 O~ 
,-I O'~ \['-- ,-~ ~'O U'A,H <I- U'~ 
H ~I" UA -.ff ,-H 
I'-t I'-I HI ~ HHH ~1 
I--4 HHI> 
I-.11--11-41-4 
O 
O IX) 
O~ 
OOOO 
0~ 
0~ 
0 
0 
O~ 
r-l, 
0 
H HH~ 
H~HH 
i 
0 B 
o) 
,%O 
-H 
I I I-H 
~4 
-17- 
& ,-'.~ ~¢"~0 O~ ~"~'h ~<1" kD E~OJ Od q" Od Lrx,~O 
OO('%1,-I I'~ OOOO O OOOO 0 0000 0 
ii 
4~ 
g 
E 
00~ ~ 0000 0 0000 0 oo~H 
0~ O uA,-I u'~ ,-I O.J D--~l ~ ,-I O~C~ ('J .J- O~ ~x'x u~, cO LC', 
O O'~,-I .--I .--I O (X) ,-I ,-I ,--I ~o3 ,-I O OO', ,-.I ,--lU~ D,.- ,-I cf~ \['-- uA \[',- .-~ u'A \[',- 
u'x 
~D 
.,~" (~) ,-I 1~"~ %D O,.I O4 0J ('~ 00 0'~ O u"x ~*1 u'x oJ ~', --,i" .-~ 0 
f-'l ~..0 ,--I ~ ,-'I £'~-- ,-( O'x C"-- ,-I (~', r-I \["- ,--I C.) h3 
0 0 0 ~'c~ ~ ~cO 00~ ~.~ ,-4 cO t",-- oJ (~0 0"~ 
.-I ,-H r-I I-I ,'-I H c4 H ~ H 
I'-I H I'-I N 43 I--I H I--I H O H F-I H t--.I O HF4HH O 
0 E-~ C-4 E.~ 
I .,M 
\[-4 
I 
,'d 
I 
,o 
I bn 
I AI-YLJ'~N 
F~ ~-K:J U 
CMENG-CU 
YA,~(;-ZHflU 
P t:K I NG 
J I-NAN 
x I-AN 
"! ~\[ -YU a N 
h ~,I--K ,3 tJ 
C hE N..';- C U 
y ~i'~ f;-- Z h .,JlJ 
\]..4 26 
P i;K I N;; 
J | - NA,"J 
X l-aN 
T.,%I-YUAN 
h ~ W-K:\]U 
C HE :Jr;- Oil 
Y aNt,-Z HfIU 
Jd4~ 
P 6K I N~; 
wI-AN 
I A\[-Y~J ",N 
I~ AM-K~J1, 
CF~:~(;-DU 
YA;~G-Z F~U 
PEK\[NG 
J I - q A,q 
x I-A:4 
I~I-YJAN 
HarJ-K(Jb 
C HE N(;-- \[JtJ 
1C4~ -. 
PFK ING 
J I -NAN 
X I-AN 
TAI -YU/~N 
hA ~I-KOU 
CI-ENG- CU 
~ A4G"/HOU 
,'105;) 
PEK \[ NG 
,I I-NAN 
X I-AN 
HAN-KI~U 
-18- 
1 K U ~I 
I K U AI 
I K L ,al 
I K U E2 
V F~ i,,M 26 
IB F 
IR F A 
46 /..: A Q 
!~ F A 
F A O 
\] uZP K3 £G 
td ISRE E3 V 
ld \]SRH E3 V 
It) TSRH E3 V 
1 TS H F3 V 
IB 15 H E~ N 
\]R S F3 N 
i~ TS H F3 N 
4 H 1(3 NI~ I'~ 
I 
I I 
I I 
lB I 
lt~ I 
4 I 
2 K K3 kU 5 
2 TCP I CL 
TCP I dU 
TCP I EL 
2 TCP I Cb 
2 TCP I CU 
2 ICP I E3U 
TCP I Clb~ 
2 K3 wE 2 
2 I 
2 I E 
; I E- 
l E__. 
2 I E 
2 I F 
p \ If. 
2 NJ I-3 NC E 
2 ZR U 
2 L U 
2 V U 
2 Z U 
Figure 6A 
-19- 
x F-AN 1~ V .~ V 
TAI -Y~JAN I V ,~? Z 
C I-I:: ".,IG- I\] U \]d b A V L 
Y AN(;-Z H(\]U IB L AI V 
JC\]'-, - \] K K2 WU 6 
F~,:K \[Nt; 1 TCP I ~L 
,1I- 4A',~ 1 TLP | C2 
xl-~N 1 \[CP I Ab 
TAI - YO -" N 1 T(\]P I ALJ 
P *~N-K \]U I l TCP I ~b L 
CP~NG-D,I 1 TCP I ~b 
Y ~ 4(;-ZP'LIU I TCP I C2 
-\]<'I/, - 4 l<~. XG 12 
PFK \[N(; ,~ I 
J I-NAN ~ l 
x I-AN 1 l 
H,~ ,l-Ki\] (J 1H l 
CI-_CNG-CU \]H I 
Y,~',IG- I I- OU 4 l E3 O 
JC"/(; - 1 K K3 XG 1 
PFK ING I TCP l V 
J l-rdA'4 ! \[CP | v 
x I-AN 1 ICP l V 
TAI-YdaN I TCP I V 
I~,I'W-K~\] U I TCP I N 
f.H~.'WG- C U \] TCP I N " 
YA/~-Z HtJU I TCP I Z 
}?.St - 1 C K'r: XG 13 
FCK IN(, IH T h I V' 
J I-:,IAN 18 "T h I V 
xI-AN l~ T h I V 
IAI -YUJN \] T H I V 
h ~;~-Kl\]iJ lH T H I N 
CP~ NG-DU IH T ~ I 
Y A:d(i-Z HFIU IH T H I Z 
\]~.81 - 3 L K3 ~'G 7 
PFK IN(; ~ L I A V 
J I-NAN 3 L l A V 
x l-AN 3 L I ~ V 
TAI-YUAN 3 k I A2 Z 
H~N-KOL ~ N l J V 
Ct, ENG-/U 3 N I A V 
YANG-Z MOU 3 L l Ai V 
,I086 - 1 hJ W3 kl~ II 
PEKING 1.B ZR E3 N 
J I-NAN " |B ZR ¢: Z 
Figure 6B 
-20- 
1. The work reported here is supported in part by grants 
from the National Science Foundation and the American 
Council of Learned Societies. 
2. See Lyovin (1968) for a more detailed description of 
the beginnings of Project DOC. 
3o The hypothesis of lexical diffusion, i.e., phonological 
change operates gradually across the lexicon is admit- 
tedly controversial. The hypothesis would not be 
acceptable to theorists in the Neogrammarian tradition, 
e.g., L. Bloomfield. However, as I argue in Wang (1969), 
there are good reasons for thinking that this is indeed 
how changes are implemented within narrow time spans, 
i.e., morpheme by morpheme rather than phoneme by pho- 
neme. The proof of the hypothesis requires large scale 
studies of the sort exemplified by DOC. 
4. The Zihui has many draw-backs, as pointed out in 
Lyovin's review (1969). However, it is obviously the 
best set of core materials co start the project on. 
-21- 

References 

Dong, Tong-he. 1953. Zhongguo Yuyinshi. (History of 
Chinese Phonetics). Taiwan. 

Dougherty, Ching-yi, Sidney Lamb and Samuel Martin. 1963. 
Chinese character indexe s. 5 vols. Berkeley: Uni- 
versity of California Press. 

Hayashi, Hideyuki, Sheila Duncan and Susumu Kuno. 1968o 
Graphical input/output of nonstandard characters. 
Communications of She Association for Computing 
Machiner,T. 11.9.613-8. 

Lyovin, Anatole. 1968. A Chinese dialect dictionary On 
computer: progress report. POL_~A 7. Berkeley. 
• 1969. Review of IIan~u Fangyin Zihui. 
Language 45.3. 

Meillet, Antoine. 1913. Sur la m~thode de la grammaire 
compar~e. Reprinted in his Linguistique IIistorique 
et Lin~uistique G~n~rale. (Paris, 1965). 

Peking University. 1962. Hanyu Fangyin Zihui. (Phonetic 
Dictionary of Chinese Dialects). Peking. 

Uang, %!illiam S-Y. 1967o Phonological features of tone. 
International Journal of American Linguistics. 
33.93-i05. 

1968. The many uses of F o. POLA 8. 
Berkeley. 

o 1969. Competing changes as a cause of 
residue. Language 45:1.9-25. 

.Sang, ~!illiam S-Y. and Anatole Lyovin. 1969. Chinese 
Linguistics Bibliograph~ on Computer. Impress with 
Cambridge University Press° 
