JIM M2KTHIAS 
COOPERATIVE FILE IMPROVEMENT AND USE 
OF A COMPUTEI~-BASED CHINESE/ENGLISH DICTIONARY - 
The CETA (Chinese-English Translation Assistance) Group is an in- 
dependent organization formed to coordinate development of Chinese 
to English translation aids and data analysis techniques. It began as an 
ad hoc body of individuals from State, Commerce, Labor, Office of 
Education, Defense, Intelligence, Voice of America, Foreign Service 
Institute, Defense Language Institute, National Science Foundation and 
Library of Congress. Extension of interest into the scholarly commu- 
nity has broadened academic dimensions to include 43 US and inter- 
national universities. CETA is developing a computer-based Chinese- 
English dictionary of current standard terms. It is also exploring tan- 
gential topics such as computer processing of Chinese research data, 
machine translation, and use of the CETA Dictionary file in an on-line 
computer aid system. 
Academic research and development of computer operations in 
, United States' universities has led to capability of computer generation 
of Chinese characters. Using this capability, CETA printed a 90,000 term 
dictionary file of Chinese-English entries and has developed a coopera- 
tive international process for refining and enriching the file. This pro- 
cess is called the C~TA File Improvement System. It is founded on gov- 
ernment/academic/private cooperation, designed to edit existing ma- 
terial and add new material. The improvement system is based on col- 
lective improvement of the file through a wide sharing of linguistic 
tasks and the use of computers to store the data and process changes. 
Thus far, thirty-seven government and forty-three academic linguists 
and language specialists have committed themselves to review an 
improvement of the file in return for which they receive the printed 
copy of the dictionary plus change pages as they are generated. Over 
51,000 suggested improvements have been submitted and evaluated 
and are awaiting update. The File Improvement System proceeds by 
cycles in which progressively more rigid standards of review are applied. 
276 JIM MATmAS 
The ftle will be reprinted in three to five year cycles with change pages 
issued during interim periods so that participants can share maximum 
benefits at all times. 
When CETA examined the problem of producing a dictionary, it 
was concluded that significant results could be achieved only by sharing 
the many tasks involved. It was a forbidding problem, however, the 
potential for improving dictionaries without waiting 20 years for new 
editions was a meaningful incentive. The CETA Group issued a hard 
copy of the 90,000 term Chinese-English listing called The CETA Com- 
puter-Based Chinese-English Dictionary. It was produced as a "liv- 
ing" file that could be changed constantly. It was printed by comput- 
er - the principal advantages of which were ability to print Chinese 
characters without typesetting and economy of effort in manipulating 
the data. The computer could sort in different sequences, make cor- 
rections or additions at will, extract particular subsets, and produce 
a hard copy image of file materials. In a word, it was possible to take 
the present computer-produced manuscript and give parts of it to vol- 
unteers to review and correct or add information. Also it was possible 
to develop methods for the reviewer to easily prepare changes and for 
CETA to evaluate and then update the manuscript fde. 
The first cycle of file review for gross error and duplication has 
been completed. The reviewers were given a set of instructions to guide 
them in review of the dictionary material and the preparation of changes 
or additions. The steps required to process improvements to the CETA 
Dictionary are, briefly stated, receipt of suggestions for change or ad- 
dition, preparation for keypunch, computer generation of a prooflist 
showing original as well as changed entries, manual review of the proof- 
list, computer selection of approved changes, and update of the com- 
puter dictionary file. The application of these steps assures that all 
changes to the master file will be examined at least once and question- 
able changes can be held for later review to avoid delaying update actions. 
As mechanism, the improvement system is quite smooth and under 
ideal conditions it is possible to change the computer file in a matter 
of minutes. Under the less than ideal conditions that usually prevail, 
it is still possible to update and provide current information within a 
few months rather than the usual 10 year dictionary building and 20 
year reissue cycles. 
Currently CETA has received and prepared for update a total of 
51,000 changes to the 90,000 term file. Since there are more additions 
than deletions, the new file will be larger by a few percent. More im- 
0 Z ~ ~ Z ~ 
• 0 ~'~ 0 
o ~o o ~ ~ ~ ~ < o ~ ~o ~ = ~ :z: ~ ~ :~ z ~ o 
~ z ~ ~ ~ < ~0 o ~ • "," 
¢.o ~ "at Z (0 .f'-, r~ r.~ ,-1 f~ fzl 0 0 Z Z ~ Z 
:~ ~ o z .3 7~ z z ,'," :~ ~ , o o < ~ o < .¢ < z o o ~. ~ ~ ~ -, 
O. a. (.-, ~, F~ \[--, \[--, F-, F-, ~ ~ b F. b, F, \[--, F., \[-. b-, ~, 
0 0 0 0 0 0 0 0 0 0 rO 0 0 0 rO 0 0 rO 0 0 0 
m 
0 0 
,-.1 0 
Z 
~:~ 
0 . ~ 0 \[" ~ 0 
o ~ ~ ~ ~ 
m ~ 0 
o 0 Z 
0 o 
o o ~ o o ~ ~ z z o 
o o o o o o o o o o o o o o o o o o o o o 
. Flg. 1. Computer Printed Chinese Characters. 
278 JIM MATHIAS 
portant, the greatest error will have been removed and the fde will be 
prepared for the next cycle which will emphasize the further enrichment 
of the lexical content, addition of grammatic information, incorporation 
of restrictive and stylistic labels, and identification of agglutinated phras- 
es. The second printing of the dictionary manuscript will include Pin 
Yin romanization with tone and telecode numbers as well as the cus- 
tomary English gloss and source information. The character vector file 
has been significantly updated so that it now contains capabilities of 
drawing approximately 10,500 characters. It will be continually updat- 
ed through the dictionary review cycles. See Figure 1 Computer 
Printed Chinese Characters. 
The fde will also be available as the core of an on-line computer 
aid. Prototype computer aid functions have been developed which 
illustrate the ways in which a computer file can be used in an interactive 
mode to help a translator. They use input by telecode and romanization 
and graphic input is simulated. A cathode ray tube is used to display 
Chinese characters, romanizations (Pin Yin, Wade-Giles, Yale), the 
radical number plus residual stroke cotmt, English meaning for the 
string and meaning for segments of the strings. Also developed is an 
automatic segmenting function which is the operation of breaking a 
string of characters into single characters and into segments of contin- 
uous characters (for synthesis of meaning form component parts). 
See Figure 2 Graphic Display. 
C HARA CTER SEQUE NCE SEGME NT E NG LISH 
(1) (2) (3) (4) IDENTIFIER MEANING 
STC 2693 3111 1714 2348 " 1-4 Diesel Engine , 
2-4 -- ! 
4 4 To Raise (W) ' 
I' Yin... Engine 
1-3 I 
P 1CHAI 2YOU 3YIN 2QING 2-3 - 
W CH'AI YU YIN • CH'ING 3 Lead, Draw, Attract, 
Y CHAI YOU YIN CHING i-2 Diesel Oil, Fuel in ' 
R 75.5 85.5 57.1 64.13 General ' 
T 9 8 4 17 Oil, Grease ' l 
Fire Wood I 
Fig. 2. Graphic display. 
COOPERATIVE FILE IMPROVEMENT 279 
CETA hopes to test this system further using a refined data base 
for evaluation of its potential for shared access by a wide government 
and academic community. 
CETA started with a poor dictionary but it was machineable. There 
are a lot of good dictionaries that are not machine readable and, there- 
fore, difficult to change or consolidate. CETa is putting these things 
together by use of a wholly unique method; the voluntary cooperation 
of interested government and academic scholars and language specialists. 
The reward to participants is: 1) awareness of contribution to a worth- 
while effort, 2) an up-to-date hard copy of the CETA computer file 
containing all the latest contributions by all participants and 3) use of 
the CETA Secretariat Office to search out and exchange information of 
common concern. The only cost is willingness to share in the work 
of CETA. 

