A Chinese Corpus for Linguistic Research 
Chu-Ren Huang 
hschuren @twims886.bitnet 
Institute of History and Philology 
Academia Sinica 
Kch-jiann Chen 
kchcn@iis,sinica.edu.tw 
Institute of lnlormation Science 
Academia Sinica 
This is a project note on the first stage of the con- 
struction of a comprehensive corpus of both Modern 
and Classical Chinese. The corpus is built with the dual 
aim of serving as the central database for Chinese lan- 
guage processing and for supporting iJl~lepth linguistic 
research in Mandarin Chinese. 
I Background 
The project being reported on is a sub-project of the 
on-going research of the CKIP (Chinese Knowledge 
Information Processing) Group. This group was 
founded by Hsieh Ching-chun in 1986 and is currently 
directed by Keh- jiann Chen and Chu-Ren Huang 
(Chang et al. 1989, Hsieh et al. 1989, Chen et al. 1991). 
The CKIP research is divided into three sub-projects 
according to their goals: 1) An On-line Lexicon for 
NLP, 2) A Corpus, and 3) A Parser. The suit-projects 
are designed to create a self-sufficient and mutual 
supporting environment for Chinese NLI: The corpus 
will be the database supporting the electronic lexicon, 
while the lexicon will be the basic reterence for auto- 
matically tagging the corpus. Moreover, both the cor- 
pus and the lexicon will support the parser. Our parser 
adopts the unification-based formalism of ICG (infor- 
mation-based Case Grammar, Chen and Huang 199{)), 
which encodes all grammatical information on each 
lexical entry. At this point in time, the lexicon consists 
of a fully automated earlier version with limited gram- 
matical information and an updated version with com- 
plete grammatical version for parsing, qtaere are more 
than 40 thousand entries in the completed electronic 
dictionary, which is available on-line in "lttiwan and al- 
lows basic pattern-matching searches. There is also a 
PC version with reduced search capacity available from 
the Industrial 'l~echnology Research Institute, the pri- 
mary funding agency of this pilot dictionary project. 
The updated version now contains roughly 30 thousand 
entries with complete grammatical information and 
another60 thousand with basic grammatical categories. 
Manipulation of lexical information such as addition of 
entries and specification of detailed grammatical infor- 
mation with respect to each attribute is maintained on- 
line (Jian and Chen 1991). The completed 90 thousand 
word lexicon will be our core lexicon fl)r pars- ing. The 
hierarchical arrangement will enable us to efficiently 
add new entries and create special lexicons for sub~lo- 
mains. 
Many modules of the parser arc now under con- 
struction and some of them have been completed, such 
as a analyzer-generator for quantifier-measure com- 
pouods (Mo et al. 1901) and a look-up segmentation 
program (Chen and IAu 1992). Both the compound 
analyzer-generator and the segmentation program 
perform well. The recognition rate fl)r the segmenta- 
tion program, excluding proper names and derived 
words, is 99.77%. Since neither proper names nor word 
breaks are marked in the writing system, they will have 
to be dealt with separate modules using both morpho- 
logical and heuristic information. The analyzer-gener- 
ator has the perfect recognition rate of 100% while al- 
lowing over-generation and ambiguity, which also of- 
ten involve proper names and derived words. 
The corpus portion of the project is mainly funded 
through a grant from the Chiang Ching-kuo Founda- 
tion lot International Scholarly Exchanges to Acade- 
mia Sinica and the University of London, and is sup- 
ported by matching funds from Academia Sinica. I~aul 
Thompson is the Co-Principle hwestigator at tim Uni- 
versity of London. The COBUIt,D project of the Uni- 
versity of Birmingham also offers technical consulta- 
tion. Of the 12 linguists working full time in the CKIP 
group, five are assigned to the corpus project in addi- 
tion to three programmers. 
II. Sources and Size of the Corpus 
A strategical decision was made early in our project 
to build separate corpora for both classical and modern 
Chinese. This is not only because the same lexical com- 
puting techniques can be used lor both modern and 
classical Chinese (data are represented as Chinese 
characters), but also because we think it will be interest- 
ing to compare the result of an open aud artificially bal- 
anced corpus (the modern one) with a completed but 
randomly balanced corpus (the classical one). In addi- 
tion, the impact on linguistic research will probably be 
more immediate and obvious in diachronical studies 
AcrEs DE COLING-92. NAN~IdS, 23-28 AOt~,r 1992 1 2 I 4 PROC. OF COLING-92. NANTES, AUG. 23-28, 1992 
titan in synchronical studies. Hence, tile classical cot'- 
pus is defined in terms of time (roughly all existing text 
written before 0 B.C.) and the modmn corpus is defined 
by size (7 -l(I million words tagged text in two years). 
This also sct the direction of future development: the 
classical corpora will develop clironologically through 
the time while the modcrn corpora will cxpand both in 
size and in domain--specific corl~)ra. 
The following texts are acquired ill the first stage: 
l) Modern Chinese 
A. 10 million characters of texts (word breaks are m)t 
used in Chinese writing systems, but the average word 
length is a little nlore than two characters) from three 
months of the l.iberal Thnes, a daily newspaper. B. 10 
million characters fi'om the China Tiates group. 
Agreement was reached ha October, 1991 with thc Chi- 
na Times group, one of the two ucwspaper giants in'lhi- 
wall, to provide daily on-line text to our project. Thus 
we will have a dependable and unlimited source of data 
(up to one million characters a day). The group pu-. 
blishes three newspapers, tleveral nlagazines, and has a 
separate book-publishing subsidiary. C. Al~mt one 
million words of data previously input by the CKIP 
project. This includes 10 articles front a magazine and 
explanatkm text ft~la a dictionary. D. 30 thoummd 
words of tran.~libed spoken text. This section will be 
input in ltF)2. 
Newspapers are tile nlainstay of our data source fur 
the otwious reason that the newspaper texts are on 
line. But it is also a (xmvenient text containing many 
vmieties of writing, including spoken couver~tion (in~ 
terviews), commentaries, translations (foreign dis- 
patches), and all genres of literary styles (Chinese 
newspapers arc different from the Westeln ones in hav- 
ing daily literature suppleinentaries, and are the most 
important venue for literary publications), llowevev, 
because of difficul-- tics ill converting different charac- 
tcr-ca~ling and mark-up systems, we can only incorpo- 
rate news from China Tintes to otn corpus so far. tither 
types of texts should be available by fall 1092. 
2) The Classical Corpns 
Roughly three million characters of Chinese written 
prose have smvived frmn the years before Christ was 
born (i.e. from all tile periods up to the Western Hart 
Dynasty). A corpus of roughly 1.5 million characters of 
the text is now available in machine readable forms 
from a previous prolect in Academia Sinica, the re- 
maining 1.5 million characters arc now being keyed in 
and will be on-line by mid-1992. 
111. Functions and Applications 
It is out" hmg term goals for the Corpus to have the 
automatic dictionary compiling abilities, fashioned af- 
tcr COBUILI) (Sinclair 1987, iN)l), adapted for both 
parsing and hard- copy publishing. In the flirt stage, 
however, we concentrate on developing tools for lin- 
guistic research. We will describe the search functions 
that we have developed so far. Most of the functions 
are character-based now and call be upgraded to word- 
based once we incorporate our tagger and word seg- 
mentation naKlule. 
The search utility in our corpus is basically a KWIC 
(key wtlrd-in~m-- text) search based on Chinese char- 
acters. Our search progranl allows the linguists to spec- 
ify both the size of right-and left-hallO side cxmtext 
sh(lwn as the search resnlt. There is also a randomizer 
to chm)se a more manageable size of data if the search 
result is too big (more than a thou- sand citations). The 
ordering of data is done in the traditional way in terms 
of nu tubers of strokes in tile character in the immediate 
context. This is not the ideal methtxl but may be the 
best available before we can deveh)p a system which is 
ln)ttt linguistically and hem-istically more ~phisticated. 
A romanizatiml based articling s~ystem would be even 
less desirable fl)r the lack of unifl~rm (and familiar) ro- 
manization system ill 'lliiwan and h~r the failure to dis- 
amhiguiate homonyms. 
In addition to the basic s(~irell procedures, the fol- 
lowing customized search commands are added to 
serve tile need of linguistic research. 
1) # < kw > # : a context where a key word is both pre- 
ceded and followed by the mime word. 
2) AA : leduplicatiml (one character) 
3) AllAll : reduplicatiml (two characters) 
4) < kwl > * < kw2 > : crax~;urreuces of two key words 
in a (possibly) di~ontinuous context. 
5) < kwl >/I < kw2 > : context where a key word (kw2) 
is 'left- disassociatcd' fronl another key word (kwl), i.e. 
where the sccnnd key word does not (recur in the left- 
context of the first key word. 
6) < kwl >/r < kw2 > : context where a key word (kw2) 
is 'right-disassociated' from another key word (kwl), 
i.e. where tile secured key word does not occur in the 
right-context of the first key word. 
t ;ommands 1)-3) are helpful tools in studying mor- 
phological rules and identifying morphological con- 
stmctions for Chinese. Since Chinese writing systems 
do not include word-breaks, and since no lexicon can 
ever offer a complete list of words, word segmentatkm 
is non-trivial in Chinese Language Processing (Chert 
and Liu 1992). Identifying and util~ing morphological 
information is therefore essential Ix)th in lexical com- 
puting and in natural language processing. 
ACnT.S Dr. COLING-92, NANTES, 23-28 Ao(rr 1992 1 2 1 5 PROC. OF COLING-92, NAN-rES, AUG. 23-28, 1992 
Command 4) is a handy tool to discover cooccur- 
rence restrictions and their semantic consequences. 
Commands 5) and 6) are used to eliminate ambiguity 
and to cut down the size of search results. 
It can be noted that since our tagger is not running 
yet and since the Chinese running text seldom defines a 
sentence by a period (a whole paragraph often contains 
only one period and many commas), the above com- 
mands use number of characters rather than sentence 
markers to delimit search domains. 
Concordance programs are also being developed for 
our project. The current version runs with our classical 
corpus. It is able to show both the text source of each 
concordance item as well as page numbers from the 
printed version for easy reference. This is originally de- 
veloped on the HP workstation, the machine we now 
use, but a version that runs on IBM PC 486 with either 
SUN Unix or CCL Unix is also available now. 
Another research tool that we developed deal with 
frequency counts of characters and words and statistical 
packages to compare linguistics and other textual fea- 
tures of the corpora. Like many other modules of our 
project, this module has been developed and com- 
pleted mdependently. It has been tested on the un- 
tagged on-line classical Chinese database of the "l~,en- 
ty-five Dynastic Histories (Hsieh 1991). This module is 
ready to be incorporated into the system. 
IV. Accomplishments and Future Developments 
In this first stage of the development of a Chinese 
corpus for NLP and for linguistic research, we have 
achieved the primary goals of acquiring the core text 
data, and establishing a mutual-supporting environ- 
ment between the lexical computing research on the 
corpus and the computational linguistics research on 
NLE Our systems are developed on HP workstations 
under Unix. The system should be portable to any Unix 
machine with compatible Chinese solution. For in- 
stance, we are porting each of our modules to a IBM PC 
486 running Unix for the use of our collaborator in Lon- 
don. 
In this preliminary stage, the most encouraging sign 
is that our human linguists have established productive 
interaction with the corpus. Human expertise helped 
to design search utilities pertinent to linguistic research 
and corpora provided both a convenient source of lin- 
guistic facts and a solid basis for deducing useful gener- 
alizations. With the basic KWIC search utilities, the 
CKIP project has finished several linguistic studies with 
the help of the corpus. Take Mo et al. (1991) for exam- 
pie, the quantifier-measure rule that our analyzer- 
generator uses is based on generalizations extracted 
from the corpus, and the program itsetf is tested on 
texts randomly selected from the corpus. Other lin- 
guistic works based on the corpus include Hong, Huang 
and Chen (1991) on morphological rules fl)r Chinese, 
and Mo, Huang and Chen on serial verb construction in 
Chinese (1991). This corpus is currently used by the 
CKIP project in their ac- counts of A-not-A questions, 
of resultative compounding, of nominalization, and of 
various reduplications. 
Incorporation of an automatic tagger and extraction 
of grammatical information from the tagged corpora is 
the most important immediate future goals of this proj- 
ect. This, of course, depends crucially on a tagging sys- 
tem with theoretically well-structured attributes. A da- 
tabase for attributes is being developed with the 1N- 
FORMIX software. And we will folh)w the TEl guide- 
lines (Sperber-McQueen and Burnard 1990) whenever 
possible. Our word segmentation program now doubles 
as a category-tagger. But this can only be viewed as a 
research aid for the linguists to detect categorical ambi- 
guities and unlisted words. We also expect the on-goo 
ing linguistic research to identify more search functions 
and refine the existing utilities. Direct extraction of 
dictionaries and grammars should I)c feasible in five 
years. 
Aeknowlegements: 
Research of this project was partially funded by the 
Chiang Ching-kuo Foundation for International Schol- 
arly Exchanges. The first author wants to express his 
gratitude to the Center for the Study of Language and 
Information, Stanford University for thc surport he re- 
ceived when he visited there, during which a draft of 
this note was completed. Responsibility of any errors 
remain ours alone. 

Bibliography 
Chang, LL. and CKIE 1989. The Categorical Analysis 
of Mandarin Chinese (Revised edition, in Chinese). 
Taipei: Academia Sinica. 
Chen, K.-J. and CKIE 1991. 'llle Chinese Knowledge 
and Information Project and Chinese Electronic Dic- 
tionary (in Chinese). Paper presented at the Joint Chi- 
nese-Japan Symposium on Information Processing. 
Taipei. 
Chen, K.-J. and C.-R. Huang. 1990. Information- 
based Case Grammar. COLING-90. Vol. ll. 54-59. 
Garside, R., G. Leech, and G. Sampson. Eds. 1987. 
The Computational Analysis of English. London: 
Longman. 
Hsieh, C.-C. 1991. Statistics of the "li~xt of the qwenty- 
Five Dynastic Histories. Pal~er presented at ROCL- 
ING IV. 
Jian, L.-E and K.-\]. Chert. 1990. The Hierarchical Re- 
presentation and Management of Lexical Information. 
Proceedings of the Third R.O.C. Computational Lin- 
guistics Conference (ROCLING III). pp. 295-310. 
Chert, K.-J. and S.-H. Liu. 1992. Word Identification 
for Mandarin Chinese Sentences. "lt~ be presented at 
COLING-92. 
Hong, W.M., C.-R. Huang, and K.q. (;hen. ~991. The 
Morphological Rules of Chinese Derived Words. ~1~ be 
presented at the 1991 International Conference on 
"lizaching Chinese as a Second Language. Dec. 1~)l. 
'l~tipei. 
Mo, R. J., Y.-R. Yang, K.--J. Chen, and C.-R. Huang. 
1991. A Analyzer-Generator for Mandarin Chinese 
Quantifier-Measure Coml~unds. Proceedings of 
ROCLING IV. 
Mo, R. J., C.-R. Huang, and K.-J. Chert. 1991. Serial 
Verb Constructions in Mandarin Chinese -- Their Def- 
inition and Control Relations. 'lh he presented at the 
1991 International Conference on 'l~aching Chinese as 
a Second Language. Dec. 1091. "laipei. 
Sinclair, J. M. t987. Ed. L(~king Up -All account of 
the COBUILI) Project in Lexical Computing, London: 
Collins. 
Sinclah', J. M. 199l. Corpus, Concordance, Coll~v~a- 
tion. Oxford: Oxford Unwersity Press. 
Sperberg-McQuecn, C. M. and 1,. Burnard. 1990. 
Giudclines for" the Enemy, ling and Interchange of Ma- 
chine-Readable 'l~:xts. (I'EI 1'1). ACI I-ACL-ALI ,('. 
