AN IBM-PC ENVIRONMENT FOR CHINESE CORPUS ANALYSIS 
Robert Wing Pong Luk 
Department of Chinese, Translation and Linguistics, City Polytechnic of Hong Kong 
Email: CTRWPL92@CPtlKVX.BITNET 
ABSTRACT 
This paper describes a set of computer 
programs for Chinese corpus analysis. These programs 
include (1) extraction of different characters, bigrams 
and words; (2) word segmentation based on bigram, 
maximal-matching and the combined technique; (3) 
identification of special terms; (4) Chinese 
concordancing; (5) compiling collocation statistics and 
(6) evaluation utilities. These programs run on the IBM- 
PC and batch programs co-ordinate the use of these 
programs. 
L INTRODUCTION 
Corpus analysis utilities are developed and 
widely available for English. For example, tbe Oxford 
Concordance program is available for over 10 kinds of 
mainframe computer (Hockey and Martin, 1987) and 
the Longman mini-concordancer (Trible and Joncs, 
1990) is available for the sales. Further enhancement of 
these utilities include compiling collocation statistics 
(Smadja, 1993) and semi-automatic gloassary 
construction (Tong, 1993). Current research has focused 
on bilingual corpora (Gale and Clmrch, 1993) with the 
alignment of parallel-text becomeing an important 
technical problem. However, there has been little 
development of corpus analysis tools for Chinese. Since 
using Chinese fiJr compnters has only become more 
generally availablein the last ten years, analysis utilities 
for Chinese are not widely. Although no integrated 
environment is available for Chinese corpus analysis, 
many specific analysis programs have been reported in 
the literature (Kit et al., 1989; Tong et al., 1993; Chang 
and Chen, 1993; Zhou et al., 1993). A Chinese 
concordance program a,~d a clmractcr-list extraction 
program are freely available from a Singapore (FTP) 
network site (Guo and Liu, 1992). tlowever, the 
programs run in the SUN workstations while many 
users, particularly non-computing experts, interact with 
an IBM-PC in Chinese, rather than a SUN workstation. 
Tile rapid adwmce of microcomputers has 
mitigated many storage and processing speed problcms. 
As for storage, the hard disk capacity can reach as high 
as 340M bytes which is adequate in comparison with tile 
demand for a corpus (8M bytes from the PH corpus) and 
a dictionary (10M bytes). Using a 486 processor, tile 
processing speed is acceptable if the user expect data to 
be analyzed over-night, similar to submilting a batch job 
to a mainframe computer. For example; the utililies we 
are developing ranked around 42,000 words in a few 
minutes and produced about one-hundred lines of 
keyword-in-context in a few seconds for a 4 million 
character Chinese corpus. 
This paper describes our effort to develop 
corpus analysis programs for Chinese. Tile programs are 
written in Turbo C++, implemented on an IBM-PC 
(486) with a 120M byte hard disk. The programs are 
divided into several types: 
a. format conversion program 
(norm.exe, phseg.exe, wform.exe) 
b. extraction of characters, bigrams and words 
(exsega.exe, exsegmi.exe,bigram.exe, 
miana.exe, worddh.exe, wral'~ka.exe, 
wlranka.cxe) 
c. word segmentation programs 
(bisegl.exe, whash.exc, bimaxn.exe) 
d. concordaucing programs 
(kwic.exe, kwicw.exe) 
e. collocation statistics programs 
(cxtract.exe, cxtractw.exe, cxstat.exe, 
cxstatw.exe) 
f. gcncral (evaluation) programs 
(wcomp.exe, scgperf.exe) 
To run these analysis utilities, a Chinese 
computing environment called Eten nmst be set up; 
otherwise Chinese characters cannot be displayed or 
entered. Since there are many different Chinese 
characlers (i.e. 13,000) compared with Western 
languages, cach Chincse charactcr is specified by two 
bytes instead of one. llowcver, many docmuent includes 
both single-byte characters and two-byte Chinese 
characters. Tiros, tile conversion prograln, norlll.exe, is 
used to convcrt all the single--byte characters (i.e. 
A..Z,a.. z ..... :,;, {,~L#,$,%,^,&, *, (,),-,+,=,l,\,/,<,>,', {, }, 1,1, 
0..9, ~, <space>, _ and ") i,~to their corresponding two- 
byte equivalent, for simplicity. For example, the-single- 
byte character "a" is converted to "~ " (2-byte). This 
program also changes tile docmnent iulo a clause or 
phrase format, using the -e option, where a new line is 
inserted after a punctuation mark (e.g. comma or fidl 
stop): 
--Jt,/kPq'q~-I ~..l~\]-I~Jt, ll ' 
q,~ ~b~ ~/ff ~'~ ,;'IIM/~L~I:H~i~j~ ~-~e,) l , 
Figure 1 : Exlraet of file llong Kong Basic l~aw in clause format. 
584 
If the text are segmented into words by space or 
"/" markers, it is possible to change or delete these 
markers using the -s option. Once, the document is 
cotwerted into two-byte format using norm.axe, the 
other utilities can be used. Batch programs can be 
written to use these utilities. For example, tile following 
batch program extracts different characters, performs 
bigram segmentation, extracts different words and 
obtain only the top 10% of the extracted words for 
compiling key-word in contexts and collocation 
statistics. 
norm -t %1 -o carp.trap -e 2 -s 5 
/" 1-byte to 2-byte; phrase format; delete space */ 
exsegs -t carp.trap -w 2 -b 0 
/* extract different charaotsrs and bigrams =1 
bigram-m 10 
1" sort bigram and extract top 10% */ 
bisagl 
/" segment using the top 10% bigrams '/ 
wotddh 
1= extract different words =/ 
wranka-m 10 
1" sort and extract top 10% words "1 
kwic -t carp.trap -k words.cut > kwie.lst 
\[* concordancing on the top 10% words '/ 
cxtract 
1" extract different characters from contexts "/ 
oxstat 
/" compile collocation statistics '1 
IL EXTRACTION PROGRAMS 
The extraction progranls assume that the text is 
not segmented. Thus, norm.axe should be used to 
remove markers fi'om the seglnented text. 
The programs, exsega.exe and exsegmi.cxe, 
extract different characters and their co-occurring 
characters, stored in cfreq.tmp (Fig 2) and 
bifile/mifile.tnlp, respectively. The first program obtains 
the co-occurrence frequencies while the second obtains 
the inutnal infornmtion. By default, tbe programs do not 
count punctuation but this can be override using the -a 
option, The different characters can be supplemented 
with information about their frequencies, pcrce,ltagcs 
and clunnlative percentages if the -w option is set to 2. 
~~¢d 909 3.529 3.5 1 
I~J/ 905 3.513 7.0 2 
~'-.f/ 789 3.063 10.1 '3 
i'12/ 647 2.512 12.6 4 
~1~/ 645 2.504 15.1 5 
~J:/ 630 2.446 17.6 6 
Figure 2: Part of tile extracted single characters from the I long 
Kong Basic latw. The characters are ranked by their 
frequencies. The first number is the fi'equency, followed by 
the percentage, cumulative percentage and rank number. 
By default, all tltc different ch:lracters are 
stored. However, sometimes only tile most frequently or 
infrequently occurring characters are interesting 
candidates for filrther investigation (e.g. 
concordaucing). The user can select characters by their 
frequencies (i.e. -f and -g options), the top or bottom 
N% (i.e. -m and -n options), their ranks (i.e. -r and -s 
options) and by their frequencies above two standard 
deviations phlS the mean (Smadja, 1993) (i.e. -z option). 
By default, the extracted bigrams have 
frequencies above unity but this can be override using 
the -b option. The bigrams stored can be sorted 
according to their frequencies or their mutual 
information in descending order using bigram.exe and 
miens.axe, respectively. The sorted bigratns are stored 
in bifile.rnk or mifile.rnk. The user can select different 
bigrams using options available for exseg programs (i.e. 
-f, -g, -m, -n, -r, -s and -z options). Both programs give 
the frequency distribution of the bigram frequencies and 
the log of their freqnencies. The selected bigrams will 
become usefid for detecting componnd nouns or word 
segmentation (Zhang et el., 1992). 
Given the text is segtnented by "/" markers 
(space markers can be converted using norm.exe), 
worddh.exe can extract all the different words from the 
text and compute word frequencies. The program 
extracted 42,613 words from the PIt corpus. There is no 
limit to the number of different words that it can extract 
but it needs some disk space to hold temporary files. The 
extracted words are stored in words.lst and they are 
sorted in descending frequencies using wranka.exe, hi 
addition, wlranka.exe sorts the extractcd words firstly by 
word length and secondly by their frequencies. This is 
particularly usefid to examine compound notms, 
technical terms and translated words as they tend 10 be 
long. Furthermore, the segmentation program, 
whash.exe, needs the words to be order by their length. 
111. WORI) SEGMENTATION PROGRAMS 
Unlike English, Chinese words are not 
delimited by any tentative markers like spaces although 
Chinese clanses are easily identified (Fig 1). Many 
segmentation programs were proposed (Chiang et el, 
1993; Fan and Tsai, 1988). We have re-implemented the 
n~axinml-matchillg technique (Kit et al, 1989) using a 
word list, L, because it is simple to program and 
achieved one of the best segmentation performance (I- 
2% error rate). However, the segmentation accuracy is 
degraded significantly (to 15% error rate in (Luk, 
1993)) when the text has many compotmd notms and 
technical terms since the accuracy depends on the 
coverage of L. A word segmentation program using 
bigrams as well as combining bigrams and maximal- 
matching was subsequently developed. 
The basic idea of tnaximal-nlalching is to 
match the input clause from left-to-right with entries in 
the given word list, L. If there is more than one matches, 
the longest entry is selected. The process iterates with 
the remaining clause at the end with the clause matched 
with the longest entry. Apart from luaxilnal-matching, 
585 
whash.exe divides and output the text in the clause 
format (Fig 2). The file that holds the word list can be 
specified using the -b option and the text using the -t 
option. Tile word list should rank tile words, firstly, by 
their length in descending order (use wlranka) and, 
secondly, by their .frequencies. Usually, the segmented 
clauses are displayed on tile screen for visual inspection 
after which the ou'tput can be redirected using the > 
option (MS DOS 5.0 option). The current whash.exe 
program can hold around 20,000 Chinese words in the 
main memory for segmentation but this is not large 
enough for a general Chinese dictionary (Fu, 1987) 
which has about 54,000 entries. 
The bigram technique does not need any 
dictionary for segmentation. This technique needs a set 
of bigrams extracted, from the text or from a general 
corpus. Typically, tile top 10% of tile bigrams are 
captured and ranked according to their co-occurrence 
frequencies (CF) or mutual information (MI). This is 
due to the fac that if tile distributions of CF and MI are 
normal, then the top 10% corresponds to the 10% 
significance level. The distribution of MI lypically does 
appear normal bnt not for CF. The top N% bigranls are 
stored ill either bifile.cut or mifile.cut, The bigram 
segmentation program, bisegl.exe, loads the bigrams 
using the -b option. A segmentation marker is placed 
between two characters in the text if the bigram of these 
two adjacent characters does not appear in bifile.cut or 
mifile.cut. This segmentation is the same as performing 
nearest-neighbour clustering of substrings (l,nk, 1993). 
The program detected many non-words depending on N. 
However, the number of non-words are significantly 
reduced if we restrict to examining only the top N% (say 
10) of the frequently occurring words. 
Both maximal-umtching and bigranl techniques 
were combined, in order to detect words not in the word 
list and reduce tile amount of non-words detected (Luk, 
1993). Maximal-matching is carried out first and the 
bigram technique is used to combine consecutive single- 
character words in the segmented text since words not in 
L are usually segmented into smaller ones by maximal- 
matching. The test data shows that the combined 
technique reduced tbe error rate by 33"/o and detected 
33% of the desired words not in L. The combined 
techinque is written as a batch program as follow: 
whesh -b wordlst.txt-t text > text.trap 
/' maximal-match with existing word list "/ 
bimaxn -t text.trap 
/" combine single-character words from segmented text "/ 
worddh 
1" extract words from segmented text "/ 
wlrenka -t words.lst 
/" rank words by their lengths ~/ 
whash -b wordl.rnk -t text > text.res 
/" maximal-match with identified words "/ 
IV. CONCORDANCE PROGRAMS 
We modified tile concordance program by Guo 
and Lin (1992) since tile program assumed that the 
main nlenlory can hold the entire corpus or text. Instead, 
the modified program loads a portiou called a page into 
the main memory and performs matching to find the 
appropriate contexts. The page size can be changed 
using file -p option but we fouud that tile program 
operates well at -p 10000 (which is the default size). 
The modified programs, kwic.exe and kwicw.exe, can 
process files of size just over 2G bytes which is much 
bigger than the hard disk. 
38 l,~Jt-)~ut¢ll~l,:~l~, ~l~</il:~>~JJ;~t'~Tth~3 - 
Figure 3: The keyword-ln-context (kwlc) fi~m~at produced by kwie.exe. 
Note that the line ntm'tbers are on the left-most posilioes and the keyword 
is delimited by "<" and ">". 
A keyword file mnst be specified using the -k 
option and each keyword sltould be terminated by "/". 
The nunlber of characters in tile left and right contexts 
can be spccified in bytes, using file -1 and -r options 
respectively. If-n 0 is specified then lille numbers will 
appear on the left. There are additional options for 
indexing in the original concordance programs but these 
options are not important in tile current implementation. 
Tbe kwicw.exe deals with segmented text. tlere, the -1 
and -r options specify the number of words in the left 
and right contexts. The length of each context (approx. 
1000 characters allocated) can hold 20 words assuming 
that each word has 24 characters. 
V. COMPILING COLLOCATION STATISTICS 
Collocation statistics (Fig 4) refers to tile 
frequencies of each different words or characters at 
different positions in the contexts of a keyword. These 
frequencies are usefid for detecting significant 
collocation in English but these frequencies are tedious 
and error prone to conlpile by hand. We have also 
written programs lo compile these statistics lbr Chinese 
but factorial analysis (l\]iber, 1993) still rem,'fins Io be 
implenlcnlcd. 
Chinese concordancing is carricd out first to 
extract the relevant contexts. The output of 
concordancing shonld be storcd in kwic.lst. Theu, 
cxtractl.exe will extract all the different words in the 
context, using an FSM to decode the kwie format. The 
program sorts these words according to their fieqncncy 
of occurrence in the context. The different words are 
stored in cxtract.crk and the user can select candidates 
using options as in exsega.exe Next, cxstat.exe compile 
the frequencies of these different words at differeut 
positions in the contexts. The statistics are stored in 
cxtract.sla. For segmented text, kw~cw.exe, cxtractw.exe 
and cxsiaiw.exe are used instead. 
586 
kov =lgJ/ 
<1~>/ \[ 711 0 0 0 0 O< 71> 0 0 0 0 0 
,fill I 501 5 6 9 2 O< O> 1 4 10 8 5 
~'/ \[ 241 a 7 1 2 1< o> 0 2 1 1 2 
~/ \[181 o o 2 1 I< O> 1 0 1 I 2 
Figure 4: Collimation statistics. 'rile dilli~rent words in the contexts are 
displayed on the left and flJe square brackets show tile frequency of 
occurrence in the context of the keyword. The mlgle brackets indicate tile 
position orthe key,.vord. 
Unlike Smadja (1993), the ke~vord rnay be 
part of a Chinese word. Thus, the program can compile 
statistics about different prefixes, suffixes or stems of a 
Chinese word. This is particularly interesting for 
itwestigatiog translated terms and compound nouns. 
VL EVALUATION PROGRAMS 
Two progranls were written to meastlre the 
performance of word segmentation and word 
identification. For segmentation, segperf.exe examines 
two identical texts that were segmented by different 
mettmds. The program shows the amount of 
segmentation error, the number of clauses, the mmlber 
of clause th.'lt are segmented correctly and the amount of 
over- or under-segmentation. Files of the segntented 
texts are specified by the -a and -m options. The user 
can inspect parallel clauses to examine individual 
differences in segmentation by setting the -d 
(diagnostic) option to 1. 
For word identification, wcomp.exe compares 
two sets of different word lists and determines the 
antount of word overlap. The program shows the 
distribution of word overlap for different length of 
words. This is important since long words tend to be 
compound nouns thal are not in a general dictionary. 
Using the -i and -j options, the program saves words 
that overlap and words that do not overlap, respectively. 
REFERENCES 
ItlI~EI~,, D. (1993) "Co-occurrence patterns among 
collocations: a 1ooi Ibr corpus-based lexical knowledge 
acquisition", Computational Linguistics, 19, n3, pp.531- 
538. 
CIIANG, C-H. AND C-D. CIII",N (1993) "Chinese t)art-of- 
speech tagging using an HMM", l'roceedingx of 
(2omputational Linguistics: Research and Applications, 
Xiamen, PRC, pp. 114-119. 
CmAN(;, T.tI., J.S. CHAN(;, M.Y. L~M AND K.Y. Su 
(1993) "Statistical models for word segmentation and 
nnkown word resolution", Proceedingx ~" RO(;LING 1 / 
~93, pp. 123-146. 
FAN, C.K. AND W.tI. TSAI (1988) "Automatic word 
identification in Chinese sentences by the relaxation 
technique", Computer Processing of Chinese and 
Oriental Languages, 4, nl, pp. 33-56. 
Fir, X-L. (1987)Xiandiao llanyu 7'unrun Cidian, Waiyn 
Jiaoxue Yu Yanjiu Pnblishing House: Beijing, PRC. 
GALE, W.A. AND K.W. CllUlml! (1993) "A program for 
aligning sentences in bilingual corpora", Computational 
Linguistics, 19, nl, pp. 75-102. 
Guo, J. AND }t.C. LnI (1992) "PI! - a Chinese corpus for 
pinyin-hanzi transcriplion", ISS Technical report 1R93~ 
112-0, htstitute of Systems Science, Nalional Universily 
of Singapore. 
HOCKI:,¥, S. AND J. MAICnN (1987) "The Oxford 
concordance program version 2", Literary and 
Linguistic U, omlmtin£, , 19, nl, pp. 75--102. 
Krr, C., Y. l_,lu AND N. LIANt; (1989) "On methods of 
Chinese automatic word segmentation", Journal of 
Chinese Infi~rmation Processing, 3, n l, pp. 13-20. 
LIJK, R.W.P. (1994) "Chinese word segmentation using 
nmximal-matching and bigram techniques", submitted 
to ROCLING '94, Taiwan. 
TON(;, K. S-T. (1993)"From single parenl to bound- 
pairs: the secret fiR: of compuleresc", in PEMIIEI(TON R. 
AND "I'sAN(;, I'2. S-C. (1993) Studies in Lexis, Langnage 
Centre, ltong Kong University of Science and 
Technology, pp. 196-214. 
TONG, X., C. I\[UANG AND C. Guo (1993) "Example- 
based sense tagging of atoning Chinese text", 
Proceedings of the Workxho I) on Vein Large Corpora, 
ACL-93, Ohio Stale University, Cofimtbus, 22 June. 
SMA1)JA, 1:. (1993) "Retrieving collocations from text: 
Xlracr', Computational Linguisticx, 19, n l, pp. 141-177. 
Tltll~lll.l', C. ANI)G. JONES (1990) Cotlcordattce in the 
classroom, Suffolk: Longman. 
ZIIANG, J-S., ,q. CIII:N, Y. ZIIENG, X-Z. Lit! ANI) S-J. KE 
(1992) "Automatic recognition of Chinese fifll name 
depending on nmlliple corpus", Journal of Chine, se 
Information Procexsinj4, 6, n3, pp. 7-15. 
Znou, M., 17;. IIUANG AND J. YANG (1993) "CST'F: 
Chinese syntactic tagging tool with self-learning 
abilily", l'roceedin,w of Computational Linguistics: 
Research atut AptHicationx , Xianten, PRC, pp. 155-160. 
ACKNOWI) I,GEMENT 
Thanks to l)r. Webster Ibr correcting the grammafical 
mistakes in this paper. 
5/47 
