Character-based Collocation for Mandarin Chinese 
Chu-Ren l-luang 
Institute of ltistory and Philology 
Academia Sinica 
Nankang, Taipei, Taiwan 
hschuren@ccvax.sinica.edu.tw 
Keh-j iann Chen 
Institute of Information Science 
Academia Sinica 
Nankang, Taipei, Taiwan 
kchen@iis.sinica.edu.tw 
Yun-yan Yang 
Computing Center 
National Taiwan University 
Taipei, Taiwan 
yang@iis.sinica.edu.tw 
This paper describes a characters-based Chinese 
collocation system and discusses the advantages of it 
over a traditiolml word-based systcm. Since wordbreaks 
are not conventionally marked in Chinese text corpora, a 
character-based collocation system has the dual 
advantages of avoiding pre-proccssing distortion and 
directly accessing sub-lexical information. Furthermore, 
word-based collocational properties can be obtained 
through an auxiliary modttle of automatic segmentation. 
corpora as they are, we ",viii be able to access sub-lexical 
information without additional cost. To take the full 
advantage of the nature of texts, reliable tools can also 
be devised to obtain \[exical collocation. In this paper, 
we ,,viii describe the design and implementation era Chi- 
nese collocational system that does not require the pre- 
processing of automatic segmentation but is awe to 
allow both lexical and sub-lexical information be 
automatically extracted. 
PI(OJ1;CT NOTE: I~ARGI; TEXT COI(I'ORA II. Background: Corpus and Computational Platfonn 
1. Introduction 
Collocation has been established as an essential tool in 
computational linguistics (Church and Mercer 1993). In 
addition, various col\[ocatiomd programs have been 
proven to bc indispensable in automatic acquisition of" 
\[exical information (e.g. Sinclair 1991, and Bibcr i993). 
Sincc words arc the natural and undisputed units in 
available text corpora, virtually all the current 
collocationa\[ programs are word-based. However, there 
are languages where texts do not conventionally mark 
words, such as Chinese. l_Jnlcss a large tagged corpus is 
available, a word-based collocation system in these 
languages faces tile following inevitable difficulties. 
First, hand-segnlentation of a large corpus is tedious and 
financially nearly impossible. Second, automatic 
scgmentation prograln can neither identify words not 
listed ill tile lexicon nor correctly segment all -Milch are 
listed. Third, estimation of lexical probability relies on 
word-ficqucncy counts based on the inaccurate results 
of automatic scgmcntation thus the deviation tcnds to be 
greater than standard tolerance. 
Text corpora without wordbreaks, nevertheless, also 
has their advantages. Take Chinese for example, tim 
basic units of text corpora are zi4 'character', a fairly 
faithful representation of the morphemic level of tim 
language. In other words, if we take Chinese Icxt 
This collocation system is developed on the 20 
million charactcr modern Chinese corpus at Academia 
Sinica (Huang and Chen 1992, lquang In Press). This 
corpus is composed mostly of newspaper texts. It is 
cstimatcd to have 14 million words. Following 
industrial standard in Taiwan, our collocation system 
can deal with any corpora encoded by BIG-5 code. The 
program is dcvclopcd undcr a UNiX cnvironmcnt on HP 
workstation. It should, howcver, be portable to any 
UNIX machiac with compatible Chincse solution. The 
collocation systcm is currcntly used in research by more 
than 10 linguists affiliated with thc Chinese Knowledge 
Information Processing (CKIP) group at Academia 
Sinica. It is also open to any visiting scholar for on-site 
USe. 
III. Overall Design of the System 
There are two major modules in tile collocation 
system: one deals directly with unsegmented texts and 
the other which incorporates automatic segmentation 
before collocation. The two modules share tile pre- 
process of KWIC search module, which allows user- 
specitqed linguistic patterns (Ituang and Chen 1992). 
They also share three common routines to detect 
character collocation, to identify possible collocation 
words through N-grams, and to contextually filter texts 
with user-sl3ecified strings. 
540 
The overall design of the system is schematically 
represented in diagram 1. 
l)il~rain I, S),steni Design 
\[ l"iti"ii~'li& ~" \] 
1/ I 7 
No 
: II cat't n (2) Word Colloc~dion 
(3) Categorical I)istribution 
(4) N-\[.,ram (5) l;illcriug wilh 
C :haiaclor with (6) Filleting with 
Calcgm its (7) End 
(1)1(a)1(3)1(4)1(5)1((,) \] \[ 
l/ 
END 
I I 
tV. ColIocalion Without Sc!,,mculalion 
There arc three collocatiomd tools awfilable in this 
system without segmenting the texts inlo words. First, 
character collocation allows automatic acquisition of 
sub-lcxical information, sttcll as the conditions on mor- 
pho-lexical rt, les. This is attested by the studies on the 
notion o1' word in the mental lexicon reported in \[luang 
ct al. (1993), and the generalizations of productive 
dcrivational rules in Mandarin offered in I long ct al. 
(1992). Take note that when applying KWIC search Io 
the corpus, a user has the liee(Iom to specify a key that 
is a single character, a multi-character siring, or even a 
discontinuous string of characters. These charactcr 
strings may or may not be words. Thus the extracted col- 
locationaL relation is not simply between characters. It 
can also be between characters and either a simplex 
word, a compound, or a phrase. '\['he collocational 
relation in our systel)\] is nleastlrcd alld rcl)resented by 
both Mutual lnforunation (Church and l lanks I990) and 
frequency. The user can choose to sort and rank the 
collocates by cither criteria. S/lie can also specify thres- 
hold wlhie by eithe! criteria. Usually, the n\]ost effective 
method is to use licquency threshold and Mutual Inlbr- 
mation ranking (1 luang In Press). In addition to the mea- 
sures of correlation, distribution of the collocates is also 
indicated in terms of positions relative to the key and 
liequency of occurrences at each position for each 
collocating character. 
Second, lexical information can also be derived.fi'om 
this collocational system regardless of its lack of demar- 
cation of lexical items. This is achieved through a 
silnplc Markov lnodel. Once the KWIC search ex- 
tracted the relevant contexts, a simple N-gram routine 
can be perlbrrrtcd on lhe context(s) specified by the user. 
Dcl~cmling on the purpose of the study and the size of 
relcwmt texts, the length of the targct sequence as well 
as \[lie Ihreshold l)Ulllbcr Call be specified. For instance, 
a linguist may want to look lor all two or three character 
sequences that occtlr over 5 times after a key verb. This 
would likely lima out a list of possible arguments (i.e. 
syntactic words) for that verb. lleuce lexical informa- 
tion StlCh as semantic restriction of the predicates on its 
post-arguments can be indirectly extracted. In our 
system, the user is allowed to iterate the N-gram search 
by desigmttin~, different contexts and string length (N). 
The lbllowing is an example of collocation without 
segmentation, t luang ctal. (1994) argue that Mandarin 
light verbs select the verbs they nomiualize. This is 
supported by the N-gram collocation restllts in diagram 
2. The collocation is extracted from a 20 million 
clmracter corpus and the collocation window is 5 
characters to the right el'the key word. It shows that the 
verbjin4xing2 (ypically nominalizcs a process verb. 
Diagram 2. N-granl Collocation (By Frequency) 
Bi--syllabic Collocation with the verb jin4xing2 
gon\[,lzuo4 'to work' 
diao4cha2 'to investigate' 354 
gong I cheng2 'engineering work' 233 
437 
tao31un4 'to discuss' 223 
l.,ou ltong I 'to communicate' 198 
xie2tiao2 'to coordinate' 185 
yan2j iu4 'to study' 185 
liao3jie3 'to understand' 166 
guclhua4 'to plan' 156 
xie2shangl 'to negotiate' 154 
Last, the user can specify a character string in the 
context as a filter. The lnost usefill application is to 
specify a string that forms a syntactic word. This is a 
technique commonly used to resolve categorical or 
sense ambiguities. Combining both N-gram search and 
string filtering, fi'equcncy-based word collocatipn is 
achieved without segmentation. 
V. Collocation After Segmentation 
When lexical or phrasal relation is the focus of the 
study, the above collocation module may sometimes be 
541 
inadequate. In tiffs case, we will necd to apply the 
automatic segmentation/tagging program such that we 
can acquire information involving word pairs as well as 
grammatical categories. The automatic segmentation 
proccdurc is an revised version of the program reported 
in Chen and Liu (1992). The on-line lexicon is the 
CKIP lexicon of more than 80 thousand cntries (Chen 
1994). 
We did not automatically segment and tag the whole 
corpus for very good reasons. First, without a correctly 
tagged corpus, no statistically-based tagger can perform 
satisfactorily yet. 
Second, tllcrc is no practical way to recover incor- 
rectly identified words. That is, when the automatic 
taggcr takes a character fi'om a target word to form an 
inal~propriate word with a neighboring character; that 
target word is lost and cannot be identified in this 
context. Tiros, it will be linguistically more felicitous to 
allow KWIC to identify all matching strings and allow 
filtering of incorrect matched words in later steps. 
Last, segmented texts restrict tim available collocation 
inlbrmation exclusively at word levcl. For instance, not 
only morphelne-nmrplaenae collocation will not be 
availablc, neither can correlations bctwccn a mnrphcnac 
and a word be extracted. 
In contrast, when optional scglnentation is performed 
on-line on the result of KWIC search, the collocational 
systcnr can be applied to any electronic text corpora 
with 
minimal pre-proccssing. This current approach also 
allows us to mix sub-lcxical, lexical, and extra-lexical 
conditions according to our research need. 
Even though the post-segmentatiou module shared 
three routines with tim module without segmentation, 
they do differ non-trivially in their applications. First, 
the character collocation module is basically tim same. 
The additional step of segmentation excludes accidental 
string matches. For instance, with qu4shi4 "to pass 
away' as the keyword, KWIC may extract the incorrect 
context 'tal qu4 shi4jie4 ge,l di4 lu3 xing2'. This error 
in identifying word boundaries can be easily avoided 
when the text is correctly segnrented. In this case, the 
correct segmentation is 'tal qu4 shi4jie4 ge4di4 
ht3xing2 (s/he go work\[ everywhere travel)'. Second, 
N-gram in this module now can include both sequences 
of characters and sequences of words. 
Two additional tools directly utilize grammatical tags. 
Tim first one is tim computing of tim distribution of 
grammatical categories in the context. The second is 
contextual filte,'ing in terms of grammatical categories. 
One caution needs to be mentioned here. As mentioned 
earlicr, we do not have a highly reliable automatic 
tagger yet because the,'e is no dependably tagged large 
Chinese corpus, l lence our automatic segmentation 
program looks up the categories of the words but do not 
attempt to resolve ambiguity. Since categorically 
ambiguous words make up only around 20% of the texts 
(Chen and Liu 1992, Chen et al In Preparation), keeping 
all possible tags seem to be an acceptable compromise 
for the moment. But this also means that a user must be 
on the lookout for possible errors caused by multiple 
tags. Our system allows the use," to view the categorical 
distribution of tim whole context, as well as to focus on 
a smaller context and specific categories. Diagram 3 
shows tim categorical collocation of the head of the 
post-verbal argument of Imo4de2 'to get/receive.' We 
obtained this information by first perform the 
discontinuous KWIC on huo4de2 and the relative clause 
head marker de. After segmentation and collocatio!L we 
restrict tim disphty to tim first position to tim right of de, 
and to the two major categories of N and V. The result 
shows that this verb typically take subclasses of 
common noun and (nominalized) transitive verbs as 
argulr~ents. 
Diagram 3. Categorical Collocation 
l leads of Relative Clause Arguments ofhuo4de2 
rl ficquency rl fiequency 
Nab 63 Vc2 62 
Nac40 Vhl 1 28 
Nad27 Vkl 20 
Nca26 Vcl 19 
Ncb 13 Vc2 16 
Last, the word-based collocation system is tim part of 
our system that will take the most processing capacity. 
This is also the only part of our system that is still being 
tested at this moment. Word frequencies of our corpus 
have already been calculated and stored. The 
automatically segmented word-based collocation 
module should be available for linguistic research within 
weeks. 
VI. Conclusion 
In this paper, we described a collocation system that 
works on text corpora without word marks. Tiffs system 
has tim advantage of extracting sub-lexical information. 
This is also particularly useful in studying Chinese 
language co,pora since sociological words are distinct 
fi'om syntactic words in Chinese (Chao 1968). Thus in 
linguistic and literary computing, it is often necessary to 
formulate generalizations based on zi4, the sociological 
word. The teclmiques reported in this paper should also 
find applications in two aspects of future computational 
linguistic research. Fi,st, it can be applied to other 
language text corpora for extraction of sub-lexical 
collocation. Second, it can be applied to text corpora 
542 
which do not come with clear word demarcation: 
including corpora in languages in which sociological 
words and syntactic words do not coincide and spoken 
corpora. 
Acknowledgements 
Research of this project was partially funded by tile 
Chiang Ching-kuo Foundation for International 
Scholarly Exchanges, tile National Science Council of 
R.O.C., and Academia Sinica. We would like to thank 
l,i-ping Chang and other colleagues fit CKtP for their 
help and comlrients. P, esponsibility of filly remaining 
errors is ours alone. 

Bibliography 
Ililler, D. 1993. Co-occurrence Patterns alnong 
Collocations: A Tool for Coqnis-Based Lexical 
Knowledge Acquisition. Computational l,inguistics. 
19.3:531-538. 
Chao, Y. R. 1968. A Gramniar of Spoken Chinese. 
Berkeley: University of Calilbrnia Press. 
Chert, K.-j. 1994. l,inguistic lnlbrmation and l,exi- cal 
Data Management in Dictioilary Research. Invited 
Paper preserited are the Internal Conference oil 
Computer Processing of Oriental Languages. l)aeion 
Korea. May 10-14. 
___, and S.-II. Liu. 1992. Word Identification for 
Mandarin Chinese Sentences. COIANG-92. 101- 
105. Nantes, lrrance. 
__, S.-h. l,hl, L.-p. Chang, aiid Y.-h. Chin, In 
Preparation. A Practical Tagger for Chinese Corpo- 
ra. Nankang: Academia Sinica. 
Chinese Knowledge Informiition I'roeesshig (2roilp. 
1993. The CKIP Categorical Classification of 
Mandarin Chinese (hi Chinese). CK1P Technical Re- 
port no. 93-05. Taipei: Academia Sinica. 
__ 1994. A Frequency Dictionary o\[" Written Chi- 
nese. CKIP Technical Report no. 94-01 Taipei: 
Academia Sinica. 
C\]lnl'eh, 1(,,., and P. I \[anks. 1990. Word Associ- 
ation Norms, Mutual hdormation, and I,exicogra- 
phy. Computational Linguistics. 16.1:22-29. 
, and R. L. Mercer. 1993. Introduction to 
the Special Issue ell Computational l,inguistics 
Using l,arge Corpora. Computalional Linguistics. 
19.1:t-24. 
lhnlg, W.M.,C.-R. lhlang, andK.-J. Chen. 1991. 
The Morphological Rules of Chinese Derived 
Words. l','esented at the 1991 International Confer- 
ence on Teaching Chinese as a Second lmnguage. 
Taipci. 
lluang, C.-R. In Press. Corpus-based Studies of 
Mandarin Chinese: Foundational Issues and Pre- 
liminary Results. In M. Y. Chen and O. J-l,. Tzeng 
Eds. 1993. Linguistic Essays in l loner of Willian~ S.- 
Y. Wang. Taipei: Pyramid. 
, Kalhleen Ahrens, and Keh-jiann Chen. 1993. A 
l)ata-driven Approach to Psychological Reality of tile 
Mental Lexicon: Two Studies in Chinese Corpus Lin- 
guistics. Proceedings of the International Conference 
oH the Biological Basis of l,anguage. 53-68. Chinyi: 
Center of Cognitive Science, National Chung Cheng 
University. 
..... l,.-p. Chang., and M.-I,. Yeh. A Co,'pus- 
based Study of Nominalization and Verbal Seman- 
tics: Two Light Verbs in Mandarin Chinese. Paper 
presented at tile Sixth North American Conference on 
Chinese IAnguislics. May 13-15, USC. 
, and K.-j. Chen. 1992. A Chinese Corpus Ibr 
Linguistic Research. COLING-92. 1214-1217. Nan- 
tes, France. 
Sinclair, J. M. 1991. Corpus, Concordance, Colloca- 
tion. Oxford: Oxlbrd University Press. 
Sproat, R., and C. Shih. 1990. A Statistical Me- 
thod for Iqndirlg Word ll, onndaries in Chinese Text. 
Conlputer Processhlg o\[' Chinese and Oriental 
l,an{;tlages. 4.4:336-351. 
Svartvik, J. 1992. I,',d. I)irections hi Corpus I,in- 
guistics. Proceedings of Nobel Synlposhnn 82, 4-8 
August 1091. Trends in l,inguistics Studies and Mo- 
nographs 65. Berlin: Moulon. 
Wang, M.-C., C.-P,. \[hlang, and K.-,i. Chen. 1994. 
The hlentificatioi/ and Classification of Unk~lown 
Words ill Chinese: a N-grflm-llased Apt)roach. 
Mantlscript. Acadelnia Sirlica. 
