How Should a Large Corpus Be Built?-A Comparative Study of 
Closure in AAnnotated Newspaper Corpora from Two Chinese 
Sources, Towards Building A Larger Representative Corpus 
Merged from Representative Sub Collections 
John Kovarik (kovariks@worldnet.att.net) 
U.S. Department of Defense 
Abstract 
This study measures comparative lexical and 
syntactic closure rates in annotated Chinese 
newspaper corpora from the Academica Sinica 
Balanced Corpus and the University of Penn- 
sylvania's Chinese Treebank. It then draws in- 
ferences as to how large such corpora need be 
to be representative models of subject-matter- 
constrained  domains within the same 
genre. Future large corpora should be built in- 
crementally only by combining smaller repre- 
sentative sub collections. 
1 Prior Work 
Practically speaking, earlier attempts at build- 
ing corpora, such as the IBM/Lancaster ap- 
proach, have taken an all-inclusive perspective 
toward text selection proposing that (Garside 
and McEnery, 1993) raw texts for parsed cor- 
pora should come from a variety of sources. 
The IBM/Lancaster group used the Canadian 
Hansards collection of parallel parsed English 
and French sentences as a base of English parsed 
sentences and then focused on the Computer 
Manuals domain, in which they attempted to 
randomly select texts with some additional non- 
Computer Manual material selected as "a mea- 
sure of 'light relief' " supposedly for the benefit 
of the annotators. A broad approach was also 
used in the Hong Kong element of the Interna- 
tional Corpus of English (ICE) project (Green- 
baum, 1992) which sought to assemble a range 
of both spoken and written texts along with 
a range of both formal and informal situations 
to provide a reasonably large, well-documented 
and detailed snapshot of the use of educated 
English. (Bolt, 1994) Both the IBM/Lancaster 
approach and the ICE project build millions of 
tokens worth of corpora. 
But from a more principled perspective, Dou- 
glas Biber, in speaking on representativeness 
in corpus design, pointed out that the linguis- 
tic characterization of a corpus should include 
both its central tendency and its range of varia- 
tion. (Biber, 1993)Similarly Geoffrey Leech has 
stated that a corpus, in order to be representa- 
tive, must somehow capture the magnitude of 
s not only in their lexis but also in their 
syntax. (Leech, 1991) This suggests we should 
build corpora focusing on how well they can ap- 
proach lexical and syntactic closure, rather than 
by merely fixating on ever larger amounts of 
text. To build representative corpora why not 
first select representative texts constrained by 
genre of writing? 
In general one possible technique for methodi- 
cal corpus selection would be to build large cor- 
pora out of representative sub-collections con- 
strained by genre and subject matter. Zelig 
Harris said, "Certain proper subsets of the sen- 
tences of a  may be closed under some 
or all of the operations defined in the , 
and thus constitute a sub of it." (Har- 
ris, 1968) Calling this an inductive definition 
of sub Satoshi Sekine has embarked on 
studies involving new trends in the analysis of 
subs (Sekine, 1994). Both Harris and 
Sekine recognized that subs are an effi- 
cient way to observe and measure the properties 
of natural  in smaller, representative 
blocks. 
Getting down to specifics, McEnery and Wil- 
son (McEnery and Wilson, 1996) have hypoth- 
esized that genres of writing, such as the style 
used in newspapers and similar printed publi- 
cations to report news stories, represent a con- 
strained subset of a natural . Thus 
newspaper texts constitute a sub - a 
version of a natural  which does not 
display all of the creativity of that natural lan- 
116 
guage. The newspaper sub can be fur- 
ther constrained by subject matter to divide it 
into smaller, more manageable subsets. 
A key mathematical feature of a sub 
is that it will show a high degree of closure 
at various levels of description, setting it apart 
from unconstrained natural . This clo- 
sure property of a sub is analogous to 
the mathematical property of transitive closure. 
McEnery and Wilson used the closure property 
to measure and compare rates of lexical and syn- 
tactic closure in three corpora: the IBM com- 
puter manual corpus, the Canadian Hansards, 
and the American Printing House for the Blind 
corpus. To date, however, there has been little 
work in a similar vein in other s. 
2 Overview 
This work applies the methodology of McEnery 
and Wilson to examine closure rates in a com- 
parative study of all available tagged Chinese 
newspaper corpora. First I define lexical and 
syntactic closure for this study in section 3. 
Then, section 4 begins this study with an exam- 
ination of ~ the newspaper texts of the Aca- 
demica Sinica Balanced Corpus (ASBC). Sec- 
tion 5 extends this study to an examination of 
the newspaper texts of the UPenn Chinese Tree- 
bank (CTB). Section 6 presents my findings and 
section 7 discusses some implications for future 
corpus building. 
3 Lexical and Syntactic Closure 
3.1 Tokenization in Chinese 
It should be pointed out that Chinese is an ag- 
glutinative, not an inflected . More- 
over, while Chinese tokens can concatenate, 
Chinese has no extensive morphology like many 
Indo-European s. Chinese, of course, 
has no white space separating lexemes, as a re- 
sult, all Chinese text must first be segmented 
into word lengths. However, once a text has 
been segmented, no stemming is needed so each 
segmented Chinese word can be counted as it 
occurs without the need of finding its lemma. 
3.2 " Lexical Closure 
Lexical closure is that property of a collection of 
text whereby given a representative sample, the 
number of new lexical forms seen among every 
additional 1,000 new tokens begins to level off 
at a rate below 10 per cent. 
3.3 Syntactic Closure 
Syntactic closure is that property of a collection 
of text whereby given a representative sample of 
a type of text, then the number of new syntac- 
tic forms seen among every additional 1,000 new 
tokens begins to level off. A syntactic form is 
the combination of token plus type. Thus syn- 
tactic closure approaches as the number of new 
grammatical uses for a previously observed to- 
ken plus the number of new tokens, regardless 
of syntactic use, level off to a growth rate below 
10 per cent. 
4 Academica Sinica Balanced 
Corpus (ASBC) 
While it is common practice to attempt to build 
huge annotated corpora, it is of course very te- 
dious, very expensive, and especially challeng- 
ing for annotators to maintain consistency over 
such a huge task. Consequently one must hope 
that once an annotated corpus of newspaper 
texts is created, it can be statistically measured 
and confirmed to be a representative sample. 
I first measured lexical and syntactic closure 
rates in all ASBC newspaper texts but found 
that when viewed as a whole this newspaper 
sub-collection of the ASBC does not approach 
closure (see graphs below). 
i i.ii:::; :: • 
 i iiii  ii 1 
... ~: .. ?: .i::::"~:!~i : .:.: :i ~ : ~i. i!: ""!':!~'~;. :: i~-.'=:i~:.~'~.~.::.'~i? 
o- "iI|! ll!!|iiii!Iil 
T~lt~ 
117 
ROG1Jn9 News Synlactic Closure 
I 
0 
This raises the question how cart we hope 
for NLP applications to learn on large cor- 
pora if they themselves never approach statis- 
tical closure, never approach being statistically 
confirmed as a representative model of the lan- 
guage ? 
I then focused downward on subsections of 
the newspaper corpus-grouping them by similar 
filename. I searched the ASBC corpus looking 
for files of annotated newspaper text and found 
a total of 57 files (18.7 Mb); my findings are 
summarized in the following table. 
Academica Sinica Balanced Corpus News 
Fries Size 
05 01.9Mb 
16 01.5Mb 
01 00.2Mb 
36 15.1Mb 
Filenames Subject Matter 
A .... '94 Academics 
C .... '93 Various News 
SSSLA '91 Politics etc. 
T .... '91-'95 Sports etc. 
The large single file, n~.med "SSLA', dealt 
with a wide assortment of subject matter and 
thus was significantly different from the other 3 
newspaper collections. Not only was its individ- 
ual file size rather large; it was not even close 
to the size and homogeneity of the other three 
newspaper multi-file collections. I rejected it 
from further study. 
The other sub-collections were more similar. 
Topically speaking, the ASBC "A" newspaper 
collection was focused primarily on news (77 per 
cent) while at the same time focusing narrowly 
on academic events in 1994. The ASBC "C" 
newspaper collection was less narrowly focused 
on news (73.5 per cent) but expanded its fo- 
cus to other than academia while limiting it- 
self to events of 1993. The ASBC "T" news- 
paper collection, however, spanned the period 
1991 through 1995 and dealt with many differ- 
ent subjects, the most frequent of which were 
sports, news, and domestic politics, but even 
each of these most frequent subjects only repre- 
sented 9 per cent of the whole. 
Let us consider the three ASBC newspaper 
sub-collections ("A', "C', and "T" filenames) 
to be potentially representative subs. 
If we can observe relatively high degrees of clo- 
sure at various levels of description, we can pro- 
pose that such sub-collections are representa- 
tive subs within the newspaper genre 
of Chinese natural . Conversely, those 
which do not have a high degrees of closure are 
definitely not sub corpora and not of 
further interest for this study. The following 
graphs depict the observed lexical and syntac- 
tic closure rates of the three ASBC newspaper 
sub-collections under study. 
ROCUng A ¢,~iliecllon Lmxical 
eOO0 ~ .... r ':'.. -: .. .'_.."-: :.!. • 
t '. , !:, '. ,-4. , ,:",, .... '",:, ,, :;,',j 
I . .i'..: • ; . .... i - . .. '-.'t:"i.~: .... " , 
T~lilledi 
ASSC ~ Lexl¢ll ~muro 
: .. .!.: • : . \] . .: .- ..... • : • : ... ( 
118 
ASIK: ~ LArxica.! CIo~JI'~ 
15000 
10000 
o 
TakerL.~ 
lq~OCLIng T CoUec~on Syma ctlc QomJnre 
45000 . . . • .... 
TokI~, 
It appears that the ASBC "A" newspa- 
per sub-collections does approach lexical clo- 
sure; while the "C" and "T" newspaper sub- 
collections definitely do not. 
It appears that the ASBC "A" newspaper 
sub-collection also approaches syntactic clo- 
sure; while the "C" and "T" newspaper sub- 
collections do not. 
5 UPenn Chinese Treebank 
ROCUng A Co41ecth)n SyntacUc Closure 
1 " " " ' - 
20~ 
Tokon$ 
ROCUng C Collection SyntacUc Cl~ure 
I next applied the same measures on the UPenn 
Chinese Treebank corpus. I wanted to com- 
pare the rate at which the UPenn collection ap- 
proaches lexical and syntactic closure with that 
of the ASBC "A" and "T" sub-collections. The 
329 Xinhua newswire documents in the UPenn 
Chinese Treebank annotated corpus came from 
two sub-collections and total 3,289 sentences 
averaging 27 words or 47 characters per sen- 
tence excluding newspaper headlines which are 
characteristica,Uy highly abbreviated clauses. 
The Perl script written to do this analysis 
is freely available at http:h'home.att.net/-ko- variks/closure.htm. 
,2000 , - • • : • " " " " i ? 
0 ~ 40000 60000 80000 
Tokens 
UPenn Chlnese 'rreebank Newspaper Texts 
Files 
121 .322Mb 
040 .114Mb 
076 .406Mb 
054 .181Mb 
038 .060Mb 
Size Sub- Collection Subject 
One 
Two 
Two 
Two 
One 
'94 Economics 
'96 Economics 
'97 Economics 
'98 Economics 
General 
119 
UPe~ ~ IJx|¢l ~o't~re 
,o=<, /ii:--..-~:;. ~7;!~7'.. ~"!?~;.i~:7i;;7 :-.17.!7;~\]:!. 7.'i7 7: L.! :~ i ~ ' ........ " ..... 
• .-. :".. :.: .... i:."..-::;.. ,. ': .. +-/ i 
0 "~!I1!!!11!II!1 
'lrc, keml 
~ cia, Syreacet ~ 
' =o ~i;;~" .... ~ :2:;2~ ~ ~"~ ,~"~"-i ";:~ .~ ~ ~.: .~...~ ;:.~, :.~ :~;~: i ~ ~_ -i i! "It!!II I!!|!t!!I!t 
"l'¢lae=t= 
6 Findings 
The UPenn data initially approaches lexical and 
syntactic closure at a rate which can be fa- 
vorably compared with the ASBC "A" Chinese 
newspaper sub-corpus. By the time 59,000 to- 
kens of the UPenn corpus were ta~ed, only 56 
new token+tag combinations were observed in 
the preceding 1,000 tokens and 27 of those were 
new proper nouns. In comparison, at this point 
the closure rate in the ASBC "A" corpus was 
not quite as good (see table below). But contin- 
uing to the 69,000 token mark, the ASBC =A" 
closure rate had overtaken that of the UPenn 
CTB. 
Tokens Corpus 
59,000 UPenn CTB 
; ASBC A 
"69~000 ;'"UPenn CTB 
I ASBC A 
New token+tags 
56 ~27 ~R) 
77 (34 N...) 
6z (36 Nit) 
6:(~9 N...) 
InterestingJy the graphs for both the ASBC 
"A" and the UPenn CTB data reveal two- 
humped curves. At approximately the 60,000 
token mark on both UPenn CTB graphs the 
curve was starting to flatten only to suddenly 
shift into a sharper climb. 
I~OGlJng A ~onecll~ LIxl¢II 
i- ='. ""~ ~-/i~-;":t":-..~~i~.~;i-.:~ ~- ..~.7~.-! 
"!i!!I111!I11!II11 
To,liel~i 
IJIPe~ c'r'B IJxical 
120~(I, • • . i 
~XX~ • " : ,'. • . : ~ • • ~. "..: '" ~ ~{~ _..~ .~: i 
°!!!!!!!1!!!III 
Tokens 
An investigation of the UPenn CTB data 
revealed that the vast majority of the docu- 
ments at this point dealt with international as- 
pects of the Chinese economy z, whereas pre- 
viously the vast majority of documents had fo- 
cused on Chinese domestic economic growth 2. 
And similarly on the ASBC "A" collection clo- 
sure graphs around the 30,000 token mark, both 
1 Headlines of articles in UPenn 
CTB ff.zom 66,800-68,300 tokens- 66809: 
~i~-~- ;~:~:~l~\[i,/~.~ ~l~l!~ I~l~ =:.~ I~!~i ~:1~ ~- ; 66976: 
; 6~: ill, t!ll~~~¢li ; 
~Topics of articles in UPenn CTB from 
~s,ooo-~2,ooo ~ok~ns- ~so7~: ~tt~~l 
120 
curves start to flatten only to suddenly around 
the 40,000 token mark shift back into a climb 
until about 70,000 tokens. An investigation of 
the ASBC data also showed a subtle shift in 
subject matter, most of the documents from 
29,000 to 39,000 tokens were short notes and 
simple bulletins o~ shifts in academic positions, 
whereas the later data had more long stories on 
subjects of greater complexity 3, whereas the 
later data had more long stories on subjects of 
greate r complexity 4 
Thus the ASBC "A" collection, more so than 
the UPenn CTB corpus, did eventuaUy seem to 
approach closure. By the time we reach 80,000 
tokens, the ASBC "A" collection only saw 14 
new token+tag combinations in the last thou- 
sand tokens of new text(see table below). Six of 
those 14 new combinations were nouns, seven 
were verbs, and one was a preposition ~, hav- 
ing nearly reached lexical closure on newspa- 
~E~ ~~~2~1~ ; 29444: 
~P~~ ; 29948: ~\]~~ 
; 31003: ~~~ ; 31499: 
~-~~ ; 31890: ~1~1"~ ; 
32246: tt~.~. 
~Topics of articles in ASBC =A ~ from 29,300-32,000 " 
tokens-29306: 2933~: 29382: ~,~,: ~~,g 
; 29537: 29581{ ~,1~,: '/J%~ ; 29636: ~: 
~~ ; 30087: ~'~: ~m~"~" ; 30153: ~'~': 
\[~.~J ~E~~ ; 30342: 30402: 
30449: 30478: 30539: ~,~: A~; 30590: 30677: 
30772: 30822: 30944: ~,~: /J'~ ; 30976: 31093: • ,~: ~~. 
4Topics of articles in ASBC ~A" from 56,000-57,000 
tokens- 56183: A~=~: ,~.~.J;~ ~ ~'~.1~:~ ; 
~6232: ~,g: ~~; 562~4: ~,~: ~~ 
; 56323: ~:lJ~,.: r~~:~j ~,~; 
56475: ~,~: r~--~.K~-~~:J ~q~ 
; 56635: ~: ~~ ; 56798: ~: 
~~~~~,~ ; 56852: 
'~:~,,~..~,: ~~~t~~ ; 56930: ~1~: 
• 1~.~ r~.~j ~ ; 5~019: ~,~,: 
r ~ --I:;~ {~rJ~:~,~ j. 
~ASBC Subcollection A* at 80,000 tokens, 14 New 
token+types observed: Na '~, Na ~fi\]~/~':, 
Nc ~, P ~, VA J.~J~, VC ~, VC ~, 
vc t~, VF ~, VF ~, V~ ~t; Tot~ tags: 
7, Total new items: 14. 
per articles regarding academics. In contrast 
the CTB collection instead logged 146 new to- 
ken+tag combinations at the 80,000 token mark 
and its new vocabulary ranged widely across 12 
different parts of speech 6. While the majority 
of this new vocabulary were nouns, this contin- 
uing influx of new words was due primarily to 
the late inclusion of international news in the 
CTB's collection of newspaper articles regard- 
ing Chinese economics. 
~UPenn CTB at 80,000 tokens, 146 New token+types 
observed: AD~, hD~ig, AD~, ADrift, AD~, ADI\]~, AD~, CD 
1 " 2 ~, 
CD 1 7 1 6~, CD2 0 0, CD2 0 0~, 
CD2 1 0 0~, CD2 8 0 0~', CD4 00~j', 
CD 5 5, jj~..~d~:, JJ~.~E, JJ::~,~, 
JJ~--, JJ~J~, LC~, LC~, LC~, LC~, M~, M~, NN~, NN~, 
NN~, NN~, NN "~, NN~, NN~, NN~-~, NN~, NN~, 
NN~ NN~J~, NN~, NN~\]~, NN~J NN~, NN~, NN~, NNJ~, 
NN~ NN~ NN~;, NN~, NN~J~, NN~J~ NN~, NN~\]~ NN~, 
NN~ NN~, NN~, NN~, NN~ NN~, NN~, NN~, NN~ 
NN~ NN/J~, NN\]~I~, NN~, NN~, 
NN~, NN ~, NN~ NN~, NN~t, NN~, NN~, NN~, NN~, 
NR~, NR~, NRP~~, NR~, NR~~, NR~, NR~, NR~, 
NR~-~, NR~X~, NR~'~-~ " ~, NR~~, NR~, NR~ " ~~, 
NR~J~, NR~, NRJ~, NR-~, NR~, NR~, NR~, NR~, 
NR~, NR~, NTI 1 ~, NTI 8 ~ 3~E, NTI 8~, NT1 9 ~ 9~ E, NTI ~ 8 2~E, 
NTI ~ 8 6~E, NTI ~ 8 9~E, NTIS, 
PUt'i, P~, e~, VA~-~, VV~,., vvI~K vv~, vv3~, vv~lK vvt/~lii, 
vv:~& vv-~l~, vv-~, vv'~'~-, vv~-.k 
vv:~', vv~, vv~, vv~t, vv~a, 
vv~J~, vv~~l, vv~~, vv~~J, vv~l~A~, wE, vv~, vv~, 
vv~#L.t~, vv~ffff~.,, vv--~'~J=, 
vv~~!, vv~l; Tota tags: 12, rota 
new items: 146. 
121 
Tokens 
80,000 
Corpus New token%tags 
UPenn CTB 146 (24 NR) 
ASBC A 14 (6 N...) 
7 Corpus Building Implications 
The fact that the UPenn Chinese Treebank 
data approaches lexical and syntactic closure at 
rates comparable to the ASBC "A" file newspa- 
per collection suggests that if the UPenn data 
had been selected more narrowly, it might have 
reached closure for the economics domain in the 
newspaper genre even sooner. Some day corpus 
linguistics may only need much smaller collec- 
tions of annotated corpora than is the practice 
today, relying on new directions in sub 
research. In the case of the ASBC "A" col- 
lection, for example, a robust learning struc- 
ture should be able to build a useable model 
on 100,000 words worth of data like this Which 
exhibits strong tendencies to lexical and syn- 
tactic closure in the Chinese newspaper genre 
constrained to a given domain. 
If the UPenn CTB were enlarged by the in- 
fusion of more news stories on international 
aspects of Chinese economic development, the 
CTB might better reach lexical and syntactic 
closure. The following graph shows that the 
blind addition of 20K additional Chinese eco- 
nomic news stories does not aid closure much. 
This additional data spanned many topics not 
seen in the original 100K collection. If it had 
been selected precisely to aid closure by mea- 
suring its potential contribution before exten- 
sive hand annotation, the result could have been 
better. 
d.OKAugaer~d ~ L,gxical Clom 
i ." - • > , . . .- . -... . " ' .i ' . - 
0 "!!!I!!!II!I!1!!! 
~Olilllvt 
2gK Aug~mnt~d CTB ~ln~ct~ Closum 
-I-. i. 
i .. . . . . . • ... .. 
• 0 "- !III!iI!l!I!1t!!I 
Tolim~ 
Nevertheless, this expanded CTB collection 
is sufficiently improved that the rate of closure 
toward of the expanded collection is better (see 
table below). 
121,000 
122,000 
123,000 
New Tokens 
55 
61 
61 
New Token+Tags 
04 (12 NIt) 
68 '17 NIt) 
76 (23 NIt) 
124,000 78 82 (18 Nit) 
125,000 48 63 (11 NIL) 
126,000 52 .61 (16 NIt) 
Consequently, this expanded CTB collection 
is sufficiently improved that merging it with the 
ASBC "A" collection results in a far more mea- 
surably representative larger corpus as shown in 
the final two graphs below. The creation of such 
measurably representative large corpora out of 
such smaller, better focused sub build- 
ing blocks would be cheaper and faster with- 
out the resulting tools developed against such 
corpora suffering much degradation in speed or 
accuracy. 
Combined ASBC "A" and UPqmn CTB Lexlcal Clmm 
.... ....... .... i iiii!  .... i' 
u~ : . " :.../~.~-~: .:-L i ~ . ..-:~i:~. 
" 'i". ""'~ .~'' " "." "' " • " 
0 i iii I '1 J J Jl "!llJlll!lll!!!!ll 
Tel~ 
122 
Combined ~ "A.. = and UPenn ~ Syntactic CIo~xe 
20ooo 
I00~ ~- 
• i ):+ ~, 1 
TQk~ I'1$ 
8 Discussion 
Since only two Chinese tagged corpora are avail- 
able at the present time, only about 200,000 
words of Chinese corpora have been so stud- 
ied. But to see more work in this vein, one 
need only consult McEnery and Wilson's study 
of three English corpora: the IBM computer 
manual corpus, the Canadian Hansards, and the 
American Printing House for the Blind corpus 
(McEnery and Wilson, 1996). Their study (ex- 
haustively detailed in Chapter 6 of their book) 
spans more than 2.2 million words of tagged 
English text in three different domains. Con- 
sequently the total of 2.3 million words from 
McEnery and Wilson's results in tagged En- 
glish texts when combined with these results in 
tagged Chinese newspaper texts should satisfy 
any who might argue that there is insufficient 
data upon which to draw some general conclu- 
sions. 
This paper does not argue that the two clo- 
sure measures used are the only measures pos- 
sible. The argument here is simply that these 
two closure measures are used to spot when a 
sub corpus approaches closure--that is, 
when the curve of new types• and new combina- 
tions of type with token begins to flatten at a 
rate below ten percent. One can readily point 
out that no natural  corpus can ever 
guarantee closure. The best anyone can aspire 
to do today, given the current state of our art, 
is to only approach closure. 
9 References 

References 
D. Biber. 1993. Representativeness in corpus 
design, volume 8(4), pages 243-257. 
P. Bolt. 1994. The international corpus of 
english project-the hong kong experience. 
pages 15-24. 
It. Garside and A. McEnery. 1993. Treebank- 
ing: the compilation of a corpus of skeleton 
parsed sentences, pages 17-35. 
S. Greenbaum. 1992. A new corpus of english: 
Ice. pages 171-179. 
Z. Harris. 1968. Mathematical Structures of 
Language. New York: John Wiley and Sons. 
G. Leech. 1991. The state of the art in corpus 
linguistics, pages 8-29. 
T. McEnery and A. Wilson. 1996. Corpus Lin- 
guistics. Edinburgh: Edinburgh University 
Press. 
S. Sekine. 1994. A new direction for sublan- 
guage nlp. In Proceedings o/ the Interna- 
tional Conference on New Methods in Lan- 
guage Processing, CCL, UMIST, pages 123- 
129. 
