Proceedings of 
The Workshop on Comparing Corpora 
Held in conjunction with 
The 38th Annual Meeting of the 
Association for Computational Linguistics 
Edited by 
Adam Kilgarriff 
and 
Tony Berber Sardinha 
7 October 2000 
Hong Kong University of Science and Technology (HKUST) 
Hong Kong 
Proceedings of 
The Workshop on Comparing Corpora 
Held in conjunction with 
The 38th Annual Meeting of the 
Association for Computational Linguistics 
Edited by 
Adam Kilgarriff 
and 
Tony Berber Sardinha 
7 October 2000 
Hong Kong University of Science and Technology (HKUST) 
Hong Kong 
(~)2000 The Association for Computational Linguistics 
Order copies of this and other ACLworkshop proceedings from: 
Association for Computational Linguistics (ACL) 
75 Paterson Street, Suite 9 
New Brunswick, NJ 08901 
USA 
Tel: +1-732-342-9100 
Fax: +1-732-342-9339 
ac 1 @ac lweb. org 
Preface 
Anyone who has worked with corpora will be all too aware of differences between them. De- 
pending on the differences, it may, or may not, be reasonable to expect results based on one corpus 
to also be valid for another. It may, or may not, be appropriate for a grammar, or parser, based on 
one to perform well on another. It may, or may not, be straightforward to port an application from 
a domain of the first text type to a domain of the second. Currently, charactefisations of corpora are 
mostly textual and informal. A corpus is described as "Wall Street Journal" or "transcripts of business 
meetings" or "foreign learners' essays (intermediate grade)". It would be desirable to be able to place 
a new corpus in relation to existing ones, and to be able to quantify similarities and differences. 
Allied to corpus-similarity is corpus-homogeneity. An understanding of homogeneity is a pre- 
requisite to a measure of similarity - it makes little sense to compare a corpus sampled across many 
genres, like the Brown, with a corpus of weather forecasts, without first accounting for the one being 
broad, the other narrow. 
Given the centrality of corpora to contemporary  engineering, it is remarkable how little 
research there has been on corpus similarity. The only well-understood measure is cross-entropy, from 
Information Theory, which is widely used in  modelling, particularly for speech recognition 
(see, eg, Roukos 1996). However it is not clear whether, or where, it is a good measure, and there is 
some evidence that it does not match our intuitions (Kilgardff and Rose 1998, Kilgarriff in press). 
Biber's work (1989, 1995) on corpus characterisation, coming from sociolinguistics, has made a 
considerable impact, with various researchers applying the model in  engineering (eg Folch 
et al 2000) or subjecting it to critical scrutiny (Lee 2000). Studies in text classification, genre and 
sub are also salient, but it is rarely evident how well the technologies developed in these 
fields are suited to measunng corpus similarity or homogeneity. 
There are of course many ways in which two corpora will differ, and different kinds of difference 
will be relevant for different kinds of purposes. Thus, similarity such that a part-of-speech tagger 
developed for one corpus works well in the other, may differ from similarity for Machine Translation. 
We currently lack a sophisticated vocabulary for talking about the various ways in which corpora 
differ, and hope that the workshop will contribute to the development of one. 
We welcomed contributions concerned with measuring and comparing corpora from any field. 

References 
Biber, Douglas. 1988. Variation across speech and writing. Cambridge University Press. 
Biber, Douglas. 1995. Dimensions in Register Variation. Cambridge University Press. 
Folch, Helka, Serge Heiden, Benrit Habert, Serge Fleury, Gabriel Illouz, Pierre Lafon, Julien Nioche and 
Sophie Prrvost 2000. TyPTex: Inductive typological text classification by multivariate statistical analysis 
for NLP systems tuning/evaluation. In Proc. 2nd LREC, Athens, Greece. Pp 141-148. 
Kilgarriff, Adam. In press. Comparing corpora. Int. JnL Corpus Linguistics. 
Kilgarriff, Adam and Tony Rose. 1998. Measures for corpus similarity and homogeneity. In Proc. EMNLP-3, 
pages 46--52, Granada, Spain, June. ACL-SIGDAT. 
Lee, David. 2000. Modelling Variation in spoken and written : the multidimensional approach 
revisited. Ph.D. thesis, University of Lancaster. 
Roukos, Salim, 1996. Language Representation, chapter 1.6. NSF and EU Survey of the State of the Art in 
Human Language Technologies, www. cse. ogi/CSLU/HLTsurvey, html. 
