c© 2003 Association for Computational Linguistics
Introduction to the Special Issue on the
Web as Corpus
Adam Kilgarriff
∗
Gregory Grefenstette
†
Lexicography MasterClass Ltd. and ITRI Clairvoyance Corporation
University of Brighton
The Web, teeming as it is with language data, of all manner of varieties and languages, in
vast quantity and freely available, is a fabulous linguists’ playground. This special issue of
Computational Linguistics explores ways in which this dream is being explored.
1. Introduction
The Web is immense, free, and available by mouse click. It contains hundreds of
billions of words of text and can be used for all manner of language research.
The simplest language use is spell checking. Is it speculater or speculator? Google
gives 67 for the former (usefully suggesting the latter might have been intended) and
82,000 for the latter. Question answered.
Language scientists and technologists are increasingly turning to the Web as a
source of language data, because it is so big, because it is the only available source
for the type of language in which they are interested, or simply because it is free
and instantly available. The mode of work has increased dramatically from a standing
start seven years ago with the Web being used as a data source in a wide range of
research activities: The papers in this special issue form a sample of the best of it. This
introduction to the issue aims to survey the activities and explore recurring themes.
We first consider whether the Web is indeed a corpus, then present a history of
the theme in which we view the Web as a development of the empiricist turn that has
brought corpora center stage in the course of the 1990s. We briefly survey the range
of Web-based NLP research, then present estimates of the size of the Web, for English
and for other languages, and a simple method for translating phrases. Next we open
the Pandora’s box of representativeness (concluding that the Web is not representative
of anything other than itself, but then neither are other corpora, and that more work
needs to be done on text types). We then introduce the articles in the special issue and
conclude with some thoughts on how the Web could be put at the linguist’s disposal
rather more usefully than current search engines allow.
1.1 Is the Web a Corpus?
To establish whether the Web is a corpus we need to find out, discover, or decide what
a corpus is. McEnery and Wilson (1996, page 21) say
In principle, any collection of more than one text can be called a
corpus....But the term “corpus” when used in the context of modern
linguistics tends most frequently to have more specific connotations
than this simple definition provides for. These may be considered un-
∗ Lewes Rd, Brighton, BN2 4JG, UK. E-mail: Adam.Kilgarriff@itri.brighton.ac.uk
† Suite 700, 5001 Baum Blvd, Pittsburgh, PA 15213-1854. E-mail: grefen@clairvoyancecorp.com
334
Computational Linguistics Volume 29, Number 3
der four main headings: sampling and representativeness, finite size,
machine-readable form, a standard reference.
We would like to reclaim the term from the connotations. Many of the collections
of texts that people use and refer to as their corpus, in a given linguistic, literary, or
language-technology study, do not fit. A corpus comprising the complete published
works of Jane Austen is not a sample, nor is it representative of anything else. Closer
to home, Manning and Sch ¨utze (1999, page 120) observe:
In Statistical NLP, one commonly receives as a corpus a certain amount
of data from a certain domain of interest, without having any say in
how it is constructed. In such cases, having more training data is
normally more useful than any concerns of balance, and one should
simply use all the text that is available.
We wish to avoid a smuggling of values into the criterion for corpus-hood. McEnery
and Wilson (following others before them) mix the question “What is a corpus?” with
“What is a good corpus (for certain kinds of linguistic study)?” muddying the simple
question “Is corpus x good for task y?” with the semantic question “Is x a corpus at
all?” The semantic question then becomes a distraction, all too likely to absorb energies
that would otherwise be addressed to the practical one. So that the semantic question
may be set aside, the definition of corpus should be broad. We define a corpus simply
as “a collection of texts.” If that seems too broad, the one qualification we allow relates
to the domains and contexts in which the word is used rather than its denotation: A
corpus is a collection of texts when considered as an object of language or literary study.
The answer to the question “Is the web a corpus?” is yes.
2. History
For chemistry or biology, the computer is merely a place to store and process infor-
mation gleaned about the object of study. For linguistics, the object of study itself (in
one of its two primary forms, the other being acoustic) is found on computers. Text
is an information object, and a computer’s hard disk is as valid a place to go for its
realization as the printed page or anywhere else.
The one-million-word Brown corpus opened the chapter on computer-based lan-
guage study in the early 1960s. Noting the singular needs of lexicography for big data,
in the 1970s Sinclair and Atkins inaugurated the COBUILD project, which raised the
threshold of viable corpus size from one million to, by the early 1980s, eight million
words (Sinclair 1987). Ten years on, Atkins again took the lead with the develop-
ment (from 1988) of the British National Corpus (BNC) (Burnard 1995), which raised
horizons tenfold once again, with its 100 million words and was in addition widely
available at low cost and covered a wide spectrum of varieties of contemporary British
English.
1
As in all matters Zipfian, logarithmic graph paper is required. Where corpus
size is concerned, the steps of interest are 1, 10, 100, ..., not 1, 2, 3, ...
Corpora crashed into computational linguistics at the 1989 ACL meeting in Van-
couver, but they were large, messy, ugly objects clearly lacking in theoretical integrity
in all sorts of ways, and many people were skeptical regarding their role in the disci-
pline. Arguments raged, and it was not clear whether corpus work was an acceptable
1 Across the Atlantic, a resurgence in empiricism was led by the success of the noisy-channel model in
speech recognition (see Church and Mercer [1993] for references).
335
Kilgarriff and Grefenstette Web as Corpus: Introduction
part of the field. It was only with the highly successful 1993 special issue of this
journal, “Using Large Corpora” (Church and Mercer 1993), that the relation between
computational linguistics and corpora was consummated.
There are parallels with Web corpus work. The Web is anarchic, and its use is
not in the familiar territory of computational linguistics. However, as students with
no budget or contacts realize, it is the obvious place to obtain a corpus meeting their
specifications, as companies want the research they sanction to be directly related
to the language types they need to handle (almost always available on the Web), as
copyright continues to constrain “traditional” corpus development,
2
as people want
to explore using more data and different text types, so Web-based work will grow.
The Web walked in on ACL meetings starting in 1999. Rada Mihalcea and Dan
Moldovan (1999) used hit counts for carefully constructed search engine queries to
identify rank orders for word sense frequencies, as an input to a word sense dis-
ambiguation engine. Philip Resnik (1999) showed that parallel corpora—until then a
promising research avenue but largely constrained to the English-French Canadian
Hansard—could be found on the Web: We can grow our own parallel corpus using
the many Web pages that exist in parallel in local and in major languages. We are
glad to have the further development of this work (co-authored by Noah Smith) pre-
sented in this special issue. In the student session of ACL 2000, Rosie Jones and Rayid
Ghani (2001) showed how, using the Web, one can build a language-specific corpus
from a single document in that language. In the main session Atsushi Fujii and Tet-
suya Ishikawa (2000) demonstrated that descriptive, definition-like collections can be
acquired from the Web.
2.1 Some Current Themes
Since then there have been many papers, at ACL and elsewhere, and we can mention
only a few. The EU MEANING project (Rigau et al. 2002) takes forward the exploration
of the Web as a data source for word sense disambiguation, working from the premise
that within a domain, words often have just one meaning, and that domains can be
identified on the Web. Mihalcea and Tchklovski complement this use of Web as corpus
with Web technology to gather manual word sense annotations on the Word Expert
Web site.
3
Santamar
´
ia et al., in this issue, discuss how to link word senses to Web
directory nodes, and thence to Web pages.
The Web is being used to address data sparseness for language modeling. In
addition to Keller and Lapata (this issue) and references therein, Volk (2001) gathers
lexical statistics for resolving prepositional phrase attachments, and Villasenor-Pineda
et al. (2003) “balance” their corpus using Web documents.
The information retrieval community now has a Web track as a component of its
TREC evaluation initiative. The corpus for this exercise is a substantial (around 100GB)
sample of the Web, largely using documents in the .gov top level domain, as frozen
at a given date (Hawking et al. 1999).
The Web has recently been used by groups at Sheffield and Microsoft, among
others, as a source of answers for question-answering applications, in a merge of search
engine and language-processing technologies (Greenwood, Roberts, and Gaizauskas
2 Lawyers may argue that the legal issues for Web corpora are no different from those around non-Web
corpora. However, first, language researchers can develop Web corpora just by saving Web pages on
their own computer without any copying, thereby avoiding copyright issues, and second, a Web
corpus is a very minor subspecies of the caches and indexes held by search engines and assorted other
components of the infrastructure of the Web: If a Web corpus is infringing copyright, then it is merely
doing on a small scale what search engines such as Google are doing on a colossal scale.
3 〈http://teach-computers.org/word-expert.html〉.
336
Computational Linguistics Volume 29, Number 3
2002; Dumais et al. 2002). AnswerBus (Zheng 2002) will answer questions posed in
English, German, French, Spanish, Italian, and Portuguese.
Naturally, the Web is also coming into play in other areas of linguistics. Agirre
et al. 2000) are exploring the automatic population of existing ontologies using the
Web as a source for new instances. Varantola (2000) shows how translators can use
“just-in-time” sublanguage corpora to choose correct target language terms for areas
in which they are not expert. Fletcher (2002) demonstrates methods for gathering and
using Web corpora in a language-teaching context.
2.2 The 100M Words of the BNC
One hundred million words is a large enough corpus for many empirical strategies
for learning about language, either for linguists and lexicographers (Baker, Fillmore,
and Lowe 1998; Kilgarriff and Rundell 2002) or for technologies that need quantitative
information about the behavior of words as input (most notably parsers [Briscoe and
Carroll 1997; Korhonen 2000]). However, for some purposes, it is not large enough.
This is an outcome of the Zipfian nature of word frequencies. Although 100 million is
a huge number, and the BNC contains ample information on the dominant meanings
and usage patterns for the 10,000 words that make up the core of English, the bulk
of the lexical stock occurs less than 50 times in the BNC, which is not enough to
draw statistically stable conclusions about the word. For rarer words, rare meanings
of common words, and combinations of words, we frequently find no evidence at all.
Researchers are obliged to look to larger data sources (Keller and Lapata, this issue;
also Section 3.3). They find that probabilistic models of language based on very large
quantities of data, even if those data are noisy, are better than ones based on estimates
(using sophisticated smoothing techniques) from smaller, cleaner data sets.
Another argument is made vividly by Banko and Brill (2001). They explore the
performance of a number of machine learning algorithms (on a representative dis-
ambiguation task) as the size of the training corpus grows from a million to a bil-
lion words. All the algorithms steadily improve in performance, though the question
“Which is best?” gets different answers for different data sizes. The moral: Perfor-
mance improves with data size, and getting more data will make more difference than
fine-tuning algorithms.
2.3 Giving and Taking
Dragomir Radev has made a useful distinction between NLP “giving” and “taking.”
4
NLP can give to the Web technologies such as summarization (for Web pages or
Web search results); machine translation; multilingual document retrieval; question-
answering and other strategies for finding not only the right document, but the right
part of a document; and tagging, parsing, and other core technologies (to improve
indexing for search engines, the viability of this being a central information retrieval
research question for the last 20 years). “Taking” is, simply, using the Web as a source
of data for any CL or NLP goal and is the theme of this special issue. If we focus too
closely on the giving side of the equation, we look only at short to medium-term goals.
For the longer term, for “giving” as well as for other purposes, a deeper understanding
of the linguistic nature of the Web and its potential for CL/NLP is required. For that,
we must take the Web itself, in whatever limited way, as an object of study.
Much Web search engine technology has been developed with reference to lan-
guage technology. The prototype for AltaVista was developed in a joint project be-
4 In remarks made in a panel discussion at the Empirical NLP Conference, Hong Kong, October 2002.
337
Kilgarriff and Grefenstette Web as Corpus: Introduction
tween Oxford University Press (exploring methods for corpus lexicography [Atkins
1993]) and DEC (interested in fast access to very large databases). Language identifi-
cation algorithms (Beesley 1988; Grefenstette 1995), now widely used in Web search
engines, were developed as NLP technology. The special issue explores a “homecom-
ing” of Web technologies, with the Web now feeding one of the hands that fostered
it.
3. Web Size and the Multilingual Web
There were 56 million registered network addresses in July 1999, 125 million in January
2001, and 172 million in January 2003. A plot of this growth of the Web in terms of
computer hosts can easily be generated. Linguistic aspects take a little more work
and can be estimated only by sampling and extrapolation. Lawrence and Giles (1999)
compared the overlap between page lists returned by different Web browsers over the
same set of queries and estimated that, in 1999, there were 800 million indexable Web
pages available. By sampling pages, and estimating an average page length of seven
to eight kilobytes of nonmarkup text, they concluded that there might be six terabytes
of text available then. In 2003, Google claims to search four times this number of Web
pages, which raises the number of bytes of text available just through this one Web
server to over 20 terabytes from directly accessible Web pages. At an average of 10
bytes per word, a generous estimate for Latin-alphabet languages, that suggests two
thousand billion words.
The Web is clearly a multilingual corpus. How much of it is English? Xu (2000) es-
timated that 71% of the pages (453 million out of 634 million Web pages indexed by the
Excite engine at that time) were written in English, followed by Japanese (6.8%), Ger-
man (5.1%), French (1.8%), Chinese (1.5%), Spanish (1.1%), Italian (0.9%), and Swedish
(0.7%).
We have measured the counts of some English phrases according to various search
engines over time and compared them with counts in the BNC, which we know has
100 million words. Table 1 shows these counts in the BNC, on AltaVista in 1998 and
in 2001, and then on Alltheweb in 2003. For example, the phrase deep breath appears
732 times in the BNC. It was indexed 54,550 times by AltaVista in 1998. This rose
Table 1
Frequencies of English phrases in the BNC and on AltaVista in 1998 and 2001, and on
AlltheWeb in 2003. The counts for the BNC and AltaVista are for individual occurrences of the
phrase. The counts for AlltheWeb are page counts (the phrase may appear more than once on
any page).
Sample Phrase BNC WWW WWW WWW
(100 M) Fall 1998 Fall 2001 Spring 2003
medical treatment 414 46,064 627,522 1,539,367
prostate cancer 39 40,772 518,393 1,478,366
deep breath 732 54,550 170,921 868,631
acrylic paint 30 7,208 43,181 151,525
perfect balance 38 9,735 35,494 355,538
electromagnetic radiation 39 17,297 69,286 258,186
powerful force 71 17,391 52,710 249,940
concrete pipe 10 3,360 21,477 43,267
upholstery fabric 6 3,157 8,019 82,633
vital organ 46 7,371 28,829 35,819
338
Computational Linguistics Volume 29, Number 3
to 170,921 in 2001. And in 2003, we could find 868,631 Web pages containing the
contiguous words deep breath according to AlltheWeb. The numbers found through the
search engines are more than three orders of magnitude higher than the BNC counts,
giving a first indication of the size of the English corpus available on the Web.
We can derive a more precise estimate of the number of words available through
a search engine by using the counts of function words as predictors of corpus size.
Function words, such as the, with, and in, occur with a frequency that is relatively
stable over many different types of texts. From a corpus of known size, we can cal-
culate the frequency of the function words and extrapolate. In the 90-million-word
written-English component of the BNC, the appears 5,776,487 times, around seven
times for every 100 words. In the U.S. Declaration of Independence, the occurs 84
times. We predict that the Declaration is about 84 × 100/7 = 1,200 words long. In fact,
the text contains about 1,500 words. Using the frequency of one word gives a first
approximation. A better result can be obtained by using more data points.
From the first megabyte of the German text found in the European Corpus Ini-
tiative Multilingual Corpus,
5
we extracted frequencies for function words and other
short, common words. We removed from the list words that were also common words
in other languages.
6
AltaVista provided, on its results pages, along with a page count
for a query, the number of times that each query word was found on the Web.
7
Ta-
ble 2 shows the relative frequency of the words from our known corpus, the index
frequencies that AltaVista gave (February 2000), and the consequent estimates of the
size of the German-language Web indexed by AltaVista.
We set aside words which give discrepant predictions (too high or too low) as (1)
AltaVista does not record in its index the language a word comes from, so the count
for the string die includes both the German and English occurrences, and (2) a word
might be under- or overrepresented in the training corpus or on the Web (consider
here, which occurs very often in “click here”). Averaging the remaining predictions
gives an estimate of three billion words of German that could be accessed through
AltaVista on the day in February 2000 that we conducted our test.
Table 2
Short German words in the ECI corpus and via AltaVista, giving German Web estimates.
Word Known-Size-Corpus AltaVista Prediction for
Relative Frequency Frequency German-Language Web
oder 0.00561180 13,566,463 2,417,488,684
sind 0.00477555 11,944,284 2,501,132,644
auch 0.00581108 15,504,327 2,668,062,907
wird 0.00400690 11,286,438 2,816,750,605
nicht 0.00646585 18,294,174 2,829,353,294
eine 0.00691066 19,739,540 2,856,389,983
sich 0.00604594 17,547,518 2,902,363,900
ist 0.00886430 26,429,327 2,981,546,991
auf 0.00744444 24,852,802 3,338,438,082
und 0.02892370 101,250,806 3,500,617,348
Average 3,068,760,356
5 〈http://www.elsnet.org/resources/eciCorpus.html〉.
6 These lists of short words and frequencies were initially used to create a language identifier.
7 AltaVista has recently stopped providing information about how often individual words in a query
have been indexed and now returns only a page count for the entire query.
339
Kilgarriff and Grefenstette Web as Corpus: Introduction
Table 3
Estimates of Web size in words, as indexed by AltaVista, for various languages.
Language Web Size
Albanian 10,332,000
Breton 12,705,000
Welsh 14,993,000
Lithuanian 35,426,000
Latvian 39,679,000
Icelandic 53,941,000
Basque 55,340,000
Latin 55,943,000
Esperanto 57,154,000
Roumanian 86,392,000
Irish 88,283,000
Estonian 98,066,000
Slovenian 119,153,000
Croatian 136,073,000
Malay 157,241,000
Turkish 187,356,000
Language Web Size
Catalan 203,592,000
Slovakian 216,595,000
Polish 322,283,000
Finnish 326,379,000
Danish 346,945,000
Hungarian 457,522,000
Czech 520,181,000
Norwegian 609,934,000
Swedish 1,003,075,000
Dutch 1,063,012,000
Portuguese 1,333,664,000
Italian 1,845,026,000
Spanish 2,658,631,000
French 3,836,874,000
German 7,035,850,000
English 76,598,718,000
This technique has been tested on controlled data (Grefenstette and Nioche 2000)
in which corpora of different languages were mixed in various proportions and found
to give reliable results. Table 3 provides estimates for the number of words that
were available in 30 different Latin-script languages through AltaVista in March 2001.
English led the pack with 76 billion words, and seven additional languages already
had over a billion.
From the table, we see that even “smaller” languages such as Slovenian, Croatian,
Malay, and Turkish have more than one hundred million words on the Web. Much of
the research that has been undertaken on the BNC simply exploits its scale and could
be transferred directly to these languages.
The numbers presented in Table 3 are lower bounds, for a number of reasons:
• AltaVista covers only a fraction of the indexable Web pages available
(the fraction was estimated at just 15% by Lawrence and Giles [1999]).
• AltaVista may be biased toward North American (mainly
English-language) pages by the strategy it uses to crawl the Web.
• AltaVista indexes only pages that can be directly called by a URL and
does not index text found in databases that are accessible through dialog
windows on Web pages (the “hidden Web”). This hidden Web is vast
(consider MedLine,
8
just one such database, with more than five billion
words; see also Ipeirotis, Gravano, and Sahami [2001]), and it is not
considered at all in the AltaVista estimates.
Repeating the procedure after an interval, the second author and Nioche showed
that the proportion of non-English text to English is growing. In October 1996 there
8 〈http://www4.ncbi.nlm.nih.gov/PubMed/〉.
340
Computational Linguistics Volume 29, Number 3
Table 4
AltaVista frequencies for candidate translations of groupe de travail.
labor cluster 21
labor grouping 28
labour concern 45
labor concern 77
work grouping 124
work cluster 279
labor collective 423
labour collective 428
work collective 759
work concern 772
labor group 3,977
labour group 10,389
work group 148,331
were 38 German words for every 1,000 words of English indexed by AltaVista. In
August 1999, there were 71, and in March 2001, 92.
3.1 Finding the Right Translation
How can these large numbers be used for other language-processing tasks? Consider
the compositional French noun phrase groupe de travail. In the MEMODATA bilingual
dictionary,
9
the French word groupe is translated by the English words cluster, group,
grouping, concern, and collective. The French word travail translates as work, labor,or
labour. Many Web search engines allow the user to search for adjacent phrases. Com-
bining the possible translations of groupe de travail and submitting them to AltaVista
in early 2003 yielded the counts presented in Table 4. The phrase work group is 15
times more frequent than any other and is also the best translation among the tested
possibilities. A set of controlled experiments of this form is described in Grefenstette
(1999). In Grefenstette’s study, a good translation was found in 87% of ambiguous
cases from German to English and 86% of ambiguous cases from Spanish to English.
4. Representativeness
We know the Web is big, but a common response to a plan to use the Web as a
corpus is “but it’s not representative.” There are a great many things to be said about
this. It opens up a pressing yet almost untouched practical and theoretical issue for
computational linguistics and language technology.
4.1 Theory
First, “representativeness” begs the question “representative of what?” Outside very
narrow, specialized domains, we do not know with any precision what existing corpora
might be representative of. If we wish to develop a corpus of general English, we
may think it should be representative of general English, so we then need to define
the population of “general English-language events” of which the corpus will be a
sample. Consider the following issues:
• Production and reception: Is a language event an event of speaking or
writing, or one of reading or hearing? Standard conversations have, for
each utterance, one speaker and one hearer. A Times newspaper article
has (roughly) one writer and several hundred thousand readers.
9 See 〈http://www.elda.fr/cata/text/M0001.html〉. The basic multilingual lexicon produced by
MEMODATA contains 30,000 entries for five languages: French, English, Italian, German, Spanish.
341
Kilgarriff and Grefenstette Web as Corpus: Introduction
• Speech and text: Do speech events and written events have the same
status? It seems likely that there are orders of magnitude more speech
events than writing events, yet most corpus research to date has tended
to focus on the more tractable task of gathering and working with text.
• Background language: Does muttering under one’s breath or talking in
one’s sleep constitute a speech event, and does doodling with words
constitute a writing event? Or, on the reception side, does passing (and
possibly subliminally reading) a roadside advertisement constitute a
reading event? And what of having the radio on but not attending to it,
or the conversational murmur in a restaurant?
• Copying: if I’d like to teach the world to sing, and, like Michael Jackson or
the Spice Girls, am fairly successful in this goal and everyone sings my
song, then does each individual singing constitute a distinct language
production event?
In the text domain, organizations such as Reuters produce news feeds
that are typically adapted to the style of a particular newspaper and then
republished: Is each republication a new writing event? (These issues,
and related themes of cut-and-paste authorship, ownership, and
plagiarism, are explored in Wilks [2003].)
4.2 Technology
Application developers urgently need to know what to do about sublanguages. It
has often been argued that, within a sublanguage, few words are ambiguous, and a
limited repertoire of grammatical structures is used (Kittredge and Lehrberger 1982).
This points to sublanguage-specific application development’s being substantially sim-
pler than general-language application development. However, many of the resources
that developers may wish to use are general-language resources, such as, for English,
WordNet, ANLT, XTag, COMLEX, and the BNC. Are they relevant for building ap-
plications for sublanguages? Can they be used? Is it better to use a language model
based on a large general-language corpus or a relatively tiny corpus of the right kind
of text? Nobody knows. There is currently no theory, no mathematical models, and
almost no discussion.
A related issue is that of porting an application from the sublanguage for which
it was developed to another. It should be possible to use corpora for the two sublan-
guages to estimate how large a task this will be, but again, our understanding is in
its infancy.
4.3 Language Modeling
Much work in recent years has gone into developing language models. Clearly, the
statistics for different types of text will be different (Biber 1993). This imposes a lim-
itation on the applicability of any language model: We can be confident only that
it predicts the behavior of language samples of the same text type as the training-
data text type (and we can be entirely confident only if training and test samples are
random samples from the same source).
When a language technology application is put to use, it will be applied to new
text for which we cannot guarantee the text type characteristics. There is little work
on assessing how well one language model fares when applied to a text type that is
different from that of the training corpus. Two studies in this area are Sekine (1997)
and Gildea (2001), both of which show substantial variation in model performance
342
Computational Linguistics Volume 29, Number 3
Table 5
Hits for Spanish pensar que with and without possible “dequeismos errors” (spurious de
between the verb and the relative), from Alltheweb.com (March 2003). Not all items are errors
(e.g., “...pienso de que manera...” ... think how...). The correct form is always at least 500
times more common than any potentially incorrect form.
pienso de que 388
pienso que 356,874
piensas de que 173
piensas que 84,896
piense de que 92
piense que 67,243
pensar de que 1,640
pensar que 661,883
when the training corpus changes. The lack of theory of text types leaves us without
a way of assessing the usefulness of language-modeling work.
4.4 Language Errors
Web texts are produced by a wide variety of authors. In contrast to paper-based, copy-
edited published texts, Web-based texts may be produced cheaply and rapidly with
little concern for correctness. On Google a search for “I beleave” has 3,910 hits, and
“I beleive,” 70,900. The correct “I believe” appears on over four million pages. Table 5
presents what is regarded as a common grammatical error in Spanish, comparing the
frequency of such forms to the accepted forms on the Web. All the “erroneous” forms
exist, but much less often than the “correct” forms. The Web is a dirty corpus, but
expected usage is much more frequent than what might be considered noise.
4.5 Sublanguages and General-Language-Corpus Composition
A language can be seen as a modest core of lexis, grammar, and constructions, plus
a wide array of different sublanguages, as used in each of a myriad of human ac-
tivities. This presents a challenge to general-language resource developers: Should
sublanguages be included? The three possible positions are
• No, none should.
• Some, but not all, should.
• Yes, all should.
The problem with the first position is that, with all sublanguages removed, the
residual core gives an impoverished view of language (quite apart from demarcation
issues and the problem of determining what is left). The problem with the second is
that it is arbitrary. The BNC happens to include cake recipes and research papers on
gastro-uterine diseases, but not car manuals or astronomy texts. The third has not,
until recently, been a viable option.
4.6 Literature
To date, corpus developers have been obliged to make pragmatic decisions about the
sorts of text to go into a corpus. Atkins, Clear, and Ostler (1992) describe the desiderata
and criteria used for the BNC, and this stands as a good model for a general-purpose,
general-language corpus. The word representative has tended to fall out of discussions,
to be replaced by the meeker balanced.
343
Kilgarriff and Grefenstette Web as Corpus: Introduction
The recent history of mathematically sophisticated modeling of language variation
begins with Biber (1988), who identifies and quantifies the linguistic features associated
with different spoken and written text types. Habert and colleagues (Folch et al. 2000;
Beaudouin et al. 2001) have been developing a workstation for specifying subcorpora
according to text type, using Biber-style analyses, among others. In Kilgarriff (2001)
we present a first pass at quantifying similarity between corpora, and Cavaglia (2002)
continues this line of work. As mentioned above, Sekine (1997) and Gildea (2001)
directly address the relation between NLP systems and text type; one further such item
is Roland et al. (2000). Buitelaar and Sacaleanu (2001) explores the relation between
domain and sense disambiguation. A practical discussion of a central technical concern
is Vossen (2001), which tailors a general-language resource for a domain.
Baayen (2001) presents sophisticated mathematical models for word frequency
distributions, and it is likely that his mixture models have potential for modeling
sublanguage mixtures. His models have been developed with a specific, descriptive
goal in mind and using a small number of short texts: It is unclear whether they can
be usefully applied in NLP.
Although the extensive literature on text classification (Manning and Sch ¨utze 1999,
pages 575–608) is certainly relevant, it most often starts from a given set of categories
and cannot readily be applied to the situation in which the categories are not known in
advance. Also, the focus is usually on content words and topics or domains, with other
differences of genre or sublanguage remaining unexamined. Exceptions focusing on
genre include Kessler, Nunberg, and Sch ¨utze (1997) and Karlgren and Cutting (1994).
4.7 Representativeness: Conclusion
The Web is not representative of anything else. But neither are other corpora, in any
well-understood sense. Picking away at the question merely exposes how primitive
our understanding of the topic is and leads inexorably to larger and altogether more
interesting questions about the nature of language, and how it might be modeled.
“Text type” is an area in which our understanding is, as yet, very limited. Although
further work is required irrespective of the Web, the use of the Web forces the issue.
Where researchers use established corpora, such as Brown, the BNC, or the Penn
Treebank, researchers and readers are willing to accept the corpus name as a label for
the type of text occurring in it without asking critical questions. Once we move to the
Web as a source of data, and our corpora have names like “April03-sample77,” the
issue of how the text type(s) can be characterized demands attention.
5. Introduction to Articles in This Special Issue
One use of a corpus is to extract a language model: a list of weighted words, or
combinations of words, that describe (1) how words are related, (2) how they are
used with each other, and (3) how common they are in a given domain. Language
models are used in speech processing to predict which word combinations are likely
interpretations of a sound stream, in information retrieval to decide which words are
useful indicators of a topic, and in machine translation to identify good translation
candidates.
In this volume, Celina Santamar
´
ia, Julio Gonzalo, and Felisa Verdejo describe how
to build sense-tagged corpora from the Web by associating word meanings with Web
page directory nodes. The Open Directory Project (at 〈dmoz.org〉) is a collaborative,
volunteer project for classifying Web pages into a taxonomic hierarchy. Santamar
´
ia et
al. present an algorithm for attaching WordNet word senses to nodes in this same
taxonomy, thus providing automatically created links between word senses and Web
344
Computational Linguistics Volume 29, Number 3
pages. They also show how this method can be used for automatic acquisition of
sense-tagged corpora, from which one could, among other things, produce language
models tied to certain senses of words, or for a certain domain.
Unseen words, or word sequences—that is, words or sequences not occurring in
training data—are a problem for language models. If the corpus from which a particu-
lar model is extracted is too small, there are many such sequences. Taking the second
author’s work, as described above, as a starting point, Frank Keller and Mirella Lapata
examine how useful the Web is as a source of frequency information for rare items:
specifically, for dependency relations involving two English words such as <fulfill OB-
JECT obligation>. They generate pairs of common words, constructing combinations
that are and are not attested in the BNC. They then compare the frequency of these
combinations in a larger 325-million-word corpus and on the Web. They find that Web
frequency counts are consistent with those for other large corpora. They also report
on a series of human-subject experiments in which they establish that Web statistics
are good at predicting the intuitive plausibility of predicate-argument pairs. Other
experiments discussed in their article show that Web counts correlate reliably with
counts re-created using class-based smoothing and overcome some problems of data
sparseness in the BNC.
Other very large corpora are available for English (English is an exception), and
the other three papers in the special issue all exploit the multilinguality of the Web.
Andy Way and Nano Gough show how the Web can provide data for an example-
based machine translation (Nagao 1984) system. First, they extract 200,000 phrases
from a parsed corpus. These phrases are sent to three online translation systems. Both
original phrases and translations are chunked. From these pairings a set of chunk
translations is extracted to be applied in a piecewise fashion to new input text. The
authors use the Web again at a final stage to rerank possible translations by verifying
which subsequences among the possible translations are most attested.
The two remaining articles present methods for building aligned bilingual corpora
from the Web. It seems plausible that such automatic construction of translation dic-
tionaries can palliate the lack of translation resources for many language pairs. Philip
Resnik was the first to recognize that it is possible to build large parallel bilingual
corpora from the Web. He found that one can exploit the appearance of language
flags and other clues that often lead to a version of the same page in a different
language.
10
In this issue, Resnik and Noah Smith present their STRAND system for
building bilingual corpora from the Web.
An alternative method is presented by Wessel Kraaij, Jian-Yun Nie, and Michel
Simard. They use the resulting parallel corpora to induce a probabilistic translation
dictionary that is then embedded into a cross-language information retrieval system.
Various alternative embeddings are evaluated using the CLEF (Peters 2001) multilin-
gual information retrieval test beds.
6. Prospects
The default means of access to the Web is through a search engine such as Google.
Although the Web search engines are dazzlingly efficient pieces of technology and
excellent at the task they set for themselves, for the linguist they are frustrating:
10 For example, one can find Azerbaijan news feeds online at 〈http://www.525ci.com〉 in Azeri (written
with a Turkish code set), and on the same page are pointers to versions of the same stories in English
and in Russian.
345
Kilgarriff and Grefenstette Web as Corpus: Introduction
• The search engine results do not present enough instances (1,000 or 5,000
maximum).
• They do not present enough context for each instance (Google provides a
fragment of around ten words).
• They are selected according to criteria that are, from a linguistic
perspective, distorting (with uses of the search term in titles and
headings going to the top of the list and often occupying all the top
slots).
• They do not allow searches to be specified according to linguistic criteria
such as the citation form for a word, or word class.
• The statistics are unreliable, with frequencies given for “pages containing
x” varying according to search engine load and many other factors.
If only these constraints were removed, a search engine would be a wonderful
tool for language researchers. Each of the constraints could straightforwardly be re-
solved by search engine designers, but linguists are not a powerful lobby, and search
engine company priorities will never perfectly match our community’s. This suggests
a better solution: Do it ourselves. Then the kinds of processing and querying would
be designed explicitly to meet linguists’ desiderata, without any conflict of interest or
“poor relation” role. Large numbers of possibilities open up. All those processes of
linguistic enrichment that have been applied with impressive effect to smaller corpora
could be applied to the Web. We could parse the Web. Web searches could be specified
in terms of lemmas, constituents (e.g., noun phrase), and grammatical relations rather
than strings. The way would be open for further anatomizing of Web text types and
domains. Thesauruses and lexicons could be developed directly from the Web. And
all for a multiplicity of languages.
11
The Web contains enormous quantities of text, in numerous languages and lan-
guage types, on a vast array of topics. Our take on the Web is that it is a fabulous
linguists’ playground. We hope the special issue will encourage you to come on out
and play!

References
Agirre, Eneko, Olatz Ansa, Eduard Hovy
and David Martinez. 2000. Enriching very
large ontologies using the WWW. In
Proceedings of the Ontology Learning
Workshop of the European Conference of AI
(ECAI), Berlin.
Atkins, Sue. 1993. Tools for computer-aided
corpus lexicography: The Hector project.
Acta Linguistica Hungarica, 41:5–72.
Atkins, Sue, Jeremy Clear, and Nicholas
Ostler. 1992. Corpus design criteria.
Literary and Linguistic Computing, 7(1):1–16.
Baayen, Harald. 2001. Word Frequency
Distributions. Kluwer, Dordrecht.
Baker, Collin F., Charles J. Fillmore, and
John B. Lowe. 1998. The Berkeley
FrameNet Project. In Proceedings of
COLING-ACL, pages 86–90, Montreal,
August.
Banko, Michele and Eric Brill. 2001. Scaling
to very very large corpora for natural
language disambiguation. In Proceedings of
the 39th Annual Meeting of the Association for
Computational Linguistics and the
10th Conference of the European Chapter of the
Association for Computational Linguistics,
Toulouse.
Beaudouin, Val ´erie, Serge Fleury, Benoˆıt
Habert, Gabriel Illouz, Christian Licoppe,
and Marie Pasquier. 2001. Typweb:
d´ecrire la toile pour mieux comprendre
les parcours. In Colloque International sur
les Usages et les Services des
T´el ´ecommunications (CIUST’01), Paris,
June. Available at 〈http://www.cavi.univ-
paris3.fr/ilpga/ilpga/sfleury/typweb.htm〉.
Beesley, Kenneth R. 1988. Language
identifier: A computer program for
automatic natural-language identification
of on-line text. In Language at Crossroads:
Proceedings of the 29th Annual Conference of
the American Translators Association, pages
47–54, October 12–16.
Biber, Douglas. 1988. Variation across speech
and writing. Cambridge University Press,
Cambridge.
Biber, Douglas. 1993. Using
register-diversified corpora for general
language studies. Computational Linguistics,
19(2):219–242.
Briscoe, Ted and John Carroll. 1997.
Automatic extraction of subcategorization
from corpora. In Proceedings of the Fifth
Conference on Applied Natural Language
Processing, pages 356–363, Washington,
DC, April.
Buitelaar, Paul and Bogdan Sacaleanu. 2001.
Ranking and selecting synsets by domain
relevance. In Proceedings of the Workshop on
WordNet and Other Lexical Resources:
Applications, Extensions and Customizations,
NAACL, Pittsburgh, June.
Burnard, Lou. 1995. The BNC Reference
Manual. Oxford University Computing
Service, Oxford.
Cavaglia, Gabriela. 2002. Measuring corpus
homogeneity using a range of measures
for inter-document distance. In Proceedings
of the Third International Conference on
Language Resources and Evaluation, pages
426–431, Las Palmas de Gran Canaria,
Spain, May.
Church, Kenneth W. and Robert L. Mercer.
1993. Introduction to the special issue on
computational linguistics using large
corpora. Computational Linguistics,
19(1):1–24.
Dumais, Susan, Michele Banko, Eric Brill,
Jimmy Lin, and Andrew Ng. 2002. Web
question answering: Is more always
better? In Proceedings of the 25th ACM
SIGIR, pages 291–298, Tampere, Finland.
Fletcher, William. 2002. Facilitating
compilation and dissemination of ad-hoc
web corpora. In Teaching and Language
Corpora 2002. Available at 〈http://
miniappolis.com/KWiCFinder/
KWiCFinder.html〉.
Folch, Helka, Serge Heiden, Benoˆıt Habert,
Serge Fleury, Gabriel Illouz, Pierre Lafon,
Julien Nioche, and Sophie Pr ´evost. 2000.
Typtex: Inductive typological text
classification by multivariate statistical
analysis for NLP systems
tuning/evaluation. In Proceedings of the
Second Language Resources and Evaluation
Conference, pages 141–148, Athens,
May–June.
Fujii, Atsushi and Tetsuya Ishikawa. 2000.
Utilizing the World Wide Web as an
encyclopedia: Extracting term
descriptions from semi-structured text. In
Proceedings of the 38th Meeting of the ACL,
pages 488–495, Hong Kong, October.
Gildea, Daniel. 2001. Corpus variation and
parser performance. In Proceedings of the
Conference on Empirical Methods in NLP,
Pittsburgh, PA.
Greenwood, Mark, Ian Roberts, and Robert
Gaizauskas. 2002. University of Sheffield
TREC 2002 Q & A system. In E. M.
Voorhees and Lori P. Buckland, editors,
The Eleventh Text Retrieval Conference
(TREC-11), Washington. U.S. Government
Printing Office.
Grefenstette, Gregory. 1995. Comparing two
language identification schemes. In
Proceedings of the Third International
Conference on the Statistical Analysis of
Textual Data (JADT’95), pages 263–268,
Rome, December 11–13. Available at
〈www.xrce.xerox.com/competencies/content-
analysis/publications/Documents/P49030/
content/gg aslib.pdf〉.
Grefenstette, Gregory. 1999. The WWW as a
resource for example-based MT tasks.
Paper presented at ASLIB “Translating
and the Computer” conference, London,
October.
Grefenstette, Gregory. 2001. Very large
lexicons. In Walter Daelemans, Khalil
Simaan, Jakub Zavrel, and Jorn Veenstra,
editors, Computational Linguistics in the
Netherlands 2000: Selected Papers from the
Eleventh CLIN Meeting, Language and
Computers 37. Rodopi, Amsterdam.
Grefenstette, Gregory and Julien Nioche.
2000. Estimation of english and
non-english language use on the WWW.
In Proceedings of the RIAO (Recherche
d’Informations Assist ´ee par Ordinateur),
pages 237–246, Paris.
Hawking, D., E. Voorhees, N. Craswell, and
P. Bailey. 1999. Overview of the TREC8
Web track. In Proceedings of the Eighth Text
Retrieval Conference, Gaithersburg,
Maryland, November.
Ipeirotis, Panagiotis G., Luis Gravano, and
Mehran Sahami. 2001. Probe, count, and
classify: Categorizing hidden Web
databases. In Proceedings of the SIGMOD
Conference, Santa Barbara, CA.
Jones, Rosie and Rayid Ghani. 2000.
Automatically building a corpus for a
minority language from the Web. In
Proceedings of the Student Workshop of the
38th Annual Meeting of the Association for
Computational Linguistics, Hong Kong,
pages 29–36.
Karlgren, Jussi and Douglass Cutting. 1994.
Recognizing text genres with simple
metrics using discriminant analysis. In
Proceedings of COLING-94, pages
1071–1075, Kyoto, Japan.
Kessler, Brett, Geoffrey Nunberg, and
Hinrich Sch ¨utze. 1997. Automatic
detection of text genre. In Proceedings of
ACL and EACL, pages 39–47, Madrid.
Kilgarriff, Adam. 2001. Comparing corpora.
International Journal of Corpus Linguistics,
6(1):1–37.
Kilgarriff, Adam. 2003. Linguistic search
engine. In Kiril Simov, editor, Shallow
Processing of Large Corpora: Workshop Held in
Association with Corpus Linguistics 2003,
Lancaster, England, March.
Kilgarriff, Adam and Michael Rundell. 2002.
Lexical profiling software and its
lexicographical applications—A case
study. In Proceedings of EURALEX ’02,
Copenhagen, August.
Kittredge, Richard and John Lehrberger.
1982. Sublanguage: Studies of Language in
Restricted Semantic Domains. De Gruyter,
Berlin.
Korhonen, Anna. 2000. Using semantically
motivated estimates to help
subcategorization acquisition. In
Proceedings of the Joint Conference on
Empirical Methods in NLP and Very Large
Corpora, pages 216–223, Hong Kong,
October.
Lawrence, Steve and C. Lee Giles. 1999.
Accessibility of information on the Web.
Nature, 400:107–109.
Manning, Christopher and Hinrich Sch ¨utze.
1999. Foundations of Statistical Natural
Language Processing. MIT Press, Cambridge.
McEnery, Tony and Andrew Wilson. 1996.
Corpus Linguistics. Edinburgh University
Press, Edinburgh.
Mihalcea, Rada and Dan Moldovan. 1999. A
method for word sense disambiguation of
unrestricted text. In Proceedings of the 37th
Meeting of ACL, pages 152–158, College
Park, MD, June.
Nagao, Makoto. 1984. A framework of a
mechanical translation between Japanese
and English by analogy principle. In Alick
Elithorn and Ranan Banerji, editors,
Artificial and Human Intelligence.
North-Holland, Edinburgh, pages 173–180.
Peters, Carol, editor. 2001. Cross-Language
Information Retrieval and Evaluation,
Workshop of Cross-Language Evaluation Forum
(CLEF 2000) Lisbon, Portugal, September
21–22, 2000, Revised Papers. Lecture Notes
in Computer Science. Springer-Verlag.
Resnik, Philip. 1999. Mining the Web for
bilingual text. In Proceedings of the 37th
Meeting of ACL, pages 527–534, College
Park, MD, June.
Rigau, German, Bernardo Magnini, Eneko
Agirre, and John Carroll. 2002. Meaning:
A roadmap to knowledge technologies. In
Proceedings of COLING Workshop on A
Roadmap for Computational Linguistics,
Taipei, Taiwan.
Roland, Douglas, Daniel Jurafsky, Lise
Menn, Susanne Gahl, Elizabeth Elder, and
Chris Riddoch. 2000. Verb
subcategorization frequency differences
between business-news and balanced
corpora: The role of verb sense. In
Proceedings of the Workshop on Comparing
Corpora, 38th ACL, Hong Kong, October.
Sekine, Satshi. 1997. The domain
dependence of parsing. In Proceedings of
the Fifth Conference on Applied Natural
Language Processing, pages 96–102,
Washington, DC, April.
Sinclair, John M., editor. 1987. Looking Up:
An Account of the COBUILD Project in
Lexical Computing. Collins, London.
Varantola, Krista. 2000. Translators and
disposable corpora. In Proceedings of CULT
(Corpus Use and Learning to Translate),
Bertinoro, Italy, November.
Villasenor-Pineda, L., M. Montes y G´omez,
M. P ´erez-Coutino, and D. Vaufreydaz.
2003. A corpus balancing method for
language model construction. In Fourth
International Conference on Intelligent Text
Processing and Computational Linguistics
(CICLing-2003), pages 393–401, Mexico
City, February.
Volk, Martin. 2001. Exploiting the WWW as
a corpus to resolve PP attachment
ambiguities. In Proceedings of Corpus
Linguistics 2001, Lancaster, England.
Vossen, Piek. 2001. Extending, trimming
and fusing WordNet for technical
documents. In Proceedings of the NAACL
2001 Workshop on WordNet and Other Lexical
Resources, Pittsburgh, June. Available at
〈http://engr.smu.edu/rada/mwnw/
papers/WNW-NACL-205.pdf.gz〉.
Wilks, Yorick. 2003. On the ownership of
text. Computers and the Humanities.
Forthcoming.
Xu, J. L. 2000. Multilingual search on the
World Wide Web. In Proceedings of the
Hawaii International Conference on System
Science (HICSS-33), Maui, Hawaii, January.
Zheng, Zhiping. 2002. AnswerBus question
answering system. In E. M. Voorhees and
Lori P. Buckland, editors, Proceedings of
HLT Human Language Technology Conference
(HLT 2002), San Diego, CA, March 24–27.
