Comparing corpora with WordSmith Tools: 
How large must the reference corpus be? 
Tony BERBER-SARDINHA 
LAEL, Catholic University of Sao Paulo 
Rua Monte Alegre 984 
05014-001 Sao Paulo SP, Brazil 
tony4@uol.com.br 
Abstract 
WordSmith Tools (Scott, 1998) offers a 
program for comparing corpora, known as 
KeyWords. KeyWords compares a word list 
extracted from what has been called 'the 
study corpus' (the corpus which the 
researcher is interested in describing) with a 
word list made from a reference corpus. The 
only requirement for a word list to be 
accepted as reference corpus by the software 
is that must be larger than the study corpus. 
one of the most pressing questions with 
respect to using KeyWords seems to be what 
would be the ideal size of a reference 
corpus. The aim of this paper is thus to 
propose answers to this question. Five 
English corpora were compared to reference 
corpora of various sizes (varying from two 
to 100 times larger than the study corpus). 
The results indicate that a reference corpus 
that is five times as large as the study corpus 
yielded a larger number of keywords than a 
smaller reference corpus. Corpora larger 
than five times the size of the study corpus 
yielded similar amounts of keywords. The 
implication is that a larger reference corpus 
is not always better than a smaller one, for 
WordSmith Tools Keywords analysis, while 
a reference corpus that is less than five times 
the size of the study corpus may not be 
reliable. There seems to be no need for using 
extremely large reference corpora, given that 
the number of keywords yielded do not 
seem to change by using corpora larger than 
five times the size of the study corpus. 
Introduction 
WordSmith Tools (Scott, 1998) offers a 
program for comparing corpora, known as 
KeyWords. This tool has been used in several 
studies as a means for describing various lexico- 
grammatical characteristics of different genres 
(Barbara and Scott, 1999; Batista, 1998; Berber 
Sardinha, 1995, 1999a, b; Berber Sardinha and 
Shimazumi, 1998; Bonamin, 1999; Collins and 
Scott, 1996; Conde, 1999; Dutra, 1999; Freitas, 
1997; Fuzetti, 1999; Granger and Tnbble, 1998; 
Lima-Lopes, 1999; Lopes, 2000; Ramos, 1997; 
Santos, 1999; Scott, 1997; Silva, 1999; Tribble, 
1998). The keywords identified by the program 
are not necessarily the 'most important words' in 
the corpus (Scott, 1997), or those that 
correspond to readers' intuitions as to what the 
topics of the texts are. It is generally thought that 
a set of WordSmith Tools keywords indicate 
'aboutness' (Phillips, 1989). 
KeyWords compares a word list extracted 
from what has been called 'the study corpus' 
(the corpus which the researcher is interested in 
describing) with a word list made from a 
reference corpus. The result is a list of 
keywords, or words whose frequencies are 
statistically higher in the study corpus than in 
the reference corpus. The software also 
identifies words whose frequencies are 
statistically lower in the study corpus, which are 
called 'negative keywords', in contrast to 
positive keywords, which have higher 
frequencies in the study corpus. Negative 
keywords, though, will not be discussed in the 
present paper. Hence, whenever keyword is 
mentioned in this paper, it will mean 'positive 
keyword'. 
The only requirement for a word list to be 
accepted as reference corpus by the software is 
that must be larger than the study corpus. Thus, 
the composition and length of KeyWord lists can 
vary according to at least six parameters: 
• The composition of the study corpus. 
• The composition of the reference 
corpus. 
• The size of the study corpus;. 
• The size of the reference corpus. 
• The statistical test used in the 
comparison of frequencies (log- 
likelihood and chi-square are available). 
• The level of significance (p) used as the 
'keyness' benchmark (the cut-off point). 
Since WordSmith Tools is Windows software, 
it has appealed to a large audience of applied 
linguists willing to do corpus-based research, to 
whom this platform is generally the only one 
that they know how to use. To them, one of the 
most pressing questions with respect to using 
KeyWords seems to be what would be the ideal 
size of a reference corpus. The aim of this paper 
is thus to propose answers to this question. 
1 Using KeyWords 
A KeyWord list is a portion of the study 
corpus word list. KeyWords compares the 
frequencies for each type in the study and 
reference corpora. The program calculates the 
log-likelihood (G2) 1 or Chi-Square (X 2) of each 
word form based on its distribution in both 
corpora, an example of which is given in the 
table below. 
Study 
corpus 
Reference 
corpus 
Word 
form x 
10 
(10%) 
10(1%) 
Remaining 
word forms 
Total 
90 (90%) 100 
(100%) 
1000 (99%) 1010 
(100%) 
For a distribution such as the above, both the 
log-likelihood and chi-square statistics would 
probably flag the word form in question as a 
keyword, since its frequencies in the two 
corpora are so different (10% versus 1%). The 
way KeyWords processes word lists is not 
unique, and has been applied by researchers 
using other software (De Cock, Granger, Leech, 
and McEnery, 1998; Granger and Rayson, 1998; 
Milton, 1998). 
After processing the word lists, the keyword 
lists appear in WordSmith Tools as illustrated 
below. 
1 See Dunning (1992) for the formulae. 
From left to right, the columns in the window 
refer to: 
LI'P'~ l\[\] ,1 - ~_1.~1 
i 192 0,905 97.787 0.11 495.7 0.000000 
JOB 91 0,4,5 19.454 0.02 3,80.4 0.000000 
LOVE 93 0,46 21.296 0.02 376.6 0.000000 
RITA 33 O, 16 305 338.9 0.000000 
NOWADAYS 38 0,19 1.365 290.4 0.000000 
Is 45,,5 2,25 889.648 0,98 244.8 0.000000 
14 0,07 0 235.4 0.000000 
UNIVERSITY 51 0,25 15.333 0.02 180.3 0.000000 
PERSON 54 0,27 21.747 0.02 161.9 0.000000 
MONEY 60 0,30 31.442 0.03 151.6 0.0O0000 
LIVE 45 0,22 15.551 0.02 147.5 0.000000 
• 'Word': the keywords. 
• 'Freq': frequency in the study corpus; 
• <file name> %: percent frequency in the 
study corpus; 
• 'Freq': frequency in the reference 
corpus; 
• <file name> %: percent frequency in the 
reference corpus; 
• Keyness: the value of the log-likelihood 
or chi-square statistics; 
• p: the significance value associated with 
the statistic. 
2 Methodology 
In order to answer this question, the following 
English corpora were used: 
• Corpus of job application letters, taken 
from the DIRECT Corpus 2 . 
• Corpus of newspaper editorials, from 
the Brown Corpus ('B" subcorpus). 
• Corpus of newspaper reviews, from the 
Brown Corpus ('C' subcorpus). 
• Corpus of mystery fiction, from the 
Brown Corpus ('L" subcorpus). 
• Corpus of science fiction, from the 
Brown Corpus ('M' subcorpus). 
These five corpora added up to about 162 
thousand words: 
Corpus Tokens 
Letters 11,761 
Editorials 54,626 
Types 
2,415 
8,582 
2 For more information on the DIRECT project, log 
on to www.direct.f2s.com 
8 
Reviews 35,741 17,746 
Mystery 48,298 6,281 
Sci-Fi 12,081 2,982 
Total 162,5071 
The reference corpora were compiled out of 
texts published in 'The Guardian'. The reason 
for choosing it is that newspaper text is the most 
typical kind of reference corpus used by applied 
linguists, mainly because it is easy to get. 
Therefore, the results obtained here would be 
relevant to the typical user of KeyWords. The 
reason for specifically choosing the Guardian is 
that Mike Scott, the author of WordSmith Tools, 
makes it available on his website a word list of 
95 million tokens of The Guardian text on his 
website. This has become a popular choice for 
several WordSmith Tools users investigating 
English keywords. Once again, it was hoped that 
by using The Guardian, the investigation would 
mirror a typical choice of WordSmith users. For 
the present study, a portion of the Guardian 
word list was used, namely from texts published 
in 1994, taken randomly. 
The size of the reference corpora varied 
according to the size of the study corpora. For 
each study corpus, 18 reference corpora were 
created. Each one was n times larger than the 
study corpus, with n being 2, 3, 4, 5, 6, 7, 8, 9, 
10, 20, 30, 40, 50, 60, 70, 80, 90, and 100. For 
instance, the letters corpus had 11,761 tokens, 
and so for n=2 the size of the reference corpus 
was 23,552 tokens (11,761 * 2); for n=3, the 
reference corpus size was 35,283 (11,761 x 3), 
for n=4 47,044, and so on, up to n=100, whose 
size was 1,176,100 words. 
The KeyWords settings used for the 
comparisons were as follows: 
Setting 
Procedure 
Max p. value 
Max wanted 
Min frequency 
* most allowed 
Value 
loglikelyhood 
0.01 
160(0* 
Table 1: KeyWords settings 
The table below shows the size of all of the 
reference corpora used in the study: 
Size of reference corpora 
N=2 n=3x n=4 n=5 N=6 N=7 
Letters Tokens 23,522 35,283 47,044 58,805 70,566 82,327 
Editorials 
Reviews 
Mystery 
Sci-Fi 
Types 5,543 10,161 11,163 
71,482 
12,249 
Tokens 109,252 163,878 218,504 273,130 327,756 382,382 
Types 14,973 18,378 21,746 24,118 26,537 28,382 
Tokens 178,705 
11,000 
7,409 8,863 
107,223 142,964 
14,331 17,758 
148,324 i 
7,550 9,032 
214,446 
Types 
250,187 
19,490 21,559 23,402 
Tokens 96,596 144,894 193,192 241,490 289,788 338,086 
Types 13,880 17,636 ~20,285 22,861 24,925 26,928 
Tokens 24,162 36,243 60,405 72,486 84,567 
5,644 Types 11,318 
Size of reference corpora 
10,325 12,422 
n=8 n=9 n=10 n=20 n=30 n=40 
Letters Tokens 94,088 105,849 117,610 235,220 352,830 470,440 
13,095 
Editorials 
i ! 
iReviews 
Mystery 
Sci-Fi 
Types 13,896 14,879 
14,209 15,156 
22,650 27,763 31A71 
Tokens 437,008 !491,634 546,260 1092,520 1,638,780 12,185,040 
Types 30,292 31,825 33,672 47,305 57,325 65,237 
Tokens 285,928 321,669 357,410 714,820 1,072,230 1,429,640 
Types 24,940 26,524 27,812 38,610 47,081 53,695 
Tokens 386,384 434,682 482,980 1965,960 1,448,940 1,931,920 
Types 28,563 30,084 31,669 i44,755 53,867 61,531 
Tokens 96,648 108,729 120,810 i241,620 362,430 483,240 
113,305 28,144 
Size of reference corpora 
22,918 Types 32,010 
n=50 n=60 n=70 In=80 n=90 n=100 
Letters Tokens 588,050 705,660 823,270 940,880 1,058,490 1,176,100 
Types 35,083 38,560 
Editorials 
Reviews 
Mystery 
Sci-Fi 
42,421 44,607 47,061 48,902 
Tokens 2,731,300 3,277,560 3,823,820 4,370,080 4,916,340 5,462,600 
Types 71,680 77,397 82,743 87,902 92,884 97,121 
Tokens 1,787,050 2,144,460 12,501,870 2,859,280 3,216,690 3,574,100 
Types 59,690 64,753 !69,242 73,167 76,945 80,574 
Tokens 2,414,900 2,897,880 3,380,860 3,863,840 4,346,820 14,829,800 
Types 68,117 73,623 78,508 83,076 87,578 \[92,157 
Tokens 604,050 724,860 845,670 966,480 1,087,290 \[1,208,100 
42,822 Types 45,101 38,959 47,474 
Table 2: Size of reference corpora 
135,460 49,617 
3 Results 
The results for the total number of keywords 
obtained are shown in the following table. Since 
the study corpora were of different sizes, the 
number of keywords is also shown as a 
percentage of the total types of the study corpus. 
For instance, the letters corpus had 2,415 types; 
the number of keywords obtained comparing 
this corpus to the n=2 reference corpus was 279; 
therefore, this corresponds to 11.6% of the total 
types. 
n= Letters I Editorials Reviews Mystery Sci-Fi 
Keywds. % i Keywds. % Keywds.% Keywds. % Keywds. % 
2 279 11.6 433 5.0 401 5.2 583 9.3 137 4.6 
3 347 14.4 1686 8.0 582 17.5 748 11.9 202 6.8 
4 354 14.7 637 7.4 496 6.4 !728 11.6 !196 6.6 
5 481 19.9 963 11.2 889 11.5 i1027 16.4 363 12.2 
6 480 19.9 910 10.6 872 11.3 1035 16.5 361 12.1 
7 450 18.6 892 10.4 829 10.7 1018 16.2 355 11.9 
8 457 18.9 887 10.3 846 10.9 1037 16.5 350 11.7 
9 !457 18.9 880 10.3 822 10.6 1031 116.4 332 11.1 
10 462 19.1 896 10.4 837 10.8 1050 16.7 330 11.1 
30 497 20.6 960 11.2 919 i11.9 1116 17.8 364 12.2 
40 507 21.0 953 11.1 926 112.0 1135 18.1 367 12.3 
50 490 20.3 936 10.9 !914 11.8 !1123 17.9 373 12.5 
60 492 20.4 942 11.0 i933 12.0 11141 18.2 378 12.7 
70 492 20.4 928 10.8 !914 11.8 11140 18.1 368 12.3 
80 485 20.1 948 11.0 929 12.0 1145 18.2 i374 12.5 
90 485 20.1 943 11.0 922 11.9 1130 18.0 i383 12.8 
100 475 19.7 1952 11.1 939 12.1 1143 18.2 382 12.8 
Table 3: Keyword totals (% = pct. of the total number of types in the study corpus). 
The results indicate that the number of keywords 
increases as the size of the reference corpus 
increases, but this increase is not linear. For 
instance, the keywords for n=2 in the letters 
corpus was 279, for n=3 it was 347, and for n= 
100 the total keywords was 475. Had the growth 
been linear, for n=3 there would be 418 
keywords, and for n=100 13,950. Obviously, a 
total of 13,950 keywords could never have been 
obtained since die maximum possible number of 
10 
keywords in the letters corpus is 2,415, which is 
the total number of types. The same is true of all 
the other corpora. 
This suggests that there must be a point at 
which the growth in number of keywords 
diminishes. This can be shown by plotting the 
number of keywords for each size of n across all 
the study corpora, as in the graph below. 
25 
20 
15 
10 
5 
0 
size of reference corpus (n) 
* Letters 
• Mystery 
Sci-Fi 
......... Reviews 
• Editorials 
Plot 1: Distribution of keywords 
The plot shows that for all study corpora the 
keyword totals rose from n=2 to n=3, then fell or 
stabilized at n=4, rose again at n=5 and from 
then on basically reached a plateau. For instance, 
for the letters corpus, the keyword totals for n=2, 
n=3, n=4, n=5, and n=6 were respectively 11.6, 
14.4, 14.7, 19.9, and 19.9. Hence, there was 
indeed a considerable rise from n=2 to n=3 (11.6 
to 14.4), followed by a slight rise at n=4 (14.7), 
then a major increase at n=5 (19.9), and there 
was no change from n=5 to n=6 (19.9 to 19.9). 
In order to check where the major changes 
occurred, an ANOVA was run on the keyword 
totals across the various n sizes. The results are 
shown in the table below. 
Source df I SS 
Size ofn 21 \] 1540.8087 
Error 68i18.6184 
Total 18911559.4271 
F Ip 
267.98 \[ < 0.0001 
Table 4: Results of ANOVA for keyword totals 
across reference corpora 
The value of F(21,68)=267.98 is significant at 
p<0.0001, which indicates that size of the 
reference corpora had a significant effect on the 
keyword totals. This does not show us the 
differences in keyword totals among n sizes. 
In order to know at which n sizes the keyword 
totals are statistically different, the REGWF 
(Ryan-Einot-Gabriel-Welsch) Multiple F Test 
was run in SAS. The results appear in the table 
below, in decreasing order of the average 
percentage of keyword totals across the five 
study corpora. 
A 
A 
A 
A 
A 
A 
A 
A 
A 
A 
A 
Groupings Avg. % Size 
keywords ofn 
!14.8840 40 
14.8480 i60 
14.7900 20 
14.7780 100 
14.7780 80 
14.7600 90 
14.7220 30 
B 14.6940 70 
B C 14.6780 50 
B C D 14.2280 5 
B C D 14.0660 6 
B C D 13.6860 8 
C D 13.6340 10 
D 13.5660 7 
D 13.4640 9 
E 9.100 3 
\]E 9.3280 4 
\[ F 7.1300 2 
Table 5: Results of REGWF test 
The REGWF test presents the results in terms 
of groupings, identified by letters. Keyword 
totals in the same grouping are not statistically 
different. Hence, sizes of n equal to 40, 60, 20, 
100, 80, 90, 30, 70, 50, 5, and 6 formed 
grouping A, which has on average 14.066% to 
14.884% keyword totals. Likewise, n sizes equal 
to 70, 50, 5, 6, and 8 were in grouping B, with 
averages ranging from 13.686% to 14.694%. 
Note that this is overlap among groupings, and 
so groupings A, B, C and D are in fact joined. 
This grouping comprises n sizes ranging from 5 
to 100. The remaining groupings are non- 
overlapping: grouping E was formed by n sizes 
3 and 4, and grouping F by n=2. 
Therefore, there are two basic divisions in the 
previous table, namely at n sizes equal to 2, 3, 
11 
and 5. These correspond to the major peaks and 
plateaus visible in the plot. 
The results suggest, then, that the critical value 
for a reference corpus seems to be five. In other 
words, the answer to the question 'what is the 
ideal size of a reference corpus' is five. A 
reference corpus that is five times as large as the 
study corpus yields a larger number of keywords 
than a smaller reference corpus. This means that 
the results of a keyword analysis based on a 
reference corpus that is less than five times the 
size of the study corpus could be very different 
from a study done on a corpus, say, just three 
times larger than the study corpus, in so far as 
the number ofkeywords go. Several potentially 
revealing keywords could be left out of the 
analysis iftbe reference corpus is not as large as 
five times or more. 
Condusion 
The aim of this study was to estimate the ideal 
size of a reference corpus to be used in 
WordSmith Tools KeyWords procedure. 
KeyWords provides facilities for comparing a 
study corpus to a reference corpus, which, by 
default, must be larger than study corpus. 
The results indicated that a reference corpus 
that is five times larger than the study corpus 
yields a similar amount of keywords than 
reference corpora that are up to 100 times larger 
than the study corpus. This was taken to mean 
that a reference corpus does not need to be more 
than five times larger than the study corpus. 
In sum, a larger reference corpus is not always 
better than a smaller one, for WordSmith Tools 
Keywords analysis. There seems to be no need 
for using extremely large reference corpora, 
given that the number of keywords yielded do 
not seem to change by using corpora larger than 
five times the size of the study corpus. This may 
be important for WordSmith Tools users, who 
may be short of disk space and memory on their 
PCs to process large reference corpora. A 
suggestion that might come out of this finding is 
that researchers should not spend time and 
resources building, collecting or searching for 
larger and larger reference corpora. Resources 
would be better spent in the compilation of 
reference corpora that are more suitable in terms 
of their contents viz ~ viz the study corpus. 
This study did not tackle several important 
questions. One of them is whether the keywords 
that were identified represent the main concepts 
or topics found the texts. Aqualitative study 
would be needed to answer this, as an 
independent test of validity of the status of the 
keywords. Another question is the effect of the 
size of the study corpus. It is not known how 
study corpora of the same size behave in terms 
of the total keywords that they yield when 
compared to reference corpora of the same size. 
Another question is the composition of the 
keyword lists obtained. This study restricted 
itself to quantitative aspects of keyword list 
variation, but it would be important that changes 
be assessed qualitatively as well. In particular, it 
would be pertinent to know which keywords 
were added or dropped as the levels of n 
changed ~. Finally, the fact that Brown corpus 
texts are short fragments and not whole texts 
may have upset the results, since the number of 
keywords seems to vary considerably as a 
function of the size of the texts (Mike Scott, 
personal communication). Shorter texts provide 
less room for repetition, which in turn influences 
word frequencies. 
Acknowledgements 
My thanks go to Mike Scott and the three 
anonymous reviewers for their comments. 

References 
Leila Barbara and Mike Scott (1999). Homing on a 
genre: invitations for bids. In "Writing Business: 
Genres, media and discourse", In F. Bargiela- 
Chiapini & C. Nickerson, ed., Longman, New 
York, USA, pp. 227-254. 
Maria Eug~nia Batista (1998) E-Mails na troca de 
informafao numa muitinacionah o g~nero e as 
escolhas 16xico-gramaticais. Unpublished MA 
Thesis, LAEL, Catholic University of Sao Paulo, 
Brazil. 
Tony Berber-Sardinha (1995). The OJ Simpson trial: 
Connectivity and consistency. Paper presented at 
the BAAL Annual Meeting, Southampton, 
England, 14 September 1995. 
Tony Berber-Sardinha (1999a) UsingKeyWords in 
text analysis: Practical aspects. DIRECT Papers, 
42. LAEL, Catholic University of Sao Paulo, 
Brazil / AELSU, University of Liverpool, England. 
(Available online at www.direct.f2s.com) 
Tony Berber- Sardirdaa (1999b) Wordsets, keywords, 
and text contents: an investigation of text topic on 
the computer. Delta, 15, pp. 141-149. (Available 
online at www.scielo.br) 
Tony Berber-Sardinha and Marilisa Shimazumi 
(1998) Using corpus linguistics to describe the 
APU (Assessment of Performance Unit) archive of 
schoolchildren's writing. Unpublished manuscript. 
(Available online at www.tonyberber.f2s.com) 
MArcia Bonamin (I 999) Anfilise organizational e 
16xico-gramatieal de duas se95es de revistas de 
informfitica, em ingles. Unpublished MA Thesis, 
LAEL, Catholic University ofS~o Paulo, Brazil. 
(Available online at www.lael.t2s.com/online.htm) 
Heloisa Collins and Mike Scott (1996)Lexical 
landscaping. DIRECT Papers, 32. CEPRIL, 
Catholic University of S~o Panlo, Brazil, and 
AELSU, Liverpool University, England. 
Helena Conde (1999) Aspeetos culturais da escrita de 
alunos de urea escola americana em S~o Paulo - 
Urea perspectiva baseada em corpus. MA Project. 
LAEL, Catholic University of S~o Paulo, Brazil. 
Sylvie De Cock, Syivianne Granger, Geoffrey Leech 
and Tony McEnery (1998)An automated approach 
to the phrasicon of EFL learners. In "Learner 
English on Computer", S. Granger, ed., Longman, 
New York, pp. 67-79. 
Ted Dunning (1992) Accurate methods for the 
statistics of surprise and coincidence. 
Computational Linguistics, 19, pp. 61-74. 
Patrlcia Durra (I 999) Anfilise 16xico-gramatical 
baseada era corpus da mfsiea pop contemporfinea. 
MA Project, LAEL, Catholic University of S~o 
Paulo, Brazil. 
Alice de Freitas (1997). Amrrica m~gica, Gr~- 
Bretanha real e Brasil tropical: um estudo iexieal 
de panfletos de hot,is. Unpublished Doctoral 
Thesis, LAEL, Catholic University of S~o Paulo, 
Brazil. (Available online at www.lael.f2s.com 
/online.htm) 
Helena Fuzetti (1999) A interag~o oral entre eriangas 
numa escola arnerieana- Urea abordagem baseada 
em corpus. MA Project, LAEL, Catholic 
University of Sao Paulo, Brazil. 
Sylvianne Granger and Paul Rayson (1998) 
Automatic profiling of learner texts. In "Learner 
English on Computer", S. Granger, ed., Longman, 
New York, USA, pp. 119-131. 
Sylvianne Granger and Chris Tribble (1998) Learner 
corpus data in the foreign  classroom: 
Form-focused instruction and data-driven 
learning. In "Learner English on Computer", S. 
Granger ed., Longraan, New York, USA, pp. 199- 
209. 
Rodrigo Lima-Lopes (1999) Padr~es colocaeionais 
dos participantes ern cartas de neg6eios em lingua 
inglesa. Manuscript. LAEL, Cafl~olic University of 
Sao Paulo, Brazil. 
Maria Cecilia Lopes (2000) Homepages 
institucionais em portugu~s e suas vers~es para o 
ingles: Urea anfilise baseada em corpus de aspeetos 
lexicais e discursivos. Unpublished MA Thesis, 
Sao Paulo, Brazil, LAEL, Catholic University of 
Sao Paulo, Brazil• 
John Milton (1998)ExploitbtgL1 andinter 
corpora in the design of an electronic  
learning and production environment. In "Learner 
English on Computer", S. Granger, ed., Longman, 
New York, USA, pp. 186-199. 
Martin Phillips (1989) Lexical Structure of Text. 
Birmingham: ELR, University of Birmingham, 80 
p. 
Rosinda Guerra Ramos (1997) Proje~o de imagem 
atrav~s de escolhas lingfiisticas: Um estudo no 
eontexto empresarial. Unpublished Doctoral 
Thesis, LAEL, Catholic University ofSao Paulo, 
Brazil. 
Valrria Branco Pinto dos Santos (1999) PadrSes 
interpessoais no g~nero de cartas de negociag~o. 
Unpublished MA Thesis, LAEL, Catholic 
University of S~o Paulo, Brazil. (Available online 
at www.lael.f2s.com/online.htm) 
Mike Scott (1997) PC Analysis of key words-and 
key key words. System, 25, pp. 233-245. 
Mike Scott (1998) WordSmith Tools Version 3. 
Oxford University Press, Oxford, England. 
Maria Fernanda da Silva (1999) Anfilise lexical de 
folhetos de propagandas de escolas de linguas e as 
representa95es de ensino. Unpublished MA Thesis, 
LAEL, Catholic University of Sao Paulo, Brazil. 
(Available online at http//www.lael.f2s.eom/ 
online.htm) 
Chris Tribble (1998) Genres, keywords, teaching- 
towards a pedagogic account of the  of 
Project Proposals. Paper presented at TALC98, 
Oxford, England. 
