Design and evaluation of grammar checkers in multiple s 
Antje HELFRICH 
NLG, Microsoft Corp. 
I Microsoft Way 
Redmond, WA 98052 
antjehOmicrosoft.com 
Bradley MUSIC 
NLG, Microsoft Corp. 
1 Microsoft Way 
Redmond, WA 98052 
brmusic@microsofl.com 
Abstract 
This paper describes issues hn~olved hi the development of a grammar checker ht multiple s at Miclvsoft 
Colporation. Focus is on design (selecting and priorit&ing error i(lent~ication rules) and evaluation (determinhtg 
product quality). 
Introduction 
The goal of the project discussed here is to develop 
French, German and Spanish grammar checkers 
for a broad user base consisting of millions of 
Microsoft Word customers - users who create 
documents of all types, styles and content, using 
various terminology and dialects, and who want 
proofing tools that help eliminate mistakes in an 
efficient and non-intruding lhshion. 
The fact that the user base is so broad 
poses many challenges, among them the questions 
of which errors are most common among such a 
diverse set of users, and what types of input the 
grammar checkers need to be tested and evaluated 
011. 
This paper will describe the connnon 
methods and processes that we use across the 
 teams for design and evaluation, while 
focusing on -specific characteristics for 
the actual product design. The central role of large 
text corpora in the three s (including 
regional variations) for both design and evaluation 
will be discussed. 
Design: How do we know what the right 
features are? 
In the design phase of the software development 
process, we ask the question: What should the 
product do for the user? For a grammar checker, 
the main features are the error detection and 
correction rules, or "critiques". The goal of the 
design phase is to determine which level of 
proofing and which error types typical Microsoft 
Word users care about most. It is important to 
remember that our grammar checker is not a 
standalone product, but a component within 
Microsoft Word, and that the main goal of the user 
is to create documents as efficiently as possible. 
We don't want to distract, delay or bother people 
with a picky proofing component that points out 
linguistic issues most users don't care about (even 
if we could critique those with high precision j) or 
eagerly highlights any "suspicious" sentence with a 
potential problem. Instead, we focus on critiques 
that are actually helpful to the majority of users 
from their point of view and support them in their 
ultimate goal of creating grammatically clean 
documents efficiently. 
Researching the customer 
The first step towards determining the feature set is 
to describe the target user of the grammar checker. 
One early decision was to focus on native users, 
since we are developing a grammar checker and 
not a -learning tool. However, many of 
the grammar mistakes native users make are also - 
or even more - common among non-native users, 
so we know that the grammar checker will be 
helpful to this population as well. 
The target user base for our grammar 
checkers are current and future Microsoft Word 
users, and we benefit fi'om information that has 
already been gathered about the Microsoft Word 
user profile. We know that Microsoft Word is used 
mostly at the workplace, and we know what types 
of documents various professionals create in the 
respective countries. 
In addition to relying on general Microsoft 
Word user information, we learn about people's 
proofing behavior in interviews, focus groups and 
Even actual errors can belong in this category: the French 
capitalization rules lbr  names vs. people (e.g. 
.franfais vs. Francais), for instance, are clear, but customer 
research shows that many users don't want such errors to be 
pointed out. 
1036 
surveys, conducted in the target markets Germany 
(where we include Swiss and Austrian speakers), 
France, Canada, Spain, and Latin America. We 
develop discussion guides and questionnaires to 
gather detailed inforrnation about how people 
eusure that their doculnents are "gralnnlar-clean", 
starting with questions about the types of 
documents they write, whether they care about the 
correctness of their writing equally for all 
documents (and have found, not surprisingly, that 
the level of desired proofing depends on the 
intended formality o1' the document, which in turn 
depends on the target audience for the text) and 
proceed with questions about how they prool' their 
texts and what types of issties they feel they need 
help with. 
Focus groups and survey participants 
provide a lot of input on tim question of what 
errors people care about most. We know from 
these studies to focus on actual grammar errors 
instead of on stylistic issues, since there is no 
common agreement ahout the latter and people are 
generally less interested in seeing them pointed 
out. We also receive detailed feedback on 
-slgecific priorities for error detection: we 
have learned for illstance that tVrench speakers care 
about agreement and getting tense and mood right, 
German speakers care about selection of case, 
capitalization and spelling together vs. apart rules, 
and Spanish speakers care about agreement, 
correct use o\[" clitics, and confusabie words, anlong 
other error types. 
Selecting and prioritizing the features 
After delermining tim target user for the granunar 
checker, we systematically compile the set of 
critiques that will be helpful to this user base. For 
features like the user interface we use data gained 
from user feedback concerning the existing English 
grammar checker and confirm the findings in the 
target countries; the actual error recognition rules, 
however, are selected solely on a hmguage-specific 
basis. 
The methods we apply in order to 
determine the critique sets are systematic and are 
shared among the tealrls. First, error types and 
potential critiques are compiled based Oll the 
sources listed below; in a second step we prioritize 
and trim down the list o1' potential critiques 
according to criteria of frequency, taelpl'tlllless, and 
reliability. 
Language/linguistic knowledge: Each 
 team consists of linguists and 
computational linguists who grew tip and were 
educated in the native  community. We 
painfully remember grammar rules that were 
drilled into us back in elementary school and have 
theoretical and practical experiences that range 
fl'ona  teaching to transhttion/localization 
backgrounds to PhDs in linguistics. While we 
know that disagreement errors are conamon in all 
our target s due to forced agreement 
(between subject and verb or between a noun and 
its articles/modifiers), we pay special attention to 
-specific phenomena and error types. For 
instance, analysis of French errors reveals a high 
degree of confusion between infinitive and past 
participial verb forms, presumably due to their 
phonetic equality; we therefore developed special 
confusable word detection algorithms for the 
French grammar checker. 
Another aspect el'  knowledge is 
to observe trends and changes in  use, 
whether the changes are speaker-induced (e.g. 
gradually changing case requirements after specific 
prepositions in German) or externally motivated 
like the spelling reform in Germany, which has 
huge consequences for the grammar checker 2. 
Reference bool<s: Books about typical (and 
frequent) grammar errors can he hard to come by, 
depending on the  being analyzed, though 
we did find sources for typical "grammalical 
stumbling blocks" for all langt, ages. Excellent 
iidormalion came from books about writing good 
business letters, since their target readers overhtp 
with our target users, and they contain good lists of 
grammar issues that people often grapple with (e.g. 
capitalization in multil)le word expressions, 
inchiding standard business letter phrases, in 
German). Unfortunately most of these give no (or 
very little) indication of the frequency of the error. 
Customer research: As described above, 
we spend considerable time and effort to 
investigate what errors native  users 
struggle with and would like help with. 
2 The spelling reforna affects the grannnar checker since many 
changes in capitalization rules and spelling together vs. ap:ul 
rules require syntactic parsing in order to identify and correct 
mistakes. An example is "zur Zeit" which is still spelled apart 
when governing a genitive object, but is, according to the ImW 
spelling rules, spelled together and with lower case ("zurzeit") 
when used adverbially. 
1037 
Market analysis: We study the market tbr 
grammar checkers and proofing tools in general in 
the French/German/Spanish-speakiug countries, to 
review what products and features users are 
familiar with and might expect in a grammar 
checker. 
Text corpus: We process and review 
millions o1' sentences for each  to find out 
which errors actually occur and at what frequency. 
All of the sources listed above contribute 
to the design process. The most decisive factors 
stem fi'om our customer research, which informs us 
about what users view as their biggest grammar 
challenges, and the corpus analysis, which inlkmns 
us about what errors users actually make. Corpus 
analysis plays such a central role in out" feature 
design that it is discussed separately in the next 
section. 
Analyzing text and error data 
Our text corpora are central for product design and 
evaluation, and we are investing heavily in 
creating, acquMng, categorizing, storing, tracking, 
and maintaining data for the grammar checker and 
furore product development projects. While we 
have to compile three separate corpora for French, 
German and Spanish, the methods and principles 
we apply to building and maintaining the corpora 
are shared. 
The corpus used in the grammar checker 
protect is representative of the documents that 
target users create, and therefore the input that the 
grammar checker will have to deal with. It includes 
a mix ot' documents from various media (e.g. 
newspaper vs. web site), styles (e.g. formal vs. 
casual) and content (e.g. finance vs. science). The 
proportion of each category is predetermined 
according to the Microsoft Word user profile 
described above. 
The research community benefits from 
access to published corpora not available for 
commercial use. In contrast, a corporation that 
needs data for development and testing purposes is 
much more restricted. The following list gives an 
overview of some of the challenges we are faced 
with: 
Copyright issues: While we are surrounded 
by a lot of text, especially on the internet, many of 
these documents are copyrighted and cannot be 
used without permission; we need to follow 
detailed legal guidelines and procedures, which 
can cause substantial lag time between identifying 
useful corpus sources and actually acquiring and 
using thena. 
Size: We need huge amounts of corpus, in 
all s, in order to represent the various 
media, styles, and contents. To render test results 
meaningful, we need to ensure that all of the error 
types we develop critiques for have sufficient 
representation in the corpus. 
Edited vs. unedited data: For our purposes, 
we are especially interested in text that has not 
undergone proofing and revision, in order to find 
errors people actually make while entering text, as 
well as to later test the quality of the grammar 
checker. Such documents are extremely hard to 
come by, so we found ways to have such unedited 
text data specifically created for our project. 
Edited data is used to verify that the grammar 
checker does not falsely identify errors in correct 
input. 
Blind vs. non-blind dala: We divide our 
corpus into two parts of equal size and 
corresponding content as far as document types, 
subject malter and writing styles are concerned. 
Halt' of the corpus is awfilable to the whole team 
and is used for design and development as well as 
testing: The program manager uses this corpus to 
identify and analyze error types and frequency, and 
to support developers by providing corpus samples 
for specific grammatical constructions or error 
occurrences; the test team uses it to provide open 
feedback to developers about the precision of the 
parser and llle grannnar checker. Tim other half o1' 
the corpus is "blind" and only awtilable to the test 
team; it is used to lneasure the accuracy of both 
parser and grammar checker. When the test team 
finds "bugs" (e.g. missed error identification, or 
faulty analysis of a correct sentence as 
grammatically wrong) in the blind corpus, the 
underlying pattern o1' the problem is reported, but 
the specific sentence is not revealed in order to 
prevent tuning the product to individual sentences 
and biasing the accuracy numbers. Doubling the 
corpus in this way means that we need more data 
in terms of sheet quantity; it also poses additional 
challenges for categorizing, tracking, and securing 
the data. 
Cleaning: While we don't want the corpus 
clean in terms of gramlnar errors, we do need to 
process it to standardize the format, remove 
1038 
elements like HTMI, formatting codes, hard 
rotulllS, etc. so we c{111 use it ill aUtOlnated tools. 
The extensive effort put into design helps 
to ensure that the product focuses on errors people 
actually make and care about. The next section 
describes the lesting done to determine if we've 
achieved acceptable quality for identification and 
correction of these error types. 
Evaluation: ltow do we know when we're 
done? 
l)uring tim development process, testers give 
feedback and quality assessment, based on both the 
blind and non-blind corpora, and usil'lg a variety o1' 
tools to provide quick turn-around after a change to 
the system. Development feedback shows lhe 
effects of each change to the lexicon, nlorphology, 
grammar or critiquing system, where the testers 
systematically apply -iildependenl 
n-iethods of analysis and reporlirig, l)ovelopers 
need io know the impact of any chal-iges they make 
as soon as possible, so that l'urlher developnloni 
can proceed with confidence or, in the case of all 
unexpected negative impact, problenls can be 
corrected before furiher developnaent. Quality 
assessment is partially reflected in terms of agreed- 
upon nlelrics, such as recall, precision, and false 
flags per page. 
As we approach the end of the 
devolopnlent process, we continue to nmiliior the 
metrics against pro-defined goals, but also shift 
focus io other kinds of testing with orientation 
towards the user's experience wilh the glallllnal" 
checker. 
This section will briefly outline key 
metrics used for quality assessment as well as 
some of the user-focused testing we do before 
shipping the final version. 
Precision 
Precision = good flags/total flags, e.g. if the 
granlnlar checker correctly identifies 160 errors on 
a given corpus, and iricorreclly flags 15 
words/phrases as erfors ill lhat corpus, the 
precision will be 160/I 75=91%. Determining 
precision has less meaning the more the test corpus 
has undefgone editing. Ill the cxl.fcllle case o1" a 
highly edited text (e.g. a published book) where in 
principle there should be no grammar errors 
present at all, the only flags a grammar checker 
could possibly get would be l'alse flags, thus 
precision would be 0%, which would give an 
inacetlrate illlprossion of prodtlct quality. 3 
Precision is reported on a wu-iety of corpora within 
the hmguage teams, these having the same 
representativity across the teams. 
Recall 
Recall = good flags/expected flags, i.e. what 
percentage o1' the errors is actually spoUed. 
Research on users' impressions of granlmar 
checker quality consistently shows that i_lSOl'S \[fie 
less concerned about recall than about the number 
of false flags. Tiffs has entailed a cross-linguistic 
prioritization of improving quality by the reduction 
of I'alse flags. In tin'ms of metrics, this nleans that 
increasing precision and decreasing the false flag 
per page rate have had a higher priority than recall 
for those glamlllar checkers. ()rio challenge hero is 
tile fact lhat methods for rodtlcillg false fhtgs C\[lll 
risk loss of good fhtgs that would be helpful to our 
users, so a light hand is required to balance 
reduction of the absohito number of false flags vs. 
still spotting and correeling the errors people really 
make. 
False flags per page 
Although highly edited texts are less interesting for 
dcternaining precision, they are iml)Ollant as a basis 
for measuring how 'noisy' the gl'allllltar checker is 
on a finished docunlent. 4 This call be n-ieasured in 
tornis of false flags per page, with the ideal being 
zero - however  being as coinplox as it is, 
it is in fact extremely difficult to achieve no false 
flags in a system that attempts to pal'SO and correct 
the frequent errors ill agroonloilt, n-iood, etc. Mole 
realistically, a trade-off has to be accepted that 
gives the critiques l'OOnl lO work, while still staying 
under what's considered an annoying level of i'alse 
flags per page. \]i\] the lVrench, Gel'Ilia.Ill ~.llld Spanish 
gralllinar checker development of fort, we sot a goal 
3 This was a flaw in a recent evaluation of a French gl'anlmar 
checker done by tile French Academy, where a grammar- 
checking product was run against French lilcralure from the 
last fotlr centu,ics, wilh the none too surprising result thai it 
suggested changes to the great authors' prose. \[AFP99\] 
,I Note that noisiness is affected by faclors other than grammar 
checker quality; lor instance the UI can hell) to reduce 
annoying flags by rcnlolnboring editing of each sentence so as 
llOl lo bother tlscrs with SalllO crfors ()liCe they've boon 
explicitly ignored, as is done hi Microsoft Word+ 
1039 
el' having less than one false flag per page. Once 
we were well below that for each  (while 
still achieving precision and coverage goals), we 
subjected the grammar checkers to beta testing (see 
below) to confirm whether the users' impressions 
of the helpfulness of the grammar checker conform 
to the metrics. 
Market analysis 
Although the Natural Language Group doesn't sell 
lhe grammar checkers as standalone products, it is 
still interesting for us to determine how we fare 
against grammar checkers already on the market. 
Since we can't be sure that other grammar 
checkers have been evaluated in exactly the same 
way, we can't rely on the competitors' reported 
metrics, sttch as false flag per page rates. We 
theret'ore do our own objective quality comparison 
based on the same blind corpus as we evaluate 
ourselves against. Here is where the strict division 
between blind and non-blind is absolutely essential 
to avoid skewing the results - if the non-blind 
corpus were used, we would show ourselves to an 
advantage, since the developers also have access to 
that corpus and naturally train against it. 
Final testing: 'real world', bug bashes, beta 
testing 
Even given a low false flag per page rate and 
acceptable precision and recall measures, when all 
is said and done the user's impression of quality 
and usel'ulness can still come down to highly 
specific contexts. Regardless of how we score on 
our own metrics, users will often turn a grammar 
checker otT due to a 'spectacular' false flag and/or 
annoyance. An example of what is meant by a 
spectacular false flag is Nous sommes "~ *Nous 
somme, where sommes is the correct first person 
phu'al present form of the French verb ~itre 'to be', 
while s'omme without the -s has both masculine 
and feminine noun readings, entailing the 
possibility o1' a misparse of the verb as a noun, and 
therefore a potential false flag. A Canadian user 
who encountered this false flag when using a 
product that is now off the market immediately 
turned off that grammar checker for good. An 
error on a common word like this can give users a 
very low opinion of tim grammar checker quality 
and cause them turn it off for this flag alone. 
Regardless of what the metrics tell us as to overall 
quality, it still comes down to a subjective user 
experience. 
Therefore, when the product is getting 
close to its final shippable state, we break away 
fi'om the metrics to gain insigbt into how users 
experience the grammar checker quality and 
usefulness. 'Real world testing' refers to test 
passes where the test teams use the grammar 
checker to edit documents like actual users will. 
Rather than focusing on detailed analysis of 
specific errors, they gain a general impression of 
the product quality. 
Bug bashes are another type of testing, 
where native speaker users fiom outside our group 
are asked to set aside a dedicated time to t'inding 
bugs in the grammar checker. These normally take 
place over several hours. Users may be asked to 
explore the limits o1' the grammar checker, for 
instance by executing certain tasks, such as 
proofing an existing document, changing their 
settings, etc. The purpose is to find functional and 
linguistic bugs that may have been missed by our 
own extensive testing. We also ask the 
participants to answer a few questions on their 
overall experience with the grammar checker. 
Finally, beta testing is simply where the 
grammar checker is used in native speakers' daily 
document production environment - they are asked 
to use it in their daily work, submitting bugs via 
email. Eventually they are also asked to record 
their impressions on the usefulness of the grammar 
checker, with as many specifics as possible. 
Conclusion 
Developing a grammar checker for a broad user 
base presents many challenges, and this paper 
focused oll two areas: design and evaluation. The 
multilingual project environment allows for 
substantial leveraging of knowledge, methods and 
processes; in the end, though, a grammar checker's 
value is determined by the support it provides to a 
specific  community. To this end, native 
 data guides the development, including 
the analysis of large corpora and intense study of 
the target market's customer and their proofing 
needs. 

References 

\[AFP99\] Agence France-Presse press release, May 
21, 1999, "L'Acaddmie fi'an~aise met on 
garde centre des logiciels de correction". 
