Techniques for Accelerating a Grammar-Checker 
Karel Oliva 
Computa,tiona,1 Linguistics 
University of Saa,rla,nd 
Postfach 15 11 50 
D - 66 041 Sa.arbr(icken 
Germany 
ema,il: olivaOcoli .uni-sb. de 
Abstract 
The paper describes several possibilities of 
using finite-state automata a~s means for 
speeding up the performance of a grammar- 
and-parsing-based (as opposed to pattern- 
matching-based) grammar-checker able to 
detect errors from a predefined set. The 
ideas contained have been successflfily im- 
plemented in a grammar-checker for Czech, 
a free-word-order language from the Slavic 
group. 
1 Introduction 
This paper describes an efficiency-supporting tool 
for one of the two grarnmar-checker technologies de- 
veloped in the fi'amework of the PECO2824 Joint 
Research Project sponsored by the European Union. 
The project, covering Bulgarian and Czech, two ti'ee- 
word-order languages from the Slavic t~rnily, was 
performed between January 1993 and mid 1996 by 
a consortium consisting of both academic and indu- 
strial partners. 
The basic philosophy of the technology discussed 
in this paper 1 is that of linguistic-theoretically smmd 
grammar-and-parsing-based machinery able to de- 
tect, by constraint relaxation, errors from a predefi- 
ned set (as opposed to pattern-matching approaches, 
which do not seem promising for a free word-order 
language). The core of the system (broad-coverage 
HPSG-based grammars of Bulgarian and Czech, and 
a single language-independent parser) was developed 
m the first three years of the project and was then 
passed to the industrial partners Bulgarian Business 
System IMC Sofia. and Macron Prague, Ltd. While 
the Bulgarian system remained in more or less a de- 
monstrator stage only, the Czech one satisfied Ma- 
cron's requirements as to syntactic coverage. Ho- 
wever, Macron expressed serious worries about the 
speed of the system, should this be really introdu- 
ced to the market. Following this, severa.1 possibili- 
IAs for the alterna.tive technology, cf. (Hola.n, Kubol't, 
a.ztd Pl/Ltek, 1997) 
ties of using finite-state automata (FSA) as means 
for speeding up the performance of the system were 
designed, developed and implemented, in particular: 
• for detecting sentences where none of the prede- 
fined errors can occur (tiros ruling out such sent- 
ences from the procedure of error-search proper) 
• for detecting which one(s) of tile predefined er- 
ror types might possibly occur in a particular 
sentence (hence, cutting clown the search space 
of the error-search proper) 
• for detecting errors which are of such a nature 
that their occurrence might be discovered by 
a machinery simpler than full-fledged parsing 
with constraint relaxation 
• for splitting (certain cases of) complex sent- 
ences into independent clauses, a,llowing thus 
for the error-detection to be performed on shor- 
t, er strings. 
2 Lexicalization of Error Search 
Very many of the errors to be discovered by the sy- 
stem can be traced down to mismatches of (vMues 
of) features projected into the synta.ctic structure 
from the lexicon. Even though the error searching 
capabilities of the systern are not lirnited m princi- 
ple to these lexicMly induced errors, ibr a practicM 
implementation it turned out to be useful to narrow 
down the error search of the system to ahnost only 
these kinds of errors, fbr the following reasons: 
1. the loss of generality of the system is in fact only 
minimM, since the majority of errors which the 
system is able to detect are of this nature any- 
way (the only exception being agreement errors 
revolving NPs with complicated internM struc- 
ture, e.g., ellipses or coordina,tion) 
2. this loss of error coverage (ahnost negligible tbr 
a real application) is outweighed by substa.n- 
tim gain in overall (statistical) speed of the sy- 
stem, which is achieved by adding a preproces- 
sing phase consisting of a finite state automaton 
pa,ssing through the input string and looking for 
a lexical trigger of a contingent error: 
155 
• if this automaton does not find any 
such trigger, tile time-consurrfing grammar- 
checking process proper (i.e. parsing, pos- 
sihly also reparsing with relaxed cons- 
traints) is not started at all and the sent- 
ence is iminediately marked as one contai- 
ning no detectable error 
• if this automaton finds such a lexical trig- 
ger of an error, it. 'remelnbers' its nature 
so that in tile tbllowmg phases, only the 
respect,ive constrailltS are relaxed (which 
helps to cut down the search space, a.s 
compa.red to reparsing with relaxing of all 
predefined-errors-related constraints) 
As an example of this idea, let us consider a sy- 
stem deMing with errors in subject-verb agreement 
in Czech (and taking - for the very purpose of this 
example - detection of no other errors into account). 
Since the realistic pa.rt of such errors in Czech is the 
'-I/-Y' dichotomy on homophonic past tense verb 
endings occurring on plural verbs ('-I' ending sta.n- 
drag with l:)lural masculine animate subjects, '-Y en- 
ding with plural masculine inanimate and feminine 
subjects), the preprocessing finite-state automaton 
marks all sentences not conta.ining any of these forms 
(i.e. all sentences containing only singular verbs, or 
plural verbs but in present tense or in neuter gender, 
or infinite verb tbrms) as 'containing no detectable 
error', without any actual grammar-checking taking 
place (it is, however, obvious that this does not ne- 
cessarily mea.n that the sentences are truly correct - 
they just do not contain the kind of error the system 
is able to detect). 
3 Alternative Error-Classification 
and Error Search by Finite Automata 
Another important step towards tile apl)lication of 
FSA to error-detection was developing a new dimen- 
sion of cl~ussitication of errors to be detected: apart 
fi'om the lnore standard criteria of frequency a.nd 
lmrforlnance/conqmtence, we developed a scale ba- 
sed on the comI~lexit, y of the forlnal appara.tus nee- 
ded for the detection of the particular error type (as 
for error typology developed for the purpose of the 
error detection techniques used in the l)roject, of. 
(l%odrignez Selles, GMvez, and Oliva, 1996)). On 
t, he one end of this sca.le were errors recognizable wi- 
thin a strictly local context, such as COmlnas missing 
in front of a certain kind of complementizers (sub- 
ordinating conjunctions) or incorrect vocalization of 
a preposition (in both Bulgarian and C, zech, certain 
prepositions ending norlnally with a. consonant get, ~t 
supporting w)cal m case the word that follows them 
also starts with a. consonant - the parMlel in English 
would be the opt)osition between the two forms a 
and an of the indefinite article). On the other elm 
of the scale we put, e.g., the general case of subject- 
verb ~greement, errors. Pra.ctically lllOl'e imtn.~rtallt 
was tile question whether there exists a, (:lass of er- 
rors with complexity of detection lying between the 
"trivial errors" and the errors tor tile detection of 
which a fldl-fledged analysis is necessa.ry - in other 
words, the question whether there exist some errors 
for the recognition of which 
• on the one hand, a limited local context is msuf 
ficient (i.e. it is necessa.ry tot this end to process 
a substring of length which ca.nnot be set in a.d- 
vance, in generM the whole input string), 
• on the other hand, it is not necessary t.o use the 
power of the fldl-fledged parser, al,d, in parti- 
cular, it is sufficient to use the power of a fi- 
nite state autolnaton or only slight augmenta.- 
lion thereof. 
Following some linguistic research, two such error 
types have been selected for implenlentation, and 
while one of thenr is just a lnarginal subtype of an 
error in subject-verb agreernent, the other is an error 
type of its own, and in addition one of really crucial 
importance tot practical gramma.r-checking due to 
its high frequency of occurrence. 
Tile forlner error to be detected by the l~init,e state 
machinery is a particular instance of an error where 
a plural masculine animate subject is conjoined with 
a verb in a phlrM feminine tbrm (el. also the example 
above). The idea of detecting some particular cases 
of this error by a finite state automaton results from 
the combination of the following observa.tions: 
• the nominative plural forln of masculine ani- 
mate nouns of the declension types pd'.. and 
pi'edse.da is not ambiguous (homonynmus) with 
any other case forms (apa.rt form vocative case, 
which we shall deal with innnediately below); 
this means that if such a. forn~ occurs in a sent- 
ence, then this forln ca.n be only 
- either a snbject, 
- or a nominal predicate (with copula) 
-or a colnparisoli to these, adjoined by 
means of the conjunctions jako, .'jako~to or 
cosy 
- or an excla.lnative expression (in nomina- 
tive or vocative case) 
• due to rules of Czech interl)unction, any excla- 
nlative expression ha.s to be selm.ra.t, ed from the 
rest, of the sentence by COllllll~lb 
• also, due to rules of Czech mt,erpunction, two 
finite verbs in Czech must be sepa.rated fi'oln 
each other by either a. comma, or by one of the 
following coordinating conjunctions: a., i and 
uebo 
Hence, if we buihl up a finite state am, oma.ton able 
to recognize the following substrings: 
1. <unambiguous masculine a.nimate noun in no- 
minative plural> followed by any string contai- 
ning neither a finite verb \[~,rm nor a. COlm-ne 
156 
nor one of the conjunctions o, i and ~t, b,) fc>llo- 
wed by <unaml)igm,us 1,asl particii)le m plural 
feminine > 
or (due to free word order) 
2. < unaml)iguoub past par ti<'iph-, in I)lu - 
ra.l feminine> followed I)y a,ny string contai- 
ning neither a tinite verb form nor a comma 
nor one of the conjunctions a, i, '..(bo followed 
by <unambiguous masculine animate noun in 
nominative plural> a,nd coml)ine it with a sin> 
ple automaton able to detect the absence of the 
words jako, jako2to and toby as well as the a.b- 
sence of any finite torm of the copula b:¢l ('to 
be') in the sentence, then we may conclude that 
we have built a device able to detect whether 
a sentence contains a particular instance of a 
subject-verb agreernent violation. 
The det, e<'ti<m of the latter error is also based on 
the Czech interl)unct.ion rule prescribing that there 
alwa.ys lIlllSt, occllr a. (:onllna (it a coordina.ting con- 
junction between two finite verb forms. Hence, a 
simi)le finite state a./itolua.toll checking whether bet- 
ween any two finite verb f(~rllls a COllllna or a c()of 
dinating conjunction occurs is a.Me to detect ma.ny 
cases of the omission of a. co,tuna, a.t the end of an 
embedded subordinated clause, which is one of the 
most Dequent errors at all. (Of course, tile word- 
forms of the verb must be mmmbiguously identifia- 
Me as such - i.e. such tbrms as ~eu,., j~.d'u, lral.lm, 
holl etc., do not qualify due to their part of sl)eech 
ambiguity, which means that ill senten<'es containing 
them this stra.tegy cannot be used). 
4 Using FSA 
for Splitting a Sentence into Clauses 
The last idea how to gain efficiency is that of split- 
ting the sentence (if possible) into clauses before the 
processing, which has a two-fold positive effect on 
the overall process of grammar-checking: 
I. it. is less time consuming to parse two 'shorter' 
strings than one longer (a.ssmning that l)arsing 
is a.t least cubic in t, ime, this fl)llows trivially 
fronl the inequality 
A a+B a <A a+:C4 ~B+aAB=+B a =(A+B) a 
for A,B positive - length of strings) 
2. it is possible to detect an error in one of the 
substrings (clauses) irrespective to the results 
of analysis of (any of) the other one(s); ill I>ar - 
ticular, also m ca.ses where a.t least one of them 
was not analyzed and, hence, also tile pa.rsing 
(including the pa.rsing with rela.xed constraints) 
of the whole input could not have I)een perfor- 
lned on the original st.ring, which would have 
hindered the error messa.ge pertinent to the sub- 
string successfully parsed during the parsing 
with constraint relaxation to be issued. 
Ill particular, this means that measures are to be 
tbund which would allow for sl)litting the input sent- 
ence into clauses by purely superficial criteria. Ob- 
viously, this is not possible in a.ll cases (Ibr all sent- 
en<-es), but on tile ()tiler hand it is also clear that ill 
any language there exists a (statistically) huge sub- 
set of sentences of this language where such techni- 
ques are applicable. For Czech, such a.n al)proach 
might be iml)lemented using pattern matching tech- 
niques which wouht recognize for example the follo- 
wing patterns (and use them in an ol)vious way tbr 
splitting the sentence into clauses): 
1. <any string> <finite verb> <ally string> 
<conjunctive coordinating conjunction> <any 
string> <finite verb> <a.ny string> <end of 
sentence> 
2. <any string> <finite verb> <any string> 
<comma> <non-conjunctive coordinating con- 
junction or complementizer> <any string> 
<finite verb> <any string> <end of sentence> 
3. <complementizer> <any string> <finite verb> 
<any string> <comma> <any string> <finite 
verb> <any string> <end of sentence> 
where the expressions have the following meaning(s): 
• <a.ny string> is a variable for any string not 
containing elements of the following nature: fi- 
nite verb or word form honlonynlous with a 
finite verb, coordinating conjunction (of any 
kind), complementizer, any interlmnction sign 
• <finite verb> is a variaMe 
- tbr a main verb (not tor an auxiliary) spe- 
cified for person, 
- or for a past participle of a n,ain verb; 
neither of these might be homonyntous in part 
of speech (but they n-light he ambiguous within 
the defined class - such verbs as podr'obl, proudl 
do qualify) 
• <end of sentence> is simI)ly either a full-stop, 
a question-mark, an exclamation-mark, a colon 
or a semi-colon. 
All the renlaining expressions have clear mnemonics, 
and also the classes which they stand for do not 
contain elements which are ambiguous as to part of 
speech. 
5 Significance and Caveats 
The techniques to be used for gaining overall per- 
formance, speed and nlemory efficiency etc., of a 
gramma.r-checking system presented result solely 
frorn research concerning relevant l)rOl)erties of the 
syntax of a particular language (( :ze(-ll, in t)art also 
Bulgarian), and, herlce, they are strongly language- 
dependent. However, it seems t.o be self evident that 
the core idea is transferable t<) <:)tiler languages. The 
157 
introduction of these techniques contributes to tile 
process of ripening of the system into a real indu- 
strial application at least in the following points: 
• it speeds up the overall performance of the sy- 
stem considerably (in the order ranging from 
one to two magnitudes, depending on the text 
to be processed) by avoiding full-fledged par- 
sing to be performed in unnecessary cases or by 
making this parsing simpler 
• it extends its coverage, in particular the caim- 
bilities of the system to recognize a.s error-free 
also a large nmnber of sentences which in the 
original version of the system would be unana- 
lyzable by the non-relaxed grammar (ms well as 
by the grammar with relaxed constraints, tot 
that matter) due to either incompleteness of the 
grammar proper or the exhaustion of hardware 
resources. 
There is a serious caveat to be issued, however: 
since they do not employ fllll anMysis of the input 
sentence, these techniques are - albeit probably only 
rarely on the practical level - more likely to issue in- 
correct error messages, in the sense that their capa- 
bilities of detecting an erroneous sentence are exacffy 
the same as on the full-fledged approach, but their 
capabilities of detecting 'what kind of error occurred 
in the sentence are slightly reduced. For example, in 
(the Czech equivalent of) tile sentence 
*Your wife drives very drives tast 
a grammar-checker based solely on the flfll-fledged 
philosophy would correctly recognize that the same 
verb is repeated twice, while a checker using only fi- 
nite state automaton detecting the presence/absence 
of a comma or a coordinating conjunction between 
two finite verbs issues a message concerning exactly 
the 'missing connna' - and similar exarrq:~les can be 
constructed also for all the other cases. In other 
words, there is a price to be paid for the speed-up 
of the error-checking process by means of tile tech- 
niques proposed. 

References 

Holan, T., V. Kubon, and M. Platek. 1997. A prototype of a grammar checker for Czech. ANLP. 

Rodriguez Selles, Y., M.R. Galvez, and K. Oliva. 
19.96. Error detection techniques in grammar-checking. Technical report of the PECO2824 project, Autonomous University of Barcelona. and 
University of Saarland. 
