AUTOMATIC PROOFREADING OF' FROZEN PIIRASES IN GERMAN 
RAI ,F KI';S\[';, FRIEDR1CH I)UDI)A, MARIANNE KUGI,F,R 
TA Triumph-Adler AG 
TA Research EF 
Ffwther Str. 212 
11-851111 Nuremberg 8(1 
Gerlnany 
e-mail: ralf@triumph-adler.de 
Abstract 
\["rozen phrases are introduced as a new level 
of automatic proofieadiltg in between the 
stm,dard level of spelling verification tfl 
ist)lated words and tile levcl of grammar 
checking. The design and the 
iulplenlentatioll of a corresponding 
proofleading system are described in detail. 
1 Introduction 
European languages contain thousands (if 
what Mauriee Gross calls "frozen" or 
"compound words" (G ross, 1986). 111 contrast 
to "free forms", frozen words - though being 
separable into several words and suffixes - 
lack syntactic and/or semantic comtlosition- 
ality. This "lack of conlpositionality is 
apparent fiom lexical restrictions" (at night, 
but: *at day, *at evening, etc.) as well as "by 
the impossibility of inserting material that is 
a pliori plausil~lc?' (*at {coming, tlresent, 
cokt, dark} night) (Gross, 1986). 
Now, these kinds of co-occurrence 
restrictions (ltaxris, 19701 determine not 
only the concrete lexical composition of an 
individual conlptltnld word but also its 
spelling. 
Consider, as an example, the Gernlan noun 
"Bezug" which-like any other nolni ill 
German- starts with a capital letter and 
occurs as a free form--e.g, in contexts like 
"mit Bezug anl"-such that its co-(iccnrrents 
can w~ry freely. There is a single exceptk/n to 
this rule, nalnely the phrase "in bczug auff, 
which is entirely flozen, in tile sense that it 
excludes any variation of its tlarls tit' 
structure, and which, by the sanle token, 
restricts the spelling of the noun to lower 
case. 
t:or examples like this we introduce the term 
"frozen phrases" as referring to (the sub-class 
of) those frozen words that are compounds 
of several words. As Zimmermanll 
(Zimmcrmann, 1987) points out for multL 
words in general, frozen phrases are clearly 
(~ttt of scope of standard spelling correction 
systems due to tile fact that these systems 
ctlcck for isolated words Olfly and disregard 
thc respective contexts. Yet, as Gross (Gross, 
1986) indicates, at least tile entirely and 
nearly entirely tl-OZell Rirms can be 
recognized and compared with the hel l) of 
simple string operations. Thus, these kinds of 
frozell phi%lSCS are accessible t(1 tile lllet\[lOdS 
of classical autolnatic proofleadillg. 
Folk)wing Gross (Gross, 1986) and 
Zimlnermann (Zimmermanll, 1987), we 
prt/posc to further extend standard spelling 
correction systems onto the level of frozen 
phrases by simply lnaking thenl capable of 
treating more than a single word at a time. 
This implies that we arrive at a context- 
sensitive system. 
Focusing (m individttal co-occt, rrences 
(Harris, 19701, the proposed extenskln will 
be a conservatiw: one, ill tile sense that it 
requires just widening the scope of the string 
matching/comparing operations that 
classically are used in spelling correction 
systenls. No deep and time-cousuming 
analysis, like palsiilg, is involved. 
Restricting the systenl that way makes our 
approach different fr(nn tile one considered 
in (bLimon and Herz, 1991), where a context 
sensitive spelling velificati(lli is proposed to 
be done with the help of "local constraints 
auttnnata (I.CAs)" which process contextual 
COilstraillls on the level of lexical or syntactic 
categories rather than on the basic level of 
strings, hi fact, proof-reading with LCAs 
rather anlounts to genuine granlnlar 
checking aim as such Imltmgs to a different 
and higher level of lmlguage checking. 
ACRES DE COLING-92, NANrl.;S, 23-28 A(:@I 1992 1 0 2 9 I'ROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
2 Phenomenology 
The essential feature of the new level of 
automatic proofreading is that lexicalized co- 
occurrence restrictions determine the 
orthography of whole phrases beyond the 
scope of what can be accepted or rejected on 
the basis of isolated words alone. 
For German, these restrictions determine, 
among other things, whether or not certain 
forms (1) begin with an upper or lower case 
letter; (2) have to be separated by (2.1) 
blank, (2.2) hyphen, (2.3) or not at all; (3) 
combine with certain other forms; or even 
(4) influence punctuation. Examples are: 
(t) lch laufe eis. versus 
Ich laufe aufdem Eis. 
Er diirfle Bankrott machen, versus 
Er dfirfle bankrott se&. 
(2.1) Sic kann nicht Fahrrad 
fahren, versus 
(2.3) Sic kann nicht radfahren. 
(2.1) Es war bitter kalt. versus 
(2.3) Es war ein bitterkalter Tag. 
(2.2) Er liebt lch-Romane, versus 
(2.3) Er liebt Romane in lchforn,. 
(3) BetonblOcke vs. *Betonblocks versus 
Hiluserblocks vs. *HduserblOcke 
(4) Er rauchte~ ohne daft sic dawm wugte. 
vcrsus 
*Er rauchlc ohne~ daft sic daVOll wuBle. 
3 System Design 
Like conventional spelling correction of 
isolated words, proofreading of frozen 
phrases is a lexicon-based process. 
However, while in conventional spelling 
correction "referencing a dictionary of 
correctly spelled words" (Frisch and Zamora, 
1988) is standard, proofreading of the 
higher-level orthographic featnres 
mentioned in 2 above can be done on the 
basis of a lexicon that encodes the 
corresponding error patterns directly. 
Thus, each entry in the system lexicon is 
modelled as a quintuple <W,L,R,C,E> 
specifying an error pattern of a (multi-) word 
W for which a correction C will be proposed 
accompanied by an explanation E just in 
case a given match of W against some 
passage in the text nnder scrutiny differs 
significantly from C and the - possibly empty 
- left and right contexts L and R of W also 
match the environment of W's counterpart in 
the text. 
Disregarding E for a moment, this is 
tantalnotmt to saying that each such record is 
interpreted as a string rewriting nile 
W-->C / L R 
rephtcing W (e.g.: Bezug) by C (e.g.: bezug) 
in the environment 1_, R (e.g.: in auf). 
The form of these productions can best be 
characterized with an eye to the Chomsky 
hierarchy as unrestricted, since we can have 
any non-mill number of symbols on the LHS 
replaced by any number of symbols on the 
RHS, possibly by null (Partee et al., 1990). 
With an eye to semi-Thue o1 extended 
axiomatic systems one could say that a 
linearly ordered sequence of strings W, C1, 
C2, ..., Cm is a derivatkm of Cm iff (1) W is a 
(faulty) string (in the text to be corrected) 
and (2) each Ci follows from the immediately 
preceding string by one of the productions 
listed in the lexicon (Partee et al., 1990). 
Thus, theoretically, a single mistake can be 
corrected by applying a whole sequence of 
productions, though in practice the default is 
clearly that a correction be done in a single 
derivational step, at least as long as the 
system is just operating on strings and not on 
additional non-terlnimll sylnbols. 
Occurrences of W, L, and R in a text are 
recognized by pattern matching techniques. 
An error pattern W ignores the particnlarly 
error-prone aspects npper/k)wer case and 
word separator (sec the examples in 2 
above). It thus matches both the correct and 
incorrect spellings with respect to these 
features. 
Beside wildcards ff)r characters, like "*", a 
pattern for W, l J, or R may contain also 
wildcards for words allowing, for example, 
the specification of a maxinml distance of L 
or R with respect to W. Since the types of 
errors discussed here only occur within 
ACRES DE COLING-92, NANTES. 23-28 AObq' 1992 1 0 3 0 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
seuteuces, such a distant match has to be 
restricted by the sentence boundaries. Thus, 
by having the system operate seutencewisc, 
any left or right context is naturally restricted 
to be some string within the same sentence 
as W mr to be at bonndary of that sentence 
(e.g.: a punctuation mark). 
Any left or right context is either a positive 
or at negative one, i.e., its COUlpOneuts are 
homogeneously either required or forbidden 
in order for the corresponding rule to fire. So 
far it has not been necessary to allow tku • 
mixed modes within it left or right context. 
In case a correction C is proposed to the 
user, additionally at message will be displayed 
to him identifying the reason why C is correct 
rather than W. l)epending on the user's 
knowledge of tile language under 
investigation, he can take this either as an 
opporttulity to learn or rather as a help tier 
deciding whether to finally accept or reject 
the proposal. 
There are two kinds of explanations, 
absolute and conditional ones. Whereas 
absolute rules indicate that the system has 
necessary and sufficient evidence fi~r W's 
deviance, there clearly are cases where 
either W or C could he correct and this 
question cannot be decided on the basis of 
the system's lexical infl)rmation alone, hi 
these cases, a conditiol,al o1 if-then 
explanatkm is given to the user offering a 
higher-level deciskm criterion which the 
system itself is unable to apply. 
Take, as an example, the sentence 
Dieser Fihn hetrifft exit und Jung. 
which clearly allows for two readings, oue 
which renders "Alt und Jmlg" as the false 
spelling of the frozen phrase "all und juug" 
meaning "everybody", and another one which 
takes "AIt und Jung" as the correct free form 
that literally designates the old aim the 
young while excluding the mktdle-aged. 
Thus, suhstitutahility by "jedennann" (i.e.: 
"everybody") wouM be an adequate decision 
criterion to convey to the user. 
In its present design, tile system is based on 
the simplest possible working hyl~othesis, i.e. 
the assumption that the higherqevel or 
cognitive errors in a sentence cau be 
corrected independently. The intuition 
behind this assumption is that, normally, 
cognitive (or orthographical) errors are by 
far less frequent than ordinary motoric (or 
typographical) errors (for this distinction see 
Berkel and Smcdt, 1990), and that, as a 
consequence of this, they occur within 
distinct contexts such that it is excluded that 
the correction of one error be conditional 
upon the correction of another. 
According to this assumption, each sentence 
of the lext is processed only once in the 
fffllowing nmnner: The system reads from a 
plain text copy, T2, of the original formatted 
text, T I, processes the errors on T2 one after 
the other, and finally writes the 
corresponding corrections to TI, without 
also writing them to T2. 
Now, an abstract counterexample to the 
working hypothesis can be construed quite 
easily: Given a sentence containing tile 
sequence of errors 
... Wl W2 ... 
and given lcxical rules 
(RI) Wl-o> el 
(R2) W2--> C2 / C1 
rewriting (i,e. correcting) WI as C1 in any 
context, and W2 as C2 if preceeded by CI, 
then, clearly, tile system will correct Wl but 
will fail to correct W2. 
For the system to also correct W2, it must be 
able to take its previous output into accot, nt 
again. That is, it should not only read from 
T2 but also write to it. 
t Iowever, to allow corrections to be written 
also to q'2 would mean to stepwise introduce 
tleW contexts on W2. As a consequence, tile 
system woukl halve to redo the checking of 
tile whtfle sentence each time a correction 
has been made, tbr this correction might well 
bca relevant context of some previously 
cncouutered expression. 
Thus, giving up the working hypothesis 
would result in having the system take 
multiple runs through one and the same 
sentence instead of a single rtul, and this, of 
course, would drastically reduce system 
performance. 
Fortunately, we have not yet come across 
any (signi\[icaut atllount el) data that would 
justify such a redesign of the system. 
AC'I'ES l)E COLING-92, NANTES, 23-28 ^o(rr 1992 1 0 3 1 l'Roc, ov COLING-92, NANTES, AUG. 23-28, 1992 
However, since the data captured in the 
system's lexicon covers at present some 50 % 
of the relevant phenomena compared to the 
Duden (Berger 1985), the ultimate 
complexity of tile system has to be regarded 
as an open and empirical question. 
4 Status of Implementation 
A first prototype of the system described 
above has been developed in C under UNIX 
within the ESPRIT I1 project 2315 
"Translator's Workbench" (TWB) as one of 
several orthogonal modules checking basic 
as well as higher levels (like grammar and 
style; see (Thurmair, 1990) and 
(Winkelmann, 1990)) of various languages. 
A derived and extended version - covering 
3.000 rewriting rules and some 80 
explanations - has been integrated into both 
a proprietary text processing software under 
DOS and Microsoft's WOP, D FOR 
WINDOWS, version 1.1. 
This extended version has been combined 
with a conventional spelling verifier to form 
a single proofreader for the user. Internally, 
however, its hidden sub-modules are still 
totally independent from one another and 
process a sentence one alter the other. 
Thus, it may happen that the checkers 
disturb each others results by proposing 
antagonistic corrections with respect to one 
and the same expression: Within the correct 
passage "in bezug auf', tbr example, "bezug" 
will first be regarded as an error by the 
standard checker which then will propose to 
rewrite it as "Bezug". If the user accepts this 
proposal he will receive the exactly opposite 
advice by the frozen phrases checker. 
On the other hand, checking on different 
levels could nicely go hand in hand and 
produce synergetic effects: For, clearly, any 
context sensitive checking requires that the 
contexts themselves be correct antl thus 
possibly have been corrected in a previous, 
possibly context free, step. The checking of a 
single word could in turn profit from 
contextual knowledge in narrowing down the 
number of correction alternatives to be 
proposed for a given error: While there may 
be some eight or nine plausible candidates as 
corrections of "Bezjg" when regarded in 
isolation, only one candidate, i.e. "bezug", is 
left when the context "in auf" is taken into 
account. 
The same interdependence seems to exist 
with respect to higher levels of language 
checking. At least it can be argued that a 
grammar checker will profit from integration 
with a frozen phrases checker: For nothing 
but an expert for frozen phrases can verit3, 
the correctness of (idiomatic) expressions 
like "ruhig Blut bewahren" or "auf gut 
Gliick", and thus prevent a grammar checker 
from flagging the missing inflection of the 
adjective ("ruhig" or "gut") in attributive 
position. 
Thus, there is a strong demand for arriving at 
a holistic solution ff)r multi-level language 
checking rather than for just having various 
level experts particularistically hooked 
together in series. This will be the task for 
the near future. 
5 Conclusion 
As concerns the German language, we have 
shown that there is a well-defined level of 
automatic proofreading in between the 
classical level of checking isolated words and 
the more advanced level of grammar 
checking. On the basis of work done by 
Zellig Harris (Harris, 1970) and Maurice 
Gross (Gross, 1986) we have identified this 
level as that of so-called "frozen phrases", 
and we have proposed and implemented an 
automatic pmofi'eading system that covers a 
significant amount of the frozen phrases of 
German. 
We take it that a similar approach is feasible 
for languages other than German as well. 
Although in comparison with English, 
French, Italian, and Spanish, German seems 
to be unique as concerns co-occurrence 
restrictions ff)r upper/lower case st)ellings in 
a large number of cases, there are at least, as 
indicated in (Gross, 1986), the thousands of 
compounds or frozen words in each of these 
languages which are clearly within reach fflr 
tile methods discussed. In addition, the 
generalizability of our approach has received 
some confirmation fiom case studies carried 
out ff)r English and Italian. 
References 
Dieter Berger, editor. Richtiges und gutes Deutsch. WOrterbuch der sprachlic'hen 
Zweifel,~/?ille. Der Duden ill 10 Bfinden. Das 
Standardwerk zur deutschen Sprache, 
ACRES DE COLING-92, NAmxs, 23-28 AO~r 1992 1 0 3 2 PROC. OF COL1NG-92, NANTES. AUG. 23-28. 1992 
vohunc 9. Bibliogral)hisches lnstitut, 
Mannhcim, Germany, 3rd ed. \[985. 
Brigitte van Berkel and Koeuraad l)e Smedt 
19911. Triph(nle Analysis: A combined 
method for the correction of orthographical 
and typographical errors, hi l'roceedings of 
the Second Conferetice olt Applied Natural 
Language Processhtg, pages 77-83, Austin, 
Texas, February 1988. Association for 
Computational Linguistics. 
Rudolf l,'risch and Antonio Zamora. Spelling 
assistance lbr conlpoulld words. IBM Journal 
q\[ Researvh and Development, 32(2): 195-2110, 
March 1988. 
Maurice Gross. Lexicon Grammar. The 
Representation of CompouiId Words. In 
Ptoceedbrl~s ~)\[ the Eleventh biternational 
Conference on Computational Lhtguistics, 
pages l~6, Boml, Germany, August 1986. 
International Cominittee on Computatioflal 
Linguistics. 
Zellig S, ltarris, t'apet:s" in Stnwtural a#td 
Tran.sJbrmational 1,htguistics. Fornlal 
Linguistics Series, volume 1. D. Reidol 
Publishing Conlpaliy, Dordrccht, Ttic 
Netherlands, 1970. 
John E. Hopcrofl and Jeffrey D. Ulhnan. 
Introduction to Automata Theoo,, l.anguages, 
and Computation. Addison-Wesley 
Publishing Company, Reading, 
Massachusetts, 1979. 
Barbara tl. Partee, Alice ter Meulzn, and 
Robert 12. Wall. Mathematical MeUtods in 
Linguistics. Studies in l~inguistics and 
Philosophy, volume 311. Kluwer Academic 
Publishers, Dordrecht, The Netherlands 
19911. 
Joseph J. Pollock and Antonio Zatllora. 
Autonlatic spelling correction in scic, ntific 
and scholarly text. Communicatiom" of the 
ACM, 27(4): 358-368, April 1984. 
Mort Pdmon and Jacky tterz. The 
Recognition Capacity of Local Syntactic 
Constraints. In Proceedings ()\[" the Fifth 
Conference of the European Chapter of the 
Association Jor Comlmtational Lhtguistics, 
pages 155-1611, Berlin, Germany, April 1991. 
Asst)ciation for Computatiolml l.inguistics. 
Gerard Salton. lltttomatie Tert l'rocessb~g. 
7he Tran,~\[btmation, Amdysis, aml Retrieval 
of Information by Computer. Addison-Wesley 
Publishing Company, Reading, 
Massachusetts, 1989. 
Gregor Thurmair. Parsing for Grammar and 
Slylc Checking. In l'apers presented to the 
7hirteenth International Conference on 
Computatiomd Lhtguistics, volume 2, pages 
365-3711, Helsinki, Finland, August 1990. 
Hans Karlgren, editor. 
Gtinter Winkelmann. Semiautomatic 
Interactive Multilingual Style Analysis 
(SIMSA). In Paper~ presented to the 
Thirteenth International Conference on 
Computational Lhtguistics, volume 1, pages 
7%81, l telsinki, Finland, August 1990. Hans 
Karlgren, editor. 
l larald Zimmermann. Textverarbeitung im 
Rahulen moderner 
Bfirokonmmnikationstechniken. PIK. t'raxis 
der lnformationsverarbeitung und 
Kommunikation 10: 38-45, 1987. 
Acres DE COLING-92, NANTES, 23-28 AOtn 1992 1 0 3 3 I'ROC. OF COLING-92, NAN-fF.S, AUG. 23-28, 1992 
