Compiling and Using Finite-State Syntactic Rules 
Kimmo Koskenniemi, Pasi Tapanainen and Atro Voutilainen 
University of Helsinki 
Research Unit for ComputaHonal Linguistics 
HaUituskatu 11 
SF-O0100 I..lelsinkJ 
F~and 
Abstract 
A language-independent framework for syntac- 
tic finlte-state parsing is discussed. The article 
presents a framework, a formalism, a compiler 
and a parser for grammars written in this for- 
realism. As a substantial example, fragments 
from a nontrivial finite-state grammar of Eng- 
lish are discussed. 
The linguistic framework of the present 
approach is based on a surface syntactic tag- 
ging scheme by F. Karlsson. This representa- 
tion is slightly less powerful than phrase 
structure tree notation, letUng some ambigu- 
ous constructions be described more concisely. 
The finite-state rule compiler implements what 
was briefly sketched by Koskenniemi (1990). It 
is based on the calculus of finite-state 
machines. The compiler transforms rules into 
rule-automata. The run-time parser exploits 
one of certain alternative strategies in perform- 
ing the effective intersection of the rule autom- 
ata and the sentence automaton. 
Fragments of a fairly comprehensive finite-state 
granmmr of English axe presented here, includ- 
ing samples from non-finite constructions as a 
demonstration of the capacity of the present 
formalism, which goes far beyond plain disam- 
blguation or part of speech tagging. The 
grammar itself is directly related to a parser 
and tagging system for English created as a 
part of project SIMPR I using Karlsson's CG 
(Constraint Grammar) formalism. 
1. Introduction 
The present finite-state approach to syntax 
should not be confused with eg. attempts to 
characterize syntactic structures with regular 
1. Esprit 11 project No. 2083, Structured 
information Management: Proceaalng and 
Retrieval. 
Afn'Es DE COLING-9Z NANTES, 23-28 Aou'r 1992 
phrase structure grammars. Instead of using 
trees as a means of representlng structures, we 
use syntactic tags associated with words, and 
the finite-state rules constrain the choice of 
tags. This style of representaUon was adopted 
from Karlsson's CG approach and an earlier 
Finnish parser called FPARSE (Karlsson 1985, 
1990). 
The current approach employs a shallow sur- 
face oriented syntax. We expect it to be useful 
in syntactic tagging of large text corpora. Infer- 
mat/on retrieval, and as a starting point for 
more elaborate syntactic or semantic analysis. 
1.1 Representation of sentences 
We represent the sentences as regu/ar expres- 
sions, or equivalently, asfinite-state networks, 
which list all combinatory possibilities to inter- 
pret them. Consider the sentence: 
the program runs. 
A (simplified) representation for the morpholog- 
ically processed but syntactically unanalyzed 
sentence as a regular expression could be 
roughly as follows: 
0@ 
the DEF ART 
\[8 I @/ I 8< I 8>\] 
\[ \[program N NOM SG \[eSUBJ \[ 8OBJ I 
8PREDC} \] 1 
\[Drogram V PRES NON-SG3 8FINV 8MAINV\] i 
\[program V INF\]\] 
\[8 I 8/ I 8< I @>\] 
\[ \[run V PRES SG3 @FINV 8MAINV\] I 
\[run N NOM PL \[eSUBJ I 8OBJ I 8PREDC\]\]\] 
88 
Here 8S represents a sentence boundary, @ a 
word boundary, 8/ an ordinary clause hound- 
amy, @< a begi,Lrflng of a center embedded 
clause, and @> the end of such an embedding. 
Square brackets '\[...r are used for grouping, 
and vertical bars' I' separate alternaUves. Each 
word has been assigned all possible syntactlc 
roles It could assume in sentences (eg. 0SUBJ 
1 5 6 PROC. oF COLING-92, NANTES. AUG. 23-28, 1992 
or @OBJ or ~PREDC). Note that between each two 
words there might be a clause boundary or a 
plain word boundary. The regular expression 
represents a number of strings (some 320) 
which we call the readings of the (unanalyTed) 
sentence. The following is one of them: 
@8 the DEF ART 8/ 
program V PRES NON-SG3 8FINV 8MAINV 0 
run N NOM PL 8PREDC 8@ 
This one is very ungranmmtlcal, though. It will 
be the task of the rule component to exclude 
such, and leave only the grammatical one(s) 
intact: 
88 the DEF ART 8 
program N NOM SG 8SUBJ @ 
run V PRES SG3 8FINV 8MAINV 88 
Note that in this framework, the parsing does 
not build any neW structures. The granu-natieal 
reading Is already present in the input repre- 
sentation. 
1.2 The role of rules 
The task for the rules here is (as is the ease with 
the CG approach by Karlsson) to: 
• exclude those interpretations of ambiguous 
words which are not possible in the current 
sentence, 
• choose the correct type of boundaries 
between each two words, and 
• detern~Ine which syntactic tags are the 
appropriate ones. 
Rules should preferably express meaningful 
constraints which result in the exclusion of all 
ungramnmtical alternatives. Each rule should 
thus be a grammatical statement which effec- 
tively forbids certain tag combinations. 
Rules in the CG formalism are typically dedi- 
cated for one of the above tasks, and they are 
executed as successive groups. 
In finite-state syntax, rules are logically unor- 
dered. Furthermore, In order to achieve word 
level disambiguation, one typically uses rules 
which describe the occurrences of boundaries 
and syntactic tags in grammat/ca//y correct 
structures rather than indicating how the 
incorrect interpretations can be identified. 
Thus, the three effects are achieved, eve** ff 
individual finite-state rules cannot be classified 
into corresponding three groups. 
1.3 Rule automata 
Finite-state rules are represented using regular 
expressions and they are transformed into 
finite-state automata by a rule compiler. 
The whole finite-state grammar consists of a set 
of rules which constrain the possible choices of 
word Interpretations, tags and boundaries to 
ACRES DE COLING-92, NAN'D!S, 23-28 AO6T 1992 
only those which are considered grammatical. 
The entire grammar Is effectively equivalent to 
the (theoretical) intersection of all individual 
rule automata. However, such an intersection 
would be impractical to compute due to Its 
huge size. 
The logical task for any finite-state parser in the 
current approach is to compute the intersec- 
tion of the unanalyzed sentence automaton and 
each rule automaton. Actual parsing can be 
done in several alternative ways which are 
guaranteed to yield the same result, but which 
vary in terms of efficiency. 
2. The finite-state rule formalism 
Tapanainen (1991 ) has implemented a compiler 
and a parser for finite-state gramnmrs. The 
compilation and the parsing is based on a Com- 
mon Lisp finite-state program package written 
by him. Tapanainen also reports in his Master's 
thesis (1991) new methods for optimizing the 
result of the compilation and improving the 
speed of parsing. 
The current rule compiler has only few built-in 
rules or definitions. Instead, It has a formalism 
for defining relevant expressions and new rule 
types. There are two types of definitions for 
this purpose. The first one defines a constant 
regular expression which can later on be 
referred to by its name: 
name = expressionI 
Some basic notations are defined in this way 
such as the dot which stands for a sequence of 
tokens within a single word: 
• = \tOo I o I o/ I O< I o>\]1 
The backslash '\' denotes any sequence of 
tokens not containing occurrences of its argu- 
ment (which here lists all types of word and 
clause boundaries). A variation of the dot is a 
dot-dot'.." which represents a sequence of 
tokens within the same clause: 
• <> - o< \[- I • I Q/I* ~>; 
• . = \[- I • I ~<>\]*s 
The second type of definitions has parameters, 
and it can be used for expressions which vary 
according to their values: 
name(paranb, .., param,) - expressionl 
The expression is a regular expression formu- 
lated using constant terms and the parameter 
symbols param i. An example of this type of def- 
initions is the loll@wing which requires every 
clause to be of a given form X: 
clause(X) - \ \[ \[~ I ~/ I 0<\] 
-iX I -..\] 
\[~> I O/ I O0\]\] 
1 5 7 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
The formula forbids subsequences which are 
clauses but not of form x (the middle term is 
easier to understand as \[ -x & .. \]). 
Experience with writing actual large scale 
grammars within the finlte-state framework 
has indicated that we need more flexibility in 
defining rules than what was first expected. 
This flexibility is achieved by having one very 
general rule format: 
expressionl 
The expression simply defines a constraint for 
all sentences, ie. it is already as such equiva- 
lent to a rule automaton, Forbidding unwanted 
combinations or sequences, such as two finite 
verbs within the same clause, can be excluded 
cg. by a rule: 
UNIQUE (FINV) 
Here, UNIQUE Is a definition which has been 
made using the formalisms above, and is avail- 
able for the grammar writer. Using the UNIQUE 
definition, one can express general principles, 
such as that there is at most one main verb, at 
most one subject etc. in each clause. 
Most of the actual rules still use the right 
arrow format: 
expression -> 
left-context _ right-context; 
All three parts of the rules are regular expres- 
sions. The rule requires that any occurrence of 
expression must be surrounded by the given 
context. 
3. English finite-state grammar 
The English finite-state grammar discussed 
here was written by Voutilainen. The grammar 
itself is much more comprehensive than what 
can be described in this paper. Although the 
grammar already covers most of the areas of 
English grammar that it is intended to cover, it 
is still far from complete in details. The gram- 
mar, when complete, will be part of Voutilai- 
nen's PhD dissertation (forthcoming). This sec- 
tion presents some general principles from 
that grammar, and a few examples from more 
complex phenomena. 
3.1 Goals of the grammar 
The present grammar has many goals and 
characteristics similar to those of the SIMPR 
Constraint Granmmn 
• the ability to parse unrestricted running 
texts with a large dictionary, 
* concrete, surface-orlented description in 
terms of dependency syntax. 
The current finite-state syntax uses, indeed, 
the same ENGTWOL lexicon as the SIMPR CG 
syntax (Karlsson et al. 1991). The set of syn- 
tactic features are adopted from the CG 
description almost as such with a few addl- 
tions. 
In the present finite-state approach, however, 
we aim at: 
• more general and linguistically motivated 
rules (fewer, more powerful and general 
rules in the grammar), 
• more accurate treatment of Intrasentential 
structure (three types of clause boundaries 
instead of one), and 
• a satisfactory description of certain complex 
constructions and sentence structures. 
The present formalism can achieve somewhat 
more general and powerful rules than the cur- 
rent CG formalism through tile use of full reg- 
ular expression notation. 
3.2 Clause boundaries 
Some power and accuracy is gained through a 
commitment to use a notation for clause 
boundaries which is exact in defining when 
words belong to the same or a different clause. 
The two formalisms are equivalent in many 
cases: 
@@ The dog chased a cat 
@/which ate the mouse @@ 
The more elaborate clause boundary marking 
makes a difference in case of center-embed- 
ding: 
@@ The man @< who came first @> got the job @@ 
This convention indicates that there are two 
clauses: 
The man .. got the job 
.. who came first .. 
3.3 Constituent structure 
Head-modifier relations are expressed (here 
and in the CG) with tags, eg.: 
a DET @DN> 
big A @AN> 
cat N @SUBJ 
The head of a NP is tagged as a major constitu- 
ent, here as a subject. In case the constituent 
is a coordinated one, each of the coordinated 
head gets the same tag: 
John's N GEN @GN> 
brother N NOM SG @SUBJ 
and COORD @CC 
aunts N NOM PL @SUBJ 
The genitival attribute O>GN modifies at least 
the next noun (brother) but possibly also 
some further ones at the same level of coordi- 
nation (aunts). 
Aclms DE COLING-92, NANTES, 23-28 AOUT 1992 1 5 8 Foot. OF COLING-92, NANTES, AUG. 23-28, 1992 
3.4 An example 
Let us consider the following (classical) sen- 
tence 
Time flies like an arrow. 
The input to the finite-state syntax comes from 
the ENGTWOL morphological analyzer wlth 
some modifications and extensions in the sets 
of features associated with words: 
8O 
\[\[time N NOM ~G 
\[@<P I 8NN> I @APP 18PCOMPL-O/N/-F I 
8PCOMPL-O/N I @PCOMPL-S/N/-F I 
8PCOMPL-S/N I 8I-OBJ/-F I 8I-OBJ I 
@OBJ/-F { 8OBJ I 8SUBJ/-F { @SUBJ \]\] I 
\[time <Eva> 
\[\[V IMP VFIN 8+FMAINV\] I 
\[V INF \[8<NOM-FMAINV I 
O-~%~AINV/-F { 8-FMAINV\] \] \] \] \] 
\[8 I 8/ I ~< I 8>\] 
lilly <Eva> <SV> v PRES SG3 VFIN 8+FMAINV\] 
\[fly N NOM PL 
\[8<P { 8APP I 8PCOMPL-O/N/-F { 
@PCOMPL-O/N I OPCOMPL-S/N/-F l 
8PCOMPL-S/N I @I-OBJ/-F { 81-OBJ I 
8OBJ/-F I 8OBJ I 8SUBJ/-F I 8SUBJ\] \]\] 
\[8 I 8/ I @< I @>\] 
\[\[like PREP \[O<NOM I 8ADVL I 8ADVL/INV\]\] 
\[like N NOM SG 
\[8<P i @NN> I @APP { 8PCOMPL-O/N/-F 
@PCOMPL-O/N { 8PCOMPL-E/N/-F I 
8PCOMPL-S/N { 8I-OBJ/-F { @I-OBJ I 
8OBJ/-F I 8OBJ { 8SUBJ/-F I 8SUBJ\] 
\[like <SVOC/A> <Eva> <SV> V 
\[ \[SUBJUNCTIVE VFIN 8+FMAINV\] I 
\[IMP VFIN @+FMAINV\] I 
\[INF \[@<NOM-FMAINV { 8-FMAINV/-F I 
@-FMAINV\] { 
\[PRES NON-SG3 VFIN @+FMAINV\] \] \] \] 
\[8 ) 8/ } 8< I 8>\] 
\[an <Indef> DET CENTRAL ART SG 8DN>\] 
\[@ I 8/ I @< I 8>\] 
\[\[arrow V \[lIMP VFIN @+FMAINV\] { 
(INF \[8<NOM-FMAINV l 
8-FMAINV/-F I 8-FMAINV\] \] 
\[arrow N NOM SG 
\[8<P I 8NN> i 8APP I 8PCOMPL-O/N/-F 
8PCOMPL-O/N \[ 8PCOMPL-S/N/-F I 
@PCOMPL-S/N I 8I-OBJ/-F I @I-OBJ I 
8OBJ/-F I 8OBJ I 8SUBJ/-F I 8SUBJ\]\]\] 
88 
This small sample sentence representation 
contains some 21 million readings. 
Each syntactic-function label starts with S. 
Many of the common labels like ~SUBJ have 
been replaced by the combination of ~SUBJ/-F 
and 8SUBJ to reflect the distinction of subjects 
of non-finite constructions from those of the 
main verb. A similar distinction is made in the 
verbal entries. 
The grammar is committed to exclude only 
those readings which are ungrammatical. 
ACRES DE COLING-92, NANTF~, 23-28 AO6T 1992 
Thus, several readings may pass the rules, in 
thls case, the following six: 
i. 88 time N NOM SG 8t~> 
fly N NOM PL @SUBJ 8 
like <SVOC/A> <EVa> <SV> 
V PRES NC~-SG3 VFIN 8+FMAINV 8 
an <Indef> DET ~ ART SG 8DN> 8 
arrc~ N NOM SG 8OBJ 88 
2. 88 time N NOM SO 8SUBJ 8 
fly <Eva> <SV> V PRES SG3 VFIN @+FMAINV @ 
like PREP 8ADVL 8 
an <Indef> DET CENTRAL ART SG @DN> 8 
arrow N NOM SG 8<P 88 
3. 88 time N NOM SG 8SUBJ 8 
fly <s~/o> <SV> V PRES SG3 VFIN @+FMAINV @ 
like N NCt4 SG 8OBJ 8 
an <Indef> DET CENTRAL ART SG 8DN> @ 
arrow N NOM SG @APP 88 
4. 88 time <Eva> V IMP VFIN @+FMAINV 8 
fly N NOM PL 8St~J @ 
like <SVOC/A> <SVO> <S~V> 
V PRES NON-SG3 VFIN 8+FMAINV 8 
an <Indef> DI~.P CENTRAL ART SG @DN> @ 
arrow N NOM SG @OBJ 8@ 
5. 88 time <55\]0> V IMP VFIN 8+FMAINV 8 
fly N NOM PL 80BJ @ 
like PREP 8<NOM 8 
an <Indef> DET CENTRAL ART SG 8DN> 8 
arrow N NOM SG 8<P 88 
6. 88 time <55/0> V IMP VFIN @+FMAINV @ 
fly N NOM PL @OBJ 8 
like PREP @ADVL 8 
an <Indef> DET CENTRAL ART SG 8DN> @ 
arrow N NOM SG 8<P @8 
3.5 Overview of rules 
The finite-state grammar for English consists 
of some 200 rules dedicated for several areas of 
the grammar: 
• Internal structure of nominal and non-finite 
verbal phrases. The structure is described 
as head-modifier relations, including deter- 
miners, premodiflers and postmodiflers. 
• CoordinaUon at various levels of the gram- 
mar. 
• Surface-syntactlc functions of nominal 
phrases. 
The structure of noun phrases is described 
using two approaches together. A coarse struc- 
ture is fixed with the mechanism of deflnIUons. 
It would not be feasible to use that mechanism 
alone (because it would lead to a context-free 
descripUon). The deflniUons are supplemented 
with ordinary finite-state rules which enforce 
further restrictions. 
1 5 9 PROC. OF COLING-92, NANTES, AUO. 23-28, 1992 
3.6 Non-finite Constructions 
Between the level of the nominal phrase and the 
finite clause, there is an Intermediary level, that 
of non-finite t~nsmtct/ons (see Quirk & el. 
1985). These constructions resemble noun 
phrases when seen as parts of the surrounding 
clause because they act eg. as subjects, objects, 
preposition complements, etc., postmodifiers, 
or adverbials, eg.: 
(Wa~ng home} was wearisome. 
• She wants (to come} 1. 
She was fond of (singing in the dark}. 
The dog (barking in the corridor} was irritable. 
('fired by her journey}, she fell asleep. 
Internally, non-finite constructions are like 
finite clauses because the main verb of a non- 
finite construction can have subjects, objects, 
adverbials etc. of Its own. 
Both finite and non-finite constructions have a 
verbal skeleton, which in a finite construction 
starts with aJO~e verb and ends with the first 
main verb. The finite verbal skeletons In the 
following examples are underlined: 
Shs sinas. 
Will she ~? 
She would no t have been singinq unless .. 
A non-finite verbal skeleton starts with certain 
kinds of non-finite verb (to+infinitive. present 
participle, past participle, non-finite auxiliary) 
and ends with the first main verb to the right: 
It is easy lode it. 
~red by her journey, she went into her room. 
They knew it all, ~ there before. 
Non-finite verb chains do not contain center- 
embedded verbs, whereas a non-finite con- 
struction can be center-embedded within a 
finite verb chain only ff it is (a part off a nomi- 
nal phrase: 
Can \[shooting hunters} be dangerous? 
Can men (shooting hunters} be dangerous? 
The use of syntactic tags instead of a hierar- 
chical tree-structure forces us to a very fiat 
description of sentences. This might result in 
problems when describing clauses with non- 
finite constructions with a small set of tags, 
eg.: 
The boy \[kicking @MAINV\] the \[ball @OBJ\] 
\[saw @MAINV\] the \[cow @OBJ\]. 
A useful concept in clause-level syntax is the 
uniqueness principle. We wlsh to say, for 
Instance, that In a clause, there is at most one 
l. The~ is another way to interpret this 
sentence without any non-finite construc- 
tions by including 'to come' in the finite verb 
chain. We have adopted the current inter- 
prctation in order to achieve certaing lin- 
guistle generallzaUona. 
(possibly co-ordinated) subject, object, or pred- 
icate complement. Uniqueness holds for the 
finite clause, and each non-finite construction 
separately, and this will be very difficult to for- 
mulate, ff we use same tags for both domains 
(as in the above example). 
The syntactic tags as given In the finite-state 
version of ENGTWOL capitalize heavily on non- 
finite constructions in order to overcome this 
problem: 
The boy \[kicking @MAINV/-F\] the (ball @OBJ/-F\] 
\[saw @MAINV\] the \[cow @OBJ\]. 
Here, the object in the non-finite construction 
is furnished with a label different from the cor- 
responding label used in the finite construc- 
tion, so there is no risk of confusion between 
the two levels. 
The duplication of certain labels for certain 
categories Increases the amount of ambiguity, 
but, on the other hand, the new ambiguity 
seems to be of a fairly controllable type. The 
description of non-finite constructions boils 
down to two subtasks. One is to express con- 
straints on the Internal structure of non-finite 
constructions; the other, the control on their 
distribution. 
In terms of verb chain and constituent struc- 
ture, non-finite constructions resemble finite 
constructions. Their main difference is that 
word order in non-finite constructions is much 
more rigid. 
We proceed with some examples of rules 
describing non-finite constructions. An infini- 
tive acting as main verb in a non-finite con- 
struction is preceded by to acting as an 
Infinitive marker or by a subject of a non-finite 
phrase or by a co-ordinated infinitive. 
So we wish. for instance, the following utter- 
ances to be accepted: 
He wants \[to @INFMARK>\] \[go INF @-FMAINV/-F\]. 
She saw \[her @SUBJ/-F\] \[go INF @-FMAINV/-F\]. 
She saw \[her @SUBJ/-F\] 
\[come INF @-FMAINV/-F\] and \[go INF @-FMAINV/-F\]. 
The constraint is expressed as a rule: 
I J.n~-ma:l.n/-f .~, 
\[\[~INFIKiERX> \[@ laOm'l\]*\] J 
\[leub::l/-~ l<*\] I 
\[ltnt'-ma:l.n/-f I1-~* I\]phr-cc\]l ~_ 
Items preceded by an exclamation mark are 
constant definitions, t /-f signals any constit- 
uent that can occur In a postverbal position in 
a non-finite construction. 
A past participle as a main verb in a non-finite 
construction must always be preceded by an 
appropriate klnd of auxiliary or clause bound- 
my. 
Acr~ DE COLING-92, NANTES, 23-28 ^OI~T 1992 1 6 0 Pave. OF COLING-92, NANTES, AUG. 23-28, 1992 
For example: 
\[Having @-FAUXW-FJ \[gone PCP2 @-FMAINV/-F\] 
home, they rested. 
This constraint corresponds to a rule: 
I pcp2-nla;l.n/- ~ => 
\[lI~r:l.m-aux/-f I lclb\] taffy1* -_ I 
There are further rules for the distribution of 
non-finite constructions with present partici- 
ples, etc. Further rules have been written for 
the description of the Internal structure of non- 
finite constructions which, in turn, is fairly 
straight-forward. The overall experience Is that 
a fairly adequate description of these types of 
phenome~m can be achieved by the set of syn- 
tactic tags proposed above accompanied by a 
manageable set of finite-state rules. 
4. Implementation 
We need a compiler for transforming the rules 
written in the finite-state formalism Into finite- 
state automata, and a parser which first trans- 
forms sentences into finite-state networks, and 
then computes the logical intersection of the 
rule-automata and the sentence automaton. 
4.1 Compilation of the rules 
The grammar consisting of rules is first parsed 
and checked for formal errors using a GNU 
flex and bison parser generator programs. 
The rest of the compilation Is done In Common 
Lisp by transforming the rules written in the 
regular expression formalism Into finlte-state 
automata. 
FuU-seale grammars tend to be large contain- 
Ing maybe a few hundred finite-state rules. In 
order to facilitate the parsing of sentences, the 
compiler tries to reduce the number of rule 
automata after each rule has been compiled. 
Methods were developed for determining which 
of the automata should be merged together by 
intersecting them (Tapanainen 1991). The key 
idea behind this is the concept of an activation 
alphabet. Some rule-automata turn out to be 
irrelevant for certain sentences, simply 
because the sentences do not contain any 
symbols (or combinations of symbols) neces- 
sary to cause the automaton to fail. Such rule- 
automata can be ignored when parsing those 
sentences. Furthermore, It turned out to be a 
good strategy to merge automata with similar 
activation alphabets (rather than arbitrary 
ones, or those resulting in smallest intersec- 
tions). 
4.2 Parsing sentences 
The implementation of the parsIng process Is 
open to many choices which do not change tile 
results of tile parsing, but which may have a 
stgnifiemlt effect on the time mid space require- 
ments of the parsing. As a theoretical staxlJi~ 
point one could take tile following setup. 
Parser A: Assume that we first enumerate all 
readings of a sentence-automaton. Each read- 
tng Is, In turn, fed to each of the rule-automala. 
Those readings that are accepted by all rule- 
autonmta form the set of parses. 
Parser A is clearly Infeasible In practice because 
of the immense number of readings repre- 
sented by tile sentence-automaton (millions 
even in relatively simple sentences, and tile 
number grows exponentially wllh sentence 
length). 
A second elementary mad theoretical approach: 
Parser B: Take the sentence automaton and 
Intersect with each rule-autonmton In turn. 
This is more feasible, but experiments have 
shown that the number of states In the inter- 
mediate results tends to grow prohibitively 
large when we work with full scale grmnmars 
and complex sentences {Tapanainen 1991). 
This is ml Important property of llnite-state 
automata. All automata Involved are reasona- 
bly small, and even tile end result Is very small, 
but file Intermediate results can be extremely 
large imore than 100,000 states and beyond 
the capacity of tile machines and algorithms we 
have\]. 
A fm-ther refinement of the above strategy I3 
would be to carefully choose tile order in which 
the Intersecting Is done: 
Pcu'ser C: Intersect the rule-automata with the 
sentence automaton In the order where you 
first evaluate each of the remaining automata 
according to how much they reduce the 
number of readings remaining. "lhe one which 
makes the greatest reduction is chosen at each 
step. 
This strategy seems to be feasible but much 
effort Is spent on the repeated evaluation. It 
turns out that one nlay even use a one-time 
estimation for the order: 
ParserD:. Perform a tentatWe Intersection of the 
sentence autmnalon and each of the rules first. 
Then Intersect the rules with the sentence 
automaton one by one tn the decreasing order 
of their capacity to reduce the number of read- 
hags from the or~Ttnal sentence automaton. 
We may also choice to operate In parallel 
Instead of rule by rule: 
Parser E: Simulate the Intersection of all rules 
and the sentence automaton by trying to enu- 
merate readings In the sentence automaton but 
constraining the process by the rule-automata. 
Each tune when a taale rejects the next token 
Acl~.s DE COL1NG-92, NANVES, 23-28 AO~r 1992 1 6 1 PI~OC. ov COL1NG~92, NANTES. AU(;. 23-28, 1992 
proposed, the corresponding branch In the 
search process is abandoned. 
This strategy seems to work fairly satisfactorily. 
It was used In the initial stages of the grammar 
development and testing together with two 
other principles: 
• merging of automata into a smaller set of 
automata during the compflatlon phase 
using the activation alphabet of each 
automaton as a guideline 
• excluding some automata before the parsing 
of each sentence according to the presence 
of tokens in the sentence and the activation 
alphabets of the merged automata. 
Some further improvements were achieved by 
the following: 
Parser I~. Manually separate a set of rules defin- 
ing the coarse clause structure into a phase to 
be first intersected with the sentence automa- 
ton. Then use the strategy E with the remaining 
rules. The initial step establishes a fairly good 
approximation of feasible clause boundaries. 
This helps the parsing of the rest of the rules by 
rejecting many incorrect readings earlier. 
Parsing simple sentences like "time flies like an 
arrow" takes some 1.5 seconds, whereas the 
following fairly complex sentence takes some 10 
seconds to parse on a SUN SPARCstation2: 
Nevertheless the number of cases in which 
assessment could not be related to factual rental 
evidence has so far not been so great as to render the 
whole system suspect. 
The sentence automaton Is small in terms of 
the number of states, but it represents some 
10 a5 distinct readings. 
5. Acknowledgments 
The work ofTapanainen \] is a part of the activity 
of the Research Unit for Computational Lin- 
guistics (RUCL) partly sponsored by the Acad- 
emy of Finland. Voutilainen is a member of the 
SIMPR project at RUCL, sponsored by the Finn- 
ish Technology Development Center (TEKES). 
The SIMPR CG parser, grammars and diction- 
aries were designed and written by F. Karlsson, 
A. Voutilainen, J. Helkklltl, and A. Anttila. 
Many of these results and innovations are 
either directly used here, or have had a direct 
influence on the present results. 
1. Electronic mall addressee of the authors 
are: Klmmo. Ko,ske nnleml @HelsLnkl.Fl, 
Pasl.Tapanalnen@Helslnkl.Fl, 
avoutlla@llng.helslnkl, fl 
AcrEs DE COLING-92, NANTES, 23-28 ^Ot3T 1992 
1 6 2 PRec. OF COLING-92, NANTES, AUG. 23-28, 1992 

References 

E. EJerhed, K. Church: Finite-State Parsing. 
F. Karlsson (ed.) Papers from the Seventh Scan- 
dlnavlan Conference on Linguisi~s. University 
of Helsinki, Department of General Linguistics, 
Publications, No. 10. pp. 410-432. 

F. Karlsson 1985. Parsing Finnish in terms 
of Process Grammar. F. Karlsson (ed.) Computa- 
lionel Morphosyntax: Report on Research 1981- 
84. University of Helsinkl, Department of Gen- 
eral Linguistics, Publications, No. 13. 

F. Karlsson 1990. Constraint Grammar as a 
Framework for Parsing Running Text. H. Karl- 
gren (ed.) COLING-90: Papers Presented to the 
13th International Conference on Computational 
L/tu3uist/cs. Helsinkl, Vol. 3, pp. 168-173. 

F. Karlsson, A. Voutilalnen, J. Helkkil~i, A. 
Anttfla, 7 January 1991. Na/ural Language Pro- 
cesslruJ for Information Retrleval Purposes. 
SIMPR Document No. SIMPR-RUCL-1990- 
13.4e. Research Unit for Computational Lin- 
guistics, University of Helslnkl, Finland. 220 
PP. 

F. Karlsson, A. Voutflalnen, J. Helkkilgt, A. 
Anttila (forthcoming.l, Constraint Grammar: A 
Language-Independent System for Parsing Run- 
nlng TexL 

L. Karttunen, K.Koskermiemi, IL Kaplan 
1987. A Compiler for Two-level Phonological 
Rules. Tools for Morphological Analysis (M. Dal- 
rymple, IL Kaplan, L, Karttunen, K. Kosken- 
nlemi, S. Shalo, M. Wescoat}. Center for the 
Study of Language and Information, Stanford. 
Report No. CSLI-87-108. 

K. Koskenniemi 1983. Two-level Morphology: 
A General Computational Model for Word-Form 
Recognition and Production. University of Hel- 
sinki, Department of General Linguistics, Pub- 
lications, No. 11. 160 pp. 

K. Koskennierni 1990. Finite-state Parsing 
and Disambiguation. H. Karlgren (ed.) COLING- 
90: Papers Presented to the 13th International 
Conference on Computational Linguistics. Hel- 
sinki, Vol. 2, pp. 229-232. 

R. Quirk, S. Greenbaum, G. Leech, J. Svart- 
vik 1985. A Comprehensive Grammar of the 
Eng//sh Language. London, Longman. 

P. Tapanainen 1991. P, dre//isind automaat- 
teina esitett~en kielioppisOfintfjen sovel- 
tamioen luonnollisen kielen JOsent(ljdssd. 
("Natural language parsing with finite-state 
syntactic rules.'} Master's Thesis. Department 
of Computer Science, University of Helsinki. 
