Proceedings of EACL '99 
A Cascaded Finite-State Parser 
for Syntactic Analysis of Swedish 
Dimitrios Kokkinakis and Sofie Johansson Kokkinakis 
Department of Swedish/Spr£kdata 
Box 200, SE-405 30 
G6teborg University, Ghteborg 
SWEDEN 
{svedk,svesj}@svenska.gu.se 
Abstract 
This report describes the development 
of a parsing system for written Swedish 
and is focused on a grammar, the 
main component of the system, semi- 
automatically extracted from corpora. A 
cascaded, finite-state algorithm is ap- 
plied to the grammar in which the input 
contains coarse-grained semantic class 
information, and the output produced 
reflects not only the syntactic structure 
of the input, but grammatical functions 
as well. The grammar has been tested 
on a variety of random samples of dif- 
ferent text genres, achieving precision 
and recall of 94.62% and 91.92% respec- 
tively, and average crossing rate of 0.04, 
when evaluated against manually disam- 
biguated, annotated texts. 
1 Introduction 
This report describes a parsing system for fast 
and accurate analysis of large bodies of written 
Swedish. The grammar has been implemented 
in a modular fashion as finite-state, cascaded 
machines, henceforth called Cass-SWE, a name 
adopted from the parser used, Cascaded analy- 
sis of syntactic structure, (Abney, 1996). Cass- 
SWE operates on part-of-speech annotated texts 
and is coupled with a pre-processing mechanism, 
which distinguishes thousands of phrasal verbs, 
idioms, and multi-word expressions. Cass-SWE 
is designed in such a way that semantic informa- 
tion, inherited by named-entity (NE) identifica- 
tion software, is taken under consideration; and 
grammatical functions are extracted heuristically 
using finite-state transducers. The grammar has 
been manually acquired from open-source texts 
by observing legitimately adjacent, part-of-speech 
chains, and how and which function words sig- 
nal boundaries between phrasal constituents and 
clauses. 
2 Background 
2.1 Cascaded Finite-State Automata 
Finite-state technology has had a great impact on 
a variety of Natural Language Processing applica- 
tions, as well as in industrial and academic Lan- 
guage Engineering. Attractive properties, such as 
conceptual simplicity, flexibility, and space and 
time efficiency, have motivated researchers to cre- 
ate grammars for natural language using finite- 
state methods: Koskenniemi et al. (1992); Ap- 
pelt et al. (1993); Roche (1996); Roche & Schabes 
(1997). The cascaded, finite-state mechanism we 
use in this work is described in Abney (1997): 
"...a finite-state cascade consists of a se- 
quence of strata, each stratum being de- 
fined by a set of regular-expression pat- 
terns for recognizing phrases. \[...\] The 
output of stratum 0 consists of parts of 
speech. The patterns at level l are applied 
to the output of level I-1 in the manner 
of a lexicaI analyzer \[...\] longest match 
is selected (ties being resolved in favour 
of the first pattern listed), the matched 
input symbols are consumed from the in- 
put, the category of the matched pattern 
is produced as output, and the cycle re- 
peats...", (p. 130). 
2.2 Swedish Finite-State Grammars 
There have been few attempts in the past to model 
Swedish grammars using finite-state methods. K. 
Church at MIT implemented a Swedish, regular- 
expression grammar, inspired by ideas from Ejer- 
hed & Church (1983). Unfortunately, the lexicon 
and the rules were designed to parse a very lim- 
ited set of sentences. In Ejerhed (1985), a very 
245 
Proceedings of EACL '99 
general description of Swedish grammar was pre- 
sented. Its algorithmic details were unclear, and 
we are unaware of any descriptions in the liter- 
ature of large scale applications or implementa- 
tions of the models presented. It seems to us 
that Swedish language researchers are satisfied 
with the description and, apparently, the imple- 
mentation on a small scale of finite-state meth- 
ods for noun phrases only, (Cooper, 1984; Rauch, 
1993). However, large scale grammars for Swedish 
do exist, employing other approaches to parsing, 
either radically different, such as the Swedish Core 
Language Engine, (Gamb£ck & Rayner, 1992), or 
slightly different, such as the Swedish Constraint 
Grammar, (Birn, 1998). 
2.3 Pre-Processing 
By pre-processing we mean: (i) the recognition of 
multi-word tokens, phrasal verbs and idioms; (ii) 
sentence segmentation; (iii) part-of-speech tag- 
ging using Brill's (1994) part-of-speech tagger, 
and the EAGLES tagset for Swedish, (Johansson- 
Kokkinakis & Kokkinakis, 1996). The general ac- 
curacy of the tagger is at the 96% level, (98,7% 
for the evaluation presented in table (1)). Tagging 
errors do not influence critically the performance 
of Cass-SWE 1 (cf. Voutilainen, 1998); (iv) se- 
mantic inheritance in the form of NE labels: time 
sequences, locations, persons, organizations, com- 
munication and transportation means, money ex- 
pressions and body-part. The recognition is per- 
formed using finite-state recognizers based on trig- 
ger words, typical contexts, and typical predicates 
associated with the entities. The performance of 
the NE recognition for Swedish is 97.4% preci- 
sion, and 93.5% recall, tested within the AVENTI- 
NUS 2 domain. Cass-SWE has been integrated 
in the General Architecture for Text Engineering 
(GATE), Cunningham et al. (1996). 
3 The Grammar Framework 
The Swedish grammar has been semi- 
automatically extracted from written text 
corpora by observing two phenomena: (i) which 
part-of-speech n-grams, are not allowed to be 
adjacent to each other in a constituent, and (ii) 
1The parser can be tolerant of the errofieous anno- 
tation returned by the tagger, e.g. in the distinction 
between Swedish adjective-participles in (:t). This is 
accomplished by constructing rules that contain either 
adjective or participle in the following manner: 
np --+ AKTICLE(ADJECTIVEIPARTICIPLE) NOUN 
2AVENTINUS (LE-2238), Advanced Informa- 
tion System for Multilingual Drug Enforcement. 
(http://svenska.gu.se/aventinus) 
how and which function words signal bound- 
aries between phrases and clauses. (i) uses 
the Mutual Information, statistics, based on 
the n-grams. Low n-gram frequencies, such 
as verb/noun-determiner, gave reliable cues 
for clause boundary, while high values such as 
numeral-noun did not, and thus rejected. Obser- 
vation (i) is related to the notion of distituent 
grammars, "...a distituent grammar is a list 
of tag pairs which cannot be adjacent within a 
constituent...", Magerman & Marcus (1990); (ii) 
is a supplement of (i), which recognizes formal 
indicators of subordination/co-ordination, such 
as conjunctions, subjunctions, and punctuation. 
3.1 Syntactic Labelling and the 
Underlying Corpus 
The syntactic analysis is completed through the 
recognition of a variety of phrasal constituents, 
sentential clauses, and subclauses. We follow 
the proposal defined by the EAGLES (1996), 
Syntactic Annotation Group, which recognizes 
a number of syntactic, metasymbolic categories 
that are subsumed in most current categories of 
constituency-based syntactic annotation. The la- 
belled bracketing consists of the syntactic cate- 
gory of the phrasal constituent enclosed between 
brackets. Unlabelled bracketing is only adopted 
in cases of unrecognized syntactic constructions. 
The corpora we used consisted of a variety of 
different sources, about 200,000 tokens, collected 
in AVENTINUS. The rules are divided into lev- 
els, with each level consisting of groups of pat- 
terns ordered according to their internal complex- 
ity and length. A pattern consists of a category 
and a regular expression. The regular expressions 
are translated into finite-state automata, and the 
union of the automata yields a single, determin- 
istic, finite-state, level recognizer, (Abney, 1996). 
Moreover, there is also the possibility of grouping 
words and/or part-of-speech tags using morpho- 
logical and semantic criteria. 
3.2 Grammar Rules 
Some of the most important groups include: 
• Noun Phrases, Grammar0: the number 
of patterns in grammar0 is 180, divided in six 
different groups, depending on the length and 
complexity of the patterns. A large number 
of (parallel) coordination rules are also imple- 
mented at this level, depending on the simi- 
larity of the conjuncts with respect to several 
different characteristics, (cf. Nagao, 1992). 
• Prepositional Phrases, Grammar1: the 
majority of prepositional phrases are noun 
246 
Proceedings of EACL '99 
phrases preceded by a preposition. Trapped 
adverbials, belonging to the noun phrase and 
not identified while applying grammar0, are 
merged within the np. Both simple and multi- 
word prepositions are used. 
• Verbal Groups, Grammar2: identifies and 
labels phrasal, non-phrasal, and complex ver- 
bal formations. The rules allow for any num- 
ber of auxiliary verbs, possible intervening 
adverbs, and end with a main verb or particle. 
A distinction is made between finite/infinite 
active/passive verbal groups. 
• Clauses, Grammar3 and Grammar4: the 
clause resolution is based on surface crite- 
ria, outlined at the beginning of this chapter, 
and the rather fixed word order of Swedish. 
Grammar3 distinguishes different types of 
subordinate clauses; while Grammar4 recog- 
nizes main clauses. A unique level is desig- 
nated for each type of clause 
3.3 Grammatical Functions 
Grammatical functions are heuristically recog- 
nized using the topographical scheme, originally 
developed for Danish, in which the relative po- 
sition of all functional elements in the clause is 
mapped in the sentence, (Diderichsen, 1966). 
3.4 An Example 
The following short example illustrates the input 
and output to Cass-SWE: 
'Under 1998 gick 8 799 fSretag i konkurs i 
Sverige. ', i.e. 'During 1998, 8 799 companies 
went bankrupt in Sweden.' 
The input to Cass-SWE is an annotated version 
of the text: 
'Under/S 1998/MC/tim gick/YMISh 8_799/MC 
f6retag/NCN(SP)NI/org i/S konkurs/NCUSNI 
i/S Sverige/NP/icg./F'. 
Output: 
\[main_clause 
TIME=\[rp head=Under sem=tim 
IS head=Under sem=n/a Under\] 
\[np head=1998 sem=tim 
\[MC head=f998 sem=tim 1998\]\]\] 
\[vg-active-finite head=gick sem=n/a 
\[VMISA head=gick sem=n/a gick\]\] 
SUBJ=\[np head=f~retag sem=org 
\[MC head=8_799 sem=n/a 8_799\] 
\[NCN(SP)NI head=f6retag sem=org foretag\]\] 
P-OBJ=\[pp head=i sem=n/a 
\[S head=i sem=n/a i\] 
\[np head=konkurs sem=n/a 
\[NCUSNI head=konkurs sem=n/a konkurs\] \] \] 
\[pp head=i sem=icg 
IS head=i sem=n/a i\] 
\[np head=Sverige sem=icg 
\[NP head=Sverige sem=icg Sverige\]\]\] 
IF .\]\] 
Here s: preposition; MC: numeral; VMISA: finite, 
active verb; NCUSNI/NCN(SP)NI: common nouns; NP: 
proper noun and F: punctuation; while tim: time 
sequence; org: organization and icg: geograph- 
ical location. The output produced reflects the 
coarse-grained semantics and part-of-speech used 
in the input, as well as the head of each phrase 
and the grammatical functions: TIME, SUBJ(ect) 
and P-0BJ(ect). 
4 Evaluation 
The performance of the parser partly depends on 
the output of the tagger and the rest of the pre- 
processing software. Our way of dealing with how 
"correct" the performance of the parser is, follows 
a practical, pragmatic approach, based on consul- 
tation of modern Swedish syntax literature. We 
use the metrics: precision (P), recall (R), F-value 
(F) and cross-bracketed rate. F = ($2+1) PR/$ 2 
P+R, where $ is a parameter encoding the rela- 
tive importance of (R) and (P); here $=1. Eval- 
uation is performed automatically using the evalb 
evaluation software, (Sekine & Collins, 1997). 
4.1 'Gold Standard' and Error Analysis 
For the evaluation of Cass-SWE we use three 
types of texts: (i) a sample taken from a man- 
ually annotated Swedish corpus of 100,000 words 
with grammatical information (SynTag, J£rborg, 
1990); (ii)-newspaper material; and (iii) a test 
suite, for non-common constructions, by consult- 
ing Swedish syntax literature. Texts (ii) and (iii) 
were annotated manually. The total number of 
tokens was 1,500 and sentences 117. 
The evaluation results are given in Table (1), for 
both noun phrases (NPs), and full chunk parsing 
(All). The errors found can he divided into: (i) 
Table h Cass-SWE, Performance 
P R F 
NPs 97.82% 
All 94.62% 
Cross 
94.52% 96.17% 0.03 
91.92% 93.2%7 0.04 
errors in the texts themselves, which we cannot 
control and are difficult to discover if the texts 
are not proofread prior to processing; (ii) errors 
produced by the tagger; and (iii) grammatical er- 
rors produced by the parser, caused mainly by the 
lack of an appropriate pattern in the rules, and 
almost exclusively in higher order clauses due to 
247 
Proceedings of EACL '99 
structural ambiguity and coordination problems. 
None of the errors in (i) and (ii) have been man- 
ually corrected. This was a conscious choice, so 
that the evaluation of the parsing will be based 
on unrestricted data. 
5 Conclusion 
We have described the implementation of a large 
coverage parser for Swedish, following the cas- 
caded finite-state approach. Our main guidance 
towards the grammar development was the obser- 
vation of how and which function words behave 
as delimiters between different phrases, as well as 
which other part-of-speech tags are not allowed 
to be adjacent within a constituent. Cass-SWE 
operates on part-of-speech annotated texts us- 
ing coarse-grained semantic information, and pro- 
duces output that reflects this information as well 
as grammatical functions in the output. A corpus, 
annotated syntactically, is a rich source of infor- 
mation which we intend to use for a number of 
applications, e.g. information extraction; an inter- 
mediate step in the extraction of lexical semantic 
information; making valency lexicons more com- 
prehensive by extracting sub-categorization infor- 
mation, and syntactic relations. 
References 
Abney, S. 1996. Partial Parsing via Finite-State 
Cascades. In Proceedings of the ESSLLI '96 Ro- 
bust Parsing Workshop, Prague, Czech Rep. 
Abney, S. 1997. Part-of-Speech Tagging and Par- 
tial Parsing, In Corpus-Based Methods in Lan- 
guage and Speech Processing, Young S. and 
Bloothooft G., editors, Kluwer Acad. Publish- 
ers, Chap. 4, pp. 118-136. 
Appelt, D.E., J. Hobbs, J. Bear, D. Israel, and M. 
Tyson. 1993. FASTUS: A Finite-State Proces- 
sor for Information Extraction from Real-World 
Text, In Proceedings of the IJCAI '93, France. 
Birn, J. 1998. Swedish Constraint Grammar, Ling- 
soft Inc., Finland, forthcoming. 
Brill, E. 1994. Some Advances In Rule-Based Part 
of Speech Tagging, In Proceedings of the 12th 
AAAI '94, Seattle, Washington. 
Cooper, R. 1984. Svenska nominalfraser och 
kontext-fri grammatik, In Nordic Journal of 
Linguistics, Vol. 7:115-144, (in Swedish). 
Cunningham, H., R. Gaizauskas, and Y. Wilks. 
1995. A General Architecture for Text Engineer- 
ing (GATE) - A New Approach to Language 
Engineering R~D, Technical report CS-95-21, 
University of Sheffield, UK. 
Diderichsen, P. 1966. Helhed og Struktur, G.E.C. 
GADS Forlag, (in Danish). 
EAGLES. 1996. Expert Advisory Group/or Lan- 
guage Engineering Standards, EAG-TCWG- 
SASG/1.8, http://www.ilc.pi.cnr.it/EAGLES/ 
home.html. Visited 01/08/1998. 
Ejerhed, E. and Church, K. 1983. Finite State 
Parsing, In Papers from the 7th Scandinavian 
Conference of Linguistics, Karlsson F., editor, 
University of Helsinki, Publ. No. 10(2):410-431. 
Ejerhed, E. 1985. En ytstruktur grammatik fSr 
svenska, In Svenskans Beskrivning 15, All@n, S., 
L-G. Andersson, J. LSfstrSm, K. Nordenstam, 
and B. Ralph, editors, GSteborg, (in Swedish). 
Gamb~ck, B. and Rayner, M. 1992. The 
Swedish Core Language Engine, CRC-025, 
http://www.cam.sri.com. Visited 01/10/1998. 
Johansson-Kokkinakis, S. and Kokkinakis, D. 
1996. Rule-Based Tagging in Spr~kbanken, 
Research Reports from the Department of 
Swedish, GSteborg University, GU-ISS-96-5. 
J£rborg, J. 1990. Anv~ndning av SynTag, Re- 
search Reports from the Department of 
Swedish, GSteborg University, (in Swedish). 
Koskenniemi, K., P. Tapanainen, and A. Vouti- 
lainen. 1992. Compiling and Using Finite -State 
Syntactic Rules, In Proceedings of COLING '92, 
Nantes, France, Vol. 1:156-162. 
Magerman, D.M. and Marcus, M.P. 1990. Parsing 
a Natural Language Using Mutual Information 
Statistics, In Proceedings of AAAI '90, Boston, 
Massachusetts. 
Nagao, M. 1992. Are the Grammars so far Devel- 
oped Appropriate to Recognize the Real Struc- 
ture of a Sentence?, In Proceedings of ~th TMI, 
Montr@al, Canada, pp. 127-137. 
Rauch, B. 1993. Automatisk igenk~nning av nom- 
inalfraser i 15pande text, In Proceedings of the 
9th NODALIDA, Eklund, R., editor, pp. 207- 
215, (in Swedish). 
Roche, E. 1996. Parsing with Finite-State Trans- 
ducers, http://www.merl-com/reports/TR96- 
30. Visited 12/03/99. 
Roche, E. and Schabes, Y., editors, 1997. Finite- 
State Language Processing, MIT Press. 
Sekine, S. and Collins, M.J. 1997. The evalb Soft- 
ware, http:/ /cs.nyu.edu/cs/projects/proteus/ 
evalb. Visited 14/12/97. 
Voutilainen, A. 1998. Does Tagging Help Parsing? 
A Case Study on Finite State Parsing, In Pro- 
ceedings of the FSMNLP '98, Ankara, Turkey. 
248 
