AN EXPERIMENT WITH HEURISTIC PARSING OF SWEDISH 
Benny Brodda 
Inst. of Linguistics 
University of Stockholm 
S-I06 91 Stockholm SWEDEN 
ABSTRACT 
Heuristic parsing is the art of doing parsing 
in a haphazard and seemingly careless manner but 
in such a way that the outcome is still "good", at 
least from a statistical point of view, or, hope- 
fully, even from a more absolute point of view. 
The idea is to find strategic shortcuts derived 
from guesses about the structure of a sentence 
based on scanty observations of linguistic units 
In the sentence. If the guess comes out right much 
parsing time can be saved, and if it does not, 
many subobservations may still be valid for re- 
vised guesses. In the (very preliminary) experi- 
ment reported here the main idea is to make use of 
(combinations of) surface phenomena as much as 
possible as the base for the prediction of the 
structure as a whole. In the parser to be deve- 
loped along the lines sketched in this report main 
stress is put on arriving at independently 
working, parallel recognition procedures. 
The work reported here Is both aimed at simu- 
latlng certain aspects of human language per- 
ception and at arriving at effective algorithms 
for actual parsing of running text. There is, 
indeed, a great need for fast such algorithms, 
e.g. for the analysis of the literally millions of 
words of running text that already today comprise 
the data bases in various large information re- 
trieval systems, and which can be expected to 
expand several orders of magnitude both in im- 
portance and In size In the foreseeable future. 
I BACKGROUND 
The genera ! idea behind the system for heu- 
ristic parsing now being developed at our group in 
Stockholm dates more than 15 years back, when I 
was making an investigation (together with Hans 
Karlgren, Stockholm) of the possibilities of 
using computers for information retrieval purposes 
for the Swedish Governmental Board for Rationali- 
zation (Statskontoret). In the course of this 
investigation we performed some psycholingulstic 
experiments aimed at finding out to what extent 
surface markers, such as endings, prepositions, 
conjunctions and other (bound) elements from 
typically closed categories of linguistic units, 
could serve as a base for a syntactic analysis of 
sentences. We sampled a couple of texts more or 
less at random and prepared them in such a way 
that stems of nouns, adjectives and (main) verbs - 
these categories being thought of as the main 
carriers of semantic Information - were substi- 
tuted for by a mere "-", whereas other formatives 
were left in their original shape and place. These 
transformed texts were presented to subjects who 
were asked to fill in the gaps in such a way that 
the texts thus obtained were to be both syntacti- 
cally correct and reasonably coherent. 
The result of the experiment was rather 
astonishing. It turned out that not only were the 
syntactic structures mainly restored, in some few 
cases also the original content was reestablished, 
almost word by word. (It was beyond any possi- 
bility that the subjects could have had access to 
the original text.) Even in those cases when the 
text itself was not restored to this remarkable 
extent, the stylistic value of the various texts 
was almost invariably reestablished; an originally 
lively, narrative story came out as a lively, 
narrative story , and a piece of rather dull, 
factual text (from a school text book on socio- 
logy) invariably came out as dull, factual prose. 
This experiment showed quite clearly that at 
least for Swedish the information contained in the 
combinations of surface markers to a remarkably 
high degree reflects the syntactic structure of 
the original text; in almost all cases also the 
stylistic value and in some few cases even the 
semantic content was kept. (The extent to which 
this is true is probably language dependent; Swe- 
dish is rather rich in morphology, and this 
property is certainly a contributing factor for an 
experiment of this type to come out successful to 
the extent it actually did.) 
This type of experiment has since then been 
repeated many times by many scholars; in fact, it 
ls one of the standard ways to demonstrate the 
concept of redundancy in texts. But there are 
several other important conclusions one could draw 
from this type of experiments. First of all, of 
course, the obvious conclusion that surface 
signals do carry a lot of information about the 
structure of sentences, probably much more than 
one has been inclined to think, and, consequently, 
It could be worth while to try to capture that 
Information in some kind of automatic analysis 
system. This is the practical side of it. But 
there is more to it. One must ask the question why 
a language llke Swedish is llke this. What are the 
theoretical implications? 
Much Interest has been devoted in later years 
to theories (and speculations) about human per- 
66 
ception of linguistic stimuli, and I do not think 
that one speculates too much if one assumes that 
surface markers of the type that appeared in the 
described experiment together constitute im- 
portant clues concerning the gross syntactic 
structure of sentences (or utterances), clues that 
are probably much less consiously perceived than, 
e.g., the actual words in the sentences/utteran- 
ces. To the extent that such clues are actually 
perceived they are obviously perceived simulta- 
neously with, i.e. in parallel with, other units 
(words, for instance). 
The above way of looking upon perception as a 
set of independently operating processes is, of 
course, more or less generally accepted nowadays 
(cf., e.g., Lindsay-Norman 1977), and it is also 
generally accepted in computational linguistics 
that any program that aims at simulating per- 
ception in one way or other must have features 
that simulates (or, even better, actually per- 
forms) parallel processing, and the analysis 
system to be described below has much emphasis on 
exactly this feature. 
Another common saying nowadays when dis- 
cussing parsing techniques is that one should try 
to incorporate "heuristic devices" (cf., e.g., 
the many subreports related to the big ARPA- 
project concerning Speech Recognition and Under- 
standing 1970-76), although there does not seem 
to exist a very precise consensus of what exactly 
that would mean. (In mathematics the term has 
been traditionally used to refer to informal 
reasoning, especially when used in classroom 
situations. In a famous study the hungarian 
mathematician Polya, 1945 put forth the thesis 
that heuristics is one of the most important 
psychological driving mechanisms behind mathe- 
matical - or scientific - progress. In AI- 
literature it is often used to refer to shortcut 
search methods in semantic networks/spaces; c.f. 
Lenat, 1982). 
One reason for trying to adopt some kind of 
heuristic device in the analysis procedures is 
that one for mathematical reasons knows that 
ordinary, "careful", parsing algorithms inherently 
seem to refuse to work in real time (i.e. in 
linear time), whereas human beings, on the whole, 
seem to be able to do exactly that (i.e. perceive 
sentences or utterances simultaneously with their 
production). Parallel processing may partly be an 
answer to that dilemma, but still, any process 
that claims to actually simulate some part of 
human perception must in some way or other 
simulate the remarkable abilities human beings 
have in grasping complex patterns ("gestalts") 
seemingly in one single operation. 
Ordinary, careful, parsing algorithms are 
often organized according to some general 
principle such as "top-down", "bottom-to-top", 
"breadth first", "depth first", etc., these 
headings referring to some specified type of 
"strategy". The heuristic model we are trying to 
work out has no such preconceived strategy built 
into it. Our philosophy is instead rather 
anarchistic (The Heuristic Principle): Whatever 
linguistic unit that can be identified at whatever 
stage of the analysis, according to whatever means 
there are, i_~s identified, and the significance of 
the fact that the unit in question has been 
identified is made use of in all subsequent stages 
of the analysis. At any time one must.be prepared 
to reconsider an already established analysis of a 
unit on the ground that evidence a~alnst the 
analysis may successively accumulate due to what 
analyses other units arrive at. 
In next section we give a brief description 
of the analysis system for Swedish that is now 
under development at our group in Stockholm. As 
has been said, much effort is spent on trying to 
make use of surface signals as much as possible. 
Not that we believe that surface signals play a 
more important role than any other type of 
linguistic signals, but rather that we think it is 
important to try to optimize each single sub- 
process (in a parallel system) as much as 
~osslble, and, as said, it might be worth while 
to look careful into this level, because the im- 
portance of surface signals might have been under- 
estimated in previous research. Our exneriments so 
far seem to indicate that they constitute ex- 
cellent units to base heuristic guesses on. An- 
other reason for concentrating our efforts on this 
level is that it takes time and requires much hard 
computational work to get such an anarchistic 
system to really work, and this surface level is 
reasonably simple to handle. 
II AN OUTLINE OF AN ANALYZER BASED ON 
THE HEURISTIC PRINCIPLE 
Figure 1 below shows the general outline of 
the system. Each of the various boxes (or sub- 
boxes) represents one specific process, usually a 
complete computer program in itself, or, in some 
cases, independent processes within a program. The 
big "container", labelled "The Pool", contains 
both the linguistic material as well as the 
current analysis of it. Each program or process 
looks into the Pool for things "it" can recognize, 
and when the process finds anything it is trained 
to recognize, it adds its observation to the ma- 
terial in the Pool. This added material may (hope- 
fully) help other processes in recognizing what 
they are trained to recognize, which in its turn 
may again help the first process to recognize more 
of "its" units. And so on. 
The system is now under development and 
during this build-up phase each process is, as was 
said above, essentially a complete, stand-alone 
module, and the Pool exists simply as successively 
updated text files on a disc storage. At the 
moment some programs presuppose that other prog- 
rams have already been run, but this state of 
affairs will be valid Just during this build~up 
phase. At the end of the build-up phase each 
program shall be able to run completely inde- 
pendent of any other program in the system and in 
arbitrary order relative to the others (but, of 
course, usually perform better if more information 
is available in the Pool). 
67 
In the ~econd phase superordinated control 
programs are to be implemented. These programs 
will function as "traffic rules" and via these 
systems one shall be able to test various strate- 
gies, i.e. to test which relative order between 
the different subsystems that yields optimal re- 
suit in some kind of "performance metric", some 
evaluation procedure that takes both speed and 
quality into account. 
The programs/processes shown in Figure i all 
represent rather straightforward Finite State 
Pattern Matching (FS/PM) procedures. It is rather 
trivial to show mathematically that a set of 
interacting FS/PM procedures of the type used in 
our system together will yield a system that 
formally has the power of a CF-parser; in practice 
it will yield a system that in some sense is 
stronger, at least from the point of view of 
convenience. Congruence and similar phenomena will 
be reduced to simple local observations. Trans- 
formational variants of sentences will be re- 
cognized directly - there will be no need for 
performing some kind of backward transformational 
operations. (In this respect a system llke this 
will resemble Gazdar's grammar concept; Gazdar 
1980. ) 
The control structures later to be superim- 
posed on the interacting FS/PM systems will also 
be of a Finite State type. A system of the type 
then obtained - a system of independent Finite 
State Automatons controlled by another Finite 
State Automaton - will in principle have rather 
complex mathematical properties. It is, e.g., 
rather easy to see that such a system has stronger 
capacity than a Type 2 device, but it will not 
have the power of a full Type I system. 
Now a few comments to Figure i 
The "balloons" in the figure represent inde- 
pendent programs (later to be developed into inde- 
pendent processes inside one "big" program). The 
figure displays those programs that so far 
(January 1983) have been implemented and tested 
(to some extent). Other programs will successively 
be entered into the system. 
The big balloon labelled "The Closed Cat" 
represents a program that recognizes closed word 
classes such as prepositions, conjunctions, pro- 
nouns, auxiliaries, and so on. The Closed Cat 
recognizes full word forms directly. The SMURF 
balloon represents the morphological component 
(SMURF = "Swedish Murphology"). SMURF itself is 
organized internally as a complex system of inde- 
pendently operating "demons" - SMURFs - each 
knowing "its' little corner of Swedish word forma- 
tion. (The name of the program is an allusion to 
the popular comic strip leprechauns "les 
Schtroumpfs", which in Swedish are called 
"smurfar".) Thus there is one little smurf recog- 
nizing derivat\[onal morphemes, one recognizing 
flectional endings, and so on. One special smurf, 
Phonotax, has an important controlling function - 
every other smurf must always consult Phonotax 
before identifying one of "its" (potential) forma- 
tires; the word minus this formative must still be 
pronounceable, otherwise it cannot be a formative. 
SMURF works entirely without stem lexicon; it 
adheres completely to the "philosophy" of using 
surface signals as far as possible. 
NOMFRAS, VERBAL, IFIGEN, CLAUS and PREPPS are 
other "demons" that recognize different phrases or 
word groups within sentences, viz. noun phrases, 
verbal complexes, infinitival constructions, 
clauses and prepositional phrases, respectively. 
N-lex, V-lex and A-lex represent various (sub)- 
lexicons; so far we have tried to do without them 
as far as possible. One should observe that stem 
lexicons are no prerequisites for the system to 
work, adding them only enhances its performance. 
The format of the material inside the Pool is 
the original text, plus appropriate "labelled 
brackets" enclosing words, word groups, phrases 
and so on. In this way, the form of representation 
is consistent throughout, no matter how many 
different types of analyses have been applied to 
it. Thus, various people can join our group and 
write their own "demons" in whatever language they 
prefer, as long as they can take sentences in text 
format, be reasonably tolerant to what types of 
'~rackets" they find in there, do their analysis, 
add their own brackets (in the specified format), 
and put the result back into the Pool. 
68 
Of the various programs SMURF, NOMFRAS and 
IFIGEN are extensively tested (and, of course, The 
Closed Cat, which is a simple lexical lookup 
system), and various examples of analyses of these 
programs will be demonstrated in the next section. 
We hope to arrive at a crucial station in this 
project during 1983, when CLAUS has been more 
thoroughly tested. If CLAUS performs the way we 
hope (and preliminary tests indicate that it 
will), we will have means to identify very quickly 
the clausal structures of the sentences in an 
arbitrary running text, thus having a firm base 
for entering higher hierarchies in the syntactic 
domains. 
The programs are written in the Beta language 
developed by the present author; c.f. Brodda- 
Karlsson, 1980, and Brodda, 1983, forthcoming. Of 
the actual programs in the system, SMURF was 
developed and extensively tested by B.B. during 
1977-79 (Brodda, 1979), whereas the others are 
(being) developed by B.B. and/or Gunnel KEllgren, 
Stockholm (mostly "and"). 
III EXPLODING SOME OF THE BALLOONS 
When a "fresh" text is entered into The Pool 
it first passes through a preliminary one-pass- 
program, INIT, (not shown in Fig. i) that "normal- 
izes" the text. The original text may be of any 
type as long as it Is regularly typed Swedish. 
INIT transforms the text so that each graphic 
sentence will make up exactly one physical record. 
(Except in poetry, physical records, i.e. lines, 
usually are of marginal linguistic interest.) 
Paragraph ends will be represented by empty re- 
cords. Periods used to indicate abbreviations are 
Just taken away and the abbreviation itself is 
contracted to one graphic word, if necessary; thus 
"t.ex." ("e.g.") is transformed into "rex", and so 
on. Otherwise, periods, commas, question marks and 
other typographic characters are provided with 
preceding blanks. Through this each word is 
guaranteed to be surrounded by blanks, and de- 
limiters llke commas, periods and so on are 
guaranteed to signal their "normal" textual func- 
tions. Each record is also ended by a sentence 
delimiter (preceded by a blank). Some manual post- 
editing is sometimes needed in order to get the 
text normalized according to the above. In the 
INIT-phase no linguistic analysis whatsoever is 
introduced (other than into what appears to be 
orthographic sentences). 
INIT also changes all letters in the original 
text to their corresponding upper case variants. 
(Originally capital letters are optionally pro- 
vided with a prefixed "=".) All subsequent ana- 
lysis programs add their analyses In the form of 
lower case letters or letter combinations. Thus 
upper case letters or words will belong to the 
object language, and lower case letters or letter 
combinations will signal meta-language informa- 
tion. In this way, strictly text (ASCII) format 
can be kept for the text as well as for the va- 
rious stages of its analysis; the "philosophy" to 
use text Input and text output for all programs 
involved represents the computational solution to 
the problem of how to make it possible for each 
process to work independently of all other in the 
system. 
The Closed Cat (CC) has the important role to 
mark words belonging to some well defined closed 
categories of words. This program makes no in- 
ternal analysis of the words, and only takes full 
words into account. CC makes use of simple rewrite 
rules of the type '~ => eP~e / (blank)__(blank)", 
where the inserted e's represent the "analysis" 
("e" stands for "preposition"; P~ = "on"). A 
sample output from The Closed Cat is shown in 
illustration 2, where the various meta-symbols 
also are explained. 
The simple example above also shows the 
format of inserted meta-lnformatlon. Each Identi- 
fied constituent is "tagged" with surrounding 
lower case letters, which then can be conceived of 
as labelled brackets. This format is used 
throughout the system, also for complex constit- 
uents. Thus the nominal phrase 'DEN LILLA FLICKAN" 
("the little girl") will be tagged as 
"'nDEN+LILLA+FLICKANn" by NOMFRAS (cf. below; the 
pluses are inserted to make the constituent one 
continuous string). We have reserved the letters 
n, v and s for the major categories nouns or noun 
phrases, verbs or verbal groups, and sentences, 
respectively, whereas other more or less transpar- 
ent letters are used for other categories. (A list 
of used category symbols is presented in the 
Appendix: Printout Illustrations.) 
The program SWEMRF (or sMuRF, as it is called 
here) has been extensively described elsewhere 
(Brodda, 1979). It makes a rather intricate 
morphological analysis word-by-word In running 
text (i.e. SMURF analyzes each word in itself, 
disregarding the context it appears in). SMURF can 
be run in two modes, in "segmentation" mode and 
"analysis" mode. In its segmentation mode SMURF 
simply strips off the possible affixes from each 
word; it makesno use of any stem lexicon. (The 
affixes it recognizes are common prefixes, suf- 
fixes - i.e. derlvatlonal morphemes - and flex- 
lonal endings.) In analysis mode it also tries to 
make an optimal guess of the word class of.the 
word under inspection, based on what (combinations 
of) word formation elements it finds in the word. 
SMURF in itself is organized entirely according to 
the heuristic principles as they are conceived 
here, i.e. as a set of independently operating 
processes that interactively work on each others 
output. The SMURF system has been the test bench 
for testing out the methods now being used 
throughout the entire Heuristic Parsing Project. 
In its segmentation mode SMURF functions 
formally as a set of interactive transformations, 
where the structural changes happen to be ex- 
tremely simple, viz. simple segmentation rules of 
the type 'T=>P-", "Sffi> -S" and "Effi>-E '' for an 
arbitrary Prefix, Suffix and Ending, respectively, 
but where the "Job" essentially consists of 
establishing the corresponding structural de- 
scriptions. These are shown in III. I, below, 
together with sample analyses. It should be noted 
that phonotactlc constraints play a central role 
69 
in the SMURF system; in fact, one of the main 
objectives in designing the SMURF system was to 
find out how much information actually was carried 
by the phonntactlc component in Swedish. (It 
turned out to be quite much; cf. Brodda 1979. This 
probably holds for other Germanic languages as 
well, which all have a rather elaborated phono- 
taxis.) 
NOMFRAS is the next program to be commented 
on. The present version recognizes structures of 
the type 
det/quant + (adJ)~ + noun; 
where the "det/quant" categories (i.e. determiners 
or quantlflers) are defined explicitly through 
enumeration - they are supposed to belong to the 
class of "surface markers" and are as such identi- 
fied by The Closed Cat. Adjectives and nouns on 
the other hand are identified solely on the ground 
of their "cadences", i.e. what kind of (formally) 
endlng-llke strings they happen to end with. The 
number of adjectives that are accepted (n in the 
formula above) varies depending on what (probable) 
type of construction is under inspection. In inde- 
finite noun phrases the substantial content of the 
expected endings is, to say the least, meager, as 
both nouns and adjectives in many situations only 
have O-endings. In definite noun phrases the noun 
mostly - but not always - has a more substantial 
and recognizable ending and all intervening ad- 
Jectives have either the cadence -A or a cadence 
from a small but characteristic set. In a (sup- 
posed) definite noun phrase all words ending in 
any of the mentioned cadences are assumed to be 
adjectives, but in (supposed) indefinite noun 
phrases not more than one adjective is assumed 
unless other types of morphological support are 
present. 
The Finite State Scheme behind NOMFRAS is 
presented in Ill. 2, together with sample outputs; 
in this case the text has been preprocessed by The 
Closed Cat, and it appears that these two programs 
in cooperation are able to recognize noun phrases 
of the discussed type correctly to well over 95% 
in running text (at a speed of about 5 sentences 
per second, CPU-tlme); the errors were shared 
about 50% each between over- and undergenerations. 
Preliminary experiments aiming at including also 
SMURF and FREPPS (Prepositional Phrases) seem to 
indicate that about the same recall and precision 
rate could be kept for arbitrary types of (non- 
sententlal) noun phrases (cf. Iii. 6). (The sys- 
tems are not yet trimmed to the extent that they 
can be operatively run together.) 
IFIGEN (Infinitive Generator) is another 
rather straightforward Finite State Pattern 
Matcher (developed by Gunnel K~llgren). It recog- 
nizes (groups of) nnnflnlte verbs. Somewhat 
simplified it can be represented by the following 
diagram (remember the conventions for upper and 
lower case): 
IFIGEN parsing diagram (simplified): 
Aux  n>Adv)o  
ATT -- 
-A 
# (C)CV 
-(A/I)T 
# 
I 
where '~ux" and "Adv" are categories recognized by 
The Closed Cat (tagged "g" and "a", respectively), 
and "nXn" are structures recognized by either 
NOMFRAS or, in the case of personal pronouns, by 
CC (It should he worth mentioning that the class 
of auxiliaries in Swedish is more open than the 
corresponding word class in English; besides the 
"ordinary" VARA ("to be"), HA ("to have") and the 
modalsy, there is a fuzzy class of seml-auxillarles 
llke BORJA ("begin") and others; IFIGEN makes use 
of about 20 of these in the present version.) The 
supine cadence -(A/I)'T is supposed to appear only 
once in an infinitival group. A sample output of 
IFIGEN is given in Iii. 3. Also for IFIGEN we have 
reached a recognition level around 95%, which, 
again, is rather astonishing, considering how 
little information actually is made use of in the 
system. 
The IFIGEN case illustrates very clearly one 
of the central points in our heuristic approach, 
namely the following: The information that a word 
has a specific cadence, in this case the cadence 
-A, is usually of very llttle significance in 
itself in Swedish. Certainly it is a typical infi- 
nltlval cadence (at least 90% of all infinitives 
in Swedish have it), but on the other hand, it is 
certainly a very typical cadence for other types 
of words as well: FLICKA (noun), HELA (adjective), 
DENNA/DETTA/DESSA (determiners or pronouns) and so 
on, and these other types are by no comparison the 
dominant group having this specific cadence in 
running text. But, in connection with an "infini- 
tive warner" - an auxiliary, or the word ATT - the 
situation changes dramatically. This can be demon- 
strated by the following figures: In running text 
words having the cadance -A represents infinitives 
in about 30% of the cases. ATT is an infinitive 
marker (equivalent to "to") in quite exactly 50% 
of its occurences (the other 50% it is a subordi- 
nating conjunction). The conditional probability 
that the configuration ATT ..-A represents an 
inflnltve is, however, greater than 99%, pro- 
vided that characteristic cadences like -ARNA/- 
ORNA and quantiflers/determiners llke ALLA and 
DESSA are disregarded (In our system they are 
marked by SMURF and The Closed Cat, respectively, 
and thereby "saved" from being classified as infi- 
nitives.) Given this, there is almost no over- 
generation in IFIGEN, but Swedish allows for split 
infinitives to some extent. Quite much material 
can be put in between the infinitive warner and 
the infinitive, and this gives rise to some under- 
generation (presengly). (Similar observations re- 
garding conditional probabilities in configura- 
tions of linguistic units has been made by Mats 
Eeg-Olofson, Lund, 1982). 
70 
IV REFERENCES 
Brodda, B. "N~got om de svenska ordens fonotax och 
morfotax", Papers from the Institute Of 
Linguistics (PILUS) No. 38, University of Stock- 
holm, 1979. 
Brodda, B. '~ttre kriterler f~r igenkEnnlng av 
sammans~ttningar" in Saari, M. and Tandefelt, M. 
(eds.) F6rhandllngar r~rande svenskans beskriv- 
ning - Hanaholmen 1981, Meddelanden fr~n Insti- 
tutionen f~r Nordiska Spr~k, Helsingfors Univer- 
sitet, 1981 
Brodda, B. "The BETA System, and some Applica- 
tions", Data Linguistics, Gothenburg, 1983 
(forthcoming). 
Brodda, B. and Karlsson, F. "An experiment with 
Automatic Morphological Analysis of Finnish", 
Publications No. 7, Dept. of Linguistics, Unl- 
versity of Helsinki, 1981. 
Gazdar, G. "Phrase Structure" i_~n Jacobson, P. and 
Pullam G. (eds.), Nature of Syntactic Represen- 
tation, Reidel, 1982 
Lenat, D.P. "The Nature of Heuristics", Artifi- 
cial Intelligence, Vol 19(2), 1982. 
Eeg-Olofsson, M. '~n spr~kstatlstlsk modell f~r 
ordklassm~rknlng i l~pande text" in K~llgren, G. 
(ed.) TAGGNING, Fgredrag fr~n 3:e svenska kollo- 
kviet i spr~kllg databehandling i maJ 1982, 
FILUS 47, Stockholm 1982. 
Polya, G. "How to Solve it", Princeton University 
Press, 1945. Also Doubleday Anchor Press, New 
York, N.Y. (several editions)• 
APPENDIX: Some computer illustrations 
The following three pages illustrate some of the parsing diagrams used in 
the system: Iii. I, SMURF, and Iii. 2, NOMFRAS, together with sample analyses. 
IFIGEN is represented by sample analyses (III. 3; the diagram is given in the 
text)• The samples are all taken from running text analysis (from a novel by 
Ivar Lo-Johansson), and "pruned" only in the way that trivial, recurrent examples 
are omitted. Some typical erroneous analyses are also shown (prefixed by **). 
In III. I SMURF is run in segmentation mode only, and the existing tags are 
inserted by the Closed Cat. "A and "E in word final position indicates the 
corresponding cadences (fullfilling the pattern ?..V~M'A/E '', where M denotes a 
set of admissible medial clusters)• 
The tags inserted by CC are: aft(sentence) adverbials, b=particles, dfdeterminers, 
efprepositions, g=auxiliaries, h=(forms of) HA(VA), iffiinfinitives, j=adjectives, 
n=nouns, Kfconjunctions, q=quantifiers, r=pronouns, ufsupine verb form, v=verbal 
(group)• 
(For space reasons, III. 3 is given first, then I and II.) 
Iii. 3: PATTERN: aux/ATT^(pron)'(adv)A(adv)'inf^inf A. .. : 
..FLOCKNINGEN eEFTER..IkATTk+iHAi+uG~TTui 
.. rDETr vVARv ORIMLIGT ikATTk+iFINNAI 
rJAGr gSKAg aBARAa IHJALPAi 
- rDETr gKANg ILIGGAI 
gSKAg rVlr iV~GAi 
- rVlr gKANg alNTEa iG~i 
. ..ORNA vHOLLv SIG FARDIGA ikATTk+iKASTAi 
rDEr gV~GADEg aANTLIGENa iLYFTAi 
gSKAg rNlr aNODVANDIGTVISa iGORAi 
..rVlr hHADEh aANNUa alNTEa uHUNNITu iF~i 
..BECKMORKRET eMEDe ikATTk+IFORSOKAi+iF~I 
eMEDe VATGAS eFORe ikATTk+iKUNNAi+IH~LLAi 
SKOGEN, LANDEN gTYCKTESg iST~i 
rDENr hHADEh MISSLYCKATS ele ikATTk+iNAi 
*** qENq kS gV~GADEg IKVlNNORNA+STANNAi 
FRAMATBOJD HELA DAGEN.. 
qETTq KADSTRECK ele .. 
eTILLe ikATTk+iSEi .. 
qENq KARL INUTI? 
VIPPEN? 
HEM eMEDe SKAMMEN ... 
eOMe NARSOMHELST. 
ePAe rDETr. 
N~T eMEDe rDENr, kS~k 
eUPPe POTATISEN. 
BALLONGEN FYLLD. 
SEJ OPPE. 
STILLA eUNDERe OSS. 
SITT M~L. 
71 
IIi. i: SMURF - PARSING DIAGRAM FOR SWEDISH MORPHOLOGY 
PATTERNS "Structural Descriptions"): 
I) E_NOINGS (E): 
X " 1/VS. Me "E#; 
Structural 
changes 
E :> =E 
2) PREFIXES (P): I' I #p> - p - 
X " V " F (s) -- 
V " X ; P => (-)P> 
3) SUFFIXES (S): l (s) I " V " x 1 
X " v " F "_S - E# 
# 
S :> /S(-) 
where I : (admissible) initial cluster, F = final cluster, M = mor-h- 
e-m-eTnternal cluster, V = vowel, (s) the "gluon"S (cf. TID~INGSMA~), 
# = word boundary, (=,>,/,-) = earlier accepted affix segmentations, and 
, finallay, denotes ordinary concatenation. (It is the enhanced ele- 
ment in each pattern that is tested for its segmentability). 
BAGG'E=vDROGv . REP=ET SLINGR=ADE MELLAN STEN=AR , FOR>BI 
TALLSTAMM AR , MELLAN ROD*A LINGONTUV=OR ele GRON IN>FATT/NING. 
qETTq STORT FORE>M~L hHADEh uRORTu eP~e SIG BORT'A eIe 
SLANT=EN • FORE>M~L=ET NARM=ADE SIG HOTFULL'T dDETd KNASTR= 
=ADE eIe SKOG=EN . - SPRING 
BAGG'E SLAPP=TE kOCHk vSPRANGv . rDEr L~NG'A KJOL=ARNA 
VIRVI=ADE eOVERe 0<PLOCK=ADE LINGONTUV=OR , BAGG'E KVINNO=RNA 
hHADEh STRUMPEBAND FOR>FARDIG=ADE eAVe SOCKERTOPPSSNOR=EN , 
KNUT=NA NEDAN>FOR KNAN'A 
aFORSTa bUPPEb eP~e qENq kS V~G=ADE KVINNO=RNA STANN'A . 
rDEr vSTODv kOCHk STRACK=TE eP~e HALS=ARNA . qENq FRAN 
UT>DUNST/NING eAVe SKRACK SIPPR=ADE bFRAMb . rDEr vHOLLv 
BE>SVARJ/ANDE HAND=ERNA FRAM>FOR SIN'A SKOT=EN 
- dDETd vSERv STORT kOCHk eRUNTe bUTb , vSAv dDENd KORT~A 
eOMe FORE>MAL=ET dDETd vARy aVALa alNTEa qN~GOTq IN>UT>I ? 
- dDETd gKANg LIGG'A qENq KARL IN>UT>I ? dDETd vVETv rMANr 
aVALa kVADk rHANr vGORv eMEDe OSS 
- rJAGr TYCK=TE dDETd ROR=DE eP~e SEJ gSKAg rVlr iV~GAI 
VIPP=EN ? - JA ? ESKAg rVlr iV~GAI VIPP~EN ? 
BAGGE vSMOGv SIG eP~e GLAPP'A KNAN UT>F~R BRANT=EN • kNARk 
rDEr NARM=ADE SIG rDEr FLAT=ADE POTATISKORG=ARNA eMEDe LINGON 
kSOMk vSTODv eP~e LUT eVIDe VARSIN TUVA , vVARv rDEr aREDANa 
UT>OM SIG eAVe SKRACK . oDERASo SANS vVARv BORT'A . 
- PASS eP~e . rVlr KANHAND'A alNTEa vTORSv NARM=ARE ? vSAv 
dDENd MAGR'A RUSTRUN 
- rVlr EKANg alNTEa G~ HEM eMEDe SKAMM=EN aHELLERa • rVlr 
gM~STEE aJUa iHAi BARKORG=ARNA eMEDe . 
- JAVISST , BARKORG=ARNA 
kMENk kNARk rDEr uKOMMITu bNERb eTILLe STALL=ET I<GEN 
uVARTu rDEr NYFIK=NA rDEr vDROGSv eTILLe FORE>M~L=ET ele 
72 
Iii. 2: NOMFRAS - FS-DIAGRAM FOR SWEDISH NOUN PHRASE PARSING 
quant + dec + "OWN" + adJ + noun 
I OENNAL__ DETTA~ 
/j MI-T  ALLA "~~ 
B~DA DEN 
-ERI-NI-~ I 
ER) "NAI-EN\] 
- PYTT , vSAv nDEN+L~NGAn 
kVADk vVARv NU nDET+DARn kATTk VARA RADD eFORe ? 
nDET+OMF~NGSRIKA+,+SIDENLATTA+TYGETn 
nDEn GJORDE nEN+STOR+PACKEn eAVe dDETd . 
eMEDe SIG SJALVA eOMe kATTk nDET+HELAn alNTEa uVARITu qETTq .. 
.. nDET+NELAn alNTEa uVARITu nETT+DUGGn FARLIGT . 
nDET+FORMENTA+KLADSTRECKETn vVARv kD~k SNOTT FLE.. 
.. GRON eMEDe HANGBJORKAR kSOMk nALLAn FYLLDE FUNKTIONER . 
.. MODERN , nDEN+L~NGA+EGNAHEMSHUSTRUNn kSOMk uVARITu ele SKO.. 
STORA BOKSTAVER nETT+SVENSKT+FIRMANAMNn 
eP~e nDEN+ANDRA+,+FR~NVANDAn , vSTODv ORDEN .. 
nDETn vVARv nEN+LUFTENS+SPILLFRUKTn kSOMk hHADEh uRAMLAT.. 
kOCHk nDEN+ANDRA+EGNAHEMSHUSTRUNS+OGONn VATTNADES eAVe OMSOM 
nETT+STORT+MOSSIGT+BERGn HOJDE SIG eMOTe SKYN.. 
. • SIG eMOTe SKYN eMEDe nEN+DISIG+M~NEn kSOMk qENq RUND LYKTA .. 
eVIDe nDET+STALLEn kDARk LANDNINGSLINAN .. 
SAGA HONOM kATTk nALLA+DESSA+FOREMALn aAND~a alNTEa FORMED.. 
..ARNA kSOMk nEN+AVIGT+SKRUBBANDE+HANDn . 
kSOMk nEN+OFORMLIG+MASSAn VALTRADE SIG BALLONG.. 
- nEN÷RIKTIG+BALLONGn gSKAg VARA FYLLD eMEDe.. 
• *nDETn alNTEa vL~Gv nN~GON+KROPP+GOMDn INUNDER . 
• ** TV~ kSOMk BARGADE ~DEN+TILLSAMMANSn 
73 
