SYNTACTIC ANALYSIS OF NATURAL LANGUAGE USING 
LINGUISTIC RULES AND CORPUS-BASED PATTER.NS 
Pasi Tapanainen* Timo J~rvinen 
Rank Xerox Research Centre 
Grenoble Laboratory 
University of tIelsinki 
Research Unit for Computational Linguistics 
Abstract 
We are concerned with the syntactic annotation 
of unrestricted text. We combine a rule-based 
analysis with subsequent exploitation of empiri- 
cal data. The rule~based surface syntactic anal- 
yser leaves some amount of ambiguity in the out- 
put that is resolved using empirical patterns. We 
have implemented a system for generating and 
applying corpus-based patterns. Somc patterns 
describe the main constituents in the sentence 
and some the local context of the each syntac- 
tic function. There are several (partly) redml- 
tant patterns, and the "pattern" parser selects 
analysis of the sentence ttmt matches the strictest 
possible pattern(s). The system is applied to an 
experimeutal corpus. We present the results and 
discuss possible refinements of the method from 
a linguistic point of view. 
1 INTRODUCTION 
We discuss surface-syntactic analysis of running text. 
Our purpose is to mark each word with a syntactic 
tag. The tags denote subjects, object, main verbs, 
adverbials, etc. They are listed in Appendix A. 
Our method is roughly following 
• Assign to each word all the possible syntactic tags. 
• Disambiguate words as much as possible using lin- 
guistic information (hand-coded rules). Ilere we 
avoid risks; we rather leave words ambiguous than 
guess wrong. 
• Use global patterns to form alternative sentence 
level readings. Those alternatiw~" analyses are se- 
lected that match the strictest global pattern. \[f it 
does not accept any of the remaining readings, the 
second strictest pattern is used, and so on. 
• Use local patterns to rank the remaining readings. 
The local patterns contain possible contexts for syn- 
tactic functions. The ranking of the readings de- 
pends on the length of the contexts associated with 
the syntactic functions of the sentece. 
We use both linguistic knowledge, represented as 
rules, and empirical data collected from tagged cor- 
pora. We describe a new way to collect information 
from a tagged corpus and a way to apply it. In this 
paper, we are mainly concerned with exploiting the 
empirical data and combining two different kinds of 
parsers. 
*This work was done when the author worked in the 
Research Unit for Computational Linguistics at the Uni- 
versity of Itelsinki. 
Our work is based on work done with ENGCG, 
the Constraint Grammar Parser of English \[Karls- 
son, 1990; Karlsson, 1994; Karlsson et al., 1994; 
Voutilainen, 1994\]. It is a rule-h~ed tagger and 
surface-syntactic parser that makes a very small num- 
her of errors but leaves some words ambiguous i.e. it 
prefers ambiguity to guessing wrong. The morpholog- 
ical part-of-speech analyser leaves \[Voutilainen et al., 
1992\] only 0.3 % of all words in running text without 
the correct analysis when 3-6 % of words still have 
two or Inore I analyses. 
Vontilainen, Ileikkil'5. and Anttila \[1992\] reported 
that the syntactic analyser leaves :3-3.5 % of words 
without the correct syntactic tag, and 15-20 % of 
words remain amhiguos. Currently, the error rate has 
been decreased to 2-2.5 % and ambiguity rate to 15 % 
by Tirao Jiirvinen \[1994\], who is responsible for tag- 
ging a 200 million word corpus using I'\]NGCG in the 
Bank of English project. 
Althought, the ENGCG parser works very well in 
part-of-speech tagging, the syntactic descriptions are 
still problematic. In the constraint grammar frame- 
work, it is quite hard to make linguistic generalisations 
that can be applied reliably. To resolve the remaining 
ambiguity we generate, by using a tagged corpus, a 
knowledge-base that contains information about both 
the general structure of the sentences and the local 
contexts of tim syntactic tags. The general structure 
contains information about where, for example, sub- 
jects, objects and main verbs appear and how they 
follow one another. It does not pay any attention to 
their potential modiliers. The modifier-head relations 
are resolved by using the local context i.e. by looking 
at what kinds of words there are in the neighbour- 
hood. 
The method is robust in the sense that it is ahle 
to handle very large corpora. Although rule-b~med 
parsers usually perlbrrn slowly, 0rot is not the ca.qe 
with ENGCG. With the English grammar, the Con- 
straint Granun;~r Parser implementation by Pasi Ta- 
panainen analyses 400 words 2 per second on a Spare- 
Station 10/:30. q'hat is, one million words are pro- 
cessed in about 40 minutes. 'l'he pattern parser for 
empirical patterns runs somewhat slower, about 100 
words per second. 
1 But even then some of tile original ,xlternative analyses 
are removed 
'2InchMing all steps of preprocessing, morphologlcM 
analysis, disambiguation and syntactic analysis. The 
speed of morphological disamblguation alone exceeds 1000 
words per second. 
(,29 
2 KNOWLEDGE ACQUISITION 
We have used two schemes to extract knowledge from 
corpora. Both produce readable patterns that can be 
verified by a linguist. In the first scheme, sentences 
are handled as units and information about the struc- 
ture of the sentence is extracted. 0nly the main con- 
stituents (like subject, objects) of the sentence are 
treated at this stage. The second scheme works with 
local context and looks only a few words to the right 
and to the left. It is used to resolve the nmdifier-t,ead 
dependencies in the phrases. 
First, we form an axis of the sentence using some 
given set of syntactic tags. We collect several layers 
of patterns that may be partly redundant with each 
other. For instance, simplifying a little, we can say 
that a sentence can be of the form subjecl -- main 
verb and there may be other words before and after 
the subject and main verb. We may also say that 
a sentence can be of the form subject -- main verb 
-- object. The latter is totally covered by the former 
because the former statement does not prohibit the 
appearance of an object but does not require it either. 
The redundant patterns are collected on purpose. 
During parsing we try to find the strictest frame for 
the sentence. If we can not apply some pattern be- 
cause it conflicts with the sentence, we may use other, 
possibly more general, pattern. For instance, an axis 
that describes all accepted combinations of subject, 
objects and main verbs in the sentence, is stricter 
than an axis that describes all accepted combinations 
of subjects and main verbs. 
After applying the axes, the parser's output is usu- 
ally still ambiguous because all syntactic tags are not 
taken into account yet (we do not handle, for instance, 
determiners and adjective premodifiers here). The re- 
maining ambiguity is resolved using local information 
derived from a corpus. The second phase has a more 
probabilistie fiavour, although no actual probabilities 
are computed. We represent information in a readable 
form, where all possible contexts, that are common 
enough, are listed for each syntactic tag. The length 
of the contexts may vary. The common contexts arc 
longer than the rare ones. In parsing, we try to find 
a match for each word in a maximally long context. 
Briefly, the relation between tim axes and the joints 
is following. The axes force sentences to comply with 
the established frames. If more than one possibility is 
found, the joints are used to rank them. 
2.1 The sentence axis 
In this section we present a new method to collect 
information from a tagged corpus. We define a new 
concept, a sentence axis. The sentence axis is a pat- 
tern that describes the sentence structure at an ap- 
propriate level. We use it to select a group of possible 
analyses for the sentence. In our implementation, we 
form a group of sentence axes and the parser selects, 
using the axes, those analyses of the sentence that 
match all or as many as possible sentence axes. 
We define the sentence axis in the following way. 
Let S be a set of sentences and T a, set of syntactic 
tags. The sentence axis of S according to tags T shows 
the order of appearance of any tag in T for every sen- 
tence in S. 
Itere, we will demonstrate the usage of a sentence 
axis with one sentence. In our real application we, 
of course, use more text to build up a database of 
sentence axes. Consider the following sentence a 
ISUBJ would_+FAUXV also_ADVL 
increase_-FMAINV child NN> benefiLOBa , 
give_-FMAINV some_QN> help OBJ 
t0_AI)VL the 1)N> car_NN> industry <P 
and CC relax_-FMAINV r~,les OBa 
governing_<NOM-FMAiNV Iocal AN> 
avthority_NN> capital_AN> reeeipts OBJ , 
alIowing_-FMAINV councils SUBJ 
/o_INFMAI{K> spend_-FMAINV more ADVL. 
The axis according to the manually defined set T = 
{ SUBJ +FAUXV +FMAINV } 
is 
• .. SUBJ +FAUXV ... SUBJ ... 
which shows what order the elements of set T ap- 
pear in the sentence above, and where three (lots 
mean that there may be something between words, 
e.g. +FAUXV is not followed (in ttfis c~e) immedi- 
ately by SUBJ. When we have more than one. sen- 
tence, ttm axis contains more than one possible order 
for the elements of set T. 
The axis we have extracted is quite general. It de- 
fines the order in which the finite verbs and subjects 
in the sentence may occur but it does not say anything 
about nmdlnite verbs in the sentence. Notice that the 
second subject is not actually tt,e subject of the fi- 
nite clause, but the subject of nontinite construction 
councils to spend more. This is inconvenient, and a 
question arises whether there should be a specific tag 
to mark suhjects of the nonllnite clauses. Voutilainen 
and Tapanaincn \[1993\] argued that the richer set of 
tags could make parsing more accurate in a rule-based 
system. It may be true he.re as well. 
We can also specify an axis for verbs of the sentence. 
'Fhus the axis according to tim set 
{ +FAUXV +FMAINV 
-FMAINV INFMAI{,K> } 
is 
.... kFAUXV ..... FMAINV ..... FMAINV 
.... FMAINV ..... FMAINV .,. INFMAR, K> 
-FMAINV • .. 
The nonlinite verbs occur in this axis four times one 
after another. We do not want just to list how many 
times a nonllnite verb may occur (or occurs in a cor- 
pus) in this kind of position, so we clearly need some 
generalisations. 
The fundamental rule ofgeneralisation that we used 
is the following: Anything that is repeated may be 
repeated any number of times. 
We mark this using l)rackets and a plus sign. The 
generalised axis for the above axis is 
• .. +FAUXV \[ .... FMAINV \]+ 
• .. INI,'MARK> -FMAINV ... 
aThe tag set is adapted from the Constraint Grammar 
of English as it is. It is more extensive than commonly 
used in tagged corpora projects (see Appendix A). 
630 
We can also repeat longer sequences, for instance the 
set 
{ --FMAINV <N()M-I,'MAINV +FAUXV 
SUBJ OBJ } 
provides the axis 
SUBJ +FAUXV .... FMAINV ... oBa 
..... FMAINV ... oBa .... FMAINV OBJ 
• .. <NOM-FMAINV ... OBa ... 
-FMAINV SUBJ .... FMAINV ... 
And we lbrm a generalisation 
SUBJ +FAUXV \[ ..... FMAINV ... OBJ \]+ 
• .. <NOM-FMAINV ... OBJ ... 
-FMAINV SUBJ .... FMAINV ... 
Note that we added silently an extra (tot be.tweeu 
one -FMAINV and OBY in order not to make, dis- 
tinctions between -FMAINV OBg and -FMAINV. . . 
OBJ here. 
Another generalisation can be made using equiva- 
lence clauses. We can ,assign several syntactic tags to 
the same equivalence class (for instance -I"MAINV, 
< NOM-FMAIN V arrd < P-FMA \[N V), and then gen-. 
crate axes as above. 'l'he result would be 
SUBJ +FAUXV \[ ... nonfinv ... OBJ \]-I- 
• .. nontinv SUBJ ... nonfinv ... 
where nonfinv denotes both -FMAINV ;u,d <NOM 
FMAINV (and also <P-.1,'MAINV). 
The equivalence classes are essential in the present 
tag set because the syntactic arguments of finite verbs 
are not distinguished from the arguments of nontlnite 
verbs. Using equivalence classes for the finite attd non- 
finite verbs, we may tmiht an generallsation that ;tl)- 
plies to both types of clauses. Another way to solve 
the problem, is to add new tags for the arguments of 
the nontinite clauses, arid make several axes for them. 
2.2 Local patterns 
In the second phase of the pattern parsing scheme we 
apply local patterns, the joints. They contain iofor- 
mation about what kinds of modifiers have what kinds 
of heads, and vice versa. 
For instance, in the following sentence 4 the words 
fair and crack are both three ways ambiguous before 
the axes are applied. 
He_SUBJ gives_-t-FMAINV us I-OBa 
a l)N> fairAN>/SUBJ/NN> 
crack_OBJ/+FMAINV/SUIU theT, fll)VL 
we. SUBJ wiII+FAUXV 
be_--FMAINV/-FAUXV in_AI)VL with_A1)VL 
a_l)N> chance<P of <NOM-OF 
ca~ying<P-FMAINV off <NOM/AI)VL 
the DN> World <P/NN> Cvp_<P/Ol3J . 
After the axes have been applied, the noun phr,xse a 
fair crack has the analyses 
a DN> fairAN>/NN> crack OBJ. 
The word fairis still left partly ambiguous. We resolve 
this ambiguity using the joints. 
4This analysis is comparable to the output of I'3NCCG. 
The ambiguity is marked here using the slash. The mor.- 
phological information is not printed. 
in an ideal case we have only one head in each 
l)hrase, although it may not be in its exact location 
yet. r\['he following senLencv, fragment (temonstrate.s 
this 
They SUlta have...+ FAUXV been -VMAINV 
much AD-A> less P(X)Mlq,--,qfAD--A> 
attentive <NOM/PCOMPl,-.S 
to <NOM/AI)VI, theft)N> ... 
In tit(.' analysis, the head of the l)hr~me mvch less 
attentive may be less or altenlive. If it is less the 
word attentive is a postn,odifier, and if the head is at- 
tentive then less is a premodilier. Tim sentence is 
represented internally in the parser in such a way 
that if the axes make this distinction, i.e. force 
the.re to be exactly one subject complement, there 
are only two possil)le paths which the joints can se- 
lect from: less AD-A> attenlive_l'COMPL- S and 
less J'COMPl,- S attentive <NOM. 
Generating the joints is quite straightforward. We 
produce different alternative variants for each syntac: 
tic tag and select some of them. Wc use a couple of 
parameters to validate possible joint candidates. 
* q'he error margin provides the probability for check- 
ing if the context is relevant, i.e., there is enottg\]t 
evidence for it among the existing contexts of the 
tag. This probability may be used in two ways: 
l,'or a syntactic tag, generate all contexts (of 
length n) tl, at appear in the corpora. Select all 
those contexts that are frequent enough. Do this 
with all n's wdues: 1, 2, ... 
-- First generate all contexts of length t. Select 
those contexts that are fregnent enough among 
the generated contexts. Next,, lengtlmn all con- 
texts selected in the previous step by one word. 
Select those contexts that are frequent enough 
among the new generated context, s. R.epeat this 
sulficient malty times. 
lloth algorithms l)roduce a set of contexts of differ- 
ent lengths. Characteristic for t)oth the algorithms 
is that if they haw; gene.rated a context of length 
n that matches a syntactic function in a sentence, 
there is also a context of length n - 1 that matches. 
• The absolute, margin mmd)er of cases that is 
needed for the evidence, of till: generated context. 
If therc is less cvidencc, it is not taken into account 
arm a shorter context is generated. '.\['his is used to 
prevent strange behaviour with syntactic tags that 
are not very common or with a corpus that is not 
big enough. 
® 'l'he maximum length of the context to be gener- 
ated. 
l)uring the parsing, longer contexts are preferred 
to shorter ones. The parsing problem is thus a kind 
of pattern matching problem: we have to match a 
pattern (context) arouml each tag and tlnd a sequence 
of syntactic tags (analysis of the sentence) that h~m 
the best score. The scoring fimetion depends on the 
lengths of the matched patterns. 
631 
I text. II words\]ambiguity rate I error rate \] 
bbl 1734' 12.4 % 2.4 % 
bb2 1674 14.2 % 2.8 % 
1599 18.6 % 1.6 % 
wsj " 2309 16.2 % 2.9 % 
\]. totall\] 7316 I 15.3 % \] 2.2 %-1 
Figure 1: Test corpora after syntactical analysis of 
ENGCG. 
3 EXPERIMENTS WITH REAL 
CORPORA 
Information concerning the axes was acquired from a 
manually checked and fully disambiguated corpus 5 of 
about 30,000 words and 1,300 sentences. Local con- 
text information was derived from corpora that were 
analysed by ENGCG. We generated three different 
parsers using three different corpora 6. Each corpus 
contains about 10 million words. 
For evaluation we used four test samples (in Fig- 
ure 1). Three of them were taken frmn corpora that 
we used to generate the parsers and one is an addi- 
tional sample. The samples that are named bbl, today 
and wsj belong to the corpora from which three dif- 
ferent joint parsers, called BB1, TODAY and WSJ 
respectively, were generated. Sample bb~ is the addi- 
tional sample that is not used during development of 
the parsers. 
The ambiguity rate tells us how much ambiguity is 
left after ENGCG analysis, i.e. how many words still 
have one or more alternative syntactic tags. The error 
rate shows us how many syntactic errors ENGCG has 
made while analysing the texts. Note that the ambi- 
guity denotes the amount of work to be done, and the 
error rate denotes the number of errors that already 
exist in the input of our parser. 
All the samples were analysed with each generated 
parser (in Figure 2). The idea is to find out about 
the effects of different text types on the generation 
of the parsers. The present method is applied to re- 
duce the syntactic ambiguity to zero. Success rates 
variate from 88.5 % to 94.3 % in ditferent samples. 
There is maximally a 0.5 percentage points difference 
in the success rate between the parsers when applied 
to the same data. Applying a parser to a sample from 
the same corpus of which it was generated does not 
generally show better results. 
Some of the distinctions left open by ENGCG may 
not be structurally resolvable (see \[Karlsson et al., 
1994\]). A case in point is the prepositional attach- 
ment ambiguity, which alone represents about 20 % 
of the ambiguity in the ENGCG output. The proper 
way to deal with it in the CG framework is probably 
using lexical information. 
Therefore, as long as there still is structurally un- 
resolvable ambiguity in the ENGCG output, a cer- 
tain amount of processing before the present system 
SOonsisting of 15 individual texts from the Bank of 
English project \[J~.rvinen, 1994\]. The texts were chosen 
to cover a variety of text types but due to small size and 
intuitive sampling it cannot be truly representative. 
6We use here Today newspaper, The t",conomist -k Wall 
Street Journal and British Books. 
_T_Text s ~s e r s \[ riB1 .I TODAY 
~~ 92.5% \[92~ 
%__1 91.9-W-o 
Figure 2: Overall parsing success rate in syntactically 
amdysed samples 
might improve the results considerably, e.g., convert- 
ins structurally unresolvable syntactic tags to a single 
underspecified tag. \[,'or instance, resolving preposi- 
tional attachment ambiguity by other means would 
iruprove the success rate of the current system to 
90.5 % - 95.5 %. In the wsj sample ttLe improvement 
would be as much a.s 2.0 percentage points. 
The differences between success rates in different 
samples are partly explained by tile error types that 
are characteristic of the samples. For example, in 
the Wall Street Journal adverbials of time are easily 
parsed erroneously. This may cause an accumulation 
effect, ms happens in tile following sentence 
MAN AG Tuesday said fiscal 1989 net income 
rose 25% and said it, will raise its dividend for 
lhe year ended June 30 by about the same 
percentage. 
Tile phrase the year ended June 30 gets the analysis 
the_DN> year_NN> ended_AN> 
June_NN> 30_<P 
while the correct (or wanted) result is 
lhe DN> year_<P ended_<NOM-FMAINV 
June_ADVL 30 <NOM 
Different kind of errors appear in text bb! which con- 
tains incomplete sentences. The parser prefers com- 
plete sentences and produces errors in sentences like 
There w~s Provence in mid-autumn. Gold Zints. 
Air so serene you could look out over the sea for 
tens of miles. Rehabilitalion walks with him 
along tim woodland l)aths. 
The errors are: gold tints is parsed a.s svbjeel - main 
verb ~s well ~m r'ehabililation walks, and air is analysed 
,as a main verb, Other words have the appropriate 
analyses. 
The strict sequentiality of morphological and syn- 
tactic analysis in ENGCG does not allow the use of 
syntactic information in morphological disambigua- 
tion. The present method makes it possible to prune 
the remaining morphological ambiguities, i.e. do some 
part-of-speech tagging. Morphological ambiguity re- 
mains unresoNed if the chosen syntactic tag is present 
in two or more morphological readings of the same 
word. Morphological ambiguity 7 is reduced close to 
zero (about 0.3 % in all the samples together) and the 
overall success rate of ENGCG + our pattern parser 
is 98.7 %. 
r After ENGCG the amount of nmrphologic',d ambiguity 
in the test data was 2.9 %, with au error rate of 0.4 %. 
632 
4 CONCLUSION 
We discussed combining a linguistic rule-based parser 
and a corpus-based empirical parser. We divide the 
parsing process into two parts: applying linguistic in- 
formation and applying corpus-based patterns. The 
linguistic rules are regarded ms more reliable than the 
corpus-based generalisations. They are therefore ap- 
plied first. 
The idea is to use reliable linguistic information as 
long as it is possible. After certain phase it comes 
harder and harder to make new linguistic constraints 
to eliminate the remaining ambiguity. Therefore we 
use corpus-based patterns to do the remaining dis- 
and)iguation. The overall success rate of the com- 
bination of the linguistic rule-based parser and the 
corpus-based pattern parser is good. If some unrc- 
solvable ambiguity is left pending (like prepositional 
attachment), the total success rate of our morpho- 
logical and surface-syntactic analysis is only slightly 
worse than that of many probabilistic part-of-speech 
taggers. It is a good result because we do more than 
just label each word with a morphological tags (i.e. 
noun, verb, etc.), we label them also with syntactic 
fimction tags (i.e. subject, object, subject comple- 
ment, etc.). 
Some improvements might be achieved by modi- 
fying the syntactic tag set of ENGCG. As discussed 
above, the (syntactic) tag set of the ENGCG is not 
probably optimal. Some ambiguity is not resolvable 
(like prepositional attachment) and some distinctions 
arc not made (like subjects of the finite and the non- 
finite clauses). A better tag set for surface-syntactic 
parsing is presented in \[Voutilainen and Tapanainen, 
1993\]. But we have not modified the present tag set 
because it is not clear whether small changes would 
improve the result significantly when compared to the 
effort needed. 
Although it is not possible to fully disambiguate the 
syntax in ENGCG, the rate of disambiguation can be 
improved using a more powerful linguistic rule tbrmal- 
ism (see \[Koskenniemi el al., 1992; Koskenniemi, 1990; 
Tapanainen, 1991\]). The results reported in this sudy 
can most likely be improved by writing a syntactic 
grammar in the finite-state framework. The same 
kind of pattern parser could then be used for disam- 
biguating the resulting analyses. 
5 ACKNOWLEDGEMENTS 
The Constraint Grammar framework was originally 
proposed by Fred Karlsson \[1990\]. The extensive work 
on the description of English was (tone by Atro Vouti- 
lainen, Juha tleikkil~ and Arto Anttila \[1992\]. Timo 
J~rvinen \[1994\] has developed the syntactic constraint 
system further. ENGCG uses Kimmo Koskenniemi's 
\[1983\] two-level morphological analyser and Past Ta- 
panainen's implementation of Constraint Grammar 
parser. 
We want to thank Fred Karlsson, Lauri Karttunen, 
Annie Za~nen, Atro Voutilainen and Gregory Grefen- 
stette for commenting this paper. 
References 
\[J~rvinen, 1994\] Timo J~rvinen. Annotating 200 mil- 
lion words: The Bank of English project. \[n pro- 
ceedings of COLING-9\]~. Kyoto, 1994. 
\[Karlsson, 1990\] Fred Karlsson. Constraint Grammar 
as a framework for parsing running text. in tIans 
Karlgren (editor), COLING-90. Papers presented 
to the 13th International Conference on Compv- 
tational Linguistics. Vol. 3, pp. 168-173, IIelsinki, 
1990. 
\[Karlsson, 1994\] Fred Karlsson. Robust parsing of un- 
constrained text. In Nelleke Oostdijk and Pieter 
de IIaan (eds.), Corpus-based Research Into Lan- 
guage., pp. 121-142, l~odopi, Amsterdam-Atlanta, 
1994. 
\[Karlsson et at., 1994\] Fred Karlsson, Atro Voutilai- 
hen, Juha Ileikkilii. and Arto Anttila (eds.) Con- 
straint Grammar: a Language-Independent System 
for Parsing Unrestricted Text. Mouton de Gruyter, 
Berlin, 1994. 
\[Koskenniemi, 1983\] Kimmo Koskenniemi. Two-level 
morphology: a general computational model tbr 
word-form recognition and production. Publica- 
tions nro. 11. Dept. of General Linguistics, Univer- 
sity of Ilelsinki. 1983. 
\[Koskenniemi, 1990\] Kimmo Koskenniemi. Finite- 
state parsing and disambiguation. In lians Karl- 
gren (editor), COLING-90. Papers presented to the 
13th International Conference on Computational 
Linguistics. Vol. 2 pages 229-232, ll\[elsinki, 1999. 
\[Koskenniemi el al., 1992\] Kimmo Koskenniemi, P~i 
Tapanainen and Atro Voutilainen. Compiling and 
using finite-state syntactic rules. In Proceedings of 
the fifteenth International Conference on Computa- 
tional Linguistics. COLING-92. Vol. I, pp. 156-102, 
Nantes, France. 1992. 
\[Tapanainen, 1991\] Past Tapanainen. ~.Srellisinii au- 
tomaatteina esitettyjen kielioppis~i£ntSjen sovelta- 
minen hmnnollisen kielen j,isentKj~sK (Natural lan- 
guage parsing with finite-state syntactic rules). 
Master's thesis. Dept. of computer science, Univer- 
sity of Ilelslnki, i991. 
\[Voutilainen, 1994\] Afro Voutitainen. Three studies 
of grammar-based surface parsing of unrestricted 
english text. Publications nr. 24. Dept. of General 
Linguistics. University of llelsinki. 19!)4. 
\[Voutilainen el al., 1992\] 
Atro Voutilainen, Juha lteikkilii and Arto Anttila. 
Constraint grammar of English -- A l'erformance- 
Oriented Introduction. Publications nr. 21. Dept. 
of General Linguistics, University of Ilelsinki, 1992. 
\[Voutilainen and Tapanainen, 199q\] Atro Voutilainen 
and Past Tapanainen. Ambiguity resolution in a re- 
dnctionistic parser. In Proceedings of of Sixth Con- 
ference of the European Chapter of the Association 
for Computational Linguistics. EACL-93. pp. 394- 
403, Utrecht, Netherlands. 1993. 
A TIIE TAG SET 
2'his appendix contains the syntactic tags we have 
used. Tbe list is adopted from \[Voutilainen et al., 
1992\]. To obtain also the morphological "part-of- 
speech" tags you can send an empty e-mail message 
to engcg-info@ling.helsinki.fi. 
633 
+FAUXV = Finite Auxiliary Predicator: lie ca_~n 
read., 
-FAUXV = Nonfinite Auxiliary Predicator: Fie may 
have_ read., 
+FMAINV = Finite Main Predicator: He reads., 
-FMAINV = Nonfinite Main Predicator: He has 
re a~d., 
NPHIL = Stray NP: Volume I: Syntax, 
SUBJ = Subject: H__~e reads., 
F-SUBJ = Formal Subject: There was some argu- 
ment about that. I_tt is raining., 
OBJ = Object: He read a book., 
I-OBJ = Indirect Object: He gave Mary a book., 
PCOMPL-S = Subject Complement~ is a fool., 
PCOMPL-O = Object Complement: I eonsid~--him 
a fool., 
AI~V-L = Adverbial: He came home late. He is in the 
Ca r., 
O-ADVL = Object Adverbial: lie ran two miles. 
APP = Apposition: Helsinki, the capital of Finland, 
N = Title: King George and Mr. 
DN> - Dete~ner: He read the book., 
NN> = Premodifying Noun: The car park was full., 
AN> -- Premodifying Adjective: The bh, e car is 
mine., 
QN> --- Premodifying Quantifier: He had two sand- 
wiches and some coffee., 
GN> = Premodifying Genitive: M._yy car and flill's 
bike are blue., 
AD-A> = Premodifying Ad-Adjective: She is very 
intelligent., 
<NOM-OF = Postmodifying Of: Five of you will 
pass., 
<NOM-FMAINV = Postmodifying Nonfinite Verb: 
He has the licence lo kill. John is easy to please. The 
man drinking coffee is my uncle., 
<AD---A---~= Postmodifying Ad-Adjective: This is good 
enough., 
<~ = Other Postmodifier: The man with glasses 
is my uncle. He is the president elect. The man i_~n 
the moon fell down too soon., 
INFMAP~K> = Infinitive Marker: John wants t~ 
read., 
<P-FMAINV = Nonfinite Verb as Complement of 
Preposition: This is a brush for cleaning., 
<P = Other Complement of Prel)~ He is in tile 
car., 
CC --- Coordinator: John and Bill are friends., 
CS = Subordinator: /f Johu is there, we shall go, too., 
634 
