NPtool~ a detector of English noun phrases * 
Atro Voutilainen 
Research Unit for Computational Linguistics 
P.O. Box 4 (Keskuskatu 8) 
FIN-00014 University of Helsinki 
Finland 
E-mail: avoutila@ling.Helsinki.FI 
Abstract 
NPtool is a fast and accurate system for ex- 
tracting noun phrases from English texts 
for the purposes of e.g. information re- 
trieval, translation unit discovery, and cor- 
pus studies. After a general introduction, 
the system architecture is presented in out- 
line. Then follows an examination of a re- 
cently written Constraint Syntax. Section 6 
reports on the performance of the system. 
1 Introduction 
This paper outlines NPiool, a noun phrase detector. 
At the heart of this modular system is reduction- 
istic word-oriented morphosyntactic analysis 
that expresses head-modifier dependencies. Previ- 
ous work on this approach, largely based on Karls- 
son's original proposal \[Karlsson, 1990\], is docu- 
mented in \[Karlsson et ai., forthcoming\]. Let us sum- 
marise a few key features of this style of analysis. 
• As most parsing frameworks, also the present 
style of analysis employs a lexicon and a grammar. 
What may distinguish the present approach from 
most other frameworks is the considerable degree of 
attention we pay to the morphological and lexical 
description: morphological analysis is based on an 
extensive and detailed description that employs in- 
flectional and central derivational categories as well 
as other morphosyntactic features that can be use- 
ful for stating syntactic generalisations. In this way 
"This paper is based on a longer manuscript with the 
same title. The development of ENGCG wLs supported 
by TEKES, the Finnish Technological Development Cen- 
ter, •nd a part of the work on Finite-state syntax hu 
been supported by the Academy of Finland. 
a carefully built and informative lexicon facilitates 
the construction of accurate, wide-coverage parsing 
grammars. 
• We use tags to encode morphological distinc- 
tions, part of speech, and also syntactic information; 
for instance: 
I PRON ~ItEAD 
see V PRES @VERB 
a ART @>N 
bird N @HEAD 
FULLSTOP 
In this type of analysis, each word is provided with 
tags indicating e.g. part of speech, inflection, deriva- 
tion, and syntactic function. 
• Morphological and syntactic descriptions are 
based on hand-coded linguistic rules rather than 
on corpus-based statistical models. They employ 
structural categories that can be found in descrip- 
tive grammars, e.g. \[Quirk, Greenbaum, Leech and 
Svartvik, 1985\]. 
Regarding the at times heated methodological de- 
bate on whether statistical or rule-based information 
is to he preferred in grammatical analysis of running 
text (cf. e.g. \[Sampson, 1987a; Taylor, Grover and 
Briscoe, 1989; Church, 1992\]), we do not object to 
probabilistic methods in principle; nevertheless, it 
seems to us that rule-based descriptions are prefer- 
able bemuse they can provide for more accurate and 
reliable analyses than current probabilistic systems, 
e.g. part-of-speech taggers \[Voutilainen, Heikkil~ and 
Anttila, 1992; Voutilainen, forthcoming a\]. I Proba- 
IConsider for instance the question posed in \[Church, 
1992\] whether lexical probabilities contribute more to 
morphological or parLor-speech disambiguation than 
context does. The ENGCG morphological disambigua- 
tor, which is entirely based on context rules, uniquely 
48 
bilistic or heuristic techniques may still be a use- 
ful add-on to linguistic information, if potentially re- 
maining ambiguities must be resolved - though with 
a higher risk of error. 
• In the design of our grammar schemes, we have 
paid considerable attention to the question on the re- 
solvability of grammatical distinctions. In the 
design of accurate parsers of running text, this ques- 
tion is very important: if the description abounds 
with distinctions that can be dependably resolved 
only with extrasyntactic knowledge ~, then either the 
ambiguities due to these distinctions remain to bur- 
den the structure-based parser (as well as the poten- 
tial application based on the analysis), or a guess, 
i.e. a misprediction, has to be hazarded. 
This descriptive policy brings with it a certain de- 
gree of shallowness; in terms of information con- 
tent, a tag-based syntactic analysis is somewhere 
between morphological (e.g. part-of-speech) analysis 
and a conventional syntactic analysis, e.g. a phrase 
structure tree or a feature-based analysis. What we 
hope to achieve with this compromise in information 
content is the higher reliability of the proposed anal- 
yses. A superior accuracy could be considered as an 
argument for postulating a new, 'intermediary' level 
of computational syntactic description. For more de- 
tails, see e.g. \[Voutilainen and Tapanainen, 1993\]. 
• Our grammar schemes are also learnable: accord- 
ing to double-blind experiments on manually assign- 
ing morphological descriptions, a 100% interjudge 
agreement is typical \[Voutilainen, forthcoming a\]. 3 
• The ability to parse running text is of a high 
priority. Not only a structurally motivated descrip- 
tion is important; in the construction of the pars- 
ing grammars and lexica, attention should also be 
paid to corpus evidence. Often a grammar rule, 
as we expr~s it in our parsing grammars, is formed 
as a generalisation 'inspired' by corpus observations; 
in this sense the parsing grammar is corpus-based. 
However, the description need not be restricted to 
the corpus observation: the linguist is likely to gen- 
eralise over past experience, and this is not necessar- 
ily harmful - as long as the generalisations can also 
and correctly identifies more than 97% of all appropriate 
descriptions, and this is considerably more than the near- 
90% success rate achieved with lexical probabilities alone 
\[Church, 1992\]. Moreover, note that in all, the ENGCG 
disaanbiguator identifies more than 99.5% of all appropri- 
ate descriptions; only, some 2-3% of all anMyses remain 
ambiguous and thus do not become uniquely identified. 
For more details, see \[Voutila.inen, forthcoming 1993\]. 
2Witness, for instance, ambiguities due to adverbial 
attachment or modifier scope. 
3The 95% interjudge agreement rate reported in 
\[Church, 1992\] probably indicates that in the case of 
debatable constructions, explicit descriptive conventions 
have not been consistently established. Only a carefully 
defined grammar scheme makes the evaluation of the ac- 
curacy of the parsing system a meaningful enterprise (see 
also \[Sampson, 1987b\]). 
be validated against representative test corpora. 
* At least in a practical application, a parsing 
grammar should assign the best available anal- 
ysis to its input rather than leave many of the input 
utterances unrecognised e.g. as ill-formed. This does 
not mean that the concept of well-formedness is irrel- 
evant for the present approach. Our point is simply: 
although we may consider some text utterance as de- 
viant in one respect or another, we may still be inter- 
ested in extracting as much information as possible 
from it, rather than ignore it altogether. To achieve 
this effect, the grammar rules should be used in such 
a manner 4 that no input becomes entirely rejected, 
although the rules as such may express categorical 
restrictions on what is possible or well-formed in the 
language. 
• In our approach, parsing consists of two main 
kinds of operation: 
i. Context-insensitive lookup of (alternative) de- 
scriptions for input words 
2. Elimination of unacceptable or contextually il- 
legitimate alternatives. 
Morphological analysis typically corresponds to 
the lookup module: it produces the desired mor- 
phosyntactic analysis of the sentence, along with 
a number of inappropriate ones, by providing each 
word in the sentence with all conventional analyses as 
a list of alternatives. The grammar itself exerts the 
restrictions on permitted sequences of words and de- 
scriptors. In other words, syntactic analysis proceeds 
by way of ambiguity resolution or dlsambigua- 
tion: the parser eliminates ill-formed readings, and 
what 'survives' the grammar is the (syntactic) anal- 
ysis of the input utterance. Since the input contains 
the desired analysis, no new structure will be built 
dtvdng syntactic analysis itself. 
• Our grammars consist of constraints - par- 
tim distributional definitions of morphosyntactic cat- 
egories, such as parts of speech or syntactic func- 
tions. Each constraint expresses a piecemeal linear- 
precedence generalisation about the language, and 
they are independent of each other. That is, the con- 
straints can be applied in any order: a true grammar 
will produce the same analysis, whatever the order. 
The grammarian is relatively free to select the 
level of abstraction at which (s)he is willing to ex- 
press the distributional generalisation. In particular, 
also reference to very low-level categories is possi- 
ble, and this makes for the accuracy of the parser: 
while the grammar will contain more or less ab- 
stract, feature-oriented rules, often it is also expe- 
dient to state further, more particular restrictions 
on more particular distributional classes, even at 
the word-form level. These 'smaller' rules do not 
contradict the more general rules; often it is sim- 
4e.g. by ranking the graanmar rules in terms of 
compromisability 
49 
ply the case that further restrictions can be im- 
posed on smaller lexical classes s This flexibility in 
the grammar formalism greatly contributes to the 
accuracy of the parser \[Voutilainen, forthcoming a; 
Voutilainen, forthcoming 1993\]. 
2 Uses of a noun phrase parser 
The recognition and analysis of subclausal structural 
units, e.g. noun phrases, is useful for several pur- 
poses. Firstly, a noun phrase detector can be useful 
for research purposes: automatic large-scale analy- 
sis of running text provides the linguist with better 
means to conduct e.g. quantitative studies over large 
amounts of text. 
An accurate though somewhat superficial analysis 
can also serve as a 'preprocessor' prior to more ambi- 
tious, e.g. feature-based syntactic analysis. This kind 
of division of labour is likely to be useful for technical 
reasons. One major problem with e.g. unification- 
based parsers is parsing time. Now if a substan- 
tial part of the overall problem is resolved with 
more simple and efficient techniques, the task of the 
unification-based parser will become more manage- 
able. In other words, the more expressive but compu- 
tationally heavier machinery of e.g. the unification- 
based parser can be reserved entirely for the analysis 
of the descriptively hardest problems. The less com- 
plex parts of the overall problem can be tackled with 
more simple and efficient techniques. 
Regarding production uses, even lower levels of 
analysis can be directly useful. For instance, the de- 
tection of noun phrases can provide e.g. information 
management and retrieval systems with a suitable 
input for index term generation. 
Noun phrases can also serve as translation units; 
for instance, \[van der Eijk, 1993\] suggests that noun 
phrases are more appropriate translation units than 
words or part-of-speech classes. 
3 Previous work 
This section consists of two subsections. Firstly, a 
performance-oriented survey of some related systems 
is presented. Then follows a more detailed presenta- 
tion of ENGCG, a predecessor of the NPIool parser 
in an information retrieval system. 
3.1 Related systems 
So far, I have found relatively little documentation 
on systems whose success in recognising or parsing 
noun phrases has been reported. I am aware of three 
systems with some relevant evaluations. 
Church's Parts of speech \[Church, 1988\] performs 
not only part-of-speech analysis, but it also identi- 
ties the most simple kinds of noun phrases - mostly 
sequences of determiners, premodifiers and nominal 
heads - by inserting brackets around them, e.g. 
s Consider for instance the attachment of prepositions\] 
phrases in general and of ofphrues in particular. 
\[A/AT ~former/AP top/NN a±de/NN\] to/IN 
\[Attorney/NP/NP General/NP/NP Ed~in/NP/NP 
Meese/NP/NP\] in~erceded/VBD ... 
The appendix in \[Church, 1988\] lists the analysis of 
a small text. The performance of the system on the 
text is quite interesting: 0f243 noun phrase bracket, s, 
only five are omitted. - The performance of PaNs of 
speech was also very good in part-of-speech analysis 
on the text: 99.5% of all words got the appropri- 
ate tag. The mechanism for noun phrase identifica- 
tion relies on the part-of-speech analysis; the part-of- 
speech tagger was more successful on the text than 
on an average; therefore the average performance of 
the system in noun phrase identification may not be 
quite as good as the figures in the appendix of the 
paper suggest. 
Bourigault's LECTER \[Bourigault, 1992\] is a 
surface-syntactic analyser that extracts 'maximal- 
length noun phrases' -mainly sequences of determin- 
ers, premodifiers, nominal heads, and certain kinds 
of postmodifying prepositional phrases and adjec- 
tives - from French texts for terminology applica- 
tions. The system is reported to recognise 95% of all 
maximal-length noun phrases (43,500 out of 46,000 
noun phrases in the test corpus), but no figures are 
given on how much 'garbage' the system suggests as 
noun phrases. It is indicated, however, that manual 
validation is necessary. 
Rausch, Norrback and Svensson \[1992\] have de- 
signed a noun phrase extractor that takes as its in- 
put part-of-speech analysed Swedish text, and inserts 
brackets around noun phrases. In the recognition of 
'Nuclear Noun Phrases' - sequences of determiners, 
premodifiers and nominal heads - the system was 
able to identify 85.9% of all nuclear noun phrases in a 
text collection, some 6,000 words long in all, whereas 
some 15.7% of all the suggested noun phrases were 
false hits, i.e. the precision t' of the system was 84.3%. 
The performance of a real application would proba- 
bly be lower because potential misanalyses due to 
previous stages of analysis (morphological analysis 
and part-of-speech disarnbiguation, for instance) are 
not accounted for by these figures. 
3.2 ENGCG and the SIMPB. project 
SIMPR, Structured Information Management: Pro- 
cessing and Retrieval, was a 64 person year ESPRIT 
II project (Nr. 2083, 1989-1992), whose objective 
was to develop new methods for the management 
and retrieval of large amounts of electronic texts. A 
central function of such a system is to recognise those 
words in the stored texts that represent it in a con- 
cise fashion - in short, index terms. 
Term indices created with traditional methods 7 
are based on isolated, perhaps truncated words. 
eFor definitions of the terms recall and preciJion, see 
Section 6. 
7See e.g. \[Stlton and McGill, 1983\]. 
50 
These largely string-based statistical methods are 
somewhat unsatisfactory because many content iden- 
tifiers consist of word sequences - compounds, head- 
modifier constructions, even simple verb - noun 
phrase sequences. One of the SIMPR objectives 
was also to employ more complex constructions, the 
recognition of which would require a shallow gram- 
matical analysis. The Research Unit for Computa- 
tional Linguistics at the University of Helsinki par- 
ticipated in this project, and ENGTWOL, a Twol- 
styled morphological analyser as well as ENGCG, 
a Constraint Grammar of English, were written 
1989-1992 by Voutilainen, Heikkil~i and Anttila 
\[forthcoming\]. The resultant SIMPR system is an 
improvement over previous systems \[Smart (Ed.), 
forthcoming\] - it is not only reasonably accurate, but 
also it operates on more complex constructions, e.g. 
postmodifying constructions and simple verb-object 
constructions. 
There were also some persistent problems. The 
original plan was to use the output of the whole 
ENGCG parser for the indexing module. However, 
the last module of the three sequential modules in 
the ENGCG grammar, namely Constraint Syntax 
proper, was not used in the more mature versions of 
the indexing module - only lexical analysis and mor- 
phological disambiguation were applied. The omis- 
sion of the syntactic analysis was mainly due to the 
somewhat high error rate (3--4% of all words lost the 
proper syntactic tag) and the high rate of remaining 
ambiguities (15-25% of all words remained syntacti- 
cally ambiguous. 
Here, we will not go into a detailed analysis of the 
problems s, suffice it to say that the syntactic gram- 
mar scheme was unnecessarily ambitious for the rela- 
tively simple needs of the indexing application. One 
of the improvements in NPtoal is a more optimal syn- 
tactic grammar scheme, as will be seen in Section 5.1. 
4 NPtool in outline 
In this section, the architecture of NPtool is pre- 
sented in outline. Here is a flow chart of the system: 
Preprocessing 
V 
Morphological analysis 
V 
Constraint Grammar parsing 
%/ V 
NP-hostile finite IP-friendly finite 
state parsing state parsing 
V V 
IP extraction liP extraction 
V V 
Intersection of noun phrase sets 
SSee e.g. \[VoutLla/aen, Heikkil£ and AnttAIa, 1992\] for 
details. 
In the rest of this section, we will observe the analysis 
of the following sample sentence, taken from a car 
maintenance manual: 
The ~n\]e~ and exhaust manifolds are mounted 
on opposite sides of the cylinder head, the 
exhaust manifold channelling the gases to a 
single exhaust pipe and silencer system. 
4.1 Preprocessing and morphological 
analysis 
The input ASCII text, preferably SGML-annotated, 
is subjected to a preprocessor that e.g. determines 
sentence boundaries, recognises fixed syntagms 9, 
normalises certain typographical conventions, and 
verticalises the text. 
This preprocessed text is then submitted to mor- 
phological analysis. ENGTWOL, a morphological 
analyser of English, is a Koskenniemi-style morpho- 
logical description that recognises all inflections and 
central derivative forms of English. The present 
lexicon contains some 56,000 word stems, and al- 
together the analyser recognises several hundreds of 
thousands of different word-forms. The analyser also 
employs a detailed parsing-oriented morphosyntac- 
tic description; the feature system is largely derived 
from \[Quirk, Greenbaum, Leech and Svartvik, 1985\]. 
Here is a small sample: 
("<*the>" 
("the" DET CENTRAL ART SG/PL (©>7))) 
("<inlet>" 
("inlet" N lfOM SG)) 
("<and>" 
("and" cc (ecc))) 
( "<exhaust>" 
("exhaust" <SVO> V SUBJUNCTIVE VFIN (~V)) 
("exhaust" <SVO> V IMP VFIN (~V)) 
("exhaust" <SVO> V INF) 
("exhaust" <SVO> V PRE$ -SG3 VFIN (@V)) 
("exhaust" N NOM SG)) 
( "<manif olds>" 
("manifold" N NOM PL)) 
All running-text word-forms are given on the left- 
hand margin, while all analyses are on indented lines 
of their own. The multiplicity of these lines for a 
word-form indicates morphological ambiguity. 
For words not represented in the ENGTWOL lex- 
icon, there is a 99.5% reliable utility that assigns 
ENGTWOL-style descriptions. These predictions 
are based on the form of the word, but also some 
heuristics are involved. 
4.2 Constraint Grammar parsing 
The next main stage in NPtoei analysis is Con- 
straint Grammar parsing. Parsing consists of two 
main phases: morphological disambiguation and 
Constraint syntax. 
°e.g. multiword prepositions and compounds 
51 
• Morphological disambiguation. The task 
of the morphological disambiguator is to discard all 
contextually illegitimate morphological readings in 
ambiguous cohorts. For instance, consider the fol- 
lowing fragment: 
("<aT" 
("a" <Indef> DET CESTRAL ART SG (a>S))) 
( "<s ingle>" 
("single" <SVO> V IMP VFIS (av)) 
("single" <SVO> V IIF) 
("single" A ABS)) 
Here an unambiguous determiner is directly followed 
by a three-ways ambiguous word, two of the analyses 
being verb readings, and one, an adjective reading. 
- A determiner is never followed by a verbl°; one 
of the 1,100-odd constraints in the disambiguation 
grammar \[Voutilainen, forthcoming a\] expresses this 
fact about English grammar; so the verb readings of 
single are discarded here. 
The morphological disambiguator seldom discards 
an appropriate morphological reading: after morpho- 
logical disambiguation, 99.7-100% of all words re- 
tain the appropriate analysis. On the other hand, 
some 3-6% of all words remain ambiguous, e.g. 
head in this sentence. There is also an additional 
set of some 200 constraints - after the application 
of both constraint sets, 97-98% of all words be- 
come fully disambiguated, with an overall error rate 
of up to 0.4% \[Voutilainen, forthcoming b\]. The 
present disambiguator compares quite favourably 
with other known, typically probabilistic, disam- 
biguators, whose maximum error rate is as high as 
5%, i.e. some 17 times as high as that of the ENGCG 
disambiguator. 
• Constraint syntax. After morphological dis- 
ambiguation, the syntactic constraints are applied. 
In the NPtool syntactic description, all syntactic am- 
biguities are introduced directly in the lexicon, so 
no extra lookup module is needed. Like disambigua- 
tion constraints, syntactic constraints seek to discard 
all contextually illegitimate syntactic function tags. 
Here is the syntactic analysis of our sample sentence, 
as produced by the current parser. To save space, 
most of the morphological codes are omitted. 
("<*the>" 
("the" O~ (©>N))) 
("<inlet>" 
("inlet" i (@>I ~NH))) 
("<and>" 
("and" CC (eCC))) 
( "<exhaus 1;>" 
("exhaust" I (@>N))) 
("<manifolds>" 
("manifold" I (aIll))) 
("<are>*' 
l°save for no, which can be followed by an -ing-form; 
d. no in There is no going home 
("be" V (av))) 
("<mounted>" 
("mount" PCP2 (av))) 
( "<on>" 
("on" PREP (aAH))) 
("<opposite>" 
("opposite" A (a>S))) 
("<sides>" 
("side" S CASH))) 
(*'<of>" 
("of" PREP (©S<))) 
("<the>" 
("the" DET (a>N))) 
("<cylinder>" 
("cylinder" I (a>s asH))) 
( "<head>" 
("head" V (av)) 
("head" S (aSH))) (,,<$,>,') 
("<the>" 
("the" DET Ca>S))) 
("<exhaust>" 
("exhaust" N (©>S))) 
("<manifold>" 
("manifold" N (aSH))) 
("<channelling>" 
("char-el" PCP1 (av))) 
("<the>'° 
("the" DET (a>I))) 
("<gases>" 
("gas" I (aNH))) 
("<¢o>" 
("to" PREP (aAH))) ("<a>" 
("a" DET (a>I))) 
("<single>" 
("single" A Ca>I))) 
("<oxhaust>" 
("exhaust" I Ca>I))) 
("<pips>" 
("pipe" X (aN'H))) 
( "<and>" 
("and" cc (acc))) 
("<silencer>" 
("silencer" I Ca>N))) 
("<system>" 
("system" N (@NH))) (,,<$.>,,) 
All syntactic-function tags are flanked with '@'. For 
instance, the tag '@>N' indicates that the word is 
a determiner or a premodifier of a nominal in the 
right-hand context (e.g. fhe). The second word, in- 
#or, remains syntactically ambiguous due to a pre- 
modifier reading and a nominal head @NH reading 
- note that the ambiguity is structurally genuine, a 
coordination ambiguity. The tag @V is reserved for 
verbs and auxiliaries, cf. are as well as mounted. The 
syntactic description will be outlined below. 
52 
Pasi Tapanainen 11 has recently made a new imple- 
mentation of the Constraint Grammar parser that 
performs morphological disambiguation and syntac- 
tic analysis at a speed of more than 1,000 words per 
second on a Sun SparcStation 10, Model 30. 
4.3 Treatment of remaining ambiguities 
The Constraint Grammar parser recognises only 
word-level ambiguities, therefore some of the trover- 
sale through an ambiguous sentence representation 
may be blatantly ill-formed. 
NPtool eliminates locally unacceptable analyses by 
using a finite-state parser \[Tapanainen, 1991\] 1~ as a 
kind of 'post-processing module' that distinguishes 
between competing sentence readings. The parser 
employs a small finite-state grammar that I have 
written. The speed of the finite-state parser is com- 
parable to that of the Constraint Grammar parser. 
The finite-state parser produces all sentence read- 
ings that are in agreement with the grammar. Cam 
sider the following two adapted readings from the 
beginning of our sample sentence: 
the/¢>N inlet/©>N and/©CC exhaust/©>N 
manifolds/©NH are/BY mounted/©V 
on/©AH opposite/©>N sides/©NH 
of/@N< the/@>N cylinder/@NH head/BY 
the/~>N inlet/¢>N and/%CC exhaust/©>N 
manifolds/©NH axe/QV moun~ed/@V 
on/©AH opposite/Q>N sides/©NH 
of/QN< the/@>N cylinder/@>N head/©NH 
The only difference is in the analysi s of cylinder head: 
the first analysis reports cylinder as a noun phrase 
head which is followed by the verb head, while the 
second analysis considers cylinder head as a noun 
phrase. Now the last remaining problem is, how 
to deal with ambiguous analyses like these: should 
cylinder be reported as a noun phrase, or is cylinder 
head the unit to be extracted? 
The present system provides all proposed noun 
phrase candidates in the output, but each with an 
indication of whether the candidate noun phrase is 
unambiguously analysed as such, or not. In this so- 
lution, I do not use all of the multiple analyses pro- 
posed by the finite-state parser. For each sentence, 
no more than two competing analyses are selected for 
further processing: one with the highest number of 
words as part of a maximally long noun phrase anal- 
ysis, and the other with the lowest number of words 
as part of a maximally short noun phrase analysis. 
This 'weighing' can be done during finite-state 
parsing: the formalism employs a mechanism for im- 
posing penalties on regular expressions, e.g. on tags. 
nKesearch Unit for Computational Linguistics, Uni- 
versity of Helsinki 
12For other work m this approach, see also \[Kosken- 
niemi, 1990; Koskenniemi, Tapanalnen and Voutilainen, 
1992; Voutilainen and Tapanalnen, 1993\]. 
A penalised reading is not discarded as ungrammat- 
ical, only the parser returns all accepted analyses in 
an order where the least penalised analyses are pro- 
duced first and the 'worst' ones last. 
Thus there is an 'NP-hostile' finite-state parser 
that penalises noun phrase readings; this would 
prefer the sentence reading with cylinder/@NH 
head/@V. The 'NP-friendly' parser, on the other 
hand, penalises all readings which are not part of a 
noun phrase reading, so it would prefer the analysis 
with eylinder/@>N head/@NIY. Of all analyses, the 
selected two parses are maximally dissimilar with re- 
gard to NP-hood. The motivation for selecting max- 
imally conflicting analyses in this respect is that a 
candidate noun phrase that is agreed upon as a noun 
phrase by the two finite-state parsers systems just as 
it is - neither longer nor shorter - is likely to be an 
unambiguously identified noun phrase. The compar- 
ison of the outputs of the two competing finite-state 
parsers is carried out during the extraction phase. 
4.4 Extraction of noun phrases 
An unambiguous sentence reading is a linear se- 
quence of symbols, and extracting noun phrases from 
this kind of data is a simple pattern matching task. 
In the present version of the system, I have used 
the gawk program that allows the use of regular ex- 
pressions. With gawk's gsub function, the bound- 
aries of the longest non-overlapping expressions that 
satisfy the search key can be marked. If we formu- 
late our search query as something like the following 
schematic regular expression: 
ElM>N+ \[CC M>N+\]*\]* HEAD 
IN< \[D/M>N+ \[CC D/M>N+\]*\]* HEAD\]*\] 
ghere 
'\[' and '\]' 
cH>JJ 
'D/M>N' 
~HF.AD J 
iN<, 
axe :for grouping, 
stands for one or more 
occurrences of its argument, 
stands for zero or more 
occurrences of its axgmnen$, 
stands for premodifiers, 
stands for determiners and 
premodifiers, 
stands for nominal heads 
except pronouns, 
stands for prepositions 
starting a poszmodifying 
prepositional phrase, 
and do some additional formatting and 'cleaning', 
the above two finite-state analyses will look like the 
following13: 
the 
np: inlet and exhaust manifold 
13Note that the noun phrase heads are here ~ven in 
the bLse form, hence the absence of the plural form of 
e.g. 'manifold'. 
53 
are mounted on 
np: opposite side of the cylinder 
head, the 
np: exhaust manifold 
channelling the 
np: gas 
to a 
np: single exhaust pipe 
and 
np: silencer system 
the 
np: inlet and exhaust manifold 
are mounted on 
np: opposite side of the cylinder head 
, the 
np: exhaust manifold 
channelling the 
np: gas 
to a 
np: single exhaust pipe 
and 
np: silencer system 
The proposed noun phrases are given on indented 
lines, each marked with the symbol 'np:'. The can- 
didate noun phrases are then subjected to further 
routines: all candidate noun phrases with at least 
one occurrence in the output of both the NP-hostile 
and NP-friendly parsers are labelled with the sym- 
bol 'ok:', and the remaining candidates are labelled 
as uncertain, with the symbol '?:'. From the outputs 
given above, the following list can be produced: 
ok: inlet and exhaust manifold 
ok: exhaust manifold 
ok: gas 
ok: single exhaust pipe 
ok: silencer system 
?: opposite side of the cylinder 
?: opposite side of the cylinder head 
The linguistic analysis is relatively neutral as to 
what is to be extracted from it. Here we have con- 
centrated on noun phrase extraction, but from this 
kind of input, also many other types of construction 
could be extracted, e.g. simple verb-argument struc- 
tures. 
5 The syntactic description 
This section outlines the syntactic description that I 
have written for 2gPtool purposes. The ENGTWOL 
lexicon or the disambiguation constraints will not be 
described further in this paper; they have been doc- 
umented extensively elsewhere (see the relevant ar- 
ticles in Karlsson & al. \[forthcoming\]). 
According to the SIMP/t experiences, the vast ma- 
jority of index terms represent relatively few con- 
structions. By far the most common construction 
is a nominal head with optional, potentially coordi- 
nated premodifiers and postmodifying prepositional 
phrases, typically of phrases. The remainder, less 
than 10%, consists almost entirely of relatively sim- 
ple verb-NP patterns. 
The syntactic description used in SIMPR em- 
ployed some 30 dependency-oriented syntactic func- 
tion tags, which differentiate (to some extent) be- 
tween various kinds of verbal constructions, syntac- 
tic functions of nominal heads, and so on. Some of 
the ambiguity that survives ENGCG parsing is in 
part due to these distinctions \[Anttila, forthcoming\]. 
The relatively simple needs of an index term ex- 
traction utility on the one hand, and the relative 
abundance of distinctions in the ENGCG syntactic 
description on the other, suggest that a less distinc- 
tive syntactic description might be more optimal for 
the present purposes: a more shallow description 
would entail less remaining ambiguity without un- 
duly compromising its usefulness e.g. for an indexing 
application. 
5.1 Syntactic tags 
I have designed a new syntactic grammar scheme 
that employs seven function tags. These tags cap- 
italise on the opposition between noun phrases and 
other constructions on the one hand, and between 
heads and modifiers, on the other. Here we will not 
go into details; a gloss with a simple illustration will 
suf~ce. 
• ~V represents auxiliary and main verbs as well 
as the infinitive marker to in both finite and non- 
finite constructions. For instance: 
She should/¢V know/@V what to/QV do/©V. 
• ~NH represents nominal heads, especially 
nouns, pronouns, numerals, abbreviations and -ing- 
forms. Note that of adjectival categories, only those 
with the morphological feature <Nominal>, e.g. En- 
glish, are granted the @NH status: all other adjec- 
tives (and -ed-forms) are regarded as too unconven- 
tional nominal heads to be granted this status in the 
present description. An example: 
The English/@Ne may like the conventional. 
• Q>N represents determiners and premodifiers 
of nominals (the angle-bracket '>' indicates the di- 
rection in which the head is to be found). The head 
is the following nominal with the tag @NH, or a pre- 
modifier in between. An example: 
the/@>N fat/@>l| butchsr's/@>N wife 
• ON< represents prepositional phrases that un- 
ambiguously postmodify a preceding nominal head. 
Such unambiguously postmodifying constructions 
are typically of two types: (i) in the absence of cer- 
tain verbs like 'accuse', postnominal of-phrases and 
(ii) preverbal NP--PP sequences, e.g. 
The man in/¢~< 'the moon had 
a glass of/@N< ale. 
54 
Currently the description does not account for other 
types of postmodifier, e.g. postmodifying adjectives, 
numerals, other nominals, or clausal constructions. 
• ~CC and @CS represent co-ordinating and sub- 
ordinating conjunctions, respectively: 
Either/CCC you or/CCC I will go 
if/COS necessary. 
• @AH represents the 'residual': adjectival heads, 
adverbials of various kinds, adverbs (also intensi- 
fiers), and also those of the prepositional phrases that 
cannot be dependably analysed as a postmodifier. 
An example is in order: 
There/CAH have al~ays/©AH been very/CAH 
many people in/QAH Shis area. 
5.2 Syntactic constraints 
The syntactic grammar contains some 120 syntactic 
constraints. Like the morphological disambiguation 
constraints, these constraints are essentially negative 
partial linear-precedence definitions of the syntactic 
categories. 
The present grammar is a partial expression of four 
general grammar statements: 
1. Part of speech determines the order of determin- 
ers and modifiers. 
2. Only likes coordinate. 
3. A determiner or a modifier has a head. 
4. An auxiliary is followed by a main verb. 
We will give only one illustration of how these 
general statements can be expressed in Constraint 
Grammar. Let us give a partial paraphrase of the 
statement Part of speech determines the order of de. 
termiuers and modifiers: 'A premodifying noun oc- 
curs closest to its head'. In other words, premodifiers 
from other parts of speech do not immediately fol- 
low a premodifying noun. Therefore, a noun in the 
nominative immediately followed by an adjective is 
not a premodifier. Thus a constraint in the grammar 
would discard the @>N tag of Harry in the following 
sample sentence, where Harry is directly followed by 
an unambiguous adjective: 
("<*iU>" 
("be" <SVC/N> <SVC/A> V PKES $G3 (@V))) 
( "<*harry>" 
("harry" <Proper> N N0M SG (eNH @>N))) 
("<foolish>" 
("foolish" £ £BS (@AH))) (,,<¢?>,,) 
We require that the noun in question is a nominative 
because premodifying nouns in the genitive can occur 
also before adjectival premodifiers; witness Harry's 
in Harry's foolish self. 
5.3 Evaluation 
The present syntax has been applied to large 
amounts of journalistic and technical text (news- 
papers, abstracts on electrical engineering, manuals 
on car maintenance, etc.), and the analysis of some 
20,000-30,000 words has been proofread to get an 
estimate of the accuracy of the parser. 
After the application of the NPtool syntax, some 
93-96% of all words become syntactically unambigu- 
ous, with an error rate of less than i% 14 . 
To find out how much ambiguity remains at the 
sentence level, I also applied a 'NP-neutral' version 15 
of the finite-state parser on a 25,500 word text from 
The Grolier Electronic Encyclopaedia. The results 
are given in Figure 1. 
Figure 1: Ambiguity rates after finite-state parsing 
in a text of 1,495 sentences (25,500 words). R in- 
dicates the number of analyses per sentence, and F 
indicates the frequency of these sentences. 
\[R F \[IR \[ F HR I F fIR \[ F_-\] 
i 960 6 19 "12 6 !32 2 
i 2 304 7 i 3' 14 3 '48 2 
3 54 8 28 165 64 1 
4 93 9 3 24j 1 72 1 
5 4 i I0 3 281 1 
Some 64% (960) of the 1,495 sentences became 
syntactically unambiguous, while only some 2% of 
all sentences analyses contain more than ten read- 
ings, the worst ambiguity being due to 72 analyses. 
This compares favourably with the ENGCG perfor- 
mance: after ENGCG parsing, 23.5% of all sentences 
remained ambiguous due to a number of sentence 
readings greater than the worst case in NPtool syn- 
tax. 
6 Performance of NPtool 
Various kinds of metrics can be proposed for the eval- 
uation of a noun phrase extractor; our main metrics 
are recall and precision, defined as followslU: 
• Recall: the ratio 'retrieved intended NPs '17 / 
'all intended NPs' 
• Precision: the ratio 'all retrieved NPs' / 're- 
trieved intended NPs' 
14This figure also covers errors due to previous stages 
of analysis. 
zSi.e. • parser which does not contain the mechanism 
for penalising or fsvouring noun phrue analyses; see Sec- 
tion 4.3 •hove. 
16Thls definition also agrees with that used in Rausch 
k al. \[1992\]. 
17An 'intended NP' is the longest non-overlapping 
match of the ¢eaxch query given in extraction phue. 
55 
To paraphrase, a recall of less than 100% indicates 
that the system missed some of the desired noun 
phrases, while a precision of less than 100% indicates 
that the system retrieved something that is not re- 
garded as a correct result. 
The performance of the whole system has been 
evaluated against several texts from different do- 
mains. In all, the analysis of some 20,000 words has 
been manually checked. 
If we wish to extract relatively complex noun 
phrases with optional coordination, premodifiers and 
postmodifiers (see the search query above in Sec- 
tion 4.4), we reach a recall of 98.5-100%, with a 
precision of some 95-98%. 
As indicated in Section 4.4, the extraction utility 
annotates each proposed noun phrase as a 'sure hit' 
('ok:') or as an 'uncertain hit' ('?:'). This distinction 
is quite useful for manual validation: approximately 
95% of all superfluous noun phrase candidates are 
marked with the question mark. 
7 Conclusion 
In terms of accuracy, NPtool is probably one of the 
best in the field. In terms of speed, much remains to 
be optimised. Certainly the computationally most 
demanding tasks - disambiguation and parsing - are 
already carried out quite efficiently, but the more 
trivial parts of the system could be improved. 
8 Acknowledgements 
I wish to thank Krister Linden, Pasi Tapanainen and 
two anonymous referees for useful comments on an 
earlier version of this paper. The usual disclaimers 
hold. 

References 
\[Anttila, fortheoming\] Anttila, A. (forthcoming). 
How to recognise subjects in English. In Karlsson 
& at. 
\[Bourigault, 1992\] Bourigault, D. 1992. Surface 
grammatical analysis for the extraction of termi- 
nological noun phrases. In Proceedings of the fif- 
teenth International Conference on Computational 
Linguistics. COLING-g2, Vol. IIL Nantes, France. 
977-981. 
\[Church, 1988\] Church, K. 1988. A Stochastic Parts 
Program and Noun Phrase Parser for Unrestricted 
Text. Proceedings of the Second Conference on Ap- 
plied Natural Language Processing, A CL. 136-143. 
\[Church, 1992\] Church, K. 1992. Current Practice in 
Part of Speech Tagging and Suggestions for the 
Future, in Simmons (ed.), Sboruik praci: In Honor 
of Henry Ku~era. Michigan Slavic Studies. 
\[van der Eijk, 1993\] van der Eijk, P. 1993. Automat,- 
ing the acquisition of bilingual terminology. Pro- 
ceedings of EACL'93. Utrecht, The Netherlands. 
\[Heikkil~i, forthcoming a\] Heikkil~i, J. (forthcoming 
a). A TWOL-Based Lexicon and Feature System 
for English. In Karlsson & at. 
\[Heikkil~, forthcoming b\] Heikkil~i, J. (forthcoming 
b). ENGTWOL English lexicon: solutions and 
problems. In Karlsson & al. 
\[Karlsson, 1990\] Karlsson, F. 1990. 
Constraint Grammar as a Framework for Pars- 
ing Running Text. In H. Karlgren (ed.), Papers 
presented to the 13th International Conference on 
Computational Linguistics, Vol. 3. Helsinki. 168- 
173. 
\[Karlsson, forthcoming\] Karlsson, F. (forthcoming). 
Designing a parser for unrestricted text. In Karls- 
son & at. 
\[Karlsson et al., forthcoming\] Karlsson, F., Vouti- 
lainen, A., Heikkil~i, J. and Anttila, A. Con- 
straint Grammar: a Language-Independent Sys- 
tem for Parsing Unrestricted Tezt. Mouton de 
Gruyter. 
\[Koskenniemi, 1990\] Koskenniemi, 
K. (1990). Finite-state parsing and disambigua- 
tion. In Karlgren, H. (ed.) COLING-90. Papers 
presented to the 13th International Conference on 
Computational Linguistics, Vol. 2. Helsinki, Fin- 
land. 229-232. 
\[Koskenniemi, Tapanainen and Voutilainen, 1992\] 
Koskenniemi, K., Tapanainen, P. and Voutilainen, 
A. (1992). Compiling and using finite-state syn- 
tactic rules. In Proceedings of the fifteenth later- 
national Conference on Computational Linguist- 
ics. COLING-9~, Vol. L Nantes, France. 156-162. 
\[Quirk, Greenbaum, Leech and Svartvik, 1985\] 
Quirk, R., Greenbanm, S., Leech, G. and Svartvik, 
J. 1985. A Comprehensive Grammar of the English 
Language. London & New York: Longman. 
\[Bausch, Norrhack and Svensson, 1992\] Bausch, B., 
Norrback, R., and Svensson, T. 1992. Excerper- 
ing av nominMfraser ur 15pande text. Manuscript. 
Stockholms universitet, Institutionen F6r Lingvis- 
tik. 
\[Salton and McGill, 1983\] SMton, G. and McGill, 
M. 1983. Introduction to Modern Information Re- 
trieval. McGraw-Hill, Auckland. 
\[Sampson, 1987a\] Sampson, G. 1987. Probabilistic 
Models of Analysis. In Garside, Leech and Samp- 
son (eds.) 1987. 16-29. 
\[Sampson, 1987b\] Sampson, G. 1987. The grammat- 
ical database and parsing scheme. In Garside, 
Leech and Sampson (eds.) 1987.82-96. 
\[Smart (Ed.), forthcoming\] Smart (Ed.) (forthcom- 
ing). Structured Information Management: Pro- 
cessing and Retrieval. (provisional title). 
\[Tapanainen, 1991\] Tapanainen, P. 1991..~.grellisin~. 
automaatteina esitettyjen kielioppis~£~.ntSjen so- 
veltaminen luonnollisen kielen j~ent~.j~.s.sg (Nat- 
ural language parsing with finite-state syntactic 
rules). Master's thesis. Dept. of computer science, 
University of Helsinki. 
\[Taylor, Graver and Briscoe, 1989\] Taylor, L., Gra- 
ver, C. and Briscoe, T. 1989. The Syntactic Regu- 
larity of English Noun Phrases. In Proceedings of 
the Fourth Conference of the European Chapter of 
the ACL. 256-263. 
\[Voutilainen, forthcoming a\] Voutilainen, A. (forth- 
coming a). Context-sensitive disambiguation. In 
Karlsson & al. 
\[Voutilainen, forthcoming b\] Voutilainen, A. (forth- 
coming b). Experiments with heuristics. In Karls- 
son & al. 
\[Voutilainen, forthcoming 1993\] 
Voutilainen, A. (forthcoming 1993). Designing a 
parsing grammar. 
\[Voutilainen, Heikkil~i and Anttila, 1992\] 
Voutilainen, A., Heikkil~, J. and Anttila, A. 
(1992). Constraint Grammar of English: A 
Performance.Oriented Introduction. Publication 
No. 21, Department of General Linguistics, Uni- 
versity of Helsinki. 
\[Voutilainen and Tapanainen, 1993\] Voutilninen, A. 
and Tapanainen, P. 1993. Ambiguity resolution in 
a reductionistic parser. Proceedings of EACL'93. 
Utrecht, Holland. 
