Error-Driven Pruning of Treebank Grammars 
for Base Noun Phrase Identification 
Claire Cardie and David Pierce 
Department of Computer Science 
Cornell University 
Ithaca, NY 14853 
cardie, pierce@cs.cornell.edu 
Abstract 
Finding simple, non-recursive, base noun phrases is 
an important subtask for many natural language 
processing applications. While previous empirical 
methods for base NP identification have been rather 
complex, this paper instead proposes a very simple 
algorithm that is tailored to the relative simplicity 
of the task. In particular, we present a corpus-based 
approach for finding base NPs by matching part-of- 
speech tag sequences. The training phase of the al- 
gorithm is based on two successful techniques: first 
the base NP grammar is read from a "treebank" cor- 
pus; then the grammar is improved by selecting rules 
with high "benefit" scores. Using this simple algo- 
rithm with a naive heuristic for matching rules, we 
achieve surprising accuracy in an evaluation on the 
Penn Treebank Wall Street Journal. 
1 Introduction 
Finding base noun phrases is a sensible first step 
for many natural language processing (NLP) tasks: 
Accurate identification of base noun phrases is ar- 
guably the most critical component of any partial 
parser; in addition, information retrieval systems 
rely on base noun phrases as the main source of 
multi-word indexing terms; furthermore, the psy- 
cholinguistic studies of Gee and Grosjean (1983) in- 
dicate that text chunks like base noun phrases play 
an important role in human language processing. In 
this work we define base NPs to be simple, nonre- 
cursive noun phrases -- noun phrases that do not 
contain other noun phrase descendants. The brack- 
eted portions of Figure 1, for example, show the base 
NPs in one sentence from the Penn Treebank Wall 
Street Journal (WSJ) corpus (Marcus et al., 1993). 
Thus, the string the sunny confines of resort towns 
like Boca Raton and Hot Springs is too complex to 
be a base NP; instead, it contains four simpler noun 
phrases, each of which is considered a base NP: the 
sunny confines, resort towns, Boca Raton, and Hot 
Springs. 
Previous empirical research has addressed the 
problem of base NP identification. Several algo- 
rithms identify "terminological phrases" -- certain 
When \[it\] is \[time\] for \[their biannual powwow\] , 
\[the nation\] 's \[manufacturing titans\] typically 
jet off to \[the sunny confines\] of \[resort towns\] 
like \[Boca Raton\] and \[Hot Springs\]. 
Figure 1: Base NP Examples 
base noun phrases with initial determiners and mod- 
ifiers removed: Justeson & Katz (1995) look for 
repeated phrases; Bourigault (1992) uses a hand- 
crafted noun phrase grammar in conjunction with 
heuristics for finding maximal length noun phrases; 
Voutilainen's NPTool (1993) uses a handcrafted lex- 
icon and constraint grammar to find terminological 
noun phrases that include phrase-final prepositional 
phrases. Church's PARTS program (1988), on the 
other hand, uses a probabilistic model automati- 
cally trained on the Brown corpus to locate core 
noun phrases as well as to assign parts of speech. 
More recently, Ramshaw & Marcus (In press) ap- 
ply transformation-based learning (Brill, 1995) to 
the problem. Unfortunately, it is difficult to directly 
compare approaches. Each method uses a slightly 
different definition of base NP. Each is evaluated on 
a different corpus. Most approaches have been eval- 
uated by hand on a small test set rather than by au- 
tomatic comparison to a large test corpus annotated 
by an impartial third party. A notable exception is 
the Ramshaw & Marcus work, which evaluates their 
transformation-based learning approach on a base 
NP corpus derived from the Penn Treebank WSJ, 
and achieves precision and recall levels of approxi- 
mately 93%. 
This paper presents a new algorithm for identi- 
fying base NPs in an arbitrary text. Like some of 
the earlier work on base NP identification, ours is 
a trainable, corpus-based algorithm. In contrast to 
other corpus-based approaches, however, we hypoth- 
esized that the relatively simple nature of base NPs 
would permit their accurate identification using cor- 
respondingly simple methods. Assume, for example, 
that we use the annotated text of Figure 1 as our 
training corpus. To identify base NPs in an unseen 
218 
text, we could simply search for all occurrences of the 
base NPs seen during training -- it, time, their bian- 
nual powwow, ..., Hot Springs -- and mark them 
as base NPs in the new text. However, this method 
would certainly suffer from data sparseness. Instead, 
we use a similar approach, but back off from lexical 
items to parts of speech: we identify as a base NP 
any string having the same part-of-speech tag se- 
quence as a base NP from the training corpus. The 
training phase of the algorithm employs two previ- 
ously successful techniques: like Charniak's (1996) 
statistical parser, our initial base NP grammar is 
read from a "treebank" corpus; then the grammar 
is improved by selecting rules with high "benefit" 
scores. Our benefit measure is identical to that used 
in transformation-based learning to select an ordered 
set of useful transformations (Brill, 1995). 
Using this simple algorithm with a naive heuristic 
for matching rules, we achieve surprising accuracy 
in an evaluation on two base NP corpora of varying 
complexity, both derived from the Penn Treebank 
WSJ. The first base NP corpus is that used in the 
Ramshaw & Marcus work. The second espouses a 
slightly simpler definition of base NP that conforms 
to the base NPs used in our Empire sentence ana- 
lyzer. These simpler phrases appear to be a good 
starting point for partial parsers that purposely de- 
lay all complex attachment decisions to later phases 
of processing. 
Overall results for the approach are promising. 
For the Empire corpus, our base NP finder achieves 
94% precision and recall; for the Ramshaw & Marcus 
corpus, it obtains 91% precision and recall, which is 
2% less than the best published results. Ramshaw 
& Marcus, however, provide the learning algorithm 
with word-level information in addition to the part- 
of-speech information used in our base NP finder. 
By controlling for this disparity in available knowl- 
edge sources, we find that our base NP algorithm 
performs comparably, achieving slightly worse preci- 
sion (-1.1%) and slightly better recall (+0.2%) than 
the Ramshaw & Marcus approach. Moreover, our 
approach offers many important advantages that 
make it appropriate for many NLP tasks: 
* Training is exceedingly simple. 
. The base NP bracketer is very fast, operating 
in time linear in the length of the text. 
. The accuracy of the treebank approach is good 
for applications that require or prefer fairly sim- 
ple base NPs. 
. The learned grammar is easily modified for use 
with corpora that differ from the training texts. 
Rules can be selectively added to or deleted 
from the grammar without worrying about or- 
dering effects. 
* Finally, our benefit-based training phase offers 
a simple, general approach for extracting gram- 
mars other than noun phrase grammars from 
annotated text. 
Note also that the treebank approach to base NP 
identification obtains good results in spite of a very 
simple algorithm for "parsing" base NPs. This is ex- 
tremely encouraging, and our evaluation suggests at 
least two areas for immediate improvement. First, 
by replacing the naive match heuristic with a proba- 
bilistic base NP parser that incorporates lexical pref- 
erences, we would expect a nontrivial increase in re- 
call and precision. Second, many of the remaining 
base NP errors tend to follow simple patterns; these 
might be corrected using localized, learnable repair 
rules. 
The remainder of the paper describes the specifics 
of the approach and its evaluation. The next section 
presents the training and application phases of the 
treebank approach to base NP identification in more 
detail. Section 3 describes our general approach for 
pruning the base NP grammar as well as two instan- 
tiations of that approach. The evaluation and a dis- 
cussion of the results appear in Section 4, along with 
techniques for reducing training time and an initial 
investigation into the use of local repair heuristics. 
2 The Treebank Approach 
Figure 2 depicts the treebank approach to base NP 
identification. For training, the algorithm requires 
a corpus that has been annotated with base NPs. 
More specifically, we assume that the training corpus 
is a sequence of words wl, w2,..., along with a set of 
base NP annotations b(il&), b(i~j~),..., where b(ij) 
indicates that the NP brackets words i through j: 
\[NP Wi, ..., W j\]. The goal of the training phase is to 
create a base NP grammar from this training corpus: 
1. Using any available part-of-speech tagger, as- 
sign a part-of-speech tag ti to each word wi in 
the training corpus. 
2. Extract from each base noun phrase b(ij) in the 
training corpus its sequence of part-of-speech 
tags tl .... ,tj to form base NP rules, one rule 
per base NP. 
3. Remove any duplicate rules. 
The resulting "grammar" can then be used to iden- 
tify base NPs in a novel text. 
1. 
2. 
Assign part-of-speech tags tl, t2,.., to the input 
words wl, w2, • • • 
Proceed through the tagged text from left 
to right, at each point matching the NP 
rules against the remaining part-of-speech tags 
ti,ti+l,.., in the text. 
219 
Training Phase 
Training Corpus 
When lit\] is \[time\] for \[their biannual powwowl. 
\[ the nation I's I manufacturing titans I typically jet 
offto \[the sunny confinesl of Ireson townsl like 
\[Boca Ratonl and IHot Springs\[. 
Tagged Text 
When/W'RB \[it/PRP\] is/VBZ \[time/NN\] for/IN \[their/PRP$ 
biannual/JJ powwow/NN\] ./. \[the/DT nation/NN\] 's/POS 
Imanufacmring/VBG titans/NNSI typically/RB jet/VBP 
off/RP to/TO Ithe/DT snnny/JJ confines/NNSI of/IN 
I resort/NN towns/NNS \] like/IN I Boca/NNP Raton/NNPI 
and/CC IHot/NNP Spring~NNPI. 
~lP Rules 
<PRP> 
<NN> 
<PRP$ JJ NN> 
<DT NN> 
<VBG NNS> 
<DT JJ NNS> 
<NN NNS> 
<NNP NNP> 
Application Phase 
Novel Text , 
Not this year. National Association of Manufacturers settled 
on the Hoosier capital of Indianapolis for its next meeting. 
And the city decided to treat its guests more like royalty or 
rock sta~ than factory owners. 
Tagged Text 
Not/RB this/DT year/NN J. National/NNP 
Association/NNP of/IN ManufacturerffNNP settled/VBD 
on/IN the/DT Hoosier/NNP capital/NN of/IN 
lndianapoli~NNP for/IN its/PRP$ nexV'JJ meeting/NN J. 
And/CC the/DT city/NN decided/VBD to/TO treaV'VB 
its/PRP$ guesl.,;/NNS more/J JR like/IN royahy/NN or/CC 
rock/NN star,4NNS than/IN factory/NN owners/NNS ./. 
NP Bracketed Text 
Not \[this year\]. I National Association \] of I Manufacturers I 
settled on Ithe Hoosier capitall of \[Indianapolisl for l its next 
meetingl. And Ithe cityl decided to treat \[its guestsl more 
like \[royaltyl or/rock starsl than \[factory ownerq. 
Figure 2: The Treebank Approach to Base NP Identification 
3. If there are multiple rules that match beginning 
at ti, use the longest matching rule R. Add the 
new base noun phrase b(i,i+\]R\[-1) to the set of 
base NPs. Continue matching at ti+lR\[. 
With the rules stored in an appropriate data struc- 
ture, this greedy "parsing" of base NPs is very fast. 
In our implementation, for example, we store the 
rules in a decision tree, which permits base NP iden- 
tification in time linear in the length of the tagged 
input text when using the longest match heuristic. 
Unfortunately, there is an obvious problem with 
the algorithm described above. There will be many 
unhelpful rules in the rule set extracted from the 
training corpus. These "bad" rules arise from four 
sources: bracketing errors in the corpus; tagging er- 
rors; unusual or irregular linguistic constructs (such 
as parenthetical expressions); and inherent ambigu- 
ities in the base NPs -- in spite of their simplicity. 
For example, the rule (VBG NNS), which was ex- 
tracted from manufacturing/VBG titans/NNS in the 
example text, is ambiguous, and will cause erroneous 
bracketing in sentences such as The execs squeezed 
in a few meetings before \[boarding/VBG buses/NNS~ 
again. In order to have a viable mechanism for iden- 
tifying base NPs using this algorithm, the grammar 
must be improved by removing problematic rules. 
The next section presents two such methods for au- 
tomatically pruning the base NP grammar. 
3 Pruning the Base NP Grammar 
As described above, our goal is to use the base NP 
corpus to extract and select a set of noun phrase 
rules that can be used to accurately identify base 
NPs in novel text. Our general pruning procedure is 
shown in Figure 3. First, we divide the base NP cor- 
pus into two parts: a training corpus and a pruning 
corpus. The initial base NP grammar is extracted 
from the training corpus as described in Section 2. 
Next, the pruning corpus is used to evaluate the set 
of rules and produce a ranking of the rules in terms 
of their utility in identifying base NPs. More specif- 
ically, we use the rule set and the longest match 
heuristic to find all base NPs in the pruning corpus. 
Performance of the rule set is measured in terms of 
labeled precision (P): 
p _- # of correct proposed NPs 
# of proposed NPs 
We then assign to each rule a score that denotes 
the "net benefit" achieved by using the rule during 
NP parsing of the improvement corpus. The ben- 
efit of rule r is given by B~ = C, - E, where C~ 
220 
Training 
Corpus 
Pruning 
Corpus 
Improved 
Rule Set 
Final Rule Set 
Figure 3: Pruning the Base NP Grammar 
is the number of NPs correctly identified by r, and 
E~ is the number of precision errors for which r is 
responsible. 1 A rule is considered responsible for an 
error if it was the first rule to bracket part of a refer- 
ence NP, i.e., an NP in the base NP training corpus. 
Thus, rules that form erroneous bracketings are not 
penalized if another rule previously bracketed part 
of the same reference NP. 
For example, suppose the fragment containing 
base NPs Boca Raton, Hot Springs, and Palm Beach 
is bracketed as shown below. 
resort towns like 
\[NP1 Boca/NNP Raton/NNP, Hot/NNP\] 
\[NP2 Springs/NNP\], and 
\[NP3 Palm/NNP Beach/NNP\] 
Rule (NNP NNP , NNP) brackets NP1; (NNP / 
brackets NP2; and (NNP NNP / brackets NP~. Rule 
(NNP NNP , NNP / incorrectly identifies Boca Ra- 
ton, Hot as a noun phrase, so its score is -1. Rule 
(NNP) incorrectly identifies Springs, but it is not 
held responsible for the error because of the previ- 
ous error by (NNP NNP, NNP / on the same original 
NP Hot Springs: so its score is 0. Finally, rule (NNP 
NNP) receives a score of 1 for correctly identifying 
Palm Beach as a base NP. 
The benefit scores from evaluation on the pruning 
corpus are used to rank the rules in the grammar. 
With such a ranking, we can improve the rule set 
by discarding the worst rules. Thus far, we have 
investigated two iterative approaches for discarding 
rules, a thresholding approach and an incremental 
approach. We describe each, in turn, in the subsec- 
tions below. 
1 This same benefit measure is also used in the R&M study, 
but it is used to rank transformations rather than to rank NP 
rules. 
3.1 Threshold Pruning 
Given a ranking on the rule set, the threshold algo- 
rithm simply discards rules whose score is less than 
a predefined threshold R. For all of our experiments, 
we set R = 1 to select rules that propose more cor- 
rect bracketings than incorrect. The process of eval- 
uating, ranking, and discarding rules is repeated un- 
til no rules have a score less than R. For our evalua- 
tion on the WSJ corpus, this typically requires only 
four to five iterations. 
3.2 Incremental Pruning 
Thresholding provides a very coarse mechanism for 
pruning the NP grammar. In particular, because 
of interactions between the rules during bracketing, 
thresholding discards rules whose score might in- 
crease in the absence of other rules that are also be- 
ing discarded. Consider, for example, the Boca Ra- 
ton fragments given earlier. In the absence of (NNP 
NNP , NNP), the rule (NNP NNP / would have re- 
ceived a score of three for correctly identifying all 
three NPs. 
As a result, we explored a more fine-grained 
method of discarding rules: Each iteration of incre- 
mental pruning discards the N worst rules, rather 
than all rules whose rank is less than some thresh- 
old. In all of our experiments, we set N = 10. As 
with thresholding, the process of evaluating, rank- 
ing, and discarding rules is repeated, this time until 
precision of the current rule set on the pruning cor- 
pus begins to drop. The rule set that maximized 
precision becomes the final rule set. 
3.3 Human Review 
In the experiments below, we compare the thresh- 
olding and incremental methods for pruning the NP 
grammar to a rule set that was pruned by hand. 
When the training corpus is large, exhaustive re- 
view of the extracted rules is not practical. This 
is the case for our initial rule set, culled from the 
WSJ corpus, which contains approximately 4500 
base NP rules. Rather than identifying and dis- 
carding individual problematic rules, our reviewer 
identified problematic classes of rules that could be 
removed from the grammar automatically. In partic- 
ular, the goal of the human reviewer was to discard 
rules that introduced ambiguity or corresponded to 
overly complex base NPs. Within our partial parsing 
framework, these NPs are better identified by more 
informed components of the NLP system. Our re- 
viewer identified the following classes of rules as pos- 
sibly troublesome: rules that contain a preposition, 
period, or colon; rules that contain WH tags; rules 
that begin/end with a verb or adverb; rules that con- 
tain pronouns with any other tags; rules that contain 
misplaced commas or quotes; rules that end with 
adjectives. Rules covered under any of these classes 
221 
were omitted from the human-pruned rule sets used 
in the experiments of Section 4. 
4 Evaluation 
To evaluate the treebank approach to base NP iden- 
tification, we created two base NP corpora. Each 
is derived from the Penn Treebank WSJ. The first 
corpus attempts to duplicate the base NPs used the 
Ramshaw & Marcus (R&M) study. The second cor- 
pus contains slightly less complicated base NPs -- 
base NPs that are better suited for use with our 
sentence analyzer, Empire. 2 By evaluating on both 
corpora, we can measure the effect of noun phrase 
complexity on the treebank approach to base NP 
identification. In particular, we hypothesize that the 
treebank approach will be most appropriate when 
the base NPs are sufficiently simple. 
For all experiments, we derived the training, prun- 
ing, and testing sets from the 25 sections of Wall 
Street Journal distributed with the Penn Treebank 
II. All experiments employ 5-fold cross validation. 
More specifically, in each of five runs, a different fold 
is used for testing the final, pruned rule set; three of 
the remaining folds comprise the training corpus (to 
create the initial rule set); and the final partition is 
the pruning corpus (to prune bad rules from the ini- 
tial rule set). All results are averages across the five 
folds. Performance is measured in terms of precision 
and recall. Precision was described earlier -- it is a 
standard measure of accuracy. Recall, on the other 
hand, is an attempt to measure coverage: 
# of correct proposed NPs P = 
# of proposed NPs 
# of correct proposed NPs R = 
# of NPs in the annotated text 
Table 1 summarizes the performance of the tree- 
bank approach to base NP identification on the 
R&M and Empire corpora using the initial and 
pruned rule sets. The first column of results shows 
the performance of the initial, unpruned base NP 
grammar. The next two columns show the perfor- 
mance of the automatically pruned rule sets. The 
final column indicates the performance of rule sets 
that had been pruned using the handcrafted pruning 
heuristics. As expected, the initial rule set performs 
quite poorly. Both automated approaches provide 
significant increases in both recall and precision. In 
addition, they outperform the rule set pruned using 
handcrafted pruning heuristics. 
2Very briefly, the Empire sentence analyzer relies on par- 
tial parsing to find simple constituents like base NPs and 
verb groups. Machine learning algorithms then operate on 
the output of the partial parser to perform all attachment de- 
cisions. The ultimate output of the parser is a semantic case 
frame representation of the functional structure of the input 
sentence. 
R&M (1998) \]" R&M (1998) 
with \[ without 
lexical templates lexical templates 
93.1P/93.5R ~ 90.5P/90.7R 
Treebank \] 
Approach 
89.4p/9o.9a \] 
Table 2: Comparison of Treebank Approach with 
Ramshaw & Marcus (1998) both With and Without 
Lexical Templates, on the R&M Corpus 
Throughout the table, we see the effects of base 
NP complexity -- the base NPs of the R&M cor- 
pus are substantially more difficult for our approach 
to identify than the simpler NPs of the Empire cor- 
pus. For the R&M corpus, we lag the best pub- 
lished results (93.1P/93.5R) by approximately 3%. 
This straightforward comparison, however, is not en- 
tirely appropriate. Ramshaw & Marcus allow their 
learning algorithm to access word-level information 
in addition to part-of-speech tags. The treebank ap- 
proach, on the other hand, makes use only of part-of- 
speech tags. Table 2 compares Ramshaw & Marcus' 
(In press) results with and without lexical knowl- 
edge. The first column reports their performance 
when using lexical templates; the second when lexi- 
cal templates are not used; the third again shows the 
treebank approach using incremental pruning. The 
treebank approach and the R&M approach without 
lecial templates are shown to perform comparably 
(-1.1P/+0.2R). Lexicalization of our base NP finder 
will be addressed in Section 4.1. 
Finally, note the relatively small difference be- 
tween the threshold and incremental pruning meth- 
ods in Table 1. For some applications, this minor 
drop in performance may be worth the decrease in 
training time. Another effective technique to speed 
up training is motivated by Charniak's (1996) ob- 
servation that the benefit of using rules that only 
occurred once in training is marginal. By discard- 
ing these rules before pruning, we reduce the size of 
the initial grammar -- and the time for incremental 
pruning -- by 60%, with a performance drop of only 
-0.3P/-0.1R. 
4.1 Errors and Local Repair Heuristics 
It is informative to consider the kinds of errors 
made by the treebank approach to bracketing. In 
particular, the errors may indicate options for incor- 
porating lexical information into the base NP finder. 
Given the increases in performance achieved by 
Ramshaw & Marcus by including word-level cues, we 
would hope to see similar improvements by exploit- 
ing lexical information in the treebank approach. 
For each corpus we examined the first 100 or so 
errors and found that certain linguistic constructs 
consistently cause trouble. (In the examples that 
follow, the bracketing shown is the error.) 
222 
Base NP I Initial I Threshold Incremental I Human 
Corpus Rule Set Pruning Pruning Review 
Empire I 23.OP/46.5RI 91.2P/93.1R 92.TP/93.7RI 90.3P/9O.5R 
R&M 19.0P/36.1R 87.2P/90.0R 89.4P/90.9R 81.6P/g5.0R 
Table h Evaluation of the Treebank Approach Using the Mitre Part-of-Speech Tagger (P = precision; R = 
recall) 
BaseNP I Threshold I Threshold I Incremental I Incremental I Corpus Improvement T Local Repair Improvement + Local Repair 
Empire \[ 91.2P/93.1R 92.8P/93.7R 92.7P/93.7R 93.7P/94.0R 
87.2P/90.0R I 89.2P/gO.6R I 89"4P/90"gR I 90.7P/91.IR I R&M I 
Table 3: Effect of Local Repair Heuristics 
* Conjunctions. Conjunctions were a major prob- 
lem in the R&M corpus. For the Empire 
corpus, conjunctions of adjectives proved dif- 
ficult: \[record/N2~ \[third-quarter/JJ and/CC 
nine-month/JJ results/NN5~. 
• Gerunds. Even though the most difficult VBG 
constructions such as manufacturing titans were 
removed from the Empire corpus, there were 
others that the bracketer did not handle, like 
\[chiej~ operating \[officer\]. Like conjunctions, 
gerunds posed a major difficulty in the R&M 
corpus. 
• NPs Containing Punctuation. Predictably, the 
bracketer has difficulty with NPs containing pe- 
riods, quotation marks, hyphens, and parenthe- 
ses. 
• Adverbial Noun Phrases. Especially temporal 
NPs such as last month in at \[83.6~\] of\[capacity 
last month\]. 
• Appositives. These are juxtaposed NPs such as 
of \[colleague Michael Madden\] that the brack- 
eter mistakes for a single NP. 
• Quantified NPs. NPs that look like PPs are 
a problem: at/IN \[least/JJS~ \[the/DT right/JJ 
jobs/NNS~; about/IN \[25/CD million/CD\]. 
Many errors appear to stem from four underly- 
ing causes. First, close to 20% can be attributed 
to errors in the Treebank and in the Base NP cor- 
pus, bringing the effective performance of the algo- 
rithm to 94.2P/95.9R and 91.5P/92.TR for the Em- 
pire and R&M corpora, respectively. For example, 
neither corpus includes WH-phrases as base NPs. 
When the bracketer correctly recognizes these NPs, 
they are counted as errors. Part-of-speech tagging 
errors are a second cause. Third, many NPs are 
missed by the bracketer because it lacks the appro- 
priate rule. For example, household products busi- ness 
is bracketed as \[household/NN products/NNS~ 
\[business/Nh~. Fourth, idiomatic and specialized ex- 
pressions, especially time, date, money, and numeric 
phrases, also account for a substantial portion of the 
errors. 
These last two categories of errors can often be de- 
tected because they produce either recognizable pat- 
terns or unlikely linguistic constructs. Consecutive 
NPs, for example, usually denote bracketing errors, 
as in \[household/NN products/NNS~ \[business/Nh~. 
Merging consecutive NPs in the correct contexts 
would fix many such errors. Idiomatic and special- 
ized expressions might be corrected by similarly local 
repair heuristics. Typical examples might include 
changing \[effective/JJ Monday/NNP\] to effective \[Monday\]; 
changing \[the/DT balance/NN due/J J\] to 
\[the balance\] due; and changing were/VBP \[n't/RB the/DT only/RS losers/NNS~ 
to were n't \[the only 
losers\]. 
Given these observations, we implemented three 
local repair heuristics. The first merges consecutive 
NPs unless either might be a time expression. The 
second identifies two simple date expressions. The 
third looks for quantifiers preceding of NP. The first 
heuristic, for example, merges \[household products\] 
\[business\] to form \[household products business\], but 
leaves increased \[15 ~ \[last Friday\] untouched. The 
second heuristic merges \[June b~ , \[1995\] into \[June 
5, 1995\]; and \[June\], \[1995\] into \[June, 1995\]. The 
third finds examples like some of\[the companies\] and 
produces \[some\] of \[the companies\]. These heuristics 
represent an initial exploration into the effectiveness 
of employing lexical information in a post-processing 
phase rather than during grammar induction and 
bracketing. While we are investigating the latter 
in current work, local repair heuristics have the ad- 
vantage of keeping the training and bracketing algo- 
rithms both simple and fast. 
The effect of these heuristics on recall and preci- 
sion is shown in Table 3. We see consistent improve- 
ments for both corpora and both pruning methods, 
223 
achieving approximately 94P/R for the Empire cor- 
pus and approximately 91P/R for the R&M corpus. 
Note that these are the final results reported in the 
introduction and conclusion. Although these experi- 
ments represent only an initial investigation into the 
usefulness of local repair heuristics, we are very en- 
couraged by the results. The heuristics uniformly 
boost precision without harming recall; they help 
the R&M corpus even though they were designed in 
response to errors in the Empire corpus. In addi- 
tion, these three heuristics alone recover 1/2 to 1/3 
of the improvements we can expect to obtain from 
lexicalization based on the R&M results. 
5 Conclusions 
This paper presented a new method for identifying 
base NPs. Our treebank approach uses the simple 
technique of matching part-of-speech tag sequences, 
with the intention of capturing the simplicity of the 
corresponding syntactic structure. It employs two 
existing corpus-based techniques: the initial noun 
phrase grammar is extracted directly from an an- 
notated corpus; and a benefit score calculated from 
errors on an improvement corpus selects the best 
subset of rules via a coarse- or fine-grained pruning 
algorithm. 
The overall results are surprisingly good, espe- 
cially considering the simplicity of the method. It 
achieves 94% precision and recall on simple base 
NPs. It achieves 91% precision and recall on the 
more complex NPs of the Ramshaw & Marcus cor- 
pus. We believe, however, that the base NP finder 
can be improved further. First, the longest-match 
heuristic of the noun phrase bracketer could be re- 
placed by more sophisticated parsing methods that 
account for lexical preferences. Rule application, for 
example, could be disambiguated statistically using 
distributions induced during training. We are cur- 
rently investigating such extensions. One approach 
closely related to ours -- weighted finite-state trans- 
ducers (e.g. (Pereira and Riley, 1997)) -- might pro- 
vide a principled way to do this. We could then 
consider applying our error-driven pruning strategy 
to rules encoded as transducers. Second, we have 
only recently begun to explore the use of local re- 
pair heuristics. While initial results are promising, 
the full impact of such heuristics on overall perfor- 
mance can be determined only if they are system- 
atically learned and tested using available training 
data. Future work will concentrate on the corpus- 
based acquisition of local repair heuristics. 
In conclusion, the treebank approach to base NPs 
provides an accurate and fast bracketing method, 
running in time linear in the length of the tagged 
text.. The approach is simple to understand, im- 
plement, and train. The learned grammar is easily 
modified for use with new corpora, as rules can be 
added or deleted with minimal interaction problems. 
Finally, the approach provides a general framework 
for developing other treebank grammars (e.g., for 
subject/verb/object identification) in addition to 
these for base NPs. 
Acknowledgments. This work was supported in 
part by NSF (\]rants IRI-9624639 and GER-9454149. 
We thank Mitre for providing their part-of-speech tag- 
ger. 

References 
D. Bourigault. 1992. Surface Grammatical Anal- 
ysis for the Extraction of Terminological Noun 
Phrases. In Proceedings, COLING-92, pages 977- 
981. 
Eric Brill. 1995. Transformation-Based Error- 
Driven Learning and Natural Language Process- 
ing: A Case Study in Part-of-Speech Tagging. 
Computational Linguistics, 21(4):543-565. 
E. Charniak. 1996. Treebank Grammars. In Pro- 
ceedings of the Thirteenth National Conference on 
Artificial Intelligence, pages 1031-1036, Portland, 
OR. AAAI Press / MIT Press. 
K. Church. 1988. A Stochastic Parts Program and 
Noun Phrase Parser for Unrestricted Text. In Pro- 
ceedings of the Second Conference on Applied Nat- 
ural Language Processing, pages 136-143. Associ- 
ation for Computational Linguistics. 
J. P. Gee and F. Grosjean. 1983. Performance struc- 
tures: A psycholinguistic and linguistic appraisal. 
Cognitive Psychology, 15:411-458. 
John S. Justeson and Slava M. Katz. 1995. Techni- 
cal Terminology: Some Linguistic Properties and 
an Algorithm for Identification in Text. Natural 
Language Engineering, 1:9-27. 
M. Marcus, M. Marcinkiewicz, and B. Santorini. 
1993. Building a Large Annotated Corpus of En- 
glish: The Penn Treebank. Computational Lin- 
guistics, 19(2):313-330. 
Fernando C. N. Pereira and Michael D. Riley. 1997. 
Speech Recognition by Composition of Weighted 
Finite Automata. In Emmanuel Roche and Yves 
Schabes, editors, Finite-State Language Process- 
ing. MIT Press. 
Lance A. Ramshaw and Mitchell P. Marcus. In 
press. Text chunking using transformation-based 
learning. In Natural Language Processing Using 
Very Large Corpora. Kluwer. Originally appeared 
in WVLC95, 82-94. 
A. Voutilainen. 1993. NPTool, A Detector of En- 
glish Noun Phrases. In Proceedings of the Work- 
shop on Very Large Corpora, pages 48-57. Asso- 
ciation for Computational Linguistics. 
