Applying System Combination to Base Noun Phrase Identification 
Erik F. Tjong Kim Sang", Walter Daelemans '~, Herv6 D6jean ~, 
Rob KoelingT, Yuval Krymolowski/~, Vasin Punyakanok '~, Dan Roth" 
~University of Antwert) 
Uifiversiteitsplohl 1 
13.-261.0 Wilrijk, Belgium 
{erikt,daelem}@uia.ua.ac.be 
r Unive.rsitiil; Tiil)ingen 
Kleine Wilhehnstrat./e 113 
I)-72074 T/il)ingen, Germany 
(lej ean((~sl:q, ni)hil.ulfi-l;uebingen.de, 
7S1{,I Cambridge 
23 Millers Yard,Mill Lane 
Cambridge, CB2 ll{Q, UK 
koeling@caln.sri.coIn 
;~Bal'-Ilan University 
lbunat Gan, 52900, Israel 
yuwdk(c~)macs. 1)iu.ac.il 
"University of Illinois 
1304: W. Sl)ringfield Ave. 
Url)ana, IL 61801, USA 
{lmnyakan,(lanr} ((~cs.uiuc.edu 
A1)stract 
We us('. seven machine h;arning algorithms tbr 
one task: idenl;it~ying l)ase holm phrases. The 
results have 1)een t)rocessed by ditt'erent system 
combination methods and all of these (mtt)er- 
formed the t)est individual result. We have ap- 
t)lied the seven learners with the best (:omt)ina- 
tot, a majority vote of the top tive systenls, to a 
standard (lata set and lllallage(1 I;O ilnl)rov(', 1;11(' 
t)est pul)lished result %r this (lata set. 
1 Introduction 
Van Haltor(m eta\]. (1998) and Brill and Wu 
(1998) show that part-ofst)ee(:h tagger l)erfor- 
mance can 1)e iml)roved 1)y (:oml)ining ditl'erent 
tatters. By using te(:hni(tues su(:h as majority 
voting, errors made l)y 1;11(; minority of the tag- 
gers can 1)e r(;moved. Van Ilaltere, n et al. (1998) 
rel)ort that the results of such a ('oml)ined al)- 
proach can improve ll\])Oll the aCcllracy error of 
the best individual system with as much as 19%. 
Tim positive (;tl'e(:t of system combination tbr 
non- t)ro(:essing tasks has t)een shown 
in a large l)o(ly of mac\]fine learning work. 
In this 1)aper we will use system (:omt)ination 
for identifying base noun 1)hrases (1)aseNt)s). 
W(; will at)l)ly seven machine learning algo- 
rithms to the same 1)aseNP task. At two l)oints 
we will al)ply confl)ination methods. We will 
start with making the systems process five out- 
trot representations and combine the l'esults t)y 
(:hoosing the majority of the outtmt tL'atures. 
Three of the seven systems use this al)l)roaeh. 
Afl, er this w(; will make an overall eoml)ination 
of the results of the seven systems. There we 
will evaluate several system combination meth- 
()(Is. The 1)est l)erforming method will 1)e at)- 
t)lied to a standard data set tbr baseNP identi- 
tication. 
2 Methods and experiments 
in this se(:tion we will describe our lem:ning task: 
recognizing 1)ase noun phrases. After this we 
will (tes(:ril)e the data representations we used 
and the ma('hine learning algorithms that we 
will at)l)ly to the task. We will con(:ludc with 
an overview of the (:ombination metllo(ls that 
we will test. 
2.1 Task description 
Base noun \])hrases (1)aseNPs) are n(mn phrases 
whi(:h do not (:ontain another noun l)hrase. \]?or 
cxamt)le , the sentence 
In \[early trading\] in \[ IIong Kong\] 
\[ Mo,l,tay \], \[ g,,la \] was q, loted at 
\[ $ 366. 0 \] \[a. o1,,,.(; \]. 
contains six baseN1)s (marked as phrases be- 
tween square 1)rackets). The phrase $ 266.50 
an ounce ix a holm phrase as well. However, it 
is not a baseNP since it contains two other noun 
phrases. Two baseNP data sets haw.' been put 
forward by Ramshaw and Marcus (1995). The 
main data set consist of tbur sections of the Wall 
Street Journal (WSJ) part of the Penn Tree- 
bank (Marcus et al., 1.993) as training mate- 
rial (sections 15-18, 211727 tokens) and one sec- 
tion aS test material (section 20, 47377 tokens)5. 
The data contains words, their part-of-speech 
1This Ramshaw and Marcus (1995) bascNP data set 
is availal)le via ffp://fti).cis.upe,m.edu/pub/chunker/ 
857 
(POS) tags as computed by the Brill tagger and 
their baseNP segmentation as derived from the 
%'eebank (with some modifications). 
In the baseNP identitication task, perfor- 
mance is measured with three rates. First, 
with the percentage of detected noun phrases 
that are correct (precision). Second, with the 
1)ercentage of noun phrases in the data that 
were found by the classifier (recall). And third, 
with the F#=~ rate which is equal to (2*preci- 
sion*recall)/(precision+recall). The latter rate 
has been used as the target for optimization. 
2.2 Data representation 
In our example sentence in section 2.1, noun 
phrases are represented by bracket structures. 
It has been shown by Mufioz et al. (1999) 
that for baseNP recognition, the representa- 
tion with brackets outperforms other data rep- 
resentations. One classifier can be trained to 
recognize open brackets (O) and another can 
handle close brackets (C). Their results can be 
combined by making pairs of open and close 
brackets with large probability scores. We have 
used this bracket representation (O+C) as well. 
However, we have not used the combination 
strategy from Mufioz et al. (1999) trot in- 
stead used the strategy outlined in Tjong Kim 
Sang (2000): regard only the shortest possi- 
ble phrases between candidate open and close 
brackets as base noun phrases. 
An alternative representation for baseNPs 
has been put tbrward by Ramshaw and Mar- 
cus (1995). They have defined baseNP recog- 
nition as a tagging task: words can be inside a 
baseNP (I) or outside a baseNP (O). In the case 
that one baseNP immediately follows another 
baseNP, the first word in the second baseNP 
receives tag B. Example: 
Ino early1 trading1 ino Hongi Kongi 
MondayB ,o gold1 waso quotedo ato 
$I 366.501 anu ounce1 .o 
This set of three tags is sufficient for encod- 
ing baseNP structures since these structures are 
nonrecursive and nonoverlapping. 
Tjong Kiln Sang (2000) outlines alternative 
versions of this tagging representation. First, 
the B tag can be used for tile first word of ev- 
ery baseNP (IOB2 representation). Second, in- 
stead of the B tag an E tag can be used to 
nlark the last word of a baseNP immediately 
before another baseNP (IOE1). And third, the 
E tag call be used for every noun phrase final 
word (IOE2). He used the Ramshaw and Mar- 
cus (1995) representation as well (IOB1). We 
will use these tbur tagging representations and 
the O+C representation for the system-internal 
combination experiments. 
2.a Machine learning algorithms 
This section contains a brief description of tile 
seven machine learning algorithms that we will 
apply to the baseNP identification task: AL- 
LiS, c5.0, IO~¥ee, MaxEnt, MBL, MBSL and 
SNOW. 
ALLiS 2 (Architecture for Learning Linguistic 
Structures) is a learning system which uses the- 
ory refinement in order to learn non-recursive 
NP and VP structures (Ddjean, 2000). ALLiS 
generates a regular expression grammar which 
describes the phrase structure (NP or VP). This 
grammar is then used by the CASS parser (Ab- 
hey, 1996). Following the principle of theory re- 
finement, tile learning task is composed of two 
steps. The first step is the generation of an 
initial wa, mmar. The generation of this grmn- 
mar uses the notion of default values and some 
background knowledge which provides general 
expectations concerning the immr structure of 
NPs and VPs. This initial grammar provides 
an incomplete and/or incorrect analysis of tile 
data. The second step is the refinement of this 
grammar. During this step, the validity of the 
rules of the initial grammar is checked and the 
rules are improved (refined) if necessary. This 
refinement relies on the use of two operations: 
the contextualization (in which contexts such a 
tag always belongs to the phrase) and lexical- 
ization (use of information about the words and 
not only about POS). 
05.0 a, a commercial version of 04.5 (Quin- 
lan, 1993), performs top-do,vn induction of de- 
cision trees (TDIDT). O,1 the basis of an in- 
stance base of examples, 05.0 constructs a deci- 
sion tree which compresses the classification in- 
formation in the instance base by exploiting dif- 
tbrences in relative importance of different fea- 
tures. Instances are stored in the tree as paths 
2A demo of the NP and VP ctmnker is available at 
ht;t:p: / /www.sfb441.unituebingen.de/~ dejean/chunker.h 
tml 
aAvailable fl'om http://www.rulequest.com 
858 
of commcted nodes ending in leaves which con- 
tain classification information. Nodes are con- 
nected via arcs denoting feature wflues. Feature 
inff)rmation gain (nmt;ual inforniation 1)etween 
features and class) is used to determine the or- 
der in which features are mnt)loyed as tests at all 
levels of the tree (Quinlan, 1993), With the full 
inlmt representation (words and POS tags)~ we 
were not able to run comt)lete experiments. We 
therefore experimented only with the POS tags 
(with a context of two left; and right). We have 
used the default parameter setting with decision 
trees coml)ined with wflue groul)ing. 
We have used a nearest neighbor algoritlml 
(IBI.-1G, here listed as MBL) and a decision tree 
algoritlmi (llG\[lh:ee) from the TiMBL learning 
package (Da(flmnans et al., 19991)). Both algo- 
rithms store the training data and ('lassi(y new 
it;eros by choosing the most frequent (:lassiti(:a- 
lion among training items which are closest to 
this new item. l)ata it(uns rare rel)resented as 
sets of thature-vahu; 1)airs. Each ti;ature recc'ives 
a weight which is t)ased on the amount of in- 
formation whi(:h it t/rovides fi)r comtmting the 
classification of t;t1(; items in the training data. 
IBI-IG uses these weights tbr comt)uting the dis- 
lance l)etween a t)air of data items and IGTree 
uses them fi)r deciding which feature-value de- 
cisions shouM t)e made in the top nod(;s of the 
decision tree (l)a(;lenJans et al., 19991)). We 
will use their det, mlt pm:amet('a:s excel)t for the 
IBI-IG t)arameter for the numl)er of exmnine(t 
m',arest n(,ighl)ors (k) whi('h we h~ve s(,t to 3 
(Daelemans et al., 1999a). The classifiers use a 
left and right context of four words and part- 
ofsl)eech tags. t~i)r |;lie four IO representations 
we have used a second i)rocessing stage which 
used a smaller context lint which included in- 
formation at)out the IO tags 1)redicted by the 
first processing phase (Tjong Kim Sang, 2000). 
When /)uilding a classifier, one must gather 
evidence ti)r predicting the correct class of an 
item from its context. The Maxinmm Entropy 
(MaxEnt) fl:mnework is especially suited tbr 
integrating evidence tiom various inti)rmal;ion 
sources. Frequencies of evidence/class combi~ 
nations (called features) are extracted fl'om a 
sample corlms and considere(t to be t)roperties 
of the classification process. Attention is con- 
strained to models with these l)roperties. The 
MaxEnt t)rinciph; now demands that among all 
1;11(; 1)robability distributions that obey these 
constraints, the most mfiform is chosen, l)ur- 
ing training, features are assigned weights in 
such a way that, given the MaxEnt principle, 
the training data is matched as well as possible. 
During evaluation it is tested which features are 
active (i.e. a feature is active when the context 
meets the requirements given by t;11(', feature). 
For every class the weights of the active fea- 
tures are combined and the best scoring class 
is chosen (Berger et al., 1996). D)r the classi- 
tier built here the surromlding words, their POS 
tags and lmseNP tags predicted for the previous 
words are used its evidence. A mixture of simple 
features (consisting of one of the mentioned in- 
formation sources) and complex features (com- 
binations thereof) were used. The left context 
never exceeded 3 words, the right context was 
maximally 2 words. The model wits (:ah:ulated 
using existing software (l)ehaspe, 1997). 
MBSL (Argalnon et al., 1999) uses POS data 
in order to identit~y t/aseNPs, hfferenee re- 
lies on a memory which contains all the o(:- 
cm:rences of P()S sequences which apt)ear in 
the t)egimfing, or the end, of a 1)aseNl? (in- 
(:hiding complete t)hrases). These sequences 
may include a thw context tags, up to a 1)re- 
st)ecifi('d max_(:ont<~:t. \])uring inti',rence, MBSL 
tries to 'tile' each POS string with parts of 
noun-l)hrases from l;he memory. If the string 
coul(1 l)e fully covered t)y the tiles, il; becomes 
l)art of a (:andidate list, anfl)iguities 1)etween 
candidates are resolved by a constraint t)ropa- 
gation algorithm. Adding a (:ontext extends the 
possil)ilities for tiling, thereby giving more op- 
portunities to 1)etter candidates. The at)t)roaeh 
of MBSL to the i)rot)lem of identifying 1)aseNPs 
is sequence-1)ased rather than word-based, that 
is, decisions are taken per POS sequence, or per 
candidate, trot not for a single word. In addi- 
tion, the tiling l)rocess gives no preference to 
any (tirection in the sentence. The tiles may 1)e 
of any length, up to the maximal length of a 
1)hrase in the training (ILl;L, which gives MBSL 
a generalization power that compensates for the 
setup of using only POS tags. The results t)re- 
seated here were obtained by optimizing MBSL 
parameters based on 5-fold CV on the training 
data. 
SNoW uses the Open/Close model, described 
in Mufioz et al. (1999). As is shown there, this 
859 
section 21 
IOB1 
IOB2 
IOE1 
IOE2 
O+C 
0 
97.81% 
97.63% 
97.80% 
97.72% 
97.72% 
MBL 
Majority 98.04% 98.20% 
C Ffl=l 
97.97% 91.68 
97.96% 91.79 
97.92% 91.54 
97.94% 92.06 
98.04% 92.03 
92.82 
MaxEnt 
O C 
97.90% 98.11% 
97.81% 98.14% 
97.88% 98.12% 
97.84% 98.12% 
97.82% 98.15% 
97.94% 98.24% 
Ffl=l 
92.43 
92.14 
92.37 
92.13 
92.26 
92.60 
IGTree 
O C 
96.62% 96.89% 
97.27% 97.30% 
95.88% 96.01% 
97.19% 97.62% 
96.89% 97.49% 
97.70% 97.99% 
F\[~=1 
87.88 
90.03 
82.80 
89.98 
89.37 
91.92 
Table 1: The effects of system-internal combination by using different output representations. A 
straight-forward majority vote of the output yields better bracket accuracies and Ffl=l rates than 
any included individual classifier. The bracket accuracies in the cohmms O and C show what 
percentage of words was correctly classified as baseNP start, baseNP end or neither. 
model produced better results than the other 
paradigm evaluated there, the Inside/Outside 
paradigm. The Open/Close model consists of 
two SNoW predictors, one of which predicts the 
beginning of baseNPs (Open predictor), and the 
other predicts the end of the ptlrase (Close pre- 
dictor). The Open predictor is learned using 
SNoW (Carlson el; al., 1999; Roth, 1998) as a 
flmction of features that utilize words and POS 
tags in the sentence and, given a new sentence, 
will predict for each word whether it is the first 
word in the phrase or not. For each Open, the 
Close predictor is learned using SNoW as a func- 
tion of features that utilize the words ill the sen- 
tence, the POS tags and the open prediction. It 
will predict, tbr each word, whether it Call be 
the end of" the I)hrase, given the previously pre- 
dicted Open. Each pair of predicted Open mid 
Close forms a candidate of a baseNP. These can- 
didates may conflict due to overlapping; at this 
stage, a graph-based constraint satisfaction al- 
gorithm that uses the confidence values SNoW 
associates with its predictions is elnployed. This 
algorithln ("the combinator') produces tile list 
of" the final baseNPs fbr each sentence. Details 
of SNOW, its application in shallow parsing and 
the combinator% Mgorithm are in Mufioz et al. 
(1999). 
2.4 Combination techniques 
At two points in our noun phrase recognition 
process we will use system combination. We will 
start with system-internal combination: apply 
the same learning algorithm to variants of the 
task and combine the results. The approach 
we have chosen here is the same as in Tjong 
Kim Sang (2000): generate different variants 
of the task by using different representations 
of the output (IOB1, IOB2, IOE1, IOE2 and 
O+C). The five outputs will converted to the 
open bracket representation (O) and the close 
bracket; representation (C) and M'ter this, tile 
most frequent of the five analyses of each word 
will chosen (inajority voting, see below). We 
expect the systems which use this combination 
phase to perform better than their individuM 
members (Tjong Kim Sang, 2000). 
Our seven learners will generate different clas- 
sifications of tile training data and we need to 
find out which combination techniques are most 
appropriate. For the system-external combi- 
nation experiment, we have evaluated ditfi;rent 
voting lllechanisms~ effectively the voting meth- 
ods as described in Van Halteren et al. (1998). 
In the first method each classification receives 
the same weight and the most frequent classifi- 
cation is chosen (Majority). The second nmthod 
regards as tile weight of each individual clas- 
sification algorithm its accuracy on solne part 
of the data, tile tuning data (TotPrecision). 
The third voting method computes the preci- 
sion of each assigned tag per classifer and uses 
this value as a weight for tile classifier in those 
cases that it chooses the tag (TagPrecision). 
The fourth method uses both the precision of 
each assigned tag and tile recall of the com- 
peting tags (Precision-Recall). Finally, tile fifth 
lnethod uses not only a weight for tile current 
classification but it also computes weights tbr 
other possible classifications. The other classi- 
fications are deternfined by exalnining the tun- 
860 
ing data and registering the correct wflues for 
(;very pair of classitier results (pair-wise voting, 
see Van Halteren et al. (1998) tbr an elaborate 
explanation). 
Apart from these five voting methods we have 
also processed the output streams with two clas- 
sifters: MBL and IG%'ee. This approach is 
called classifier stacking. Like Van Halteren et 
al. (1998), we have used diff'erent intmt ver- 
sions: olle containing only the classitier Otltl)ut 
and another containing both classifier outlmt 
and a compressed representation of the data 
item tamer consideration. \]?or the latter lmr- 
pose we have used the part-of-speech tag of the 
carrent word. 
3 Results 4 
We want to find out whether system combi- 
nation could improve performmlce of baseNP 
recognition and, if this is the fact, we want to 
seJect the best confl)ination technique. For this 
lmrpose we have pertbrmed an experiment with 
sections 15-18 of the WSJ part of the Prom %'ee- 
bank as training data (211727 tokens) and sec- 
tion 21 as test data (40039 tokens). Like the 
data used by Ramshaw and Marcus (1995), this 
data was retagged by the Brill tagger in order 
to obtain realistic part-of speech (POS) tags 5. 
The data was seglnente.d into baseNP parts and 
non-lmseNP t)arts ill a similar fitshion as the 
data used 1)y Ramshaw and Marcus (1995). Of 
the training data, only 90% was used for train- 
ing. The remaining 10% was used as laming 
data for determining the weights of the combi- 
nation techniques. 
D)r three classifiers (MBL, MaxEnt and 
IGTree) we haw; used system-internal coral)i- 
nation. These learning algorithms have pro- 
cessed five dittbrent representations of the out- 
put (IOB1, IOB2, IOE1, IOE2 and O-t-C) and 
the results have been combined with majority 
voting. The test data results can 1)e fimnd in 
Table 1. In all cases, the combined results were 
better than that of the best included system. 
Tile results of ALLiS, 05.0, MB SL and SNoW 
have tmen converted to the O and the C repre- 
4Detailed results of our experiments me available on 
http: / /lcg-www.uia.ae.be/-erikt /np('oml,i / 
SThe retagging was necessary to assure that the per- 
formance rates obtained here would be similar to rates 
obtained for texts for which no Treebank POS tags are 
available. 
section 21 
Classifier 
ALLiS 
05.0 
IGTree 
MaxEnt 
MBL 
MBSL 
SNoW 
Simple Voting 
Majority 
TotPrecision 
TagPrecision 
Precision-Recall 
0 
97.87% 
97.05% 
97.70% 
97.94% 
98.04% 
97.27% 
97.78% 
98.08% 
98.08% 
98.08% 
98.08% 
C FS=j 
98.08% 92.15 
97.76% 89.97 
97.99% 91.92 
98.24% 92.60 
98.20% 92.82 
97.66% 90.71 
97.68% 91.87 
98.21% 92.95 
98.21% 92.95 
98.21% 92.95 
98.21% 92.95 
Pairwise Voting 
TagPair 98.13% 98.23% 
Memory-Based 
Tags 98.24% 98.35% 
Tags 4- P()S 98.14% 98.33% 
Deeision Trees 
Tags 98.24% 98.35% 
Tags + POS 98.13% 98.32% 
93.07 
93.39 
93.24 
93.39 
93.21 
Table 2: Bracket accuracies and Ff~=l scores 
for section WSJ 21 of the Penn ~15'eebank with 
seve, n individual classifiers and combinations of 
them. Each combination t)erforms t)etter than 
its best individual me, tuber. The stacked classi- 
tiers without COllte, xt intbrmation perform best. 
sentation. Together with the bracket; ret)resen- 
tations of the other three techniques, this gave 
us a total of seven O results and seven C results. 
These two data streams have been combined 
with the combination techniques described in 
section 2.4. After this, we built baseNPs from 
the, O and C results of each combinatkm tech- 
nique, like, described in section 2.2. The bracket 
accuracies and tile F~=I scores tbr test data can 
be found in Table 2. 
All combinations iml)rove the results of the 
best individual classifier. The best results were 
obtained with a memory-based stacked classi- 
ter. This is different from the combination re- 
sults presented in Van Ilalteren et al. (1998), 
in which pairwise voting pertbrmed best. How- 
eves, in their later work stacked classifiers out- 
perIbrm voting methods as well (Van Halteren 
et al., to appear). 
861 
section 20 accuracy precision recall 
Best-five combination 0:98.32% C:98.41% 94.18% 93.55% 
Tjong Kim Sang (2000) O:98.10% C:98.29% 93.63% 92.89% 
Mufioz et al. (1999) O:98.1% C:98.2% 92.4% 93.1% 
Ramshaw and Marcus (1995) IOB1:97.37% 91.80% 92.27% 
Argamon et al. (1999) - 91.6% 91.6% 
F/3=1 
93.86 
93.26 
92.8 
92.03 
91.6 
Table 3: The overall pertbrmance of the majority voting combination of our best five systems 
(selected on tinting data perfbrnmnce) applied to the standard data set pnt tbrward by Ramshaw 
and Marcus (1995) together with an overview of earlier work. The accuracy scores indicate how 
often a word was classified correctly with the representation used (O, C or IOB1). The combined 
system outperforms all earlier reported results tbr this data set. 
Based on an earlier combination study 
(Tjong Kim Sang, 2000) we had expected the 
voting methods to do better. We suspect that 
their pertbrmance is below that of the stacked 
classifiers because the diflhrence between tile 
best and the worst individual system is larger 
than in our earlier study. We assume that the 
voting methods might perform better if they 
were only applied to the classifiers that per- 
form well on this task. In order to test this 
hypothesis, we have repeated the combination 
experiments with the best n classitiers, where 
n took vahms from 3 to 6 and the classifiers 
were ranked based on their performance on the 
tnning data. The t)est pertbrmances were ob- 
tained with five classifiers: F/~=1=93.44 for all 
five voting methods with tile best stacked classi- 
tier reaching 93.24. With the top five classifiers, 
tile voting methods outpertbrm the best; combi- 
nation with seven systems G. Adding extra clas- 
sification results to a good combination system 
should not make overall performance worse so 
it is clear that there is some room left for im- 
provement of our combination algorithms. 
We conclude that the best results ill this 
task can be obtained with tile simplest voting 
method, majority voting, applied to the best 
five of our classifiers. Our next task was to 
apply the combination apt)roach to a standard 
data set so that we could compare our results 
with other work. For this purpose we have used 
6V~re are unaware of a good method for determining 
the significance of F~=I differences but we assume that 
this F~=I difference is not significant. However, we be- 
lieve that the fact that more colnbination methods per- 
tbrm well, shows that it easier to get a good pertbrmmlce 
out of the best; five systems than with all seven. 
tile data put tbrward by ll,amshaw and Marcus 
(1995). Again, only 90% of the training data 
was used tbr training while the remaining 11)% 
was reserved tbr ranking the classifiers. The 
seven learners were trained with the same pa- 
rameters as in the previous experiment. Three 
of the classifiers (MBL, MaxEnt and iG%'ee) 
used system-internal combination by processing 
different output representations. 
The classifier output was converted to the 
O and the C representation. Based on the 
tuning data performance, the classifiers ALLiS, 
IGTREE, MaxEnt, MBL and SNoW were se- 
lected for being combined with majority vot- 
ing. After this, the resulting O and C repre- 
sentations were combined to baseNPs by using 
the method described in section 2.2. The re- 
sults can be found in Table 3. Our combined 
system obtains an F/~=I score of 93.86 which 
corresponds to an 8% error reduction compared 
with tile best published result tbr this data set 
(93.26). 
4 Concluding remarks 
In this paper we have examined two methods for 
combining the results of machine learuing algo- 
rithms tbr identii}cing base noun phrases. Ill the 
first Inethod, the learner processed different out- 
put data representations and tile results were 
combined by majority voting. This approach 
yielded better results than the best included 
classifier. Ill the second combination approach 
we have combined the results of seven learning 
systems (ALLiS, c5.0, IGTree, MaxEnt, MBL, 
MBSL and SNOW). Here we have tested dif 
ferent confl)ination methods. Each coilfl)ination 
862 
nmthod outt)erformed the best individual learn- 
ing algorithm and a majority vote of the tol) 
five systems peribrmed best. We, have at}i}lie, d 
this approach of system-internal and system- 
external coral}|nation to a standard data set for 
base noun phrase identification and the 1}ertbr- 
mance of our system was 1)etter than any other 
tmblished result tbr this data set. 
Our study shows that the c, omt)ination meth- 
(}{Is that we have tested are sensitive for the in- 
clusion of classifier results of poor quality. This 
leaves room for imt)rovement of our results t}y 
evaluating other coml}inators. Another interest- 
ing apl)roach which might lead to a l}etter t)er- 
f{}rmance is taking into a{-com~t more context 
inibrmation, for example by coral)in|rig com- 
plete 1}hrases instead of indet}endent t}ra{:kets. 
It would also be worthwhile to evaluate using 
more elaborate me, thods lbr building baseNPs 
out of ot}en and close t}ra{:ket (:an{ti{tates. 
Acknowledgements 
l)djean, Koeling and 'l?jong Kim Sang are 
funded by the TMII. 1\]etwork Learning (Jompu- 
tational Grammars r. 1}unyakanok and Roth are 
SUl)t}orted by NFS grants IIS-98{}1638 an{t SBR- 
9873450. 

References 

Steven Alm{',y. 1996. Partial t)a\]'sing via finite- 
state cascades. In l'n, l}~wce, cdi'ngs of the /~,gS- 
LLI '95 l?,obust 1)arsi'n9 Worlcsh, op. 

SMomo Argam(m, Ido l)agan, an(l YllV~t\] Kry- 
molowsld. 1999. A memory-1}ased at}proach 
to learning shalh}w natural  patterns. 
Journal of E:rperimental and Th, eovetical AL 
11(3). 

Adam L. Berge, r, SteI}hen A. l)ellaPietra, and 
Vincent J. DellaPietra. 1996. A inaximum 
entrol)y apI)roach to natural  pro- 
cessing. Computational Linguistics, 22(1). 

Eric Bri\]l and ,lun Wu. 1998. Classifier com- 
bination tbr improved lexical disaml)iguation. 
In P~vccedings o.f COLING-A 6'15 '98. Associ- 
ation for Computational Linguistics. 

A. Carlson, C. Cunfl)y, J. Rosen, and 
D. l/,oth. 1.999. The SNoW learning archi- 
tecture. Technical Report UIUCDCS-11,-99- 
2101, UIUC Computer Science Department, 
May. 
r httl): / /lcg-www.ui',,.ac.be~/ 

Walter Daelemans, A.ntal van den Bosch, and 
Jakub Zavrel. 1999a. \])brgetting exceptions 
is harmflll in  learning. Machine 
Learning, 34(1). 

Walter Daelemans, Jakub Zavrel, Ko wmder 
Sloot, and Antal van den Bosch. 1999b. 
TiMBL: Tilb'arg Memory Bused Learner, ver- 
sion 2.G Rqfi;rence Guide. ILK Te(:hnical 
th',port 99-01. http://ilk.kub.nl/. 

Luc Dehaspe. 1997. Maximum entropy model- 
ing with clausal constraints, in PTvcecdings oJ' 
th, c 7th, 1}l, ternational Workshop on ind'uctivc 
Logic Programming. 

Hervd Ddjean. 200(I. Theory refinement and 
natural  processing. In Proceedings 
of the ColingEO00. Association for Computa- 
tional Linguistics. 

Mitchell 17 }. Marcus, Beatrice Santorini, and 
Mary Aim Marcinkiewicz. 1993. Building a 
large mmotated corpus of english: the penn 
treebank. Computational Linguistics, 19(2). 

Marcia Munoz, Vasin Punyakanok, l)an l l,oth, 
and Day Zimak. 1999. A learning ap- 
t}roa(:h to shallow t)arsing. In P~vceedings of 
EMNLP-WVLC'99. Asso('iation for Coml)u- 
tational Linguisti(:s. 

J. Ross Quinlan. 1993. c/t.5: Programs for Ma- 
th,|he Learning. Morgan Kauflnann. 

Lance A. Ramshaw and Mitchell P. Marcus. 
1995. Text chunking using transformation- 
l)ase{t learn|Jig. In 1}roceeding s o\[ the Th, i'rd 
A CL Worksh, op on Ve, r~.l LacTic Corpora. As- 
sociation for Comlmtational Linguistics. 

D. Roth. 1.9!t8. Learning to resolve natural lan- 
guage aml}iguities: A unified approach. In 
AAAL98. 

Erik F. Tjong Kim Sang. 2000. N{mn phrase 
recognition by system {:ombination. In Pro- 
ceedings of th, e ANLP-NAA CL-2000. Seattle, 
Washington, USA. Morgan Kauflnan Pub- 
lishers. 

Hans van Halteren, Jakub Zavrel, and Wal- 
ter Daelemans. 1998. Iml)roving data driven 
wordclass tagging by system corot}|nation. In 
P~veeedings of COLING-ACL '98. Associa- 
tion tbr Computational Linguistics. 

Hans van Halteren, Jakub Zavrel, and Walter 
Daelemans. to appear, hnproving accuracy 
ill nlp through coati)|nation of machine learn- 
ing systems. 
