Learning with Multiple Stacking for Named Entity Recognition
Koji Tsukamoto and Yutaka Mitsuishi and Manabu Sassano
Fujitsu Laboratories Ltd.
a0 tukamoto,mitsuishi-y,sassano
a1 @jp.fujitsu.com
1 Introduction
In this paper, we present a learning method us-
ing multiple stacking for named entity recognition.
In order to take into account the tags of the sur-
rounding words, we propose a method which em-
ploys stacked learners using the tags predicted by
the lower level learners. We have applied this ap-
proach to the CoNLL-2002 shared task to improve
a base system.
2 System Description
Before describing our system, let us see one aspect
of the named entity recognition, the outline of our
method, and the relation to the previous works.
The task of named entity recognition can be re-
garded as a process of assigning a named entity tag
to each given word, taking into account the patterns
of surrounding words. Suppose that a sequence of
words is given as below:
... Wa2a4a3 , Wa2a6a5 , Wa7 , Wa5 , Wa3 ...
Then, given that the current position is at word Wa7 ,
the task is to assign tag Ta7 to Wa7 .
In the named entity recognition task, an entity is
often made up of a sequence of words, rather than
a single word. For example, an entity “the United
States of America” consists of five words. In order
to allocate a tag to each word, the tags of the sur-
rounding words (we call these tags the surrounding
tags) can be a clue to predict the tag of the word
(we call this tag the current tag). For the test set,
however, these tags are unknown.
In order to take into account the surrounding tags
for the prediction of the current tag, we propose a
method which employs multiple stacked learners,
an extension of stacking method (Wolpert, 1992).
Stacking based method for named entity recognition
usually employs two or more level learners. The
higher level learner uses the current tags predicted
by its lower level learners. In our method, by con-
trast, the higher level learner uses not only the cur-
rent tag but also the surrounding tags predicted by
the lower level learner. Our aim is to leverage the
performance of the base system using the surround-
ing tags as the features.
At least two groups have previously proposed
systems which use the predicted surrounding tags.
One system, proposed by van Halteren et al. (1998),
also uses stacking method. This system uses four
completely different types of taggers as the first
level learners, because it has been assumed that first
level learners should be as different as possible. The
tags predicted by the first level learners are used as
the features of the second level learner.
The other system, proposed by (Kudo and Mat-
sumoto, 2000; Yamada et al., 2001), uses the ”dy-
namic features”. In the test phase, the predicted tags
of the preceding (or subsequent) words are used as
the features, which are called “dynamic features”.
In the training phase, the system uses the answer
tags of the preceding (or subsequent) words as the
features.
More detailed descriptions of our system are
shown below:
2.1 Learning Algorithm
As the learning algorithm for all the levels , we use
an extension of AdaBoost, the real AdaBoost.MH
which is extended to handle multiclass problems
(Schapire and Singer, 1999). For weak learners, we
use decision stumps (Schapire and Singer, 1999),
which select only one feature to classify an exam-
ple.
2.2 Features
We use the following types of the features for the
prediction of the tag of the word.
a8 surface form of W
a2a9a3 , Wa2a6a5 , Wa7 , Wa5 and Wa3 .
Word Feature Example Text
Digit 25
Digit+Alphabet CDG1
Symbol .
Uppercase EFE
Capitalized Australia
Lowercase(word length a10a12a11 characters) necesidad
Lowercase(word length a13a12a11 characters) del
Other hoy,
Table 1: Word features and examples
a8 One of the eight word features in Table 1.
These features are similar to those used in
(Bikel et al., 1997).
a8 First and last two/three letters of W
a7
a8 Estimated tag of W
a7 based on the word uni-
gram model in the training set.
Additionally, we use the surrounding tag feature.
This feature is discussed in Section 2.3.
2.3 Multiple Stacking
In order to take into account the tags of the sur-
rounding words, our system employs stacked learn-
ers. Figure 1 gives the outline of the learning and
applying algorithm of our system. In the learning
phase, the base system is trained at first. After that,
the higher level learners are trained using word fea-
tures (described in Section 2.2), current tag Ta7 and
surrounding tags Ta2a9a3a15a14a16a2a6a5a17a14a16a5a17a14a3 predicted by the lower
level learner. While these tag may not be correctly
predicted , if the accuracy of the prediction of the
lower level learner is improved, the features used in
each prediction become accurate. In the applying
phase, all of the learners are cascaded in the order.
Compared to the previous systems (van Halteren
et al., 1998; Kudo and Matsumoto, 2000; Yamada
et al., 2001), our system is: (i) employing more
than two levels stacking, (ii) using only one algo-
rithm and training only one learner at each level,
(iii) using the surrounding tag given by the lower
level learner. (iv) using both the preceding and sub-
sequent tags as the features. (v) using the predicted
tags instead of the answer tags in the training phase.
3 Experiments and Results
In this section, the experimental conditions and the
results of the proposed method are shown.
In order to improve the performance of the base
system, the tag sequence to be predicted is format-
ted according to IOB1, even though the sequence
Let La18 denote the a19 th level learner and let Ta20a18a22a21a23 denote a19 th
level output tags for Wa23 .
Learning:
1. Train the base learner La24 using the features described in
Table 1.
2. for a19 = 1,...,N
a25 Get T
a20
a18a27a26a29a28a30a21 of the training set using L
a18a17a26a29a28 , the fea-
tures described in section 2.2 (and Ta20a18a27a26a32a31a33a21
a26a34a31
, Ta20a18a27a26a34a31a33a21
a26a35a28
,
Ta20a18a17a26a34a31a33a21
a24
, Ta20a18a27a26a34a31a33a21
a28
, Ta20a18a27a26a32a31a33a21
a31
if a19a36a10 1).
a25 Train L
a18 using the features described in section 2.2
and Ta20a18a27a26a29a28a30a21
a26a34a31
, Ta20a18a27a26a29a28a30a21
a26a35a28
, Ta20a18a27a26a35a28a30a21
a24
, Ta20a18a17a26a29a28a30a21
a28
, Ta20a18a27a26a29a28a30a21
a31
.
3. Output La24 , La28 ,...,La37 .
Applying:
1. for k = 0,...,N
a25 Get T
a20
a18a22a21 of test set using L
a18 , the features de-
scribed in section 2.2 (and Ta20a18a27a26a35a28a30a21
a26a34a31
, Ta20a18a27a26a29a28a30a21
a26a29a28
, Ta20a18a27a26a29a28a30a21
a24
,
Ta20a18a17a26a29a28a30a21
a28
, Ta20a18a27a26a29a28a30a21
a31
if a19a38a10 0).
2. Output Ta20a37a39a21 .
Figure 1: Outline of multiple stacking algorithm
in the original corpus was formatted according to
IOB2 (Tjong Kim Sang and Veenstra, 1999).
To reduce the computational cost, features ap-
pearing fewer than three times are eliminated in the
training phase.
3.1 Base System
To evaluate the effect of multiple stacking in the
next section, the performance of the base system
is shown in Figure 2. A performance peak is ob-
served after 10,000 rounds of boosting. Note that a
decision stump used in the real AdaBoost.MH takes
into account only one feature. Hence the number
of features used by real AdaBoost.MH is less than
the number of the rounds. In our experiment, be-
cause the rounds of boosting are always less than
the number of the features (about 40,000), a large
proportion of features are not used by the learners.
If the rounds of boosting in the base system are not
enough, stacking effect may be similar to increas-
ing the rounds of boosting. In Figure 2, however,
we can see that 10,000 rounds is enough.
3.2 Multiple Stacking
We examine the effect of multiple stacking com-
pared to the base system.
The Fa40a42a41a43a5 score of multiple stacking for the Span-
ish test set (esp.testa) is shown in Table 2. By stack-
ing learners, the score of each named entity is im-
50
55
60
65
70
75
80
85
90
95
0 5000 10000 15000 20000
FB1
a44
rounds
training
test
Figure 2: Fa40a42a41a43a5 score of the base system
n overall LOC MISC ORG PER
La24 60.94 64.12 33.61 61.37 70.91
La28 65.68 67.35 40.04 66.26 74.82
La31 66.91 68.02 41.69 67.38 75.51
La45 67.24 67.96 41.76 67.95 75.77
La46 67.35 67.81 42.75 67.91 75.89
La47 67.35 67.78 42.30 67.95 75.99
La48 67.30 67.57 42.12 68.01 76.01
Table 2: Fa40a42a41a43a5 score of stacked learner
proved. Compared to the overall Fa40a42a41a49a5 score of La7 ,
the score of La5 , stacking one learner over the base
system, is improved by 4.74 point. Further more,
compared to the score of La5 , the score of La50 is
higher by 1.67 point. Through five iterations of
stacking, the score is continuously increased. The
overall scores for the six tests are briefly shown in
Table 3. The effect of two level stacking is higher
for the Spanish tests. However, multiple staking ef-
fects greater for the Dutch test, especially for the
corpus without part of speech. As discussed in Sec-
tion 3.1, the improvement of the score is not due to
the rounds of boosting. Thus, it is due to multiple
stacking.
In Table 2, stacking effects for MISC and ORG
appear greater than those for LOC and PER. It is
reasonable to suppose that MISC and ORG entities
consist of a relatively long sequence of words, and
the surrounding tags can be good clues for the pre-
diction of the current tag. Indeed, in the Spanish
training set, the ratios of entities which consist of
more than three words are 9.7%, 22.4%, 4.4% and
3.5% for ORG, MISC, LOC and PER respectively.
Table 4 and 5 show examples of the predicted tags
through the stacked level. Let us see how multi-
ple stacking works using the examples in Table 5.
Let the word “fin” be the current position. The an-
swer tag is “I-MISC”. When we use the base system
esp.a esp.b ned.a* ned.b* ned.a ned.b
La24 60.94 65.70 55.50 58.04 54.23 57.43
La28 65.68 69.39 56.72 59.18 56.20 58.84
La46 67.35 71.49 58.23 60.74 58.83 60.93
Table 3: Fa40a51a41a43a5 scores for six tests. “*” indicates use
of part of speech tags.
Answer La24 La28 La31
:
Colegio I-ORG I-ORG I-ORG I-ORG
Plico I-ORG I-ORG I-ORG I-ORG
Arias I-ORG I-LOC I-LOC I-ORG
Montano I-ORG I-LOC I-ORG I-ORG
, O O O O
de O O O O
Badajoz I-LOC I-LOC I-LOC I-LOC
:
Table 4: Example of the prediction (line 2434 to
2440 in esp.testa)
Answer La24 La28 La31 La45
:
el O O O O O
libro O O O O O
“ I-MISC I-MISC I-MISC I-MISC I-MISC
Una I-MISC O I-MISC I-MISC I-MISC
sonrisa I-MISC I-MISC I-MISC I-MISC I-MISC
sin I-MISC O I-MISC I-MISC I-MISC
fin I-MISC O O I-MISC I-MISC
“ I-MISC O O O I-MISC
, O O O O O
:
Table 5: Example of the prediction (line 16231 to
16239 in esp.testa)
La7 , the predicted tag of the word is “O”. In the next
level, La5 uses the surrounding tag features “I-MISC,
O, (O,) O, O” and also outputs “O”. In the third
level, however, La3 correctly predicts the tag using
the surrounding tag features “I-MISC, I-MISC, (O,)
O, O”. Note that no other feature changes through
the levels. The improvement in the example is
clearly caused by multiple stacking. As a result,
this MISC entity is allocated tags correctly by La52 .
The above effect would not be achieved by two level
stacking. This result clearly shows that multiple
stacking method has an advantage.
Next we examine the effect of the learning al-
gorithm to multiple stacking. We use the real Ad-
aBoost.MH for 300, 1,000, 3,000, 10,000, 20,000
rounds. Their Fa40a42a41a43a5 scores in each stacking level are
plotted in Figure 3. The score improves by stack-
ing for all algorithms. The highest score is achieved
by 10,000 iterations at every stacking level. The
shapes of the curves in Figure 3 are similar to each
other. This result suggests that the stacking effect
is scarcely affected by the performance of the algo-
rithm.
55
60
65
70
75
0 2 4 6 8 10
FB1
a44
number of stacking learners
300 rounds
1000 rounds
3000 rounds
10000 rounds
20000 rounds
Figure 3: Fa40a42a41a43a5 scores of different base system
4 Conclusion
We have presented a new method for recognizing
named entity by multiple stacking. This method
can leverage the performance of the base system
employing multiple stacked learner and using not
only the current tag but also the surrounding tags
predicted by the lower level learner. By stacking 5
real AdaBoost.MH learners, we can obtain Fa40a42a41a43a5 of
67.35 for the Spanish named entity recognition task.

References
D. M. Bikel, S. Miller, R. Schwartz and R. Weischedel.
1997. Nymble: a high-performance learning name-
finder. In Fifth Conference on Applied Natural Lan-
guage Processing.
H. van Halteren, J. Zavrel and W. Daelemans. 1998. Im-
proving data driven wordclass tagging by system com-
bination. In Proceedings of the 17th COLING and the
36th Annual Meeting of ACL.
T. Kudo and Y. Matsumoto. 2000. Japanese dependency
structure analysis based on support vector machines.
In Proceedings of the 2000 Joint SIGDAT Conference
on Empirical Methods in Natural Language Process-
ing and Very Large Corpora.
R. E. Schapire and Y. Singer. 1999. Improved boosting
algorithms using confidence-rated predictions. Ma-
chine Learning, 37(3).
E. F. Tjong Kim Sang and J. Veenstra. 1999. Represent-
ing text chunks. In Proceedings of EACL’99.
D. H. Wolpert. 1992. Stacked generalization. Neural
Networks, 5.
H. Yamada, T. Kudo and Y. Matsumoto. 2001. Japanese
named entity extraction using support vector ma-
chines. Information Processing Society of Japan, SIG
Notes NL 142-17 (in Japanese).
