In: Proceedings of CoNLL-2000 and LLL-2000, pages 163-165, Lisbon, Portugal, 2000. 
Hybrid Text Chunking 
GuoDong Zhou and Jian Su and TongGuan Tey 
Kent Ridge Digital Labs 
21 Heng Mui Keng Terrace 
Singapore 119613 
{zhougd, sujian, tongguan}@krdl, org.sg 
Abstract 
This paper proposes an error-driven HMM- 
based text chunk tagger with context-dependent 
lexicon. Compared with standard HMM-based 
tagger, this tagger incorporates more contextual 
information into a lexical entry. Moreover, an 
error-driven learning approach is adopted to de- 
crease the memory requirement by keeping only 
positive lexical entries and makes it possible 
to further incorporate more context-dependent 
lexical entries. Finally, memory-based learning 
is adopted to further improve the performance 
of the chunk tagger. 
1 Introduction 
The idea of using statistics for chunking goes 
back to Church(1988), who used corpus frequen- 
cies to determine the boundaries of simple non- 
recursive noun phrases. Skut and Brants(1998) 
modified Church's approach in a way permitting 
efficient and reliable recognition of structures of 
limited depth and encoded the structure in such 
a way that it can be recognised by a Viterbi 
tagger. Our approach follows Skut and Brants' 
way by employing HMM-based tagging method 
to model the chunking process. 
2 HMM-based Chunk Tagger with 
Context-dependent Lexicon 
Given a token sequence G~ = glg2""gn , 
the goal is to find an optimal tag sequence 
T~ = tit2.. "tn which maximizes log P(T~IG~): 
. P(T~,G?) log P(T~IG?) = log P(T~) +log p(T~)P(G?) 
The second item in the above equation is the 
mutual information between the tag sequence 
T~ and the given token sequence G~. By as- 
suming that the mutual information between 
G~ and T~ is equal to the summation off mutual 
information between G~ and the individual tag 
ti (l<i_<n): 
l P(T~, G~) n _ P(ti, G?) 
o g p-~) = E log P'(t~-P-~) 
i=1 
n 
MI(T~, G?) = ~ MI(ti, G~), 
i=1 
we have: 
n P(ti, G~) log 
P(T~IG~) = log P(T~)+~ log P(~i)P-~) 
i=1 
n n 
= log P(T~)- ~ log P(ti) + ~ log P(tilG?) 
i----1 i=1 
The first item of above equation can be solved 
by chain rules. Normally, each tag is assumed 
to be probabilistic dependent on the N-1 previ- 
ous tags. Here, backoff bigram(N=2) model is 
used. The second item is the summation of log 
probabilities of all the tags. Both the first item 
and second item constitute the  model 
component while the third item constitutes the 
lexicon component. Ideally the third item can 
be estimated by the forward-backward algo- 
rithm(Rabiner 1989) recursively for the first- 
order(Rabiner 1989) or second-order HMMs. 
However, several approximations on it will be 
attempted later in this paper instead. The 
stochastic optimal tag sequence can be found 
by maximizing the above equation over all the 
possible tag sequences using the Viterbi algo- 
rithm. 
The main difference between our tagger and 
the standard taggers lies in our tagger has a 
context-dependent lexicon while others use a 
context-independent lexicon. 
163 
For chunk tagger, we have gl = piwi where 
W~ = wlw2""Wn is the word sequence and 
P~ = PlP2""Pn is the part-of-speech(POS) 
sequence. Here, we use structural tags to 
representing chunking(bracketing and labeling) 
structure. The basic idea of representing 
the structural tags is similar to Skut and 
Brants(1998) and the structural tag consists of 
three parts: 
1) Structural relation. The basic idea is sim- 
ple: structures of limited depth are encoded 
using a finite number of flags. Given a se- 
quence of input tokens(here, the word and POS 
pairs), we consider the structural relation be- 
tween the previous input token and the current 
one. For the recognition of chunks, it is suffi- 
cient to distinguish the following four different 
structural relations which uniquely identify the 
sub-structures of depth l(Skut and Brants used 
seven different structural relations to identify 
the sub-structures of depth 2). 
• 00: the current input token and the previ- 
ous one have the same parent 
• 90: one ancestor of the current input token 
and the previous input token have the same 
parent 
• 09: the current input token and one an- 
cestor of the previous input token have the 
same parent 
• 99 one ancestor of the current input token 
and one ancestor of the previous input to- 
ken have the same parent 
Compared with the B-Chunk and I-Chunk 
used in Ramshaw and Marcus(1995)~, structural 
relations 99 and 90 correspond to B-Chunk 
which represents the first word of the chunk, 
and structural relations 00 and 09 correspond 
to I-Chunk which represents each other in the 
chunk while 90 also means the beginning of the 
sentence and 09 means the end of the sentence. 
2)Phrase category. This is used to identify 
the phrase categories of input tokens. 
3)Part-of-speech. Because of the limited 
number of structural relations and phrase cate- 
gories, the POS is added into the structural tag 
to represent more accurate models. 
Principally, the current chunk is dependent 
on all the context words and their POSs. How- 
ever, in order to decrease memory require- 
ment and computational complexity, our base- 
line HMM-based chunk tagger only considers 
previous POS, current POS and their word to- 
kens whose POSs are of certain kinds, such as 
preposition and determiner etc. The overall 
precision, recall and F~=i rates of our baseline 
tagger on the test data of the shared task are 
89.58%, 89.56% and 89.57%. 
3 Error-driven Learning 
After analysing the chunking results, we find 
many errors are caused by a limited number of 
words. In order to overcome such errors, we 
include such words in the chunk dependence 
context by using error-driven learning. First, 
the above HMM-based chunk tagger is used to 
chunk the training data. Secondly, the chunk 
tags determined by the chunk tagger are com- 
pared with the given chunk tags in the training 
data. For each word, its chunking error number 
is summed. Finally, those words whose chunk- 
ing error numbers are equal to or above a given 
threshold(i.e. 3) are kept. The HMM-based 
chunk tagger is re-trained with those words con- 
sidered in the chunk dependence context. 
The overall precision, recall and FZ=i rates 
of our error-driven HMM-based chunk tagger 
on the test data of the shared task are 91.53%, 
92.02% and 91.77 
4 Memory based Learning 
Memory-based learning has been widely used 
in NLP tasks in the last decade. Principally, it 
falls into two paradigms. First paradigm rep- 
resents examples as sets of features and car- 
ries out induction by finding the most simi- 
lar cases. Such works include Daelemans et 
a1.(1996) for POS tagging and Cardie(1993) 
for syntactic and semantic tagging. Second 
paradigm makes use of raw sequential data 
and generalises by reconstructing test examples 
from different pieces of the training data. Such 
works include Bod(1992) for parsing, Argamon 
et a1.(1998) for shallow natural  pat- 
terns and Daelemans et a1.(1999) for shallow 
parsing. 
The memory-based method presented here 
follows the second paradigm and makes use of 
raw sequential data. Here, generalization is per- 
formed online at recognition time by comparing 
164 
the new pattern to the ones in the training cor- 
pus. 
Given one of the N most probable chunk se- 
quences extracted by the error-driven HMM- 
based chunk tagger, we can extract a set of 
chunk patterns, each of them with the format: 
XP 1 n n+l r~+l = poroPlrn Pn+l, where is the 
structural relation between Pi and Pi+l. 
As an example, from the bracketed and la- 
beled sentence: 
\[NP He/PRP \] \[VP reckons/VSZ \] 
\[NP the/DT current/JJ account/NN 
deficit/NN \] \[VP will/MD narrow/VB 
\] \[ PP to/TO\] \[NP only/RB #/# 
1.8/CD billion/CD \] \[PP in/IN \]\[NP 
September/NNP \] \[O ./. \] 
we can extract following chunk patterns: 
NP=NULL 90 PRP 99 VBZ 
VP=PRP 99 VBZ 99 DT 
NP=VBZ 99 DT JJ NN NN 99 MD 
PP=VB 99 TO 99 RB 
NP=TO 99 RB # CD CD 99 IN 
PP=CD 99 IN 99 NNP 
NP=IN 99 NNP 99 . 
O=NNP 99 . 09 NULL 
For every chunk pattern, we estimate its proba- 
bility by using memory-based learning. If the 
chunk pattern exists in the training corpus, 
its probability is computed by the probability 
of such pattern among all the chunk patterns. 
Otherwise, its probability is estimated by the 
multiply of its overlapped sub-patterns. Then 
the probability of each of the N most probable 
chunk sequences is adjusted by multiplying the 
probabilities of its extracted chunk patterns. 
Table 1 shows the performance of error-driven 
HMM-based chunk tagger with memory-based 
learning. 
5 Conclusion 
It is found that the performance with the help of 
error-driven learning is improved by 2.20% and 
integration of memory-based learning further 
improves the performance by 0.35% to 92.12%. 
For future work, the experimentation on large 
scale task will be speculated in the near future. 
Finally, a closer integration of memory-based 
method with HMM-based chunk tagger will also 
be conducted. 
test data 
ADJP 
ADVP 
CONJP 
INTJ 
LST 
NP 
PP 
PRT 
SBAR 
VP 
precision 
76.17% 
78.25% 
46.67% 
20.00% 
00.00% 
92.19% 
96.09% 
72.36% 
83.56% 
92.77% 
recall 
70.78% 
78.52% 
77.78% 
5O.OO% 
OO.O0% 
92.59% 
96.94% 
83.96% 
79.81% 
92.85% 
all 91.99% 92.25% 
F~=i 
73.37 
78.39 
58.33 
28.57 
00.00 
92.39 
96.51 
77.73 
81.64 
92.81 
92.12 
Table 1: performance of chunking 

References 
S. Argamon, I. Dagan, and Y. Krymolowski. 1998. 
A memory-based approach to learning shallow 
natural  patterns. In COLING/ACL- 
1998, pages 67-73. Montreal, Canada. 
R. Bod. 1992. A computational model of lan- 
guage performance: Data-oriented parsing. In 
COLING-1992, pages 855-859. Nantes, France. 
C. Cardie. 1993. A case-based approach to knowl- 
edge acquisition for domain-specific sentence anal- 
ysis. In Proceeding of the I1th National Con- 
ference on Artificial Intelligence, pages 798-803. 
Menlo Park, CA, USA. AAAI Press. 
K.W. Church. 1988. A stochastic parts program and 
noun phrase parser for unrestricted text. In Pro- 
ceedings of Second Conference on Applied Natu- 
ral Language Processing, pages 136-143. Austin, 
Texas, USA. 
W. Daelemans, J. Zavrel, P. Berck, and S. Gillis. 
1996. Mbt: A memory-based part-of-speech tag- 
ger generator. In Proceeding of the Fourth Work- 
shop on Large Scale Corpora, pages 14-27. ACL 
SIGDAT. 
W. Daelemans, S. Buchholz, and J. Veenstra. 1999. 
Memory-based shallow parsing. In CoNLL-1999, 
pages 53-60. Bergen, Norway. 
L.R. Rabiner. 1989. A tutorial on hidden markov 
models and selected applications in speech recog- 
nition. In Proceedings of the IEEE, volume 77, 
pages 257-286. 
Lance A. Ramshaw and Mitchell P. Marcus. 1995. 
Text chunking using transformation-based learn- 
ing. In Proceedings of the Third ACL Work- 
shop on Very Large Corpora. Cambridge, Mas- 
sachusetts, USA. 
W. Skut and T. Brants. 1998. Chunk tagger: sta- 
tistical recognition of noun phrases. In ESSLLI- 
1998 Workshop on Automated Acquisition of Syn- 
tax and Parsing. Saarbruucken, Germany. 
