In: Proceedings of CoNLL-2000 and LLL-2000, pages 148-150, Lisbon, Portugal, 2000. 
Improving Chunking by Means of Lexical-Contextual Information 
in Statistical Language Models 
Ferran Pla and Antonio Molina and Natividad Prieto 
Universitat Polit~cnica de Valencia 
Camf de Vera s/n 
46020 Val~ncia (Spain) 
{fpla, amolina, nprieto}@dsic.upv.es 
1 Introduction 
In this work, we present a stochastic approach 
to shallow parsing. Most of the current ap- 
proaches to shallow parsing have a common 
characteristic: they take the sequence of lex- 
ical tags proposed by a POS tagger as input 
for the chunking process. Our system produces 
tagging and chunking in a single process using 
an Integrated Language Model (ILM) formal- 
ized as Markov Models. This model integrates 
several knowledge sources: lexical probabilities, 
a contextual Language Model (LM) for every 
chunk, and a contextual LM for the sentences. 
We have extended the ILM by adding lexical in- 
formation to the contextual LMs. We have ap- 
plied this approach to the CoNLL-2000 shared 
task improving the performance of tile chunker. 
2 Overview of the system 
The baseline system described in (Pla et al., 
2000a) uses bigrams, formalized as finite-state 
automata. It is a transducer composed of two 
levels (see Figure 1). The upper one (Figure la) 
represents the contextual LM for the sentences. 
The symbols associated to the states are POS 
tags (Ci) and chunk descriptors (Si). The lower 
one modelizes the different chunks considered 
(Figure lb). In this case, the symbols are the 
POS tags (Ci) that belong to the correspond- 
ing chunk (Si). Next, a regular substitution of 
the lower models into the upper level is made 
(Figure lc). In this way, we get a single Inte- 
grated LM which shows the possible concate- 
nations of lexical tags and chunks. Also, each 
state is relabeled with a tuple (Ci, Sj) where 
Ci E g and Sj E S. g is the POS tag set used 
and S = {\[Si, Si\], Si, S0} is the chunk set de- 
fined. \[Si and Si\] stand for the initial and the 
final state of chunk whose descriptor is Si. The 
label Si is assigned to those states which are in- 
side Si chunk, and So is assigned to those states 
which are outside of any chunk. All the LMs 
involved have been smoothed by using a back- 
off technique (Katz, 1987). We have not spec- 
ified lexical probabilities in every state of the 
different contextual models. We assumed that 
P(WjI(Ci, Si)) = P(WjlCi ) for every Si E S. 
Once the integrated transducer has been 
made, the tagging and shallow parsing process 
consists of finding the sequence of states of max- 
imum probability on it for an input sentence. 
Therefore, this sequence must be compatible 
with the contextual, syntactical and lexical con- 
straints. This process can be carried out by 
dynamic programming using the Viterbi algo- 
rithm (Viterbi, 1967), which has been appropri- 
ately modified to use our models. From the dy- 
namic programming trellis, we can obtain the 
maximum probability path for the input sen- 
tence through the model, and thus the best se- 
quence of lexical tags and the best segmentation 
in chunks, in a single process. 
3 Specialized Contextual Language 
Models 
The contextual model for the sentences and the 
models for chunks (and, therefore, the ILM) can 
be modified taking into account certain words 
in the context where they appear. This spe- 
cialization us allows to set certain contextual 
constraints which modify the contextual LMs 
and improve the performance of the chunker (as 
shown below). This set of words can be defined 
using some heuristics such as: the most frequent 
words in the training corpus, the words with a 
higher tagging error rate, the words that belong 
to closed classes (prepositions, pronouns, etc.), 
or whatever word chosen following some linguis- 
148 
(a) Contextual LM -.. 
(b) LM for Chunks (S l ) 
.......................... 
.......................... 
(c) lntegratedLM 
~ I I i I 
Figure 1: Integrated Language Model for Tagging and Chunking. 
tic criterion. 
To do this, we added to the POS tag set the 
set of structural tags (Wi, Cj) for each special- 
ized word Wi in all of their possible categories 
Cj. Then, we relabelled the training corpus: if 
a word Wi was labelled with the POS tag Cj, 
we changed Cj for the pair (Wi, Cj). The learn- 
ing process of the bigram LMs was carried out 
from this new training data set. 
The Contextual LMs obtained has some spe- 
cific states which are related to the specialized 
words. In the basic Language Model (ILM), a 
state was labelled by (Ci, Sj). In the specialized 
ILM, a state was specified for a certain word Wk 
(only if the Wk word belongs to the category 
Ci). In this way, the state is relabelled with the 
tuple (Wk, Ci, Sj) and only the word Wk can be 
emitted with a probability equal to 1. 
4 Experimental Work 
We applied both approaches (ILM and spe- 
cialized ILM) using the training and test data 
of the CoNLL-2000 shared task (http://lcg- 
www.uia.ac.be/conll2000). We also evaluated 
how the performance of the chunker varies when 
we modify the specialized word set. Neverthe- 
less, the use of our approach on other corpora 
(including different s), other lexical tag 
sets or other kinds of chunks can be done in a 
direct way. 
Although our system is able to carry out tag- 
ging and chunking in a single process, we will 
not present tagging results for this task, as the 
POS tags of the data set used are not supervised 
and, therefore, a comparison is not possible. 
We would like to point out that we have simu- 
lated a morphological analyzer for English. We 
have constructed a tag dictionary with the lex- 
icon of the training set and the test set used. 
This dictionary gave us the possible lexical tags 
for each word from the corpus. In no case, was 
the test used to estimate the lexical probabili- 
ties. 
As stated above, several criterion can be cho- 
sen to define the set of specialized words. We 
have selected the most frequent words in the 
training data set. We have not taken into ac- 
count certain words such as punctuation sym- 
bols, proper nouns, numbers, etc. This fact did 
not decrease the performance of the chunker and 
also reduced the number of states of the contex- 
tual LMs. Figure 2 shows how the performance 
of the chunker (Fz=I) improves as a function of 
the size of the specialized word set. The best re- 
sults were obtained with the set of words whose 
frequency in the training corpus was larger than 
80 (about 470 words). We obtained similar re- 
sults when only considering the words of the 
training set belonging to closed classes (that, 
about, as, if, out, while, whether, for, to, ...). 
In Table 1 we present the results of chunk- 
ing with the specialized ILM. When comparing 
these results with the results obtained using the 
basic ILM, we observed that, in general, the F- 
149 
score was improved for each chunk. The best 
improvement was observed for SBAR (from 0.37 
to 79.46), PP (from 88.94 to 95.51) and PRT 
(38.82 to 66.67). 
5 Conclusions 
In this paper, we have presented a system for 
Tagging and Chunking based on an Integrated 
Language Model that uses a homogeneous for- 
malism (finite-state machine) to combine differ- 
ent knowledge sources. It is feasible both in 
terms of performance and also in terms of com- 
putational efficiency. 
All the models involved are learnt automat- 
ically from data, so the system is very flexible 
with changes in the reference , changes 
in POS tags or changes in the definition of 
chunks. 
Our approach allows us to use any regular 
model which has been previously defined or 
learnt. In previous works, we have used bi- 
grams (Pla et al., 2000a), and we have com- 
bined them with other more complex models 
which had been learnt using grammatical in- 
ference techniques (Pla et al., 2000b). In this 
work, we used only bigram models improved 
with lexical-contextual information. 
The Ff~ score obtained increased from 86.64 to 
90.14 when we used the specialized ILM. Never- 
( 
I I I I I I I I I 
50 100 150 L200 250 300 350 400 450 500 
#SPECIALIZED WORDS 
test data precision recall 
ADJP 
ADVP 
CONJP 
INTJ 
LST 
NP 
PP 
PRT 
SBAR 
VP 
72.89 % 
79.65% 
40.00% 
lOO.OO% 
0.o0% 
90.28% 
95.89% 
60.31% 
82.07% 
91.53% 
66.89 % 
74.13% 
66.67% 
100.00% 
0.00% 
89.41% 
95.14% 
74.53% 
77.01% 
91.58% 
F/3=l 
69.76 
76.79 
50.00 
100.00 
0.00 
89.84 
95.51 
66.67 
79.46 
91.55 
all 90.63% 89.65% 90.14 
Table 1: Chunking results using specialized ILM 
(Accuracy= 93.79%) 
theless, we believe that the models could be im- 
proved with a more detailed study of the words 
whose contextual information is really relevant 
to tagging and chunking. 
6 Acknowledgments 
This work has been partially supported by the 
Spanish Research Project CICYT (TIC97-0671- 
C02-01/02). 

References 
S. M. Katz. 1987. Estimation of Probabilities from 
Sparse Data for the Language Model Component 
of a Speech Recognizer. IEEE Transactions on 
Acoustics, Speech and Signal Processing, 35. 
F. Pla, A. Molina, and N. Prieto. 2000a. Tagging 
and Chunking with Bigrams. In Proceedings of the 
COLING-2000, Saarbrficken, Germany, August. 
F. Pla, A. Molina, and N. Prieto. 2000b. An Inte- 
grated Statistical Model for Tagging and Chunk- 
ing Unrestricted Text. In Proceedings of the Text, 
Speech and Dialogue 2000, Brno, Czech Republic, 
September. 
A. J. Viterbi. 1967. Error Bounds for Convolutional 
Codes and an Asymptotically Optimal Decoding 
Algorithm. IEEE Transactions on Information 
Theory, pages 260-269, April. 
