Tagging and Chunking with Bigrams 
Ferran Pla, Antonio Molina and Natividad Prieto 
Universitat Politbcnica de Val5ncia 
Departament de Sistemes Inform\tics i Computaci6 
Camf de Vera s/n 
46(120 ValSncia 
{fpla,amolina,nprieto}@dsic.upv.es 
Abstract 
In this paper we present an integrated system for 
tagging and chunking texts from a certain . 
The approach is based on stochastic finite-state 
models that are learnt automatically. This includes 
bigrmn models or tinite-state automata learnt using 
grammatical inference techniques. As the models in- 
volved in our system are learnt automatically, this 
is a very flexible and portable system. 
Itl order to show the viability of our approach we 
t)resent results for tagging mid chunking using bi- 
grain models on the Wall Street Journal corpus. We 
have achieved an accuracy rate for tagging of 96.8%, 
and a precision rate tbr NP chunks of 94.6% with a 
recall rate of 93.6%. 
1 Introduction 
Part of Speech Tagging and Shallow Parsing are two 
well-known problems in Natural Language Process- 
ing. A Tagger can be considered as 2 translator that 
reads sentences from a certain  and outputs 
the corresponding sequences of part of speech (POS) 
tags, taking into account the context in which each 
word of the sentence appears. A Shallow Parser in- 
volves dividing sentences into non-overlapping seg- 
ments on the basis of very superticial analysis. It; 
includes discovering the main constituents of the 
sentences (NPs, VPs, PPs, ...) and their heads. 
Shallow Parsing usually identifies non-recnrsive con- 
stituents, also called chunks (Abney, 1991) (such as 
non-recursive Noun Phrases or base NP, base VP, 
and so on). It can include deterlnining syntactical 
relationships such as subject-verb, verb-object, etc., 
Shallow parsing wlfich always follows tlm tagging 
process, is used as a fast and reliable pre-l)rocessing 
phase for full or partial parsing. It can be used for 
hffbrmation Retrieval Systems, Information Extrac- 
tion, Text Summarization and Bilingual Alignment. 
In addition, it is also used to solve colnputational 
linguistics tasks such as disambiguation t)roblems. 
1.1 POS Tagging Approaches 
The different aI)proaches for solving this problem 
can be classified into two main classes depending 
oi1 tile tendencies followed for establishing tile Lan- 
guage Model (LM): tile linguistic apI)roach, based 
oil hand-coded linguistic rules and the learning ap- 
I)roach derived fi'om a corpora (labelled or non- 
labelled). Other at)proximations that use hybrid 
methods have also been proposed (Voutilaiuen and 
Padr6, 1997). 
In tim linguistic apl)roach, an exI)ert linguist is 
needed to formalise the restrictions of the . 
This implies a very lfigh cost and it is very depen- 
dent on each particular . We can lind an 
important contribution (Voutilainen, :1995) that uses 
Constraint Grammar tbrmalism. Supervised learn- 
ing methods were proposed in (Brill, 1995) to learn 
a set, of transforlnation rules that repair tim error 
committed by a probabilistic tagger. The main a(t- 
vantage of the linguistic approach is that the model 
is constructed from a linguistic I)oint of view and 
contains many and complex kinds of knowledge_ 
iI1 tim lem'ning approach, tile most extended 
tbrmalism is based on n-grains or IIMM. In tiffs 
case, the  inodel can be estimated from 
a labelled corpus (supervised methods) (Church, 
1988)(Weisehedel et al., 1.993) or from a non- 
labelled corpus (unsupervised methods) (Cutting et 
21., 1992). In the first; case, the model is trained from 
the relative observed Dequencies. In the second one, 
the model is learned using the Baunl-\¥elch algo- 
rithm from an initial model which is estimated using 
labelled corpora (Merialdo, 1994). The advantages 
of the unsupervised approach are the facility to tmild 
 models, the flexibility of choice of cate- 
gories and the ease of apt)lication to other s. 
We can find some other machine-learning approaches 
that use more sophisticated LMs, such as Decision 
Trees (Mhrquez and Rodrfguez, 1998)(Magerman, 
1996), memory-based approaclms to learn special de- 
cision trees (Daelemans et al., 1996), maximmn en- 
tropy approaches that combine statistical informa- 
tion from different sources (Ratnaparkhi, 1996), fi- 
nite state autonmt2 inferred using Grammatical In- 
ference (Pla and Prieto, 1998), etc. 
The comparison among different al)t)roaches is dif 
ficult due to the nmltiple factors that can be eonsid- 
614 
ered: tile languagK, tile mmfl)er and tyt)e of the tags, 
the size of tilt vocabulary, thK ambiguity, the diiti- 
culty of the test ski, Kte. The best rKsults rel)orted 
on the Wall Street ,lore'hal (WSJ) %'e('.l)ank (\]~'\[al'CllS 
el al., 1993), using statistical  models, have 
an ae(:uracy rack) between 95% and 97% (del)Knding 
on the different factors mKntiono.d al)ove). For the 
linguistic al)proach tim results ark l)etter. For exmn- 
p\]e, in (Voutilaineu, 1995) an accuracy of 99.7% is 
rel)orted , but cKrtain ambiguities ill thK ou|;tnl(; re- 
main unsolved. Some works have recently l)een pul)- 
lished (Brill and Wu, 1998) in which a sel; of taggers 
are combined in order to lint)rove the.Jr l/erfornmn(:e. 
In some cases, these methods achieve an accuracy of 
97.9% (llalterKn (31; al., 1998). 
1,2 Shallow Parsing A1)t)roaches 
Since the early 90's~ sKveral l;Kchni(tues for carry- 
ing out shalh)w parsing have been d(3velol)ed. Tlms(~ 
techniques can also bK classified into two main 
groups: basKd on hand-codKd linguistic rules and 
based on iKarning algorithms. ThKsK approadms 
ll~we a conunon chara(:tcristi(:: thKy take, l;he se- 
(lUKnCK of 1Kxi(:al tags 1)rot)oscd t)y a POS tagger as 
input, for both the h;arning and the (:bunking pro- 
C(~sses. 
1.2.1 Techniques based on hand-coded 
linguistiK rules 
These methods use a hand-written set of rules that 
ark defined llsing POS as tKrnfinals of tim gI'gtlll- 
mar. Most of these works use tinit(! slate \]nel;llo(ls 
for (tel;Kcl;ing (:hunks or f()r a(:(:olni)lishing el;her lin- 
guisti(: l;asks (EjKrhed, 1988), (:\lm(~y, 1996), (At o 
Mokhtar and Chanod, :19!)7). ()ther works use (tit'-- 
ferellI; ~raltllllgd;ical \]'orlllalislllS~ S/l(;h as (:OllSl;r;/illl; 
grmnmars (Voutilainen, 1993), or (:oral)inK th('. gram- 
mar rules with a set of heuristi(:s (Bourigault, :1992). 
ThesK works usually use. a small test SKi that is lllall- 
ually evaluated, so the achieved results are not sig- 
ni\[icant. The regular KXln:cssions defined in (Ejer- 
lied, 1988) identified both non-recursive clauses and 
non-recursive NPs in English text. The cxperimKn- 
tation on l;he Brown (:ortms achiKvKd a prK(:ision ratK 
of 87% (for clauses) and 97.8 % (for NPs). Ab- 
hey introduced the concept of chunk (Almey, 1991) 
m)d l/resentKd an incremental l)artial parser (Abney, 
1996). This parsKr identities chunks l)ase on the 
parts of Sl)eKch, and it then chooses how to con> 
bine them tbr higher level analysis using lexical in- 
tbrmation. ThK average 1)rKcision and recall rates for 
chunks were 87.9% and 87.1%, rest)ectivKly , on a tKst 
set of 1000 sKntKneKS. An iimrenmntal architKcture 
of finite--state transducers for French is pres(mted in 
(At-Mokhtar and Chanod, 1.997). Each transducer 
1)ert'orms a linguisti(; task su(:h as id(3ntif~ying sKg- 
ments or syntactic strueturKs and dKtecting subjects 
and ol)jects. The system was (3wfluated on various 
corpora for subject and object detKction. The pre- 
cision rate varied between 9(,).2% and 92.6%. The 
recall rate varied between 97.8% and 82.6%. 
The NP2bol llarsKr described in (Voutilainen, 
1993) identified nmximal-length noun phrases. 
NPtool gave a precision ral, e of 95-98% and a re- 
call ratK of 98.5-100%. These results were criticised 
in (Raulshaw and Marcus, 1.995) due to some in- 
consistencies and aplmrenl; mistakKs which appeared 
on thK sample given in (Voutilainen, 1993). Bouri- 
gault dKvelopKd the LECTER parser fin" French us- 
ing grmnmatical rules and soum hem'istics (Bouri- 
gault, 1992). lit achieved a recall rate of 95% iden- 
tit~ying maxilnal length ternfinological noun phrases, 
but tie (lid not givK a prKcision ratK, so it is difficult; 
to Kvaluate the actual pKribrmance of tile parsKr. 
1..2.2 LKarning Techniques 
These al)lnoachcs automa.tica.lly (:onstruel; a lan- 
guage model from a labello.d alld brackKted corpus. 
The lirst probabilistic approach was proposed in 
(Church, 1988). This method learn(; a bigram model 
for detecting simph3 noun phrasKs on the Brown cor- 
pus. Civ('n a sequen('e of parts of st)(3eeh as inl)ug , 
the Church program inserts the most prol)able open- 
ings and Kndings of NPs, using a Viterbiqiko. dy- 
namic programming algorithm. Church did not giVK 
precision and recall rates. He showKd that 5 out of 
24:3 NP were omitted, but in a very small test with 
a POS tagging ac(:uraey of 99.5%. 
Transfornlation-based 1Karning (TBI,) was USKd in 
(\]~;unshaw an(l Mar(:us, 1995) to (lc, t(',('t baSK NP. 
In this work ('hunldng was considKre(1 as a tagging 
technique, so that each P()S could be tagged with 
I (inside lmseNP), O (outside baseNl )) or B (inside 
a baseNP, but 1;11(3 pre(:eding word was ill mlother 
basKNP). This at)preach rKsulted in a precision rate 
of 91.8% and a rKcall rate of 92.3%. This iesult 
was automatically Kwlhlat;ed ell ,q. (;est set; (200,000 
words) extracl;Kd from the WS.\] Treebank. The main 
drawlmek to this approach are the high requiremKnts 
tbr tilne and space which ark needed to train ~he sys- 
l;elll; it needs to train 100 tKmplates of combinations 
of words. 
There are s(;v(;ral works that use a m('mory-based 
h,arning algorithm. ThKse at)proaehKs construct a 
classifier tbr a task by storing a sKI; of exmnples in 
inemory. Each (;xamI)le is definKd l)y a set of fhatures 
that havK to 1)c. learnt from a 1)racketed corpus. The 
Memory-Based Learning (MBL) algorithm (l)aele,- 
roans (3t al., 1999) takes into account lexical and POS 
information. It stores the following features: thK 
word form mid POS tag of thK two words to the left, 
the tbeus word and onK word to the right. This sys- 
tKm achiKved a precision rate of 93.7'7o and a recall 
rate of 94.0% on t\]lK WSJ Treebank. HowevKr, when 
only POS information was used the l)erformance de- 
creased a.chiKving a precision rate of 90.3% mid a 
615 
recall rate of 90.1%. Tile Memory-Based Sequence 
Learning (MBSL) algorithm (Argamon et al., 1998) 
learns substrings or sequences of POS and brackets. 
Precision and recall rates were 92.4% on the same 
data used in (Ramshaw and Marcus, 1995). 
A simple approach is presented in (Cardie and 
Pierce, 1998) called Treebank Apl)roach (TA). This 
techtfique matches POS sequences from an initial 
noun phrase grammar which was extracted fl'om an 
annotated corpus. The precision achieved for each 
rule is used to rank and prune the rules, discarding 
those rules whose score is lower than a predefined 
threshold. It uses a longest match heuristic to de- 
termine base NP. Precision and recall on the WSJ 
Treebank was 89.4% and 90.0%, respectively. 
It is difficult to compare the different al)proaches 
due fbr various reasons. Each one uses a different 
definition of base NP. Each one is evaluated on a 
different corpus or on different parts of the same 
cortms. Some systems have even been evaluated by 
hand on a very small test set. Table 1 summarizes 
tile precision and recall rates for learning approaches 
that use data extracted from the WSJ Treebank. 
Method NP-Pl'ecision NP-Recall 
TBL 91.8 92.3 
MBSL 92.4 92.4 
TA 89.4 90.9 
MBL 93.7 94.0 
MBL (only POS) 90.3 90.1 
Tat)le 1: Precision and recall rates tbr diflhrent NP 
parsers. 
2 General Description of our 
Integrated approach to Tagging 
and Chunking 
We propose an integrated system (Figure 1) that 
combines different knowledge sources (lexical prob- 
abilities, LM for chunks and Contextual LM tbr 
the sentences) in order to obtain the correspond- 
ing sequence of POS tags and the shallow parsing 
(\[su WllC~W.~/c~ su\] W.~lC~ ... \[su W, lC,, su\]) 
from a certain input string (1'I:1,I¥.2, ...,I/l:n). Our 
system is a transducer composed by two levels: the 
upper one represents the Contextual LM for tile 
sentences, and the lower one modelize the chunks 
considered. The formalism that we have used in all 
levels are finite-state automata. To be exact, we 
have used models of bigrmns which are smoothed 
using the backoff technique (Katz, 1987) in order to 
achieve flfll coverage of the . The bigrams 
LMs (bigram probabilities) was obtained by means 
of the SLM TOOLKIT (Clarksond and Ronsenfeld, 
LEAIINING ~- 
\[-C,m,zxtuall.~ I2"l"°'~.Chunks \] l l'e×icalPmbabilities J 
CIIUNKIN(; ~~ 
Figure 1: Overview of the System. 
1997) from tile sequences of categories in the 
training set. Then, they have been rei)resented like 
finite-state automata. 
2.1 The learning phase. 
The models have been estimated from labelled and 
bracketed corpora. The training set is composed by 
sentences like: 
\[su w,/c,w.,/c., su\] w~/c~ ... \[su ~,~:,~/c,~ su\] ./. 
where Wi are the words, Ci are part-of-speech tags 
and SU are tile chunks considered. 
Tile models learnt are: 
• Contextual LM: it is a smoothed bigram model 
learnt from tile sequences of part-of speech tags 
(Ci) and chunk descrit)tors (XU) present in the 
training corpus (see Figure 2a). 
• Models for the chunks: they are smoothed bi- 
gram models learnt fl'om the sequences of part- 
of-speech tags eorrest)onding to each chunk of 
the training corpus (see Figure 2b). 
• Lexical Probabilities: they are estilnated from 
the word frequencies, tile tag frequencies and 
the word per tag frequencies. A tag dictio- 
nary is used which is built from the full cor- 
pus which gives us the possible lexical categories 
(POS tags) for each word; this is equivalent to 
having an ideal morphological analyzer. The 
probabilities for each possible tag are assigned 
from this information taking into account the 
obtained statistics. Due to the fact that the 
word cannot have been seen at training, or it 
has only been seen in some of the possible cat- 
egories, it is compulsory to apply a smoothing 
mechanism. In our case, if the word has not 
previously been seen~ the same probability is 
assigned to all the categories given by the die- 
tionary; if it has been seen, but not in all the 
616 
(b) LM for Chunks 
......................... 
i 
, ', z f+@,, --_... 
J 
i 
i 
t 
i 
i .......................... 
(c) Integrated LM 
i I(<SU>\[( ) x * 
Figure 2: Integrated Language Model fin" Tagging and Chunking. 
categories, the smoothing called "add one" is 
applied. Afterwards, a renormalization process 
is carried out. 
Once the LMs have been learnt, a regular substi- 
tution of the lower model(s) into the upper one is 
made. In this way, we get a single Illtegrated LM 
which shows the possible concatenations of lexical 
tags and syntactical uuits, with their own transition 
probabilities which also include the lexical probabil- 
ities ms well (see Figure 2c). Not(', that the models 
in Figure 2 are not smoothed). 
2.2 The Decoding Process: Wagging and 
Parsing 
The tagging and shallow parsing process consists of 
finding out the sequence of states of maximum 1)rob- 
ability on the Integrated LM tor an input sentence. 
Therefore, this sequence must be compatible with 
the contextual, syntactical and lexical constraints. 
This process can be carried out by Dynamic Pro- 
gt'ammiitg using the Viterbi algorithm, which is con- 
veniently modified to allow for (;ransitions between 
certain states of the autotnata without consmning 
any symbols (epsilon l;ransitious). A portion of the 
Dynamic Progranmfing trellis for a generic sentence 
using the Integrated LM shown in Figure 2c can be 
seen in Figure 3. The states of the automata that 
can be reached and that are compatible with the 
lexical constraints are marked with a black circle 
(i.e., fl'om the state Ck it is possible to reach the 
state Ci if the transition is in the automata and the 
lexical probability P(Wi\[Ci) is not null). Also, the 
transitions to initial and final states of the models 
for chunks (i.e., fl'om Ci to < SU >) are allowed; 
these states are marked in Figure 3 with a white cir- 
cle and in this case no symbol is consumed. Ill all 
these cases, the transitions to initial and final pro- 
duce transitions to their successors (the dotted lines 
in Figure 3) where now symbols must be consumed. 
Once the Dynamic Programing trellis is built, we 
can obtain the maximum probability path for the 
input sentence, and thus the best sequence of lexical 
tags and the best segmentation in chunks. 
<s> 
Ci 
cj 
<Is> ....... \]\ ~ "~ \]', `% l{inal 
x\ ". ' State 
<~u> ......... ........ it>, .... ~>;, ..... 
Ci / "" / " ',' 
: ~ t I L 
(:k ' , s', 
{.Jill , / t 
• / / 
c,, ............ ........ 7~-o.. ......... //~ ..... 
</S U> "%3~ 
----t~-- 
hlput: ... Wll-2 Wll-I Wn </S> 
Output: ... Wn~2/Ci I SU Wnq/Cn SUI Wn/Ck </s> 
Figure 3: Partial %'ellis for Programming Decoding 
based oil tile Integrated LM. 
3 Experimental Work 
In this section we will describe a set of experiments 
that we carried out in order to demonstrate the ca- 
pabilities of the proposed approach for tagging and 
shallow parsing. The experiments were carried out 
617 
on the WSJ corpus, using the POS tag set; defined 
in (Marcus etlal. , 1993), considering only the NP 
chunt{s (lefine~l by (Church, 1988) and using tile 
models that we have presented above. Nevertheless, 
the use of this apt)roach on other corpora (chang- 
ing the reference ), other lexical tag sets or 
other kinds of chunks can be done in a direct way. 
3.1 Corpus Description. 
We used a t)ortion of the WSJ corpus (900,000 
words), which was tagged according to the Penn 
Treebank tag set and bracketed with NP markers, 
to train and test the system. 
The tag set contained 45 different tags. About 
36.5% of the words in the cortms were mnbiguous, 
with an ambiguity ratio of 2.44 tag/word over the 
ambiguous words, 1.52 overall. 
3.2 Experimental Results. 
In order to train the models and to test the system, 
we randomly divided the corpora into two parts: ap- 
proximately 800,000 words for training aud 100,000 
words tbr testing. 
Both the bigram models for representing contex- 
tual information mid syntactic description of the NP 
chunk and the lexical probabilities were estimated 
from training sets of different sizes. Due to the fact 
that we did not use a morphological analyser for En- 
glish, we constructed a tag dictionary with the lex- 
icon of the training set and the test set used. This 
dictionary gave us tile possible lexical tags for each 
word fl'om the corpus. In no case, was the test used 
to estimate the lexical probabilities. 
100 
99 
98 
07 
96 
95 
94 
93 
92 
BIG 
BIG-BIG 
\[\[ 
100 200 
\[i (~ {\] {1 LI 
i i i i 
300 400 500 60O 
#Words x 1000 
Figure 4: Accuracy Rate of Tagging on WSJ for 
incrementM training sets. 
In Figure 4, we show the results of tagging on the 
test set in terms of the training set size using three 
at)proaches: the simplest (LEX) is a tagging process 
which does not take contextual information into ac- 
count, so the lexical tag associated to a word will 
100 
99 
90 
97 
06 
95 
04 
93 
92 
Precision • 
Recall 
~, + 
<,, 
+ 
, i , __ i i 
100 200 300 4o0 500 600 7(30 800 
#Words x 1000 
Figure 5: NP-chunldng results on WSJ for incremen- 
tal training sets. 
Tagger 
Tagging 
Accuracy 
BIG-BIG 96.8 
Lex 94.3 
BIG 96.9 
IDEAL 100 (assumed) 
NP-Clmnking 
Precision I Recall 
94.6 193.6 
90.8 91.3 
94.9 94.1 
95.5 94.7 
Table 2: Tagging and NP-Chunking results t'or dif- 
ferents taggers (training set of 800,000 words). 
be that which has aI)peared more often in the train- 
ing set. Tile second method corresponds to a tagger 
based on a bigram model (BIG). The third one uses 
the Integrated LM described in this pai)er (BIG- 
BIG). The tagging accuracy for BIG and BIG-BIG 
was close, 96.9% and 96.8% respectively, whereas 
without the use of the  model (LEX), tile 
tagging accuracy was 2.5 points lower. The trend in 
all the cases was that an increment in the size of the 
training set resulted in an increase in the tagging 
accuracy. After 300,000 training words, the result 
became stabilized. 
In Figure 5, we show the precision (#correct 
proposed NP/#proposed NP) and recall (#correct 
proposed NP/#NP in the reference) rates for NP 
chunking. The results obtained using the Integrated 
LM were very satisfactory achieving a precision rate 
of 94.6% and a recall rate of 93.6%. The perfor- 
mance of the NP chunker improves as the train- 
ing set size increases. This is obviously due to the 
fact that tile model is better learnt when the size 
of the training set increases, and the tagging error 
decreases as we have seen above. 
The usual sequential 1)rocess for chunking a sen- 
tence can also be used. That is, first we tag the sen- 
tence and then we use the Integrated LM to carry 
out the chunking. In this case, only tim contextual 
t)robabilities are taken into account in the decoding 
618 
1)recess. In Table 2, we show the most relevant re- 
suits that we obtained for tagging and tbr NP chunk- 
ing. The first row shows the result when the tagging 
and the chunking are done in a integrated way. The 
following rows show the performmme of the sequen- 
tial process using different taggers: 
• LEX: it takes into account only lexical proba- 
t)ilities. In this case, the tagging accuracy was 
94.3%. 
• BIG: it is based on a bigram model that 
achieved an accuracy of 96.9%. 
• IDEAL: it siinulates a tagger with an accuracy 
rate of 100%. To do this, we used the tagged 
sentences of the WSJ corlms directly. 
These results confirm that precision and recall 
rates increase when the accuracy of the tagger is 
beN;er. The pert'ormmme of 1;he, se(tuential process 
(u:dng the BIG tagger) is slightly 1letter than the 
pet'formance of the integrated process (BIG-BIG). 
We think that this is 1)robably b(;cause of the way 
we combined the I)robabilities of t;he ditthrent mod- 
els. 
4 Conclusions and Future Work 
In this 1)aper, we have t)rcscntcd a system tot" Tag- 
ging and Chunldng based on an Integrated Lan- 
guage Model that uses a homogeneous tbrmalism 
(finite-state machine) to combine different knowl- 
edge sources: lexical, syntacti(:al and contextual 
inodels. It is feasible l)oth in terms of 1)erfl)rmanc(; 
and also in terms of computational (:tliciency. 
All the models involv(:d are learnt automatically 
fi'om data, so the system is very tlexibte and 1)ortable 
and changes in the reference ., lexical tags 
or other kinds of chunks can be made in a direct way. 
The tagging accuracy (96.9% using BIG and 
96.8% using BIG-BIG) is higher tlmn other similar 
alIl)roaches. This is because we have used the tag 
di('tionary (including the test set in it) to restrict 
the possible tags for unknown words, this assmnp- 
lion obviously in(:rease the rates of tagging (we have 
not done a quantitative study of this factor). 
As we have mentioned above, the comparison with 
other approaches is ditficult due mnong other reasons 
to tim following ones: the definitions of base NP are 
not always the stone, the sizes of the train and the 
test sets are difl'erent and the knowledge sources used 
in the learning process are also different. The pre- 
cision for NP-chunking is similm' to other statistical 
at)preaches t)resented in section 1, tbr 1)oth the in- 
tegrated process (94.6%) and l;tm sequential process 
using a tagger based on 1)igrams (94.9%). The recall 
rate is slightly lower than for some apl)roaches using 
the integrated system (93.6%) and is similar for the 
sequential process (94.1%). When we used the se- 
quential system taking an error ti'ee input (IDEAL), 
the performance of the system obviously increased 
(95.5% precision and 94.7% recall). These results 
show the influence of tagging errors on the process. 
Nevertheless, we are studying why the results lie- 
tween the integrated process and the sequential pro- 
cess are diflbrent. We are testing how the introduc- 
tion of soIne adjustnmnt factors among the models 
tk)r we, ighting the difl'erent 1)robability distribution 
can lint)rove the results. 
The models that we have used in this work, are ill- 
grams, but trigrams or any stochastic regular model 
can be used. In this respect, we have worked on a 
more coml)lex LMs, formalized as a. finite-state au- 
tomata which is learnt using Grammatical Inference 
tectufiques. Also, our ai)l)roach would benefit fl'om 
the inclusion of lexical-contextual in%rmation into 
the LM. 
5 Acknowledgments 
This work has been partially supl)orted 1)y the 
Stmnish I{esem'ch Projct:t CICYT (TIC97-0671-C02- 
O11O2). 

References 

S. Abney. 1991. Parsing by Chunks. R. Berwick, S. 
Almey and C. Tcnny (eds.) Principle -based Pars- 
ing. Kluwer Acadenfic Publishers, Dordrecht. 

S. Almey. 1996. Partial Parsing via Finit('.-Sta~e 
Cascades. In Proceedings of the ES,S'LLI'96 Ro- 
bust Parsinfl Workshop, l?rague, Czech l{elmblie. 

S. Argamon, I. Dagan, and Y. Krymolowski. 1.998. 
A Memory based Approach to Learning Shallow 
Natural Language, Patterns. In l~roceedi'ngs of 
t,h,e joint 17th, International Conference on Com- 
putational Linguistics and 36th Annual Meeting 
of the Association for Computational Linguistics, 
COLING-ACL, pages 67 73, Montrdal, Canada. 

S. At-Mokhtar and ,l.P. Chanod. 1997. Incremen- 
tal Finite-State Parsing. In Proceedings of the 5th, 
Conference on Applied Natural Language Process- 
ing, \Vashington D.C., USA. 

D. Bourigault. 1992. Surface Grmnmatical Anal- 
,),sis for tim Extraction of ~l~.~rminological Noml 
Phrases. In Proceedings of the 15th International 
Conference on Computational Linguistics, pages 
977-981. 

Eric Brill and Jun Wu. 1998. Classifier Combi- 
nation for hnproved Lexical Disambiguation. In 
Procccdings of the joint 17th, International Con- 
fcrcncc on Computational Linguistics and 36th 
Annual Meeting of thc Association for Computa- 
tional Linguistics, COLING-ACL, pages 191-195, 
Montrdal, Canada. 

E. Brill. 1995. Transibnnation-based Error-driven 
Learning and Natural Language Processing: A 
Case Study in Part-of-sI)eech Tagging. Compu- 
tational Linguistics, 21 (4) :543-565. 

C. Car(lie and D. Pierce. 1998. Error-Driven Prun- 
ning of Treebank Grammars for Base Noun Phrase 
Identification. In Proceedings of the joint 17th 
International Conference on Computational Lin- 
guistics and 36th Annual Meeting of the Asso- 
ciation for Computational Linguistics, COLING- 
ACL, pages 218 224, Montrdal, Canada, August. 

K. W. Church. 1988. A Stochastic Parts Program 
and Noun Phrase Parser for Unrestricted Text. 
In Proceedings of the 1st Conference on Applied 
Natural Language Processing, ANLP, pages 136- 
143. ACL. 

P. Clarksond and R. Ronsenfeld. 1997. Statistical 
Language Modeling using the CMU-Cambridge 
Toolkit. In Procccdinfls of Eurospccch, Rhodes, 
C,-reece. 

D. Cutting, J. Kut)iec , J. Pederson, and P. Nil)un. 
1992. A Practical Part-of-speech Tagger. In Pfv- 
cccdings of the 3rd Confcrcnce oft Applied Natu- 
ral Language Processing, ANLP, pages 133 140. 
ACL. 

W. Daelelnans, J. Zavrel, P. Berck, and S. Gillis. 
1996. MBT: A MeInory-Based Part of speech 
Tagger Generator. In Proceedings of the /tth 
Workshop on Very Large Cmpora, pages 14-27, 
Copenhagen, Denmark. 

W. Daelemans, S. Buchholz, and J. Veenstra. 1999. 
Memory-Based Shallow Parsing. In Proceedings 
of EMNLP/VLC-99, pages 239 246, University of 
Maryla.nd, USA, June. 

E. Ejerhed. 1988. Finding Clauses in Unrestricted 
Text by Finitary and Stochastic Methods. In Pro- 
cccdings of Second Confcrcncc on Applied Natural 
Language Processing, pages 219-227. ACL. 

H. van Halteren, J. Zavrel, and W. Daelemans. 1998. 
Improving Data Driven Wordclass Tagging by 
System Combination. In Proceedings of the joint 
17th International Confcr'cncc oft Computational 
Linguistics and 36th Annual Mccting of the Asso- 
ciation for Computational Linguistics, COLING- 
ACL, pages 491-497, Montrdal, Canada, August. 

S. M. Katz. 1987. Estimation of Probabilities from 
Sparse Data for tile Language Model Component 
of a Speech Recognizer. IEEE T~nnsactions on 
Acoustics, Speech and Signal Processing, 35. 

D. M. Magerman. 1996. Learning Grammatical 
Structure Using Statistical Decision-Trees. In 
Proceedings of the 3rd International Colloquium 
on GTnmmatical Inference, ICGI, pages 1-21. 
Springer-Verlag Lecture Notes Series in Artificial 
Intelligence 1147. 

M. P. Marcus, M. A. Marcinkiewicz, and B. San- 
torini. 1993. Building a Large Annotated Cortms 
of English: Tile Penn Treebank. Computational 
Linguistics, 19(2). 

Llu/s Mhrquez and Horacio RodHguez. 1998. Part- 
of Speech T~gging Using Decision Trees. In C. 
Nddellee and C. Rouveirol, editor, LNAI 1398: 
Proceedings of thc lOth European Conference 
on Machine Learning, ECML'98, pages 25-36, 
Chemnitz, GermNly. Springer. 

B. Merialdo. 1994. Tagging English Text with a 
Probabilistic Model. Computational Linguistics, 
20(2):155-171. 

F. Pla and N. Prieto. 1998. Using Grammatical 
Inference Methods tbr Automatic Part of speech 
Tagging. In Proceedings of 1st International Con- 
ference on Language Resources and Evaluation, 
LREC, Granada, Spain. 

L. Ramshaw and M. Marcus. 1995. Text Chunking 
Using ~lYansfbrmation-Based Learning. In Pro- 
cccdings of third Workshop on Very Large Col 
pora, pages 82 94, June. 

A. Ratnapm'khi. 1996. A Maximum Entrol)y Part 
of-speech Tagger. In Proceedings of the 1st Con- 
fcrcncc on Empirical Methods in Natural Lan- 
guagc Processing, EMNLP. 

Atro Voutilainen and Llufs Padrd. 1997. Develol)- 
inn a Hybrid NP Parser. In Proceedings of the 5th 
Conference on Applied Natural Language Prvecss- 
ing, ANLP, pages 80 87, Washington DC. ACL. 

Atro Voutilainen. 1993. NPTool, a Detector of En- 
glish Noun Phrases. In Proceedings of the Work- 
shop on Very Lafflc Corpora. ACL, June. 

Atro Voutilainen. 1995. A Syntax-Based Part of 
speech Analyzer. In Prvcccdings of the 7th Con- 
ference of the European Ch, aptcr of the Association 
for Computational Linguistics, EACL, Dut)lin, 
h'eland. 

R. Weischedel, R. Schwartz, J. Pahnueci, M. Meteer, 
and L. Ramshaw. 1993. Coping with Ambiguity 
and Unknown \~or(ls through Probabilistic Mod- 
els. Computational Linguistics, 19(2):260-269. 
