In: Proceedings of CoNLL-2000 and LLL-2000, pages 107-110, Lisbon, Portugal, 2000. 
Shallow Parsing by Inferencing with Classifiers* 
Vasin Punyakanok and Dan Roth 
Department of Computer Science 
University of Illinois at Urbana-Champaign 
Urbana, IL 61801, USA {punyakan, danr}@cs.uiuc.edu 
Abstract 
We study the problem of identifying phrase 
structure. We formalize it as the problem of 
combining the outcomes of several different clas- 
sifiers in a way that provides a coherent in- 
ference that satisfies some constraints, and de- 
velop two general approaches for it. The first 
is a Markovian approach that extends stan- 
dard HMMs to allow the use of a rich obser- 
vations structure and of general classifiers to 
model state-observation dependencies. The sec- 
ond is an extension of constraint satisfaction for- 
malisms. We also develop efficient algorithms 
under both models and study them experimen- 
tally in the context of shallow parsing. 1 
1 Identifying Phrase Structure 
The problem of identifying phrase structure can 
be formalized as follows. Given an input string 
O =< ol, 02,..., On >, a phrase is a substring 
of consecutive input symbols oi, oi+l,...,oj. 
Some external mechanism is assumed to consis- 
tently (or stochastically) annotate substrings as 
phrases 2. Our goal is to come up with a mech- 
anism that, given an input string, identifies the 
phrases in this string, this is a fundamental task 
with applications in natural  (Church, 
1988; Ramshaw and Marcus, 1995; Mufioz et 
al., 1999; Cardie and Pierce, 1998). 
The identification mechanism works by using 
classifiers that process the input string and rec- 
ognize in the input string local signals which 
* This research is supported by NSF grants IIS-9801638, 
SBR-9873450 and IIS-9984168. 
1Full version is in (Punyakanok and Roth, 2000). 
2We assume here a single type of phrase, and thus 
each input symbol is either in a phrase or outside it. All 
the methods we discuss can be extended to deal with 
several kinds of phrases in a string, including different 
kinds of phrases and embedded phrases. 
are indicative to the existence of a phrase. Lo- 
cal signals can indicate that an input symbol o 
is inside or outside a phrase (IO modeling) or 
they can indicate that an input symbol o opens 
or closes a phrase (the OC modeling) or some 
combination of the two. In any case, the lo- 
cal signals can be combined to determine the 
phrases in the input string. This process, how- 
ever, needs to satisfy some constraints for the 
resulting set of phrases to be legitimate. Sev- 
eral types of constraints, such as length and or- 
der can be formalized and incorporated into the 
mechanisms studied here. For simplicity, we fo- 
cus only on the most basic and common con- 
straint - we assume that phrases do not overlap. 
The goal is thus two-fold: to learn classifiers 
that recognize the local signals and to combine 
these in a ways that respects the constraints. 
2 Markov Modeling 
HMM is a probabilistic finite state automaton 
used to model the probabilistic generation of 
sequential processes. The model consists of 
a finite set S of states, a set (9 of observa- 
tions, an initial state distribution P1 (s), a state- 
transition distribution P(s\[s') for s, # E S and 
an observation distribution P(o\[s) for o E (9 
and s 6 S. 3 
In a supervised learning task, an observa- 
tion sequence O --< ol,o2,... On > is super- 
vised by a corresponding state sequence S =< 
sl, s2,. • • sn >. The supervision can also be sup- 
plied, as described in Sec. 1, using the local sig- 
nals. Constraints can be incorporated into the 
HMM by constraining the state transition prob- 
ability distribution P(s\]s'). For example, set 
P(sV) = 0 for all s, s' such that the transition 
from s ~ to s is not allowed. 
aSee (Rabiner, 1989) for a comprehensive tutorial. 
107 
Combining HMM and classifiers (artificial 
neural networks) has been exploited in speech 
recognition (Morgan and Bourlard, 1995), how- 
ever, with some differences from this work. 
2.1 HMM with Classifiers 
To recover the most likely state sequence in 
HMM, we wish to estimate all the required 
probability distributions. As in Sec. 1 we as- 
sume to have local signals that indicate the 
state. That is, we are given classifiers with 
states as their outcomes. Formally, we assume 
that Pt(slo ) is given where t is the time step in 
the sequence. In order to use this information 
in the HMM framework, we compute 
Pt(o\[s) = Pt(slo)Pt(o)/Pt(s). (1) 
instead of observing the conditional probability 
Pt (ols) directly from training data, we compute 
it from the classifiers' output. Pt(s) can be cal- 
culated by Pt(s) = Es'eS P(sls')Pt-l(s') where 
Pl(s) and P(sls' ) are the two required distri- 
bution for the HMM. For each t, we can treat 
Pt(ols ) in Eq. 1 as a constant r/t because our goal 
is only to find the most likely sequence of states 
for given observations which are the same for 
all compared sequences. Therefore, to compute 
the most likely sequence, standard dynamic pro- 
gramming (Viterbi) can still be applied. 
2.2 Projection based Markov Model 
In HMMs, observations are allowed to depend 
only on the current state and long term depen- 
dencies are not modeled. Equivalently, from the 
constraint point of view, the constraint struc- 
ture is restricted by having a stationary proba- 
bility distribution of a state given the previous 
one. We attempt to relax this by allowing the 
distribution of a state to depend, in addition 
to the previous state, on the observation. For- 
mally, we make the independence assumption: 
P( stlSt-l , St- 2, . . . , sl , ot, ot-1, . . . , 01) 
= P(stlSt_l,Ot). (2) 
Thus, we can find the most likely state sequence 
S given O by maximizing 
n 
P(SIO) = II\[P(stls~,..., 8t-1, O)\]Pl(slIO) 
t=2 
n 
= H\[P(stlst_l,ot)\]Pl(sllOl). (3) 
t=2 
Hence, this model generalizes the standard 
HMM by combining the state-transition prob- 
ability and the observation probability into one 
function. The most likely state sequence can 
still be recovered using the dynamic program- 
ming algorithm over the Eq.3. 
In this model, the classifiers' decisions are in- 
corporated in the terms P(sls',o ) and Pl(slo ). 
In learning these classifiers we project P(sls ~, o) 
to many functions Ps' (slo) according to the pre- 
vious states s ~. A similar approach has been 
developed recently in the context of maximum 
entropy classifiers in (McCallum et al., 2000). 
3 Constraint Satisfaction with 
Classifiers 
The approach is based on an extension of 
the Boolean constraint satisfaction formal- 
ism (Mackworth, 1992) to handle variables that 
are outcomes of classifiers. As before, we as- 
sume an observed string 0 =< ol,o2,... On > 
and local classifiers that, w.l.o.g., take two dis- 
tinct values, one indicating the openning a 
phrase and a second indicating closing it (OC 
modeling). The classifiers provide their outputs 
in terms of the probability P(o) and P(c), given 
the observation. 
To formalize this, let E be the set of all possi- 
ble phrases. All the non-overlapping constraints 
can be encoded in: f --/ke~ overlaps ej (-~eiV-~ej). 
Each solution to this formulae corresponds to a 
legitimate set of phrases. 
Our problem, however, is not simply to find 
an assignment   : E -+ {0, 1} that satisfies f 
but rather to optimize some criterion. Hence, 
we associate a cost function c : E ~ \[0,1\] 
with each variable, and then find a solution ~- 
of f of minimum cost, c(~-) = n Ei=l 
In phrase identification, the solution to the op- 
timization problem corresponds to a shortest 
path in a directed acyclic graph constructed 
on the observation symbols, with legitimate 
phrases (the variables in E) as its edges and 
their costs as the weights. Each path in this 
graph corresponds to a satisfying assignment 
and the shortest path is the optimal solution. 
A natural cost function is to use the classi- 
fiers probabilities P(o) and P(c) and define, for 
a phrase e = (o, c), c(e) = 1 - P(o)P(c) which 
means that the error in selecting e is the er- 
ror in selecting either o or c, and allowing those 
108 
to overlap 4. The constant in 1 - P(o)P(c) bi- 
ases the minimization to prefers selecting a few 
phrases, possibly no phrase, so instead we min- 
imize -P(o) P(c). 
4 Shallow Parsing 
The above mentioned approaches are evaluated 
on shallow parsing tasks, We use the OC mod- 
eling and learn two classifiers; one predicting 
whether there should be a open in location t 
or not, and the other whether there should a 
close in location t or not. For technical reasons 
it is easier to keep track of the constraints if 
the cases --1 o and --1 c are separated according to 
whether we are inside or outside a phrase. Con- 
sequently, each classifier may output three pos- 
sible outcomes O, nOi, nOo (open, not open 
inside, not open outside) and C, nCi, nCo, 
resp. The state-transition diagram in figure 1 
captures the order constraints. Our modeling of 
the problem is a modification of our earlier work 
on this topic that has been found to be quite 
successful compared to other learning methods 
attempted on this problem (Mufioz et al., 1999) 
and in particular, better than the IO modeling 
of the problem (Mufioz et al., 1999). 
Figure 1: State-transition diagramfor the 
phrase recognition problem. 
The classifier we use to learn the states as 
a function of the observations is SNoW (Roth, 
1998; Carleson et al., 1999), a multi-class clas- 
sifter that is specifically tailored for large scale 
learning tasks. The SNoW learning architec- 
ture learns a sparse network of linear functions, 
in which the targets (states, in this case) are 
represented as linear functions over a common 
feature space. Typically, SNoW is used as a 
classifier, and predicts using a winner-take-all 
4Another solution in which the classifiers' suggestions 
inside each phrase axe also accounted for is possible. 
mechanism over the activation value of the tax- 
get classes in this case. The activation value 
itself is computed using a sigmoid function over 
the linear sum. In this case, instead, we normal- 
ize the activation levels of all targets to sum to 1 
and output the outcomes for all targets (states). 
We verified experimentally on the training data 
that the output for each state is indeed a dis- 
tribution function and can be used in further 
processing as P(slo ) (details omitted). 
5 Experiments 
We experimented both with base noun phrases 
(NP) and subject-verb patterns (SV) and show 
results for two different representations of the 
observations (that is, different feature sets for 
the classifiers) - part of speech (POS) tags only 
and POS with additional lexical information 
(words). The data sets used are the standard 
data sets for this problem (Ramshaw and Max- 
cus, 1995; Argamon et al., 1999; Mufioz et 
al., 1999; Tjong Kim Sang and Veenstra, 1999) 
taken from the Wall Street Journal corpus in 
the Penn Treebank (Marcus et al., 1993). 
For each model we study three different clas- 
sifiers. The simple classifier corresponds to the 
standard HMM in which P(ols ) is estimated di- 
rectly from the data. The NB (naive Bayes) and 
SNoW classifiers use the same feature set, con- 
junctions of size 3 of POS tags (+ words) in a 
window of size 6 around the target word. 
The first important observation is that the 
SV task is significantly more difficult than the 
NP task. This is consistent for all models and 
all features sets. When comparing between dif- 
ferent models and features sets, it is clear that 
the simple HMM formalism is not competitive 
with the other two models. What is interest- 
ing here is the very significant sensitivity to the 
wider notion of observations (features) used by 
the classifiers, despite the violation of the prob- 
abilistic assumptions. For the easier NP task, 
the HMM model is competitive with the oth- 
ers when the classifiers used are NB or SNOW. 
In particular, a significant improvement in both 
probabilistic methods is achieved when their in- 
put is given by SNOW. 
Our two main methods, PMM and CSCL, 
perform very well on predicting NP and SV 
phrases with CSCL at least as good as any other 
methods tried on these tasks. Both for NPs and 
109 
Table 1: Results (F~=l) of different methods 
and comparison to previous works on NP and 
SV recognition. Notice that, in case of simple, 
the data with lexical features are too sparse to 
directly estimate the observation probability so 
we leave these entries empty. 
Method POS POS 
Model\[ Classifier only +words 
SNoW 90.64 92.89 
HMM NB 90.50 92.26 
Simple 87.83 
SNoW 90.61 92.98 
NP PMM NB 90.22 91.98 
Simple 61.44 
SNoW 90.87 92.88 
CSCL NB 90.49 91.95 
Simple 54.42 
Ramshaw & Marcus 90.6 92.0 
Argamon et al. 91.6 N/A 
Mufioz et al. 90.6 92.8 
Tjong Kim Sang 
Veenstra N/A 92.37 
SNoW 64.15 77.54 
HMM NB 75.40 78.43 
Simple 64.85 
SNoW 74.98 86.07 
PMM NB 74.80 84.80 
Simple 40.18 
SNoW 85.36 90.09 
CSCL NB 80.63 88.28 
Simple 59.27 
Argamon et al. 86.5 N/A 
i Mufioz et al. 88.1 92.0 
SV 
SVs, CSCL performs better than the probabilis- 
tic method, more significantly on the harder, 
SV, task. We attribute it to CSCL's ability to 
cope better with the length of the phrase and 
the long term dependencies. 
Our methods compare favorably with others 
with the exception to SV in (Mufioz et al., 
1999). Their method is fundamentally simi- 
lar to our CSCL; however, they incorporated 
the features from open in the close classifier al- 
lowing to exploit the dependencies between two 
classifiers. We believe that this is the main fac- 
tor of the significant difference in performance. 
6 Conclusion 
We have addressed the problem of combining 
the outcomes of several different classifiers in a 
way that provides a coherent inference that sat- 
isfies some constraints, While the probabilistic 
approach extends standard and commonly used 
techniques for sequential decisions, it seems 
that the constraint satisfaction formalisms can 
support complex constraints and dependencies 
more flexibly. Future work will concentrate on 
these formalisms. 

References 
S. Argamon, I. Dagan, and Y. Krymolowski. 1999. 
A memory-based approach to learning shallow 
natural  patterns. Journal of Experimen- 
tal and Theoretical Artificial Intelligence, 10:1-22. 
C. Cardie and D. Pierce. 1998. Error-driven prun- 
ing of treebanks grammars for base noun phrase 
identification. In Proc. of ACL-98, pages 218-224. 
A. Carleson, C. Cumby, J. Rosen, and D. Roth. 
1999. The SNoW learning architecture. Tech. Re- 
port UIUCDCS-R-99-2101, UIUC Computer Sci- 
ence Department, May. 
K. W. Church. 1988. A stochastic parts program 
and noun phrase parser for unrestricted text. In 
Proe. of A CL Conference on Applied Natural Lan- 
guage Processing. 
A. K. Mackworth. 1992. Constraint Satisfaction. In 
Stuart C. Shapiro, editor, Encyclopedia of Artifi- 
cial Intelligence, pages 285-293. Vol. 1, 2 nd ed. 
M. P. Marcus, B. Santorini, and M. Marcinkiewicz. 
1993. Building a large annotated corpus of En- 
glish: The Penn Treebank. Computational Lin- 
guistics, 19(2):313-330, June. 
A. McCallum, D. Freitag, and F. Pereira. 2000. 
Maximum entropy Markov models for information 
extraction and segmentation. In Proc. of ICML- 
2000. 
N. Morgan and H. Bourlard. 1995. Continuous 
speech recognition. IEEE Signal Processing Mag- 
azine, 12(3):25-42. 
M. Mufioz, V. Punyakanok, D. Roth, and D. Zimak. 
1999. A learning approach to shallow parsing. In 
Proc. of EMNLpo VLC'99. 
V. Punyakanok and D. Roth. 2000. Inference with 
classifiers. Tech. Report UIUCDCS-R-2000-2181, 
UIUC Computer Science Department, July. 
L. R. Rabiner. 1989. A tutorial on hidden Markov 
models and selected applications in speech recog- 
nition. Proc. of the IEEE, 77(2):257-285. 
L. A. Ramshaw and M. P. Marcus. 1995. Text 
chunking using transformation-based learning. In 
Proc. of WVLC'95. 
D. Roth. 1998. Learning to resolve natural lan- 
guage ambiguities: A unified approach. In Proc. 
of AAAI'98, pages 806-813. 
E. F. Tjong Kim Sang and J. Yeenstra. 1999. Rep- 
resenting text chunks. In Proc. of EACL'99. 
