Semantic Tagging using a Probabilistic Context Free Grammar * 
Michael Collins 
Dept. of Computer and Information Science 
University of Pennsylvania 
200 South 33rd Street, Philadelphia, PA 19104 
mcol lins@gradient, cis. upenn, edu 
Scott Miller 
BBN Technologies 
70 Fawcett Street 
Cambridge, MA 02138 
s zmiller@bbn, com 
Abstract 
This paper describes a statistical model for extraction of 
events at the sentence level, or "semantic tagging", typi- 
cally the first level of processing in Information Extrac- 
tion systems. We illustrate the approach using a manage- 
ment succession task, tagging sentences with three slots 
involved in each succession event: the post, person com- 
ing into the post, and person leaving the post. The ap- 
proach requires very limited resources: a part-of-speech 
tagger; a morphological analyzer; and a set of training 
examples that have been labeled with the three slots and 
the indicator (verb or noun) used to express the event. 
Training on 560 sentences, and testing on 356 sentences, 
shows the accuracy of the approach is 77.5% (if partial 
slot matches are deemed incorrect) or 87.8% (if partial 
slot matches are deemed correct). 
1 Introduction 
Statistical models have been used quite successfully in 
natural language processing for recovery of hidden struc- 
ture such as part-of-speech tags, or syntactic structure. 
This paper considers semantic tagging of text within 
the context of information extraction, as in the Sixth 
Message Understanding Conference (MUC-6). MUC-6 
looked at extraction of events concerning management 
successions in newspaper texts: recovering the post, 
company, person entering and person leaving the post. 
We will concentrate on the initial stage of processing, 
extraction of events at the sentence level. For example, 
given the sentence 
Last week Hensley West, 59 years old, was 
named as president, a surprising development. 
the desired output from the system would be 
{IN = Hensley West, POST = president, IND = named } 
POST is a slot designating the title of the position, IN 
is the person coming in to fill the post, and IND is an 
"indicator" - usually a verb or a noun - used to express 
the event. Table 1 gives some more examples. 
The traditional approach to this problem, as exempli- 
fied in SRI's FASTUS system (Appelt et al. 93), has been 
" The work repotted here was supported in part by the Defense Ad- 
vanced Research Projects Agency. Technical agents for part of this 
work were Fort Huachucha under contract number DABT63-94-C- 
0062. The views and conclusions contained in this document are those 
of the authors and should not be interpreted as necessarily represent- 
ing the official policies, either expressed or implied, of the Defense 
Advanced Research Projects Agency or the United States Government. 
(IN Jack Bradley) was (IND named) (POST acting president and chief 
executive officer) of this computer-network manufacturer. 
(IN He) (IND succeeds) as (POST president) (OUT Edward Marinaro) 
• 56, who is retiring. 
(IN Mr. Stanley) 's (IND appointment) comes as AK Steel's Mid- 
dletown Works is under investigation by OSHA because of its safety 
record, which includes one accident that killed four men in April 1994 
~Fhe unexpected (IND departure) of (OUT Citicorp's highly regarded 
head of retail banking) and the appointment of a tobacco executive to 
fill his shoes has caught most Citicorp employees off-guard and con" 
founded many analysts. 
On Friday, the bank said (OUT Pei-yuan Chia), head of retail banking 
• will (IND retire) this year. 
Table 1 : Some example sentences. 
to use hand-coded rules. These are typically encoded in a 
series of finite-state transducers that progressively build 
information in a bottom-up fashion. We are interested in 
developing a machine-learning approach to this problem 
for two reasons: First, developing hand-coded rules is a 
lengthy task which requires a fairly considerable amount 
of expertise - and a new set of rules must be developed 
for each new domain. Annotating training text examples 
such as those in table 1 can conceivably be done by a 
non-expert. Second, writing accurate rules is difficult, as 
there are many complex interactions between the rules, 
and there are many details to be covered. This task be- 
comes even more complex when the interaction between 
the sentence-level rules and the later stages of process- 
ing (co-reference and merging, see section 1.1) is con- 
sidered. Machine learning techniques have been shown 
to be highly effective at managing this kind of complex- 
ity in applications such as speech recognition, part-of- 
speech tagging and parsing. 
Not surprisingly this problem can be approached us- 
ing finite-state tagging methods, which have previously 
been applied to part of speech tagging (Church 88) and 
named-entity identification (Bikel et al. 97). We initially 
consider this approach, but argue that the Markov ap- 
proximation gives an extremely bad parameterization of 
the problem. Instead, the method uses a Probabilistic 
Context Free Grammar (PCFG), which has the advan- 
tages of being flexible enough to allow a good parameter- 
ization of the problem, while having an efficient decod- 
ing algorithm, a variant of the CKY dynamic program- 
ming algorithm for parsing with context-free grammars. 
The PCFG does not encode linguistic phrase structure; 
38 
rather, it is semantically motivated, modeling choices 
such as the choice Of indicator, number of slots, fillers 
for the slots, and generation of other "noise" words in 
the sentence. 
While this paper largely concentrates on the man- 
agement succession domain, and motivates many of the 
choices regarding representation with examples from it, 
the principles should be general enough to also work for 
other domains -- in fact, the method was originally de: 
veloped for an IE task involving company acquisitions 
(identifying the buyer, seller and item being bought), and 
then moved to the management succession domain with 
no domain-specific tuning. 
1.1 The Complete Information Extraction Task 
While a semantic tagging model may be useful for many 
tasks, our primary motivation is to use it within an in- 
formation extraction (IE) system. Information extrac- 
tion tasks involve processing an input text to fill slots 
in an output template, the DARPA-sponsored Message 
Understanding Conferences (MUCs) have evaluated IE 
systems from several research sites. Some previous tasks 
attempted at MUC involved extraction of information 
about terrorist attacks (MUC-4); joint ventures (MUC- 
5); and most recently, at MUC-6, management succes- 
sions. In this paper we concentrate on the management 
succession task, figure 1 gives an example input-output 
pair from the domain 1. 
(a) Who's News: Restor Industries Inc. 
RESTOR INDUSTRIES Inc. (Orlando, Fla.) -- 
Hensley E. West, 50 years old, was named 
president of this telecommunications-product 
concern. Mr. West, who most recently was a 
group vice president for DSC Communications 
Corp. in Dallas, fills a vacancy created by 
the retirement last September of John 
Bradley, 63. 
(b) 
Event 
Number 
i 
Slot , Filler 
IN 
OUT 
POST 
COMPANY 
OUT 
POST 
COMPANY 
Hensley E. West 
John Bradley 
president 
RESTOR INDUSTRIES Inc. 
Hensley E. West 
group vice president 
DSC Communications Corp. 
Figure h (a) A sample text from WSJ, involving man- 
agement successions. (b) The two succession events in 
the text. 
Most systems described in the MUC-6 proceedings 
followed the following three stages of processing in map- 
ping an input text to a set of output templates: 
I We've just shown the most important, "core" slots for tie task - 
the MUC-6 specification includes additional information such as the 
reason for the change, the title of the people involved etc. 
39 
1) Pattern matching at the sentence level. This is the 
task that is approached in this paper. In the text in fig- 
ure 1, "Hensley E. West, 50 years old, was named presi- 
dent of this telecommunications-product concern" would 
be processed to give { IN = "Hensley E. West", POST 
= "'president", COMPANY = "this telecommunications- 
product concern", VERB = "named" } and ".... the re- 
tirement last September of John Bradley .... " would give 
{ OUT = "John Bradley", NOUN = "retirement", IND 
= "resignation"} 
2) Coreference. Pronouns and definite NPs arc 
resolved to their antecedents. For example, "this 
telecommunications-product concern" would be re- 
solved to "RESTOR INDUSTRIES Inc.". This stage is 
important because pronouns and definite noun-phrases 
like "this telecommunications-product concern" are not 
informative slot-fillers. 
3) Merging. The information in a template may be 
spread across several sentences. In the merging stage the 
information from multiple mentions of the same event is 
merged into a single template. In the example, the infor- 
mation centered around "named" and "retirement" would 
be identified as referring to the same event, and would 
be combined to give { IN = "Hensley E. West", OUT 
= "John Bradley", POST = "president", COMPANY = 
"RESTOR INDUSTRIES Inc." } 
This paper's work attacks problem (1) alone, and is 
restricted to recovery of the IN, OUT and POST slots. 
1.2 Previous Work 
The majority of systems at MUC-6, including SRI's sys- 
tem FASTUS (Appelt et al. 93), and the best-performing 
sy.stem, from NYU (Grishman 95), used cascaded finite- 
state transducers, which were built by hand. The domain 
independent transducers tokenize the text, recognise per- 
son and company names, "chunk" noun and verb groups, 
and finally build some higher level, complete clauses. 
The domain specific rules then extract the slots from the 
sentence, using patterns such as the example on page 244 
of (Appelt et al. 93): "Company hires or recruits person 
from company as position". 
There have been a number of machine learning ap- 
proaches to the sentence-level stage of information ex- 
traction. The AutoSiog system (Riloff 93; Riloff 96) 
automatically learned "concept node" definitions for use 
on the MUC-4 terrorist events domain. A concept node 
specifies a trigger word, usually a verb, and maps syn- 
tactic roles with respect to this trigger to semantic slots 
- for example, a concept node might specify/f trigger 
= "destroyed" and syntax = direct-object then concept 
= Damaged-Object (Damaged-object is the name of the 
slot in this case). A concept node may also specify hard 
or soft constraints on the slot-fillers. The system uses 
the CIRCUS parser (Lehnert et al. 93) to find the syn- 
tactic roles in relation to the trigger. AutoSlog learns 
concept nodes given input-output pairs like those in fig- 
ure 1, so the indicator words do not need to be specified. 
Experiments showed that running AutoSlog followed by 
5 hours of filtering the rules by hand gave a system that 
performed as well as a hand-crafted system. 
The CRYSTAL system (Soderland et al. 9.'i) also 
learns rules that map syntactic frames to semantic roles. 
The triggers can be more complicated than those in Au- 
toSlog, in that they can specify whole sequences of 
words, or restrict patterns by specifying words or classes 
in the surrounding context. CRYSTAL learns patterns 
by initially specifying a maximally detailed pattern for 
each training example, then progressively simplifying 
and merging patterns until some error bound is exceeded. 
CRYSTAL uses the BADGER sentence analyzer to give 
syntactic information. 
(Califf and Mooney 97) describe a system for extrac- 
tion of information about job posfings from a newsgroup. 
Relational learning is used to learn rule-based patterns 
that specify: 1) a pre-filler pattern that matches the text 
before the slot; 2) a pattern that must match the actual 
slot filler; and 3) a post-filler pattern that matches the text 
after the slot. The patterns can involve parts of speech, 
semantic classes of words, or the words themselves. An 
example pattern from (Califf and Mooney 97) for identi- 
fying locations is pre-filler = in, filler = 2 or fewer words 
all proper nouns, post-filler = wordl is ",", word2 is a 
state. This matches phrases like "in Kansas City, Mis- 
souri" or "in Atlanta, Georgia". The learning algorithm 
starts with the most specific rule for each training exam- 
ple, then generalizes by merging similar rules. 
A major difference between the approaches described 
in (Riloff 93; Riloff 96; Soderland et al. 95) and the ap- 
proach in this paper is that (Riloff 93; Riloff 96; Soder- 
land et al. 95) rely on a syntactic parser to produce at least 
a shallow syntactic analysis. The approach described in 
this paper builds a system from a set of training exam- 
ples, with only a part-of-speech tagger and a morpho- 
logical analyzer as additional resources. The system in 
(Califf and Mooney 97) does not require a parser, but the 
patterns it uses are quite local (the pre-filler and post- 
filler patterns are adjacent to the slot). It isn't clear this 
method would work well for the management succes- 
sions domain where there are often many "noise" words 
between the slots and the indicator. Another major dif- 
ference between the methods is that the PCFG based 
method is probabilisfic. This may be an advantage when 
the sentence-level stage of processing is combined with 
the later merging and coreference stages, as it gives a 
principled way of combining evidence from the differ- 
ent stages of processing: an uncertainty at the sentence 
level may, for example, be resolved at the merging stage 
-- in this case it is useful for the sentence level system 
to be capable of giving a list of candidate analyses with 
associated probabilities. 
2 Background 
2.1 The Problem 
We assume the following definitions: 
40 
1. A sentence, W, consists of n words, 'tU1, LU2, ...W n. 
2. A template, T, is a non-empty set of slots, where 
each slot is a label together with a tuple giving the 
start and end point of the slot in the sentence. For 
example, T = {IN = (3, 4), OUT = (5, 6)} means 
there is an IN slot spanning words 3 to 4 inclusive, 
and an OUT slot spanning words 5 to 6 inclusive. 
In the management succession domain there are 
three possible slots, IN, OUT and POST (abbrevi- 
ated to I, O and P respectively). IN is the string 
denoting the person who is filling the post, OUT is 
the person who is leaving the post, and POST is the 
name of the post. 
3. In addition, we assume that each template contains 
an additional indicator slot, which is the verb or 
noun used to express the template. 
For example, a (W, T) pair might be 
W = Last week Hensley West, 59 years old, joined 
the company as president, a surprising development. 
T = {IN = (3,4),POST = (14, 14), IND = (I0, I0)} 
As alternative notation in this paper we either list the 
strings in the template, for example T = {IN = "Hens- 
ley West", POST = "president", IND = "joined"}, or we 
show the (W, T) pair as a bracketed sentence: 
Last week (IN Hensley West), 59 years old, 
(INDjoined) the company as (POST president) 
, a surprising development. 
Table 1 shows more examples from the management suc- 
cession domain. The machine learning task is to learn a 
function that maps an arbitrary sentence W to a template 
T, given a training set of N pairs (Wi, Ti) 1 < i < N. 
A test set of (W, T) pairs is used to evaluate the model. 
In addition to a training set, we assume the following 
resources: 
1. A part of speech (POS) tagger. The POS tagger de- 
scribed in (Ratnaparkhi 96) was used to tag both 
training and test data. 
2. A lexicon which maps each indicator word in train- 
ing data to a class, for example the morphologi- 
cal variants "join", "joins", "joined" and "joining" 
could all be mapped to the JOIN class. This can be 
done automatically by a morphological analyzer as 
in (Karp et al. 94), or by hand. (This resource is 
not strictly necessary, but will help to reduce sparse 
data problems). 
A probabilistic approach defines a conditional proba- 
bility P(T I W) or a joint probability P(T, W) for every 
candidate template for a sentence. The most likely tem- 
plate for a sentence W is then 
Tb,a = argmTaxe(T \] W) = argrr~axe(T,W) (1) 
The major part of this paper will be concerned with for- 
malizing a stochastic model that defines P(T, W). 
3 A Probabilistic Model 
3.1 A Naive Approach m Finite State Tagging 
It is useful to note that a (W, T) pair can be represented 
as a tagged sentence wl/tl, w2/t2, ...w,/tn where T = 
tl, t2...tn is the sequence of tags denoting the semantic 
type for each word in the sentence. For example, the tags 
could be I, O, P, IND, for the 3 slots and the indicator, 
and N for other (noise) words, as in 
Last/N week/N Hensley/I West/I ,/N 59/N 
years/N old/N ,/N joined/IND the/N com- 
pany/N as/N president/P ,/N a/N surprising/N 
development/N ./N 
As a straw man we consider using a standard bigram 
tagging model to tag test-set sentences. (Church 88) used 
this to recover part-of-speech tags, a related approach de- 
scribed in (Bikel et al. 97) gives a useful decomposition 
of P(T, W) into two terms: 
P(T,W) = P(L1,L2,...Lm) × H P(WilLi) (2) 
i= l...rn 
{L1, L~, ...L,~ } is the underlying sequence of tags, in the 
above example m = 7 and the sequence is {N, I, N, lIND, 
N, P, N}. Wi is the string of words under label Li, for 
example W1 = {Last, week}, W2 = {Hensley, West}. 
The two terms are then simplified, using bigram Markov 
independence assumptions, to be 
P(L1,L2,...Lm) = P(L1\]Start)P(End l nm) 
x rI P(LiILi-1) (3) 
i=2...rn 
and (if label Li covers words wsi...wei) 
P(Wi\[Li) = P(wsi, wsi+l, ...wei\[Li) = 
P(wsi I Start, Li)P(End I w~i, Li) 
X H P(wj I Wj-l, Li) (4) 
j=si+l...ei 
This finite state approach has been highly effective for 
part of speech tagging (Church 88) and name finding 
(Bikel et al. 97). However, the next section considers 
the characteristics of the task in more detail, and argues 
that a finite-state tagger is a poor model for the task. 
3.2 More about the task 
In developing an intuition for the task, and motivating 
the choices made in modeling it, it is useful to consider 
the types of information that may be useful to a system. 
Consider the following 5 points: 
I. There are 7 possible templates corresponding to the 
7 non-empty subsets of {I,O,P}. The distribution 
over these alternatives is by no means uniform - see 
table 2 for the distribution. 
41 
2. The different slots tend to contain quite different 
lexical items or strings - for example, the IN and 
OUT slots are most likely to contain a proper name 
or a personal pronoun, whereas the POST slot con- 
tains strings such as "president", "chairman" etc. 
3. The choice of indicator word depends greatly on the 
choice of template. For example "name" is very 
likely to be used to express an event involving a 
{I, P} template; "succeed" is very likely to express 
an {I, O, P} or {I, O} template. See the final column 
of table 2 for more examples. 
4. The relative order of the slots and indicator in the 
text varies considerably depending on the choice of 
indicator. For example, given the template {I} and 
the verb "join" the order is most likely to be { I 
Indicator } (e.g. IN joined the company); whereas 
given the verb "hire" the order is usually {Indicator 
I} (e.g. the company hired IN). 
5. In addition to the central indicator, there are often 
secondary indicators - mainly prepositions - which 
are strong signals of particular slots. For example, 
given the verb is "named" or "succeeded", the post 
is very likely to be preceded by the preposition "as" 
(e.g., the company named her as president, he suc- 
ceeds Jim Smith as president). 
By considering points 1-5 we can see that the finite- 
state tagging approach is deficient for the semantic tag- 
ging task. The lexical probabilities in equation (4) are 
probably sufficient to capture the lexical differences be- 
tween different states (the preference of the IN slot to 
generate proper names, of the POST slot to generate 
words like "president" and so on). But the Markov ap- 
proximation in equation (3) is deficient in many ways: it 
fails to capture the non-uniform distribution over the 7 
possible templates, worse still it is deficient in that it can 
label more than one substring with the same slot label; 
it fails to capture the dependence of the slot order on the 
indicator word, or the dependence between the template 
and indicator. 
3.3 A Probabilistic Context-Free Grammar 
Our proposal is to replace the Markov assumption in 
(3) with a probabilistic context-free grammar, that is we 
assume that the label sequence has been generated by 
the application of r context-free rules LHSj =~ RHSj 
1 _< j _< r (LHS stands for left hand side, RHS stands 
for fight hand side), and that 
P(L1L2....Lm) = H P(RHSi l LHSi). (5) 
i=l..r 
Each LHS is a single non-terminal, and RHS is a string 
of one or more non-terminals. So for each non-terminal 
LHS in the grammar there is a distribution over possible 
RHSs which sums to one. Counts of context-free rules 
can be extracted from a training set of context free trees, 
Template %age \[ Typical example 
{I,P} ,~5.6 IN was named as POST 
{O} 18.6 OUT retired 
{I,O} 16.6 i IN succeeds OUT 
{I} 7.3 IN joined the company 
{O,P} 7.1 OUT resigned as POST 
{I,O,P} 3.5 IN succeeded OUT as POST 
{P} 1.3 the company hired POST 
Most frequent indicator/percentage 
name/67% 
retire/26% 
succeed/98% 
j oin/31% 
resign/46% 
succeed/100% 
Table 2: The distribution over the possible templates (the 7 non-empty subsets of {I,O,P}), and the most common 
indicator for each template. For example, 45.6% of the templates are {I,P}, and in these 45.6% of the cases "name" is 
chosen 67% of the time. 
and used to estimate the parameters P(RHSi I LHSi). 
Given a test data sentence, the most likely tree (and hence 
the most likely template) can be recovered efficiently us- 
ing a variant of the CKY algorithm. 
4 The Grammar 
This section describes the underlying context-free 
structure 2 that we assume has generated the labels, and 
motivates it in terms of the observations in section 3.2. 
The context-free structure (the tree topology, and the 
choice of non-terminal labels within the tree), is deter- 
ministically derived from the initial labeling of the sen- 
tences -- so given a set of labeled sentences, the context- 
free structures can be recovered and the parameters can 
be estimated. 
4.1 The Leaf Categories 
The tagging model as applied in the above example as- 
sumed five tags - for the IN, OUT, and POST slots, the 
indicator, and for noise (other words). In fact, we used 
rather more categories, which are listed in table 3. These 
labels can still be deterministically recovered from the la- 
beled sentence though, given the additional information 
of a mapping from indicator words to their morphologi- 
cal stem (for example, the mapping "joined" ~ JOIN). 
The example sentence would have the following under- 
lying leaf labels: 
\[PREN Last week \] \[I Hensl~IK West \] 
\[NOISE+, 59 years old, \] \[IND(JOIN)joined 
\] \[NOISE- the company \] \[P.Prep-(JOIN) as \] 
\[P president \] \[POSTN, a surprising develop- 
ment. \] 
4.2 The Context-Free Component- a Brief Sketch 
The PCFG model assumes the pre-terminal 3 label se- 
quence {Li, L2...Lm} has been generated by a stochas- 
tic process with the following steps: 
2The structures in this paper are non-recursive, and could, there- 
fore, be equivalently handled by a hierarchy of finite-state transduc- 
ers, or even a single equivalent non-deterministic finite-state automa- 
ton. However, it is quite possible that extensions to the models could 
require recursive structures. 
SBy pre-terminal, we mean a non-terminal that dominates words 
rather than other non-terminals. 
1. Decide whether to have noise words (PREN) before 
the template TEMR 
2. Decide whether to have noise words (POSTN) after 
the template TEMP. 
3. Decide which slots to have (one of the 7 subsets of 
{I, O, P}). 
4. Decide the class of indicator words. 
5. Decide the order of the slots and indicator word. 
6. For each slot, choose whether to have noise between 
it and the indicator (NOISE+ or NOISE-). 
7. For each slot, choose whether to have a preposition 
directly preceding or following it. 
Figure 2 gives an example tree and describes the 
context-free rules within it. The next section describes 
the grammar in more detail, showing how these 7 types 
of decision can be encoded as context-free rules. 
4.3 The Context-Free Component in Detail 
This section describes the top-down derivation of a se- 
quence of leaves within a PCFG framework. 
Choosing noise at the start/end of the sentence 
This level of the model chooses whether to have noise 
preceding or following the text which expresses the suc- 
cession information. 
TOP -> PREN TEMPI #there is noise at start 
-> TEMPI #or there isn't 
TEMPi -> TEMP POSTN #there is noise following 
-> TEMP #or there isn't 
The TEMP non-terminal covers the span of the succes- 
sion information, in the above example "Hensley ... pres- 
ident". P(PREN TEMPI I TOP) can be interpreted as 
the probability of having noise at the beginning of the 
sentence, P(TEMP POSTN \[ TEMPI) is the probability 
of having post-noise. 
P(Slots) 
TEMP first re-writes in one of seven ways, correspond- 
ing to the 7 possible templates. The T. non-terminal 
encodes the slots that will be generated below it, for ex- 
ample T. IO would generate an IN and an OUT slot be- 
low it. So P(RHS \[ TEMP) will mirror the distribution 
in column 2 of table 2. 
42 
Leaf label 
I,O,P 
PREN 
POSTN 
NOISE- 
Description 
The IN, OUT and POST non-terminals 
"Noise" words before the template 
"Noise" words after the template 
"Noise" between slots and the indicator, which comes before the indicator 
"Noise" between slots and the indicator, which comes after the indicator 
lND(class) Leaf dominating the indicator, class can be any one of the morphological stems seen in training data. 
For example, IND(join) could dominate "join", "joins", "joined" or "joining". 
l.Prep-(class) Prepositions for the I, O and P slots, for a particular class, and which follow the indicator. For example, 
O.Prep-(class) P.Prep-(join) would be a preposition for the Post slot, with an indicator in the join class, and would 
P.Prep-(class) most likely be "as" 
I.Prep+(class) Prepositions for the 1, O and P slots, for a particular class, and which precede the indicator. 
O.Prep+(class) 
P.Prep+(class) 
Table 3: The pre-terminal labels that are used in the system. 
Top 
Lu, --.~k H...~.y w.., . ~9 y.... old. jo*.~ ~. comp.~.y ~ F..~.i~., . - .~.~.~*~-a do~lo~..,. 
Rule Interpretation 
TOP ~ PREN TEMPI Choose to have pre-noise (PREN) 
TEMP 1 ~ TEMP POSTN Choose to have post-noise (POSTN) 
TEMP ~ T.IP Choose to have IN and POST slots 
T.IP ~ IND.IP(JO1N) Choose to use a member of the JOIN class of indicators 
IND.1P(JOIN) ~ IND.I(JOIN) P2-(JOIN) Generate the POST slot to the fight of the indicator 
IND.I(JOIN) =~- 12+(JOIN) IND(JOIN) Generate the IN slot to the left of the indicator 
12+(JOIN) ~ I I +(JOIN) NOISE+ Have noise between the IN slot and the indicator 
II+(JOIN) ~ I Choose not to have a preposition for the IN slot 
P2-(JO1N) ~ NOISE- PI-(JOIN) Have noise between the POST slot and the indicator 
PI-(JOIN) ~ P.Prep-(JOIN) P Choose to have a preposition attached to the POST slot 
Figure 2: An example context-free tree. The table shows the interpretation of each of the rules in the tree. 
TEMP -> T.IOP TEMP -> T.IO TEMP -> T.I 
TEMP -> T.IP TEMP -> T.O 
TEMP -> T.OP TFEMP-> T.P 
P(ClasslSlots ) 
The next step is to choose the Class of indicator that 
is used to express the transaction. Each Class is a set 
of words with the same morphological stem, for exam- 
ple the JOIN class would include join, joins, joined and 
joining. P(ClasslSlots ) is implemented in the CFG 
fragment shown below. The IND non-terminal encodes 
which slots need to be generated, and the Class used to 
express the transaction. Each T. rule can re-write in N 
ways, where N is the number of classes. 
T.IOP -> IND.IOP\[Class\] T.I -> IND.I\[Class\] 
T.IO -> IND.IO\[Class\] T.O -> IND.O\[Class\] 
T.IP -> IND.IPICiassi: T.P -> IND.P\[Class\] 
T.OP -> IND. OP !Class\] 
P( OrderlClass , Slots) 
Having chosen the Slots to be generated, and the Class 
used to express the event, there are many possible orders 
in which the slots and class can appear. In the above ex- 
ample (Slots = {I, P}, Class = JOIN) there are 6 permuta- 
tions ( {JOIN I P}, {I JOIN P}, {I P JOIN } and so on). It 
is necessary to estimate a distribution over these alterna- 
tives. The order is parameterized using a binary branch- 
ing, context-free fragment: part of this (all rules with 
LHS = IND.IP\[Class\] ) is shown below. Thefull 
grammar specifies similar rules for all IND. X \[Class \] 
where X is any one of the non-empty subsets of {I, O, P}. 
IND.IPICiass\] -> IND.I\[Class\] P2-\[Class\] 
IND.IP\[Classl -> P2+lClass\] IND.I\[Class\] 
IND. IP\[Class\] -> IND.P\[Class\] I2-lClass\] 
IND.IP\[Ciass\] -> 12+\[Class\] IND.P\[Class\] 
43 
The notation is: 
• IND keeps tracks of which slots still need to be 
generated. For example IND.IP\[Class\] means 
that the IN and POST slots need to be generated. 
• The I2, 02, and P2 non-terminals will eventually 
generate the IN, OUT and POST leaves. The "2" 
stands for level 2 - more in the next section on why 
this is necessary. "+" means the slot appears be- 
fore the head-word, "-" means it appears after. The 
Class is propagated to the I2, 02 and P2 non- 
terminals. Propagation of the Class and direction 
(+ or -) is important because the identity of any 
prepositions is conditioned on this information. 
Each binary rule expresses a choice of which of the 
remaining slots to generate next, and which direction to 
generate it in. So IND.IP\[Class\] can re-write in 4 ways: 
either the IN or POST slot can be generated either to the 
left or right of the head-word itself. 
Choosing to generate noise between the slots 
Noise can appear after any slot preceding the indicator, 
or before any slot following the indicator. The CFG rules 
below encode the decision to have noise in a gap or not, 
for an IN slot generated before or after the indicator. The 
rules for OUT and POST are similar. 
I2+\[Class\] -> Ii+\[Class\] NOISE+ 
I2÷\[Class\] -> II+\[Class\] 
I2-\[Class\] -> NOISE- If-\[Class\] 
I2-\[Class\] -> II-\[Class\] 
Choosing to generate a preposition (or other 
indicator) linked to a slot 
Any of the slots can have an adjacent "indicator", usu- 
ally a preposition. The rules below encode the binary 
decision of whether to include an indicator for an IN slot 
- the OUT and POST cases are similar. 
II+\[Class\] -> I I.Prep+\[Class\] 
II+\[Class\] -> I 
Ii-\[Class\] -> I.Prep-\[Class\] I 
Ii-\[Class\] -> I 
Again, for each I1, O1 or P1 non-terminal there are 
two possible re-writes, one binary, one unary, encoding 
whether or not to generate a preposition. The I.Prep, 
O.Prep and P.Prep non-terminals then generate the in- 
dicator with a bigram model. The non-terminal encodes 
whether the slot appears before or after the head-word 
C+" or 'v'), and the Class of the head-word. 
5 Training the Model 
There are two steps to training the model: first, recov- 
ering the underlying tree structure from the training data 
labels; second, deriving counts of the CF rule applica- 
tions and bigram sequences and using these to estimate 
the parameters of the model. 
44 
5.1 Deriving the Tree Structures in Training Data 
While the tree structure described in section 4.3 may 
seem complex, it is important to realise that it can be 
deterministically derived from an annotator's labeling of 
the Slots and Indicator. This section describes how the 
structure is derived in a bottom up fashion using the fol- 
lowing annotated sentence as example input to the pro- 
cess: 
Last week \[I Hensley West \] , 59 years old , 
\[IND joined \] the company as \[P president \], a 
surprising development. 
The 6 stages are as follows: 
1. Identify the class of the indicator, and add this infor- 
mation to the IND label. Mark any prepositions ad- 
jacent to the slots. Label "noise" words with either 
PREN, POSTN, NOISE+ or NOISE-. The output 
from this stage would be: 
\[PREN Last week \] \[I Hensley West \] \[NOISE+, 59 
years old, \] \[IND(JOIN) joined \] \[NOISE- the com- 
pany \] \[P.Prep-(JOIN) as \] \[P president \] \[POSTN, 
a surprising development. \] 
2. Build level I of the slots, by including attached 
prepositions, or just building a unary rule 
II+l)OlN) PI-OOl/q) ' 
! 
| ~FrcpqJO~ p 
Xe'n~Jey wm I I 
Fre~.ident 
3. Build level 2 of the slots, by attaching NOISE+ or 
NOISE- leaves to the slots, 
12.H JOIN) P2~JOIN) 
| | NOISE- PI.4JOIN) 
\[ .59 ycarJ old. l 
I d~e compla~y P.Pr~JOIN) P 
I.knta~ w~a l I 
u pr~itlcm 
4. Build the binary-branching context-free structure 
that defines the order of the slots and indicator. 
fiqD(IOIN)JP 
I2+OOIN) II41;~JOIN).P 
IN~/OI/~ P2~JOIN) 
5. Add the top level of the tree 
TOP 
, 
"flEMP ~)$TN 
I I T.IP . I susFmill I deve~opmlml. 
I n'/DOOINklP 
5.2 Context Free Rule Probabilities 
Once the training data is processed to have full context- 
free trees, the grammar can be automatically read from 
these trees, and event counts can be extracted and used 
to estimate the parameters of the model. The maximum 
likelihood estimate for a CF rule LHS -> RHS is 
P(RHSiLHS) = C(RHS, LHS) 
C(LHS) 
where C(z) is the number of times event z has been seen 
in training data. This estimate can be unreliable, partic- 
ularly for low values of C(LHS). So we smooth this 
estimate with a "backed off" estimate Pb 
P(RHSILHS) = A C(RHS, LHS) C(LHS) + (i - A)Pb 
where 0 < A < 1. The backed off estimate Pb = 
C(RHS,L~FSb) is based on a subset of the context and the C(LHSb) 
estimate is more robust but is less detailed. For example, 
Pb for P(T.IOP --+ IND.IOP\[Class\]IT.IOP) might be 
P(T ---> IND \[Class\]IT), i.e. an estimate that ignores 
the slots when choosing the class of indicators. This 
method borrows heavily from smoothing techniques in 
language modeling for speech recognition m (Jelinek 
90) describes methods for estimating A. 
5.3 Bigram Probabilities 
The bigram model is used at the leaves of the tree 
to generate the words themselves, for example to es- 
timate P(the president I P)- The most obvi- 
ous way to estimate this is as P(theISTART, P) 
* P(presidentlthe , P) • P( E N DIpresident, P) with 
smoothing being implemented by interpolation between 
P(wlw-x, State) --r P(wlState) ~ ~ where V is the 
vocabulary size. Unfortunately we do not have space to 
• go into the full details of the smoothing here (in the fi- 
nal implementation part-of-speech information was also 
used to smooth the estimates). 
6 Experiments 
This section describes experiments on the management 
successions domain. Before giving the results, we dis- 
cuss how to deal with sentences that have more than one 
indicator. 
6.1 Dealing with Sentences that have more than 
one Indicator 
Thus far the model has assumed that there is only one 
indicator per sentence. However, training data frequently 
has more than one indicator, as in 
Mr. Smith was named president of the com- 
pany, succeeding Fred Jones. 
There are two events in this sentence, one centered 
around named, the other centered around succeeding. 
The solution is to transform sentences in both training 
and test data to give one sentence per indicator, in this 
case the sentence would be expanded to give two sen- 
tences: 
Mr. Smith was *named* president of the com- 
pany, succeeding Fred Jones. 
Mr. Smith was named president of the com- 
pany, *succeeding* Fred Jones. 
45 
The first sentence is for the named event, the second is 
for succeeding. The indicator is replaced with *indica- 
tor* to show that it is under interest -- when decoding 
test data the model either recognises *named* as a poten- 
tial indicator, but ignores succeeding, or ignores named 
and recognises *succeeding*. If the sentence appeared in 
training data it would be transformed to give two train- 
ing data trees. We should stress that this process is com- 
pletely automatic once the indicators have been identified 
in the text. 
6.2 Results 
The model was trained on 563 sentences, and tested on 
another 356 sentences. (That is, 563/356 sentences af- 
ter producing one sentence per indicator as described in 
section 6.1). The sentences were taken from the "Who's 
news" section of Wall Street Journal, which is almost ex- 
clusively about management successions. The training 
sentences were taken from 219 Who's News articles in 
the 1996 section, the test sentences were taken from 131 
articles in the 1995 section. The sentence level annota- 
tion was part of an annotation effort for the full extraction 
task, which therefore also marked the relevant corefer- 
ence relationships and the complete output template as 
in figure 1. 
The test data sentences always contain an event, and 
have all indicators marked as *indicator*-- only those 
indicators that have 1 or more slots attached to them are 
marked. This is an idealization, in that we avoid prob- 
lems of false positives, cases where a potential indicator 
is not used to express an event. See section 6.4 for sug- 
gestions about how to extend the model to deal with false 
p.ositives. 
The results are shown in table 4. We define precision 
and recall when comparing to the annotated test set an- 
swers (gold standard) as 
Number of correct slots Precision = 
Number of slots proposed 
Number of correct slots Recall = 
Number of slots in the gold standard 
In addition we report the standard "F-Measure", which is 
a combination of precision and recall 
2 x Precision × Recall F-Measure = 
Precision + Recall 
The results are quoted for the IN, OUT and POST slots 
(the IND slot is not scored, as it is marked in test data and 
would score 100% recall/precision, inflating the scores). 
The number of "correct" slots varies depending on how 
partial matches are scored - a partial match is where an 
output slot does not match a gold standard slot exactly, 
but does partially overlap. For example, in 
Bill Smith was elected vice president, human 
resources. 
Score forpartia! Precision Recall F-Measure 
0 80.6% 74.6% 77.5% 
0.5 85.9% 79.6% 82.6% 
1.0 91.3% 84.5% 87.8% 
J 
Table 4: Results on 356 test data sentences, training on 
563 sentences 
the gold standard might designate the slot as "vice pres- 
ident, human resources", whereas the program output 
might just mark "vice president". We present three: preci- 
sion/recall scores -- where a partial match scores 0, 0.5 
or 1.0, and 
Number of correct slots = Number of exact matches 
+Score for a partial match x Number of partial matches 
6.3 Analysis of the results 
In this section we look at the errors the system makes 
in more detail. There are two categories of error: preci- 
sion errors (incorrect slots); and recall errors (slots the 
system failed to propose). For these tests we ran ex- 
periments on the training data, jack-knifing (i.e. using 
cross-validation) it into 4 sections, in each case training 
on three-quarters of the training set and testing on the 
other quarter. Tables 5 and 6 show the results on this 
data set. 
Gold Proposed Correct I Partial Correct 
i +Partial 
986 944 769 71 840 I 
Table 5: Results on the jack-knifed training set (Counts) 
Score for partial Precision 
0 81.5% 
0.5 85.2% 
1.0 89.0% 
Recall F-Measure 
78.0% 79.7% 
81.6% 83.4% 
85.2% 87.0% 
Table 6: Results on the jack-knifed training set (Percent- 
ages) 
6.3,1 Precision errors 
Table 7 shows the 104 precision errors categorized by 
hand into four categories. These four categories were: 
1) Semantically Plausible. Here the model has se- 
lected a slot-filler that looks good semantically, but is 
ruled out for other reasons (usually syntactic). For ex- 
ample, 
The appointment puts (IN Mr. Zwirn) , 41 
years old, in line to succeed the unit's pres- 
ident, Frank R. Bakos, 58, who is (IND retir- 
ing) at year end. 
Here "Mr. Zwirn" is semantically a good filler for "retir- 
ing", but syntactically this is almost impossible. 
46 
Error type %age 
Semantically 37.1% 
Plausible 
"Correct" 25.7% 
Bad lexical 8.6% 
information i 
Others 28.6% 
Error sub-type %age 
Relative clauses 18.1% 
Subject 10.5% 
Other 8.6% 
Good alternative 17.1% 
> 1 reference 8.6% 
Table 7: The percentage of errors in each error category 
We sub-divided this class into 3 sub-categories: prob- 
lems with relative clauses, as in the example above; prob- 
lems with non-relativized subjects, for example "Bran- 
don Sweitzer, 53, succeeds (IN Mr. Wakefield) as pres- 
ident of Guy Carpenter and also (IND becomes) (POST 
the unit's CEO), succeeding Richard Blum, 56 Y; and 
problems that fell outside these categories. 
2) "Correct". These slots were not seen in the gold- 
standard, but were deemed pretty much correct, in that 
they would not hurt (and might even help) the score of a 
full system. They fall into two sub-categories- "good al- 
ternative", where the model's output is different from the 
gold standard but still looks reasonable, either because 
the sentence has more than one reasonable answer, or the 
gold standard is simply wrong; "> 1 reference", where 
there is more than one reference to the slot filler in the 
sentence, and the model has chosen a different one from 
the gold standard. For example, 
(OUT Mr. Johnson) , 52 , said he resigned 
(POST his positions as chief executive officer) 
Here the model marked "Mr. Johnson" as OUT, the an- 
notator marked "he", and both are in some sense correct. 
3) Bad Lexieal Information. In these cases the model 
selected a slot filler that is clearly bad for lexical reasons, 
for example 
Mr. Broeksmit is the (OUT latest) in a string 
of employees to (IND leave) the firm ... 
4) Other. Miscellaneous errors which do not fall into 
the above three categories. 
6.3.2 Recall Problems 
Of the 356 test-set sentences, 330 (92.7%) were pro- 
cessed by the system to give some output -- no output 
was produced for 26 cases. This accounts for the recall 
figures in table 4 being lower than the precision figures 
(for example, with a score of 0 for partials, precision = 
80.6%, recall = 74.6%, and 92.7%*80.6% = 74.7%). Of 
these 26 cases, 24 involved an indicator word that had 
never been seen in training data. The other 2 cases in- 
volved an unusual usage of "succeed", which had never 
been seen in training data and was peculiar enough for 
the system to fail to get an analysis (we set a probability 
threshold such that the machine gives up if it fails to find 
an analysis above this probability). 
6.4 Dealing with False Positives 
This work has made a simplifying assumption, that test 
sentences were marked with indicators that had one or 
more slots. This section considers how this process could 
be automated. 
A first step would be to identify in test data morpho- 
logical variants of words that had been seen as indicators 
in training data. However this would inevitably lead to 
false positives -- that is, potential indicators appearing 
in cases where they don't indicate an event. We could see 
two potential approaches for filtering out these spurious 
cases: first, word-sense disambiguation methods similar 
to those in (Yarowsky 95); second, we could extend the 
model to have an eighth, empty, template as a possibility 
the model should then learn how often null templates 
occur, and what kind of lexical items tend to produce 
them. 
We leave this to future work. At least in this dataset 
(Who's News articles) we believe that the false positive 
problem will not be severe, as the articles contain infor- 
mation almost exclusively on management successions, 
and most of the indicators are unambiguous within this 
sub-domain. 
The models have also made the assumption that an in- 
dicator is used to express each event. This may not be 
the case in all information extraction tasks, in some there 
may not be clear indicator words; again, we leave dealing 
with this limitation to future work. 
7 Future Work 
We anticipate two directions for future work: first, re- 
fining the current model to improve its performance, and 
second, extending the current model to encompass the 
complete information extraction task. 
7.1 Refining the Model 
When deciding on the direction of future work, it is use- 
ful to consider the error analysis in table 7. The majority 
of errors (the "semantically plausible" class) were cases 
where the model picked a slot that was semantically plau- 
sible, but syntactically impossible. It is unlikely that this 
problem can be solved with the approach described here, 
even with vastly increased amounts of training data. Our 
feeling is that a full syntactic parser as a first stage could 
radically improve performance. An improved approach 
might be to fully integrate the recovery of syntactic struc- 
ture and semantic labelings, in a similar way to the ap- 
proach used in BBN's SIFT system (Miller et al. 98). 
7.2 Extending the Model 
As discussed in section 1.1, the standard approach to 
information extraction involves three stages of process- 
ing: sentence level pattern matching, coreference, and 
template merging. Of these stages, our current work ad- 
dresses only sentence level pattern matching. However, 
we believe that the generative statistical framework de- 
scribed in this paper could be extended advantageously 
to the complete information extraction problem. In ex- 
tending the framework, we envision that the information 
extraction task would be performed using an inverted 
"information production" model. 
We can think of this model as approximating, to some 
• degree, the process by which text is produced by an au- 
thor. Specifically, we assume that each message is pro- 
duced according to a four stage process: 
1) First, the author decides what facts to express. For 
example, the text in figure 1 can be thought of as express- 
ing two succession events: IN = "Hensley E. West", OUT 
= "John Bradley", POST = "president", COMPANY = 
"RESTOR INDUSTRIES Inc.", and OUT = "Hensley 
E. West", POST = "group vice president", COMPANY 
= "DSC Communications Corp.". This process can be 
modeled as a prior probability distribution over sets of 
templates. In this example, the model would give the 
prior probability of a message containing exactly two 
succession templates: one containing slots IN, OUT, 
POST, COMPANY and the other containing slots OUT, 
POST, COMPANY. 
2) After deciding what facts to express, the author 
must decompose them into one or more component 
events. For example, the succession event IN = "Hensley 
E. West", OUT = "John Bradley", POST = "president", 
COMPANY = "RESTOR INDUSTRIES Inc." is decom- 
posed into two smaller events: IN = "Hensley E. West", 
POST = "president", COMPANY = "RESTOR INDUS- 
TRIES Inc." and IN = "Hensley E. West", OUT = "John 
Bradley". This process can be modeled as a probability 
distribution over "template splitting operations", condi- 
tioned on the full template being expressed. Template 
splitting operations are thus the generative analogue of 
the merging operations used in most information extrac- 
tion systems. 
3) Next, each component event must be expressed as a 
linguistic pattern. For example, the event IN = "Hensley 
E. West", POST = "president", COMPANY = "RESTOR 
INDUSTRIES Inc." is expressed as the linguistic pattern 
"IN ... was named POST of COMPANY", and the event 
IN = "Hensley E. West", OUT = "John Bradley" is ex- 
pressed as the linguistic pattern "IN ... fills a vacancy 
created by the retirement ... of OUT". This process can 
be modeled as a probability distribution over linguistic 
patterns, conditioned on the partial template being ex- 
pressed. Modeling this distribution is the subject of the 
main body of this paper. 
4) Finally, the entities involved in events must be 
realized as word strings within patterns. For exam- 
ple, "'RESTOR INDUSTRIES Inc." is realized as "this 
telecommunications-product concern", and "Hensley E. 
West" is realized as "Mr. West". This process can 
be modeled as a probability distribution over "descrip- 
tor generating operations", conditioned on the entity be- 
ing expressed and other features of the text. For exam- 
47 
pie, given that the author intends to express "Hensley 
E. West", and given that the full name appears earlier 
in the text, the model would assign a certain probability 
to generating the word string "Mr. West". In this case, 
the descriptor generating operation would be \[title + last 
name\]. 
Clearly, there are many details that would need to be 
resolved before a complete generative model of !informa- 
tion extraction could be implemented. In this paper, we 
have described a model containing two of the necessary 
components: a prior model over templates, and a model 
of linguistic patterns conditioned on those templates. A 
complete generative model for IE would offer two po- 
tentially powerful advantages. First, the model would 
provide pnncipled probability estimates for selecting the 
most likely set of templates given an input message: 
T = set of templates (the final output) 
M= the message, C = components 
P = linguistic patterns, S = slot fillers in T 
D = descriptions used to express the slots 
P(TIM) = ~ P(T, C, PIM) 
C,P 
where 
P(T, C, PIM) = P(T) x P(CIT) x P(PIC) x P(DIS) 
P(M) 
The second potential advantage derives from the gen- 
erative aspect of the proposed model. While there is an 
analogue in conventional IE systems for each of stages 2 
through 4 described above, there is no conventional ana- 
logue to stage 1: the prior model. We can think of this 
prior model as encoding domain-specific world knowl- 
edge about the plausibility of proposed sets of relations. 
8 Conclusions 
We have shown that a simple statistical model can iden- 
tify semantic slot-fillers in a management succession task 
with 83% accuracy (F-measure with a score of 0.5 for 
partial matches). The system was trained on only 560 
sentences, with the additional requirements of only a 
part-of-speech tagger and a morphological analyser. We 
initially considered a finite-state approach similar to that 
used for POS tagging (Church 88), or named-entity iden- 
tification (Bikel et al. 97), but argued that the Markov 
approximation gives a poor model for this task. The al- 
ternative, which has a PCFG component to define the 
probability of the underlying sequence of labels, allows 
a good parameterization of the problem, and can be de- 
coded efficiently using the CKY algorithm. Finally, we 
believe that the framework presented in this paper can be 
extended to model the complete information extraction 
process. 
48 
Acknowledgements 
We would like to thank Richard Schwartz and Ralph 
Weischedel for many helpful discussions and sugges- 
tions concerning this work. We would also like to thank 
the anonymous reviewers for several useful comments. 

References 
D. Appelt, J. Hobbs, J. Bear, D. J. Israel, and M. Tyson. 1993. FASTUS: 
a finite-state processor for information extraction from real-world 
text. In Proceedings of IJCAI-93, (Chamber'y, France), September 
1993. 
D. M. Bikel, S. Miller, R. Schwartz, and R. Weischedel. 1997. Nymble: 
a High-Performance Learning Name-finder. In Proceedings of the 
Fifth Conference on Applied Natural Language Processing, pages 
194-201. 
M. E. Caiiff and R. J. Mooney. 1997. Relational Learning of Pattern- 
Match Rules for Information Extraction. In Proceedings of the ACL 
Workshop on Natural Language Learning, Madrid, Spain. 
K. Church. 1988. A Stochastic Parts Program and Noun Phrase Parser 
for Unrestricted Text. Second Conference on Applied Natural Lan- 
guage Processing, ACIL 
R. Gfishman. 1995. The NYU System for MUC-6 or Where's the Syn- 
tax? In proceedings of the Sixth Message Understanding Confer- 
ence, Morgan Kaufmann. 
E Jelinek. 1990. Self-organized Language Modeling for Speech Recog- 
nition. In Readings in Speech Recognition. Edited by Waibel and 
Lee. Morgan Kaufmann Publishers. 
Daniel Karp, Yves Schabes, Martin Zaidel and Dania Egedi. A Freely 
Available Wide Coverage Morphological Analyzer for English. In 
Proceedings of the 15th International Conference on Computational 
Linguistics, 1994. 
W. Lehnert, J. McCarthy, S. Soderland, E. Riloff, C. Cardie, J. Peter- 
son, E Feng, C. Dolan, and S. Goldman. 1993. University of Mas- 
sachussets/Hughes: Description of the CIRCUS system as used for 
MUC-5. In Proceedings of the Fifth Message Understanding Con- 
ference (MUC-5), pages 277-290. 
M. Marcus, B. Santofini and M. Marcinkiewicz. 1993. Building a Large 
Annotated Corpus of English: the Penn Treebank. Computational 
- Linguistics, 19(2):313-330. 
S. Miller, M. Crystal, H. Fox, L. Ramshaw, R. Schwartz., R. Stone, 
R. Weischedel and the Annotation Group. 1998. Algorithms that 
Learn to Extract Information. BBN: Description of the SIFT Sys- 
tem as used for MUC-7. In Proceedings of the Seventh Message 
Understanding Conference. 
Proceedings of the Third, Fourth, Fifth and Sixth Message Understand- 
ing Conferences (MUC-3, MUC-4, MUC-S and MUC-6). Morgan 
Kaufmann, San Mateo, CA. 
A. Ratnaparkhi. 1996. A Maximum Entropy Model for Part-Of-Speech 
Tagging. Conference on Empirical Methods in Natural Language 
Processing, May 1996. 
E. Riloff. 1993. Automatically Constructing a Dictionary for Informa- 
tion Exlraction Tasks. In Proceedings of the Eleventh National Con- 
ference on Artificial Intelligence, Washington,'DC. AAAI Press I 
MIT Press. 811-816. 
E. Riloff. 1996. Automatically Generating Extraction Patterns from 
Untagged Text. In Proceedings of the Thirteenth National Confer- 
ence on Artificial Intelligence, Portland, OR. AAAI Press / MIT 
Press. 1044-1049. 
S. Soderland, D. Fisher, J. Aseltine and W. Lehnert. 1995. CRYSTAL: 
Inducing a Conceptual Dictionary. In Proceedings t~fthe ~mrteenth 
International Conference on Artificial Intelligence. AAAI Press I 
MIT Press. 1314-1319. 
Yarowsky, D. 1995. Decision Lists for Lexical Ambiguity Resolution: 
Application to Accent Restoration in Spanish and French. In Pro- 
ceedings of the 32nd Annual Meeting of the Association for Compu- 
tational Linguistics. Las Cruces, NM, pp. 88-95, 1994. 
