Nymble: 
a High-Performance Learning Name-finder 
Daniel M. Bikei 
BBN Corporation 
70 Fawcett Street 
Cambridge, MA 02138 
dbikel@bbn.com 
Scott Miller 
BBN Corporation 
70 Fawcett Street 
Cambridge, MA 02138 
szmiller@bbn.com 
Richard Schwartz 
BBN Corporation 
70 Fawcett Street 
Cambridge, MA 02138 
schwartz@bbn.com 
Ralph Weischedel 
BBN Corporation 
70 Fawcett Street 
Cambridge, MA 02138 
weisched@bbn.com 
Abstract 
This paper presents a statistical, learned 
approach to finding names and other non- 
recursive entities in text (as per the MUC-6 
definition of the NE task), using a variant of 
the standard hidden Markov model. We 
present our justification for the problem and 
our approach, a detailed discussion of the 
model itself and finally the successful results 
of this new approach. 
1. Introduction 
In the past decade, the speech recognition commu- 
nity has had huge successes in applying hidden 
Markov models, or HMM's to their problems. More 
recently, the natural language processing community 
has effectively employed these models for part-of- 
speech tagging, as in the seminal (Church, 1988) and 
other, more recent efforts (Weischedel et al., 1993). 
We would now propose that HMM's have success- 
fully been applied to the problem of name-finding. 
We have built a named-entity (NE) recognition 
system using a slightly-modified version of an 
HMM; we call our system "Nymble". To our 
knowledge, Nymble out-performs the best published 
results of any other learning name-finder. Further- 
more, it performs at or above the 90% accuracy level, 
often considered "near-human performance". 
The system arose from the NE task as specified in 
the last Message Understanding Conference (MUC), 
where organization names, person names, location 
names, times, dates, percentages and money amounts 
were to be delimited in text using SGML-markup. 
We will describe the various models employed, the 
methods for training these models and the method for 
"decoding" on test data (the term "decoding" borrowed 
from the speech recognition community, since one 
goal of traversing an HMM is to recover the hidden 
state sequence). To date, we have successfully gained 
and used the model on both English and Spanish, the 
latter for MET, the multi-lingual entity task. 
2. Background 
2.1 Name-finding as an Information- 
theoretic Problem 
The basic premise of the approach is to consider 
the raw text encountered when decoding as though it 
had passed through a noisy channel, where it had been 
originally marked with named entities. ~ The job of 
the generative model is to model the original process 
which generated the name-class-annotated words, 
before they went through the noisy channel. 
More formally, we must find the most likely 
sequence of name-classes (NC) given a sequence of 
words (W): 
Pr(NC I W) (2.1) 
In order to treat this as a generative model (where it 
generates the original, name-class-annotated words), 
we use Bayes' Rule: 
Pr(NC \[ W)- Pr(W, NC) (2.2) 
Pr(W) 
and since the a priori probability of the word 
sequence--the denominator--is constant for any 
given sentence, we can maxi-mize Equation 2,2 by 
maximizing the numerator alone. 
1 See (Cover and Thomas, 1991), ch. 2, for an excellent overview 
of the principles of information theory. 
194 
2.2 Previous 
Approaches to Name- 
finding 
Previous approaches have 
typically used manually 
constructed finite state 
patterns (Weischedel, 1995, 
Appelt et al., 1995). For START-OF-SENTENCE. 
every new language and every ~ 
new class of new information 
to spot, one has to write a 
new set of rules to cover the 
new language and to cover the 
new class of information. A 
finite-state pattern rule 
attempts to match against a 
sequence of tokens (words), in 
much the same way as a 
general regular expression 
matcher. 
In addition to these finite- 
state pattern approaches, a 
variant of Brill rules has been applied to the problem, 
as outlined in (Aberdeen et al., 1995). 
2.3 Interest in Problem and Potential 
Applications 
The atomic elements of information extraction-- 
indeed, of language as a whole---could be considered 
the who, where, when and how much in a sentence. 
A name-finder performs what is known as surface- or 
lightweight-parsing, delimiting sequences of tokens 
that answer these important questions. It can be used 
as the first step in a chain of processors: a next level 
of processing could relate two or more named entities, 
or perhaps even give semantics to that relationship 
using a verb. In this way, further processing could 
discover the "what" and "how" of a sentence or body 
of text. 
Furthermore, name-finding can be useful in its 
own right: an Internet query system might use name- 
finding to construct more appropriately-formed 
queries: "When was Bill Gates born?" could yield the 
query "BS.1\]_ Gat:es"+born. Also, name-finding 
can be directly employed for link analysis and other 
information retrieval problems. 
3. Model 
We will present the model twice, first in a 
conceptual and informal overview, then in a more- 
detailed, formal description of it as a type of HMM. 
The model bears resemblance to Scott Miller's novel 
work in the Air Traffic Information System (ATIS) 
task, as documented in (Miller et al., 1994). 
PERSON 
ORGANIZATION 
(five other n'ame-classes) 
', 
NOT-A-NAME 
END-OF-SENTENCE 
Figure 3.1 Pictorial representation of conceptual model. 
3.1 Conceptual Model 
Figure 3.1 is a pictorial overview of our model. 
Informally, we have an ergodic HMM with only 
eight internal states (the name classes, including the 
NOT-A-NAME class), with two special states, the 
START- and END-OF-SENTENCE states. Within each 
of the name-class states, we use a statistical bigram 
language model, with the usual one-word-per-state 
emission. This means that the number of states in 
each of the name-class states is equal to the 
vocabulary size, IVI. 
The generation of words and name-classes proceeds 
in three steps: 
1. Select a name-class NC, conditioning on the 
previous name-class and the previous word. 
2. Generate the first word inside that name-class, 
conditioning on the current and previous name- 
classes. 
3. Generate all subsequent words inside the current 
name-class, where each subsequent word is 
conditioned on its immediate predecessor. 
These three steps are repeated until the entire observed 
word sequence is generated. Using the Viterbi 
algorithm, we efficiently search the entire space of all 
possible name-class assignments, maximizing the 
numerator of Equation 2.2, Pr(W, NC). 
Informally, the construction of the model in this 
manner indicates that we view each type of "name" to 
be its own language, with separate bigram 
probabilities for generating its words. While the 
number of word-states within each name-class is equal 
195 
to IVI, this "interior" bigram language model is 
ergodic, i.e., there is a probability associated with 
every one of the IVI 2 transitions. As a parameter- 
ized, trained model, if such a transition were never 
observed, the model "backs off" to a less-powerful 
model, as described below, in §3.3.3 on p. 4. 
3.2 Words and Word-Features 
Throughout most of the model, we consider words 
to be ordered pairs (or two-element vectors), 
composed of word and word-feature, denoted (w, f). 
The word feature is a simple, deterministic computa- 
tion performed on each word as it is ~ to or 
feature computation is an extremely small part of the 
implementation, at roughly ten lines of code. Also, 
most of the word features are used to distinguish 
types of numbers, which are language-independent. 2 
The rationale for having such features is clear: in 
Roman languages, capitalization gives good evidence 
of names. 3 
3.3 Formal Model 
This section describes the model formally, 
discussing the transition probabilities to the word- 
states, which "generate" the words of each name-class. 
3.3.1 Top Level Model 
As with most trained, probabilistic models, we 
WordFeamre 
i 
twoDi~itNum 
fourDigitNum 
containsDigitAndAlpha 
containsDigitAndDash 
containsDigitAndSlash 
containsDigitAndComma 
containsDigitAndPeriod 
otherNum 
allCaps 
capPeriod 
firstWord 
initCap 
lowercase 
other 
Example Text 
90 
1990 
A8956-67 
09-96 
11/9/89 
23,000.00 
1.00 
456789 
BBN 
M. 
first word of sentence 
Sally 
call 
Intuition 
Two-digit year 
Four digit year 
Product code 
Date 
Date 
Monetary amount 
Monetary amount, percentage 
Other number 
Organization 
Person name initial 
No useful capitalization information 
Capitalized word 
Uncapitalized word 
Punctuation marks, all other words 
Table 3.1 Word features, examples and intuition behind them 
looked up in the vocabulary. It produces one of the 
fourteen values in Table 3.1. 
These values are computed in the order listed, so 
that in the case of non-disjoint feature-classes, such as 
containsDigitAndAlpha and 
containsDigitAndDash, the former will take 
precedence. The first eight features arise from the 
need to distinguish and annotate monetary amounts, 
percentages, times and dates. The rest of the features 
distinguish types of capitalization and all other words 
(such as punctuation marks, which are separate 
tokens). In particular, the firstWord feature arises 
from the fact that if a word is capitalized and is the 
first word of the sentence, we have no good 
information as to why it is capitalized (but note that 
allCaps and capPeriod are computed before 
fir s tWord, and therefore take precedence). 
The word feature is the one part of this model 
which is language-dependent. Fortunately, the word 
196 
have a most accurate, most powerful model, which 
will "back off" to a less-powerful model when there is 
insufficient training, and ultimately back-off to 
unigram probabilities. 
In order to generate the first word, we must make 
a transition from one name-class to another, as well 
as calculate the likelihood of that word. Our intuition 
was that a word preceding the start of a name-class 
(such as "Mr.", "President" or other titles preceding 
the PERSON name-class) and the word following a 
name-class would be strong indicators of the 
subsequent and preceding name-classes, respectively. 
2 Non-english languages tend to use the comma and period in the reverse way in which English does, i.e., the comma is a decimal 
point and the period separates groups of three digits in large numbers. However, the re-ordering of the precedence of the two 
relevant word-features had little effect when decoding Spanish, so they were left as is. 
3 Although Spanish has many lower-case words in organization names. See §4.1 on p. 6 for more details. 
Accordingly, the probabilitiy for generating the first 
word of a name-class is factored into two parts: 
Pr(NC I NC_I, w_,)- Pr((w,f):rs, I NC, NC_I). 
(3.1) 
The top level model for generating all but the first 
word in a name-class is 
Pr((w,f) I (w,f)_,, NO). (3.2) 
There is also a magical "+end+" word, so that the 
probability may be computed for any current word to 
be the final word of its name-class, i.e., 
pr((+end+,other) l(w,f):,~,NC ). (3.3) 
As one might imagine, it would be useless to have 
the first factor in Equation 3.1 be conditioned off of 
the +end+ word, so the probability is conditioned on 
the previous real word of the previous name-class, 
i.e., we compute 
w_ = +end + if 
Pr(NC\[ NO_l, W_l ) 1 NC_, =START-OF-SENTENCE 
\[W I = last observed word otherwise 
(3.4) 
Note that the above probability is not conditioned on 
the word-feature of w_ l, the intuition of which is 
that in the cases where the previous word would help 
the model predict the next name-class, the word 
feature---capitalization in particular--is not impor- 
tant: "Mr." is a good indicator of the next word 
beginning the PERSON name-class, regardless of 
capitalization, especially since it is almost never seen 
as "mr.". 
3.3.2 Calculation of Probabilities 
The calculation of the above probabilities is 
straightforward, using events/sample-size: 
Pr(NC I NC_, ' W_l)= c(NC, NC_,,w_,) (3.5) 
c(NC_I,W_ 1 ) 
pr((w,f):r., \] NC, NC_,)= c((w'f)e'*:NC'NC-O (3.6) 
c( NC, NC_, ) c(lw, f>,(w,:>_,, ,vc) 
pr((w,f) I (w,f)_l, gC)= C((w'f)-l'gC) (3.7) 
where c0 represents the number of times the events 
occurred in the training data (the count). 
3.3.3 Back-off Models and Smoothing 
Ideally, we would have sufficient training (or at 
least one observation of!) every event whose 
conditional probability we wish to calculate. Also, 
ideally, we would have sufficient samples of that 
upon which each conditional probability is 
conditioned, e.g., for Pr(NC I NC_,, w_,), we would 
like to have seen sufficient numbers of NC_I, w_~. 
Unfortunately, there is rarely enough training data to 
compute accurate probabilities when "decoding" on 
new data. 
3.3.3.1 Unknown Words 
The vocabulary of the system is built as it trains. 
Necessarily, then, the system knows about all words 
for which it stores bigram counts in order to compute 
the probabilities in Equations 3.1 - 3.3. The 
question arises how the system should deal with 
unknown words, since there are three ways in which 
they can appear in a bigram: as the current word, as 
the previous word or as both. A good answer is to 
train a separate, unknown word-model off of held-out 
data, to gather statistics of unknown words occurring 
in the midst of known words. 
Typically, one holds out 10-20% of one's 
training for smoothing or unknown word-training. 
In order to overcome the limitations of a small 
amount of training data--particularly in Spanish--we 
hold out 50% of our data to train the unknown word- 
model (the vocabulary is built up on the first 50%), 
save these counts in training data file, then hold out 
the other 50% and concatentate these bigram counts 
with the first unknown word-training file. This way, 
we can gather likelihoods of an unknown word 
appearing in the bigram using all available training 
data. This approach is perfectly valid, as we am 
trying to estimate that which we have not 
legitimately seen in training. When decoding, if 
either word of the bigram is unknown, the model used 
to estimate the probabilities of Equations 3.1-3 is the 
unknown word model, otherwise it is the model from 
the normal training. The unknown word-model can 
be viewed as a first level of back-off, therefore, since 
it is used as a backup model when an unknown word 
is encountered, and is necessarily not as accurate as 
the bigram model formed from the actual training. 
3.3.3.2 Further Back-off Models and 
Smoothing 
Whether a bigram contains an unknown word or 
not, it is possible that either model may not have 
seen this bigram, in which case the model backs off 
to a less-powerful, less-descriptive model. Table 3.2 
shows a graphic illustration of the back-off scheme: 
197 
Name-class Bi~rams 
Pr(NC I NC_ I, w_,) 
Pr(NC I NC_,) 
Pr( NC) 
1 
number of name- classes 
First-word Biisrams 
Pr((w,:>,,,, I NC, l,:C_,) 
Pr((w,f) l (+begin+, other), NC) 
Pr((w,f) \[ NC) 
Pr(w I NC). Pr(f I NC) 
1 1 
IVl number of word features 
Table 3.2 Back-off strategy 
The weight for each back-off model is computed on- 
the-fly, using the following formula: 
If computing Pr(XIY), assign weight of ;~ to 
the direct computation (using one of the 
formulae of §3.3.2) and a weight of (1 - ;t.) to 
the back-off model, where ( i , 
/q, = I c(Y) J I + unique outcomes of Y 
c(Y) 
(3.8) 
where "old c(Y)" is the sample size of the model from 
which we are backing off. This is a rather simple 
method of smoothing, which tends to work well 
when there are only three or four levels of back-off: 
This method also overcomes the problem when a 
back-off model has roughly the same amount of 
training as the current model, via the first factor of 
Equation 3.8, which essentially ignores the back-off 
model and puts all the weight on the primary model, 
in such an equi-trained situation. 
As an example---disregarding the first factor--if 
we saw the bigram "come hither" once in training and 
we saw "come here" three times, and nowhere else did 
we see the word "come" in the NOT-A-NAME class, 
when computing 
Pr("hither" I "come", NOT-A-NAME), 
we would back off to the unigram probability 
Pr("hither" I NOT-A-NAME) 
with a weight of ½, since the number of unique 
outcomes for the word-state for "come" would be two, 
and the total number of times "come" had been the 
preceding word in a bigram would be four (a 
4 Any more levels of back-off might require a more sophisticated 
smoothing technique, such as deleted interpolation. No matter 
what smoothing technique is used, one must remember that 
smoothing is the art of estimating the probability of that which is 
unknown (i.e., not seen in training). 
Non-first-word Bi~rams 
Pr((w,S> I (w,S/_,, NC) 
Pr((w,S) I ::C) 
Pr(w l NC). Pr(fl NC) 
1 1 
IV\[ number of word features 
1/(1 + ¼) = ~ weight for the bigram probability, a 
l 1 - ~ = 3 weight for the back-off model). 
3.4 Comparison with a traditional HMM 
Unlike a traditional HMM, the probability of 
generating a particular word is 1 for each word-state 
inside each of the name-class states. An alternative--- 
and more traditional--model would have a small 
number of states within each name-class, each 
having, perhaps, some semantic signficance, e.g., 
three states in the PERSON name-class, representing a 
first, middle and last name, where each of these three 
states would have some probability associated with 
emitting any word from the vocabulary. We chose to 
use a bigram language model because, while less 
semantically appealing, such n-gram language models 
work remarkably well in practice. Also, as a first 
research attempt, an n-gram model captures the most 
general significance of the words in each name-class, 
without presupposing any specifics of the structure of 
names, ~i la the PERSON name-class example, above. 
More important, either approach is mathematically 
valid, as long as all transitions out of a given state 
sum to one. 
3.5 Decoding 
All of this modeling would be for naught were it 
not for the existence of an efficient algorithm for 
finding the optimal state sequence, thereby "decoding" 
the original sequence of name-classes. The number of 
possible state sequences for N states in an ergodic 
model for a sentence of m words is N m, but, using 
dynamic programming and an appropriate merging of 
multiple theories when they converge on a particular 
state--the Viterbi decoding algorithm--a sentence can 
be "decoded" in time linear to the number of tokens in 
the sentence, O(m) (Viterbi, 1967). Since we are 
198 
interested in recovering the name-class state sequence, 
we pursue eight theories at every given step of the 
algorithm. 
4. Implementation and Results 
4.1 Development History 
Initially, the word-feature was not in the model; 
instead the system relied on a third-level back-off part- 
of-speech tag, which in turn was computed by our 
stochastic part-of-speech tagger. The tags were taken 
at face value: there were not k-best tags; the system 
treated the part-of-speech tagger as a "black box". 
Although the part-of-speech tagger used capitalization 
to help it determine proper-noun tags, this feature was 
only implicit in the model, and then only after two 
levels of back-off! Also, the capitalization of a word 
was submerged in the muddiness of part-of-speech 
tags, which can "smear" the capitalization probability 
mass over several tags. Because it seemed that 
capitalization would be a good name-predicting 
feature, and that it should appear earlier in the model, 
we eliminated the reliance on part-of-speech 
altogether, and opted for the more direct, word-feature 
model described above, in §3. Originally, we had a 
very small number of features, indicating whether the 
word was a number, the first word of a sentence, all 
uppercase, inital-capitalized or lower-case. We then 
expanded the feature set to its current state in order to 
capture more subtleties related mostly to numbers; 
due to increased performance (although not entirely 
dramatic) on every test, we kept the enlarged feature 
set. 
Contrary to our expectations (which were based on 
our experience with English), Spanish contained 
many examples of lower-case words in organization 
and location names. For example, departamento 
("Department") could often start an organization 
name, and adjectival place-names, such as coreana 
("Korean") could appear in locations and by 
convention are not capitalized. 
4.2 Current Implementation 
The entire system is implemented in C++, atop a 
"home-brewed", general-purpose class library, 
providing a rapid code-compile-train-test cycle. In 
fact, many NLP systems suffer from a lack of 
software and computer-science engineering effort: run- 
time efficiency is key to performing numerous 
experiments, which, in turn, is key to improving 
performance. A system may have excellent 
performance on a given task, but if it takes long to 
compile and/or run on test data, the rate of 
improvement of that system will be miniscule 
compared to that which can run very efficiently. On a 
Spare20 or SGI Indy with an appropritae amount of 
RAM, Nymble can compile in 10 minutes, train in 5 
minutes and run at 6MB/hr. There were days in 
which we had as much as a 15% reduction in error 
rate, to borrow the performance measure used by the 
speech community, where error rate = 100% - F- 
measure. (See §4.3 for the definition of F-measure.) 
4.3 Results of evaluation 
In this section we report the results of evaluating 
the final version of the learning software. We report 
the results for English and for Spanish and then the 
results of a set of experiments to determine the 
impact of the training set size on the algorithm's 
performance in both English and Spanish. 
For each language, we have a held-out 
development test set and a held-out, blind test set. 
We only report results on the blind test set for each 
respective language. 
4.3.1 F-measure 
The scoring program measures both precision and 
recall, terms borrowed from the information-retrieval 
community, where 
P = number of correct responses and 
number responses 
R = number of correct responses (4.1) 
number correct in key 
Put informally, recall measures the number of "hits" 
vs. the number of possible correct answers as 
specified in the key file, whereas precision measures 
how many answers were correct ones compared to the 
number of answers delivered. These two measures of 
performance combine to form one measure of 
performance, the F-measure, which is computed by 
the weighted harmonic mean of precision and recall: 
F = (f12 + 1)RP (4.2) ( 2R)+P 
where ff represents the relative weight of recall to 
precision (and typically has the value 1). To our 
knowledge, our learned name-finding system has 
achieved a higher F-measure than any other learned 
system when compared to state-of-the-art manual 
(rule-based) systems on similar data. 
4.3.2 English and Spanish Results 
Our test set of English data for reporting results is 
that of the MUC-6 test set, a collection of 30 WSJ 
documents (we used a different test set during 
development). Our Spanish test set is that used for 
MET, comprised of articles from the news agency 
AFP. Table 4.1 illustrates Nymble's performance as 
compared to the best reported scores for each category. 
199 
Case LanBual~e 
Mixed English 
Upper English 
Mixed Spanish 
Best Reported 
Score \[ Nymble 
96 93 
89 9 1 
93 9 0 
Table 4.1 F-measure Scores 
4.3.3 The Amount of Training Data 
Required 
With any learning technique one of the important 
questions is how much training data is required to get 
acceptable performance. More generally how does 
performance vary as the training set size is increased 
or decreased? We ran a sequence of experiments in 
English and in Spanish to try to answer this question 
for the final model that was implemented. 
For English, there were 450,000 words of training 
data. By that we mean that the text of the document 
itself (including headlines but not including SGML 
tags) was 450,000 words long. Given this maximum 
size of training available to us, we successfully 
divided the training material in half until we were 
using only one eighth of the original training set size 
or a training set of 50,000 words for the smallest 
experiment. To give a sense of the size of 450,000 
words, that is roughly half the length of one edition 
of the Wall Street Journal. 
The results are shown in a histogram in Figure 
4.1 below. The positive outcome of the experiment 
is that half as much training data would have given 
almost equivalent performance. Had we used only 
one quarter of the data or approximately 100,000 
words, performance would have degraded slightly, 
only about 1-2 percent. Reducing the training set 
size to 50,000 words would have had a more 
significant decrease in the performance of the system; 
however, the performance is still impressive even 
with such a small training set. 
100. 
90. 
80. 
70. 
60, 
50. 
40, 
30, 
20, 
10 
0 ~.7~ i ~ 
i o= o~c • - o~ "2 "E 
Figure 4.1: Impact of Various Training 
Set Sizes on Performance in English. The 
learning algorithm performs remarkable well, nearly 
comparable to handcrafted systems with as little as 
100,000 words of training data. 
On the other hand, the result also shows that 
merely annotating more data will not yield dramatic 
improvement in the performance. With increased 
training data it would be possible to use even more 
detailed models that require more data and could 
achieve significantly improved overall system 
performance with those more detailed models. 
For Spanish we had only 223,000 words of 
training data. We also measured the performance of 
the system with half the training data or slightly 
more than 100,000 words of text. Figure 4.2 shows 
the results. There is almost no change in 
performance by using as little as 100,000 words of 
training data. 
Therefore the results in both languages were 
comparable. As little as 100,000 words of training 
data produces performance nearly comparable to 
handcrafted systems. 
200 
100 
90 
80 
70 
60 
50 
40 
30 
20 
10 
0 K~ 
tT~ .~ .~_ o~ .=- 
C.) t.- O. ~ _ 
~ ~~ 
Figure 4.2: Impact of Training Set Size on 
Performance in Spanish 
5. Further Work 
While our initial results have been quite favorable, 
there is still much that can be done potentially to 
improve performance and completely close the gap 
between learned and rule-based name-finding systems. 
We would like to incorporate the following into the 
current model: 
• lists of organizations, person names and 
locations 
• an aliasing algorithm, which dynamically updates 
the model (where e.g. IBM is an alias of 
International Business Machines) 
• longer-distance information, to find names not 
captured by our bigram model 
6. Conclusions 
We have shown that using a fairly simple 
probabilistic model, finding names and other 
numerical entities as specified by the MUC tasks can 
be performed with "near-human performance", often 
likened to an F of 90 or above. We have also shown 
that such a system can be gained efficiently and that, 
given appropriately and consistently marked answer 
keys, it can be trained on languages foreign to the 
trainer of the system; for example, we do not speak 
Spanish, but trained Nymble on answer keys marked 
by native speakers. None of the formalisms or 
techniques presented in this paper is new; rather, the 
approach to this task the model itself--is wherein 
lies the novelty. Given the incredibly difficult nature 
of many NLP tasks, this example of a learned, 
stochastic approach to name-finding lends credence to 
the argument that the NLP community ought to push 
these approaches, to find the limit of phenomena that 
may be captured by probabilistic, finite-state 
methods. 
8. Acknowledqements 
The work reported here was supported in part by 
the Defense Advanced Research Projects Agency; a 
technical agent for part of the work was Fort 
Huachucha under contract number DABT63-94-C- 
0062. The views and conclusions contained in this 
document are those of the authors and should not be 
interpreted as necessarily representing the official 
policies, either expressed or implied, of the Defense 
Advanced Research Projects Agency or the United 
States Government. 
We would also like to give special acknowledge- 
ment to Stuart Shieber, McKay Professor of 
Computer Science at Harvard University, who 
endorsed and helped foster the completion of this, the 
first phase of Nymble's development. 

References 

Aberdeen, J., Burger, J., Day, D., Hirschman, L., 
Robinson, P. and Vilain, M. (1995) In 
Proceedings of the Sixth Message 
Understanding Conference (MUC-6) Morgan 
Kaufmann Publishers, Inc., Columbia, 
Maryland, pp. 141-155. 

Appelt, D. E., Jerry R. Hobbs, Bear, J., Israel, D., 
Kameyama, M., Kehler, A., Martin, D., 
Myers, K. and Tyson, M. (1995) In 
Proceedings of the Sixth Message 
Understanding Conference (MUC-6) Morgan 
Kaufmann Publishers, Inc., Columbia, 
Maryland, pp. 237-248. 

Church, K. (1988) In Second Conference on Applied 
Natural Language Processing, Austin, Texas. 

Cover, T. and Thomas, J. A. (1991) Elements of 
Information Theory, John Wiley & Sons, Inc., 
New York. 

Miller, S., Bobrow, R., Schwartz, R. and Ingria, R. 
(1994) In Human Language Technology 
Workshop, Morgan Kaufmann Publishers, 
Plainsboro, New Jersey, pp. 278-282. 

Viterbi, A. J. (1967) IEEE Transactions on 
Information Theory, IT-13(2), 260-269. 

Weischedel, R. (1995) In Proceedings of the Sixth 
Message Understanding Conference (MUC-6)Morgan Kaufmann Publishers, Inc., 
Columbia, Maryland, pp. 55-69. 

Weischedel, R., Meteer, M., Schwartz, R., 
Ramshaw, L. and Palmucci, J. (1993) 
Computational Linguistics, 19(2), 359-382. 
