Fertility Models for Statistical Natural Language Understanding 
Stephen Della Pietra °, Mark Epstein, Salim Roukos, Todd Ward 
IBM Thomas J. Watson Research Center 
P.O. Box 218 
Yorktown Heights, NY 10598, USA 
(*Now With Renaissance Technologies, Stonybrook, NY, USA) 
sdella@rentec, tom 
\[meps/roukos/tward\] ©watson. ibm. com 
Abstract 
Several recent efforts in statistical nat- 
ural language understanding (NLU) have 
focused on generating clumps of English 
words from semantic meaning concepts 
(Miller et al., 1995; Levin and Pierac- 
cini, 1995; Epstein et al., 1996; Epstein, 
1996). This paper extends the IBM Ma- 
chine Translation Group's concept of fertil- 
ity (Brown et al., 1993) to the generation 
of clumps for natural language understand- 
ing. The basic underlying intuition is that 
a single concept may be expressed in Eng- 
lish as many disjoint clump of words. We 
present two fertility models which attempt 
to capture this phenomenon. The first is 
a Poisson model which leads to appeal- 
ing computational simplicity. The second 
is a general nonparametric fertility model. 
The general model's parameters are boot- 
strapped from the Poisson model and up- 
dated by the EM algorithm. These fertility 
models can be used to impose clump fertil- 
ity structure on top of preexisting clump 
generation models. Here, we present re- 
sults for adding fertility structure to uni- 
gram, bigram, and headword clump gener- 
ation models on ARPA's Air Travel Infor- 
mation Service (ATIS\] domain. 
1 Introduction 
The goal of a natural language understanding (NLU) 
system is to interpret a user's request and respond 
with an appropriate action. We view this interpre- 
tation as translation from a natural language ex- 
pression, E, into an equivalent expression, F, in 
an unambigous formal language. Typically, this for- 
mal language will be hand-crafted to enhance per- 
formance on some task-specific domain. A statisti- 
cal NLU system translates a request E as the most 
likely formal expression ~' according to a probability 
model p, 
= are maxp(F\[E) --- are maxp(F, E). 
over all F over all F 
We have previously built a fully automatic statis- 
tical NLU system (Epstein et al., 1996) based on the 
source-channel factorization of the joint distribution 
p(f , E) 
p(f , E) = p(f)p(ZlF ). 
This factorization, which has proven effective in 
speech recognition (Bahl, Jelinek, and Mercer, 
1983), partitions the joint probability into an a pri- 
ori intention model p(F), and a translation model 
p(E\[F) which models how a user might phrase a re- 
quest F in English. 
For the ATIS task, our formal language is a mi- 
nor variant of the NL-Parse (Hemphill, Godfrey, and 
Doddington, 1990) used by ARPA to annotate the 
ATIS corpus. An example of a formal and natural 
language pair is: 
• F: List flights from New Orleans to Memphis 
flying on Monday departing early_morning 
• E: do you have any flights going to Memphis 
leaving New Orleans early Monday morning 
Here, the evidence for the formal language concept 
'early_morning' resides in the two disjoint clumps of 
English 'early' and 'morning'. In this paper, we in- 
troduce the notion of concept fertility into our trans- 
lation models p(EIF ) to capture this effect and the 
more general linguistic phenomenon of embedded 
clauses. Basically, this entails augmenting the trans- 
lation model with terms of the form p(nlf), where n 
is the number of clumps generated by the formal lan- 
guage word f. The resulting model can be trained 
automatically from a bilingual corpus of English and 
formal language sentence pairs. 
Other attempts at statistical NLU systems have 
used various meaning representations such as con- 
cepts in the AT&T system (Levin and Pieraccini, 
1995) or initial semantic structure in the BBN sys- 
tem (Miller et al., 1995). Both of these systems re- 
quire significant rule-based transformations to pro- 
duce disambiguated interpretations which are then 
168 
used to generate the SQL query for ATIS. More re- 
cently, BBN has replaced handwritten rules with de- 
cision trees (Miller et al., 1996). Moreover, both sys- 
tems were trained using English annotated by hand 
with segmentation and labeling, and both systems 
produce a semantic representation which is forced 
to preserve the time order expressed in the Eng- 
lish. Interestingly, both the AT&T and BBN sys- 
tems generate words within a clump according to 
bigram models. Other statistical approachs to NLU 
include decision trees (Kuhn and Mori, 1995) and 
neural nets (Gorin et al., 1991). 
In earlier IBM translation systems (Brown et al., 
1993) each English word would be generated by, 
or "aligned to", exactly one formal language word. 
This mapping between the English and formal lan- 
guage expressions is called the "alignment". In the 
simplest case, the translation model is simply pro- 
portional to the product of word-pair translation 
probabilities, one per element in the alignment. In 
these models, the alignment provides all of the struc- 
ture in the translation model. The alignment is a 
"hidden" quantity which is not annotated in the 
training data and must be inferred indirectly. The 
EM algorithm (Dempster, Laird, and Rubin, 1977) 
used to train such "hidden" models requires us to 
sum an expression over all possible alignments. 
These early models were developed for French to 
English translation. However, in NLU there is a fun- 
damental asymmetry between the natural language 
and the unambiguous formal language. Most no- 
tably, one formal language word may frequently cor- 
respond to whole English phrases. We added the 
"clump", an extra layer of structure, to accomodate 
this phenomenon (Epstein et al., 1996). In this para- 
digm, formal language words first generate a clump- 
ing, or partition, of the word slots of the English 
expression. Then, each clump is filled in according 
to a translation model as before. The alignment is 
defined between the formal language words and the 
clumps. Then, both the alignment and the clumping 
are hidden structures which must be summed over 
to train the models. 
Already, these models represent significant 
progress. They learn automatically from a bilin- 
gual corpus of English and formal language sen- 
tences. They do not require linguistically knowl- 
edgeable experts to tediously annotate a training 
corpus. Rather, they rely upon a group of trans- 
lators with significantly less linguistic knowledge to 
produce a bilingual training corpus. The fertility 
models introduced below maintain these benefits 
while slightly improving performance. 
2 Fertility Clumping Translation 
Models 
The rationale behind a clumping model is that 
the input English can be clumped or bracketed into 
phrases. Each clump is then generated from a sin- 
gle formal language word using a translation model. 
The notion of what constitutes a natural clumping 
depends on the formal language. For example, sup- 
pose the English sentence were: 
I want to fly to Memphis please. 
If the formal language for this sentence were: 
LIST FLIGHTS TO LOCATION, 
then the most plausible clumping would be: 
\[I want\] \[to fly\] \[to\] \[Memphis\] \[please\], 
for which we would expect "\[I want\]" and "\[please\]" 
to be generated from "LIST", "\[to fly\]" from 
"FLIGHTS", "\[to\]" from "TO, and "\[Memphis\]" 
from LOCATION. Similarly, if the formal language 
were: 
LIST FLIGHTS DESTINATION_LOC 
then the most natural clumping would be: 
\[I want\] \[to fly\] \[to Memphis\] \[please\], 
in which we would now expect "\[to Memphis\]" to be 
generated by "DESTINATION_LOC". 
Although these ctumpings are perhaps the most 
natural, neither the clumping nor the alignment is 
annotated in our training data. Instead, both the 
alignment and the clumping are viewed as "hidden" 
quantities for which all values are possible with some 
probability. The EM algorithm is used to produce a 
maximum likelihood estimate of the model parame- 
ters, taking into account all possible alignments and 
clumpings. 
In the discussion of fertility models we denote an 
English sentence by E, which consists of I(E) words. 
Similarly, we denote the formal language by F, a 
tuple of order g(F), whose individual elements are 
denoted by fi. A clumping for a sentence partitions 
E into a tuple of clumps C. The number of clumps 
in C is denoted by g(C), and is an integer in the 
range 1...g(E). A particular clump is denoted by 
ci, where i 6 {1...g(C)}. The number of words in 
q is denoted by g(ci), cl begins at the first word 
in the sentence, and ct(c) ends at the last word in 
the sentence. The clumps form a proper partition 
of E. All the words in a clump c must align to the 
same f. An alignment between E and F determines 
which f generates each clump of E in C. Similarly, 
A denotes the alignment, with g(A) = g(C), and the 
ai denote the formal language word to which each e 
in c~ align. The individual words in a clump c are 
represented by el ..-el(~). 
For all fertility models, the fundamental parame- 
ters are the joint probabilities p( E, C, A, F). Since 
the clumping and alignment are hidden, to compute 
the probability that E is generated by F, one calcu- 
lates: 
p(E I f ) = Zp(E,C, A IF) 
C,A 
169 
3 General and Poisson Fertility 
In the general fertility model, the translation prob- 
ability with "revealed" alignment and clumping is 
p(E,C,A \[ F) = 
1 t(P) t(c) 
Z--\[ 1-\[ P( n' \[ Y,)n,! rI p(c~- I Io,) (1) 
i=1 j=l 
e(c) 
p(c I f) = p(e(c) I f) 1\] p(e, I fc) (2) 
i=1 
where p(ni \[ fi) is the fertility probability of gen- 
erating n i clumps by formal word f~. Note that 
ni = L. The factorial terms combine to give an 
inverse multinomial coefficient which is the uniform 
probability distribution for the alignment A of F to 
C. 
It appears that the computation of the likelihood, 
which is the sum of e(F)(e(F) + product 
terms, is exponential. Although dynamic program- 
ming can reduce the complexity, there remain an 
exponentially large number of terms to evaluate in 
each iteration of the EM algorithm. We resort to 
a top-N approximation to the EM sum for the gen- 
eral model, summing over candidate clumpings and 
alignments proposed by the Poisson fertility model 
developed below. 
If one assumes that the fertility is modeled by the 
Poisson distribution with mean fertility ),: 
e-Xt )tf n p(n 
If) - n! (3) 
then a polynomial time training algorithm exists. 
The simplicity arises from the fortuitous cancella- 
tion of n! between the Poisson distribution and the 
uniform alignment probability. Substituting equa- 
tion 3 into equation 1 yields: 
p(E, C, A I F) 
1 t(F) t(C) 
= L-7 1\] ISo,) (4) 
i=1 j=l 
I t(F) £(C) 
= Lq 1-I e-X" 1\] q(cj In,) (5) 
i=i j=l 
If) -- If), (6) 
where A: '~ has been absorbed into the effective 
clump score q(c I f). In this form, it is particu- 
larly simple to explicitly sum over all alignments A 
to obtain p(E, C \[ F) by repeated application of the 
distributive law. The resulting polynomial time ex- 
pressions are: 
1 t(f) L(C) 
p(E, C l F) = L--\[. rI e-X" 1\] 4(cj IF) (7) 
i=I \]=i 
q(c I F) = q(c If) (8) 
\]EF 
The q(C \[ F) values for all possible clumpings 
can be calculated in O(e(E)2e(F)) time if the maxi- 
mum clump size is unbounded, and in O(e(E)I(F)) 
if bounded. The Viterbi decoding algorithm (For- 
ney, 1973) is used to calculate p(E I L,F) from 
these expressions. The Viterbi algorithm produces 
a score which is the sum over all possible clump- 
ings for a fixed L. This score must then normal- 
ized by the exp(-X't(v) z...~,=l AA)/L! factor. The EM 
count accumulation is done using an adaptation 
of the Baum-Welch algorithm (Baum, 1972) which 
searches through the space of all possible ctumpings, 
first considering 1 clump, then 2, and so forth. 
Initial values for p(e \[ f) are bootstrapped from 
Model 1 (Epstein et al., 1996) with the initial mean 
fertilities A/ set to 1. We also fixed the maximum 
clump size at 5 words. Empirically, we found it ben- 
eficial to hold the p(e I f) parameters fixed for 20 
iterations to allow the other parameters to train to 
reasonable values. After training, the translation 
probabilities and clump lengths are smoothed using 
deleted interpolation (Bahl, Jelinek, and Mercer, 
1983). 
Since we have been unable to find a polynomial 
time algorithm to train the general fertility model, 
we use the Poisson model to "expose" the hidden 
alignments. The Poisson fertility model gives the 
most likely 1000 clumpings and alignments, which 
are then restored according to the current general 
fertility model parameters. This gives fractional 
counts for each of the 1000 alignments, which are 
then used to update the the general fertility model 
parameters. 
4 Improved Clump Modeling 
In both the Poisson and general fertility models, the 
computation ofp(clf ) in equation 2 uses a unigram 
model. Each English word e~ is generated with prob- 
ability p(ei\[fc). Two more powerful modeling tech- 
niques for modeling clump generation are n-gram 
language models (Miller et al., 1995; Levin and Pier- 
accini, 1995; Epstein, 1996), and headword language 
models (Epstein, 1996). A bigram language model 
uses: 
p(c l Y) = 
p(e(c) l f)p(el l bdy, f~)p(bdy l el(c), fc) x 
t(¢) 
1-Iv(e, t e,-1, fo) 
i=2 
where bdy is a special marker to delimit the begin- 
ning and end of the clump. 
A headword language model uses two unigram 
models, a headword model and a non-headword 
model. Each clump is required to have a headword. 
All other words are non-headwords. The identity of 
a clump's headword is hidden, hence it is necessary 
170 
Word ~ p (n = O) 
late 1.49 .00 
early 1.55 .00 
morning 1.40 .01 
afternoon 1.62 .00 
early_morning 2.50 .00 
= i) 
.62 
.89 
.85 
.85 
.16 
p = 2) p >= 3) 
.28 .10 
.03 .08 
.11 .03 
.12 .03 
.69 .15 
Table 1: Trained Poisson and General Fertility 
Word 
early 
morning 
List 
Top p(elf) Score 
early .37 
an .22 
i .09 
morning .63 
in .12 
leaving .05 
the .21 
me .19 
show .18 
what .17 
please .04 
Top ph,ad(elf ) Score 
early .68 
i .23 
day .06 
morning .75 
leaving .06 
flights .05 
show .49 
what .17 
give .07 
you .06 
list .05 
Top p,~onhe~d(elf ) Score 
an .30 
flight .29 
would .10 
the .43 
in .37 
of .08 
me .45 
the .19 
all .12 
are .05 
please .05 
Table 2: Trained Translation Probabilities using Poisson Fertility 
Table 
Model DEC93 DEC93a 
1 
Clump 
Clump-HW 
Clump-BG 
Poisson 
Poisson-HW 
Poisson-BG 
General 
General-HW 
General-BG 
75.00 
74.78 
75.89 
76.79 
78.12 
78.12 
78.12 
79.91 
79.91 
73.21 
75.22 
77.01 
78.35 
78.12 
81.25 
81.25 
81.25 
82.59 
79.91 
83.04 
3: Class A CAS on Patterns for DEC93 
171 
to sum over all possible headwords: 
p(c If) = 
If) ~°~ 
i=1 j¢i 
5 Example Fertilities 
To illustrate how well fertility captures simple cases 
of embedding, trained fertilities are shown in table 1 
for several formal language words denoting time in- 
tervals. As expected, "early_morning" dominantly 
produces two clumps, but can produce either one or 
three clumps with reasonable probability. "morn- 
ing" and "afternoon" train to comparable fertilities 
and preferentially generate a single clump. Another 
interesting case is the formal language token "List" 
which trains to a A of 0.62 indicating that it fre- 
quently generates no English text. As a further 
check, the A values for "from", "to", and the two 
special classed words "CITY-l" and "CITY-2" are 
near 1, ranging between 0.96 and 1.17. 
Some trained translation probabilities are shown 
for the unigram and headword models in table 2. 
The formal language words have captured reason- 
able English words for their most likely transla- 
tion or headword translation. However, "early" 
and "morning" have fairly undesirable looking sec- 
ond and third choices. The reason for this is that 
these undesirable words are frequently adjacent to 
the English words "early" and "morning"; hence 
the training algorithm includes contributions with 
two word clumps containing these extraneous words. 
This is the price we pay for not using supervised 
training data. Intriguingly, the headword model is 
more strongly biased towards the likely translations 
and has a smoother tail than the unigram model. 
6 Results 
The translation models were trained with 5627 
context-independent ATIS sentences and smoothed 
with 600 sentences. In addition, 3567 training sen- 
tences were manually aligned and included in a sep- 
arate training experiment. This allows comparison 
between an unannotated corpus and a partially an- 
notated one. 
We employ a trivial decoder and language model 
since our emphasis is on evaluating the performance 
of different translation models. Our decoder is a sim- 
ple pattern matcher. That is, we accumulate the dif- 
ferent formal language patterns seen in the training 
set, and score each of them on the test set. The lan- 
guage model is just the unsmoothed unigram prob- 
ability distribution of the patterns. This LM has a 
10% chance of not including a test pattern and its 
use leads to pessimistic performance estimates. A 
more general language model for ATIS is presented 
in (Koppelman et al., 1995). Answers are gener- 
ated by an SQL program which is a deterministically 
constructed from the formal language of our system. 
The accuracy of these database answers is measured 
using ARPA's Common Answer Specification (CAS) 
metric. 
The results are presented in table 3 for ARPA's 
December 1993 blind test set. The column headed 
DEC93 reports results on unsupervised training 
data, while the column entitled DEC93a contains the 
results from using models trained on the partially 
annotated corpus. The rows correspond to various 
translation models. Model 1 is the word-pair trans- 
lation model used in simple machine translation and 
understanding models (Brown et al., 1993; Epstein 
et al., 1996). The models labeled "Clump" use a 
basic clumped model without fertility. The mod- 
els labeled "Poisson" and "General" use the Poisson 
and general fertility models presented in this paper. 
The "HW" and "BG" suffixes indicate the results 
when p(e\[f) is computed with a headword or bigram 
model. 
The partially annotated corpus provides an in- 
crease in performance of about 2-3% for most mod- 
els. For General-LM, results increased by 8-10%. 
The Poisson and general fertility models show a 2- 
5% gain in performance over the basic clump model 
when using the partially annotated corpus. This is 
a reduction of the error rate by 10-20%. The unan- 
notated corpus also shows a comparable gain. 
Acknowledgement: This work was sponsored 
in part by ARPA and monitored by Fort Huachuca 
HJ1500-4309-0513. The views and conclusions con- 
tained in this document should not be interpreted 
as representing the official policies of the U.S. Gov- 
ernment. 

References 
Bahl, Lalit R., Frederick Jelinek, and Robert L. 
Mercer. 1983. A maximum likelihood approach 
to continuous speech recognition. IEEE Trans- 
actions on Pattern Analysis and Machine Intelli- 
gence, PAMI-5(2):179-190, March. 
Baum, L.E. 1972. An inequality and associated 
maximization technique in statistical estimation 
of probabilistic functions of a Markov process. In- 
equalities, 3:1-8. 
Brown, Peter F., Stephen A. DellaPietra, Vincent J. 
DellaPietra, and Robert L. Mercer. 1993. The 
mathematics of statistical machine translation: 
Parameter estimation. In Computational Linguis- 
tics, pages 19(2):263-311, June. 
Dempster, A.P., N.M. Laird, and D.B. Rubin. 1977. 
Maximum likelihood from incomplete data via the 
EM algorithm. Journal of the Royal Statistical 
Society, 39(B):1-38. 
Epstein, M. 1996. Statistical Source Channel Mod- 
els for Natural Language Understanding. Ph.D. 
thesis, New York University, September. 
Epstein, M., K. Papineni, S. Roukos, T. Ward, and 
S. Della Pietra. 1996. Statistical natural lan- 
guage understanding using hidden clumpings. In 
Proceedings of the IEEE International Conference 
on Acoustics, Speech and Signal Processing, pages 
176-179, Atlanta, Georgia, May. 
Forney, G. David. 1973. The viterbi algorithm. Pro- 
ceedings of the IEEE, 61:268-278, March. 
Gorin, A., S. Levinson, A. Gertner, and E. Goldman. 
1991. Adaptive acquisition of language. Com- 
puter Speech and Language, 5:101-132. 
Hemphill, C., J. Godfrey, and G. Doddington. 1990. 
The ATIS spoken language systems pilot corpus. 
In Proceedings of the DARPA Speech and Natural 
Language Workshop, pages 96-101, Hidden Valley, 
PA, June. Morgan Kaufmann Publishers, Inc. 
Koppelman, J., S. Della Pietra, M. Epstein, and 
S. Roukos. 1995. A statistical approach to lan- 
guage modeling for the ATIS task. In Proceedings 
of the Spoken Language Systems Workshop, pages 
1785-1788, Madrid, Spain, September. 
Kuhn, R. and R. De Mori. 1995. The application of 
semantic classification trees to natural language 
understanding. IEEE Transactions on Pattern 
Analysis and Machine Intelligence, 17(5):449-460, 
May. 
Levin, E. and R. Pieraccini. 1995. Chronus, the next 
generation. In Proceedings of the Spoken Lan- 
guage Systems Workshop, pages 269-271, Austin, 
Texas, January. 
Miller, S., M. Bates, R. Bobrow, R. Ingria, 
J. Makhoul, and R. Schwartz. 1995. Recent 
progress in hidden understanding models. In Pro- 
ceedings of the Spoken Language Systems Work- 
shop, pages 276-279, Austin, Texas, January. 
Miller, S., D. Stallard, R. Bobrow, and R. Schwartz. 
1996. A fully statistical approach to natural lan- 
guage interfaces. In Proceedings of the 34th An- 
nual Meeting of the Association for Computa- 
tional Linguistics, pages 55-61, Santa Cruz, CA, 
June. Morgan Kaufmann Publishers, Inc. 
