Lexicalized Stochastic Modeling of Constraint-Based Grammars
using Log-Linear Measures and EM Training
Stefan Riezler
IMS, Universität Stuttgart
riezler@ims.uni-stuttgart.de
Detlef Prescher
IMS, Universität Stuttgart
prescher@ims.uni-stuttgart.de
Jonas Kuhn
IMS, Universität Stuttgart
jonas@ims.uni-stuttgart.de
Mark Johnson
Cog. & Ling. Sciences, Brown University
Mark_Johnson@brown.edu
Abstract
We present a new approach to
stochastic modeling of constraint-
based grammars that is based on log-
linear models and uses EM for esti-
mation from unannotated data. The
techniques are applied to an LFG
grammar for German. Evaluation on
an exact match task yields 86% pre-
cision for an ambiguity rate of 5.4,
and 90% precision on a subcat frame
match for an ambiguity rate of 25.
Experimental comparison to train-
ing from a parsebank shows a 10%
gain from EM training. Also, a new
class-based grammar lexicalization is
presented, showing a 10% gain over
unlexicalized models.
1 Introduction
Stochastic parsing models capturing contex-
tual constraints beyond the dependencies of
probabilistic context-free grammars (PCFGs)
are currently the subject of intensive research.
An interesting feature common to most such
models is the incorporation of contextual de-
pendencies on individual head words into rule-
based probability models. Such word-based
lexicalizations of probability models are used
successfully in the statistical parsing mod-
els of, e.g., Collins (1997), Charniak (1997),
or Ratnaparkhi (1997). However, it is still
an open question which kind of lexicaliza-
tion, e.g., statistics on individual words or
statistics based upon word classes, is the best
choice. Secondly, these approaches have in
common the fact that the probability models
are trained on treebanks, i.e., corpora of man-
ually disambiguated sentences, and not from
corpora of unannotated sentences. In all of the
cited approaches, the Penn Wall Street Jour-
nal Treebank (Marcus et al., 1993) is used,
the availability of whichobviates the standard
eort required for treebank traininghand-
annotating large corpora of specic domains
of specic s with specic parse types.
Moreover, common wisdom is that training
from unannotated data via the expectation-
maximization (EM) algorithm (Dempster et
al., 1977) yields poor results unless at
least partial annotation is applied. Experi-
mental results conrming this wisdom have
been presented, e.g., by Elworthy (1994) and
Pereira and Schabes (1992) for EM training
of Hidden Markov Models and PCFGs.
In this paper, we present a new lexicalized
stochastic model for constraint-based gram-
mars that employs a combination of head-
word frequencies and EM-based clustering
for grammar lexicalization. Furthermore, we
make crucial use of EM for estimating the
parameters of the stochastic grammar from
unannotated data. Our usage of EM was ini-
tiated by the current lack of large unication-
based treebanks for German. However, our ex-
perimental results also show an exception to
the common wisdom of the insuciency of EM
for highly accurate statistical modeling.
Our approach to lexicalized stochastic mod-
eling is based on the parametric family of log-
linear probability models, whichisusedtode-
ne a probability distribution on the parses
of a Lexical-Functional Grammar (LFG) for
German. In previous work on log-linear mod-
els for LFGby Johnson et al. (1999), pseudo-
likelihood estimation from annotated corpora
has been introduced and experimented with
on a small scale. However, to our knowledge,
to date no large LFG annotated corpora of
unrestricted German text are available. For-
tunately, algorithms exist for statistical infer-
ence of log-linear models from unannotated
data (Riezler, 1999). We apply this algorithm
to estimate log-linear LFG models from large
corpora of newspaper text. In our largest ex-
periment, we used 250,000 parses which were
produced by parsing 36,000 newspaper sen-
tences with the German LFG. Experimental
evaluation of our models on an exact-match
task (i.e. percentage of exact match of most
probable parse with correct parse) on 550
manually examined examples with on average
5.4 analyses gave 86% precision. Another eval-
uation on a verb frame recognition task (i.e.
percentage of agreement between subcatego-
rization frames of main verb of most proba-
ble parse and correct parse) gave 90% pre-
cision on 375 manually disambiguated exam-
ples with an average ambiguity of 25. Clearly,
a direct comparison of these results to state-
of-the-art statistical parsers cannot be made
because of dierent training and test data and
other evaluation measures. However, wewould
like to draw the following conclusions from our
experiments:
 The problem of chaotic convergence be-
haviour of EM estimation can be solved
for log-linear models.
 EM does help constraint-based gram-
mars, e.g. using about 10 times more sen-
tences and about 100 times more parses
for EM training than for training from an
automatically constructed parsebank can
improve precision by about 10%.
 Class-based lexicalization can yield a gain
in precision of about 10%.
In the rest of this paper we intro-
duce incomplete-data estimation for log-linear
models (Sec. 2), and present the actual design
of our models (Sec. 3) and report our experi-
mental results (Sec. 4).
2 Incomplete-Data Estimation for
Log-Linear Models
2.1 Log-Linear Models
A log-linear distribution p

(x) on the set of
analyses X of a constraint-based grammar can
be dened as follows:
p

(x)=Z

;1
e
(x)
p
0
(x)
where Z

=
P
x2X
e
(x)
p
0
(x) is a normal-
izing constant,  = (
1
;;:::;;
n
) 2 IR
n
is a
vector of log-parameters,  = (
1
;;:::;;
n
) is
avector of property-functions 
i
: X ! IR for
i = 1;;:::;;n,   (x) is the vector dot prod-
uct
P
n
i=1

i

i
(x), and p
0
is a xed reference
distribution.
The task of probabilistic modeling with log-
linear distributions is to build salient proper-
ties of the data as property-functions 
i
into
the probabilitymodel.For a given vector  of
property-functions, the task of statistical in-
ference is to tune the parameters  to best
reect the empirical distribution of the train-
ing data.
2.2 Incomplete-Data Estimation
Standard numerical methods for statis-
tical inference of log-linear models from
fully annotated dataso-called complete
dataare the iterative scaling meth-
ods of Darroch and Ratcli (1972) and
Della Pietra et al. (1997). For data consisting
of unannotated sentencesso-called incom-
plete datathe iterative method of the EM
algorithm (Dempster et al., 1977) has to be
employed. However, since even complete-data
estimation for log-linear models requires
iterative methods, an application of EM to
log-linear models results in an algorithm
which is expensive since it is doubly-iterative.
A singly-iterative algorithm interleaving EM
and iterative scaling into a mathematically
well-dened estimation method for log-linear
models from incomplete data is the IM
algorithm of Riezler (1999). Applying this
algorithm to stochastic constraint-based
grammars, we assume the following to be
given: A training sample of unannotated sen-
tences y fromasetY, observed with empirical
Input Reference model p
0
, property-functions vector  with constant 
#
, parses
X(y) for each y in incomplete-data sample from Y.
Output MLE model p

 on X.
Procedure
Until convergence do
Compute p

;; k

, based on  =(
1
;;:::;;
n
),
For i from 1 to n do

i
:=
1

#
ln
P
y2Y
~p(y)
P
x2X(y)
k

(xjy)
i
(x)
P
x2X
p

(x)
i
(x)
,

i
:= 
i
+ 
i
,
Return 

=(
1
;;:::;;
n
).
Figure 1: Closed-form version of IM algorithm
probability ~p(y), a constraint-based grammar
yielding a set X(y) of parses for eachsentence
y, and a log-linear model p

() on the parses
X =
P
y2Yj~p(y)>0
X(y) for the sentences in
the training corpus, with known values of
property-functions  and unknown values
of . The aim of incomplete-data maximum
likelihood estimation (MLE) is to nd a value


that maximizes the incomplete-data log-
likelihood L =
P
y2Y
~p(y)ln
P
x2X(y)
p

(x),
i.e.,


=argmax
2IR
n
L():
Closed-form parameter-updates for this prob-
lem can be computed by the algorithm of Fig.
1, where 
#
(x) =
P
n
i=1

i
(x), and k

(xjy)=
p

(x)=
P
x2X(y)
p

(x) is the conditional prob-
ability of a parse x given the sentence y and
the current parameter value .
The constancy requirement on 
#
can be
enforced by adding a correction property-
function 
l
:
Choose K = max
x2X

#
(x) and

l
(x)=K ;
#
(x) for all x 2X.
Then
P
l
i=1

i
(x)=K for all x 2X.
Note that because of the restriction of X to
the parses obtainable by a grammar from the
training corpus, wehave a log-linear probabil-
ity measure only on those parses and not on
all possible parses of the grammar. We shall
therefore speak of mere log-linear measures in
our application of disambiguation.
2.3 Searching for Order in Chaos
For incomplete-data estimation, a sequence
of likelihood values is guaranteed to converge
to a critical point of the likelihood function
L. This is shown for the IM algorithm in
Riezler (1999). The process of nding likeli-
hood maxima is chaotic in that the nal likeli-
hood value is extremely sensitive to the start-
ing values of , i.e. limit points can be lo-
cal maxima (or saddlepoints), which are not
necessarily also global maxima. A way to
search for order in this chaos is to search for
starting values which are hopefully attracted
by the global maximum of L. This problem
can best be explained in terms of the mini-
mum divergence paradigm (Kullback, 1959),
which is equivalent to the maximum likeli-
hood paradigm by the following theorem. Let
p[f] =
P
x2X
p(x)f(x) be the expectation of
a function f with respect to a distribution p:
The probability distribution p

that
minimizes the divergence D(pjjp
0
) to
a reference model p
0
subject to the
constraints p[
i
]=q[
i
];;i=1;;:::;;n
is the model in the parametric fam-
ily of log-linear distributions p

that
maximizes the likelihood L() =
q[lnp

] of the training data
1
.
1
If the training sample consists of complete data
Reasonable starting values for minimum di-
vergence estimation is to set 
i
= 0 for
i =1;;:::;;n. This yields a distribution which
minimizes the divergence to p
0
, over the
set of models p to which the constraints
p[
i
] = q[
i
];;i = 1;;:::;;n have yet to be ap-
plied. Clearly, this argument applies to both
complete-data and incomplete-data estima-
tion. Note that for a uniformly distributed
reference model p
0
, the minimum divergence
model is a maximum entropy model (Jaynes,
1957). In Sec. 4, we will demonstrate that
a uniform initialization of the IM algorithm
shows a signicant improvement in likelihood
maximization as well as in linguistic perfor-
mance when compared to standard random
initialization.
3 Property Design and
Lexicalization
3.1 Basic Congurational Properties
The basic 190 properties employed in our
models are similar to the properties of
Johnson et al. (1999) which incorporate gen-
eral linguistic principles into a log-linear
model. They refer to both the c(onstituent)-
structure and the f(eature)-structure of the
LFG parses. Examples are properties for
 c-structure nodes, corresponding to stan-
dard production properties,
 c-structure subtrees, indicating argument
versus adjunct attachment,
 f-structure attributes, corresponding to
grammatical functions used in LFG,
 atomic attribute-value pairs in f-
structures,
 complexity of the phrase being attached
to, thus indicating both high and lowat-
tachment,
 non-right-branching behavior of nonter-
minal nodes,
 non-parallelism of coordinations.
x 2 X, the expectation q[] corresponds to the em-
pirical expectation ~p[].Ifwe observe incomplete data
y 2Y, the expectation q[] is replaced by the condi-
tional expectation ~p[k

0 []] given the observed data y
and the current parameter value 
0
.
3.2 Class-Based Lexicalization
Our approach to grammar lexicalization is
class-based in the sense that we use class-
based estimated frequencies f
c
(v;;n) of head-
verbs v and argument head-nouns n in-
stead of pure frequency statistics or class-
based probabilities of head word dependen-
cies. Class-based estimated frequencies are in-
troduced in Prescher et al. (2000) as the fre-
quency f(v;;n) of a (v;;n)-pair in the train-
ing corpus, weighted by the best estimate of
the class-membership probability p(cjv;;n) of
an EM-based clustering model on (v;;n)-pairs,
i.e., f
c
(v;;n) = max
c2C
p(cjv;;n)(f(v;;n) + 1).
As is shown in Prescher et al. (2000) in an
evaluation on lexical ambiguity resolution, a
gain of about 7% can be obtained by using
the class-based estimated frequency f
c
(v;;n)
as disambiguation criterion instead of class-
based probabilities p(njv). In order to make
the most direct use possible of this fact, we
incorporated the decisions of the disambigua-
tor directly into 45 additional properties for
the grammatical relations of the subject, di-
rect object, indirect object, innitival object,
oblique and adjunctival dative and accusative
preposition, for active and passive forms of the
rst three verbs in each parse. Let v
r
(x) be the
verbal head of grammatical relation r in parse
x, and n
r
(x) the nominal head of grammatical
relation r in x. Then a lexicalized property 
r
for grammatical relation r is dened as

r
(x)=
8
<
:
1
if f
c
(v
r
(x);;n
r
(x)) 
f
c
(v
r
(x
0
);;n
r
(x
0
)) 8x
0
2 X(y);;
0 otherwise:
The property-function 
r
thus pre-
disambiguates the parses x 2 X(y) of a
sentence y according to f
c
(v;;n), and stores
the best parse directly instead of taking the
actual estimated frequencies as its value. In
Sec. 4, we will see that an incorporation of
this pre-disambiguation routine into the mod-
els improves performance in disambiguation
by about 10%.
exact match
evaluation
basic
model
lexicalized
model
selected
+ lexicalized
model
complete-data
estimation
P: 68
E: 59.6
P: 73.9
E: 71.6
P: 74.3
E: 71.8
incomplete-data
estimation
P: 73
E: 65.4
P: 86
E: 85.2
P: 86.1
E: 85.4
Figure 2: Evaluation on exact match task for 550 examples with average ambiguity 5.4
frame match
evaluation
basic
model
lexicalized
model
selected
+ lexicalized
model
complete-data
estimation
P: 80.6
E: 70.4
P: 82.7
E: 76.4
P: 83.4
E: 76
incomplete-data
estimation
P: 84.5
E: 73.1
P: 88.5
E: 84.9
P: 90
E: 86.3
Figure 3: Evaluation on frame match task for 375 examples with average ambiguity 25
4 Experiments
4.1 Incomplete Data and Parsebanks
In our experiments, we used an LFG grammar
for German
2
for parsing unrestricted text.
Since training was faster than parsing, we
parsed in advance and stored the resulting
packed c/f-structures. The lowambiguity rate
of the German LFG grammar allowed us to
restrict the training data to sentences with
at most 20 parses. The resulting training cor-
pus of unannotated, incomplete data consists
of approximately 36,000 sentences of online
available German newspaper text, comprising
approximately 250,000 parses.
In order to compare the contribution of un-
ambiguous and ambiguous sentences to the es-
timation results, we extracted a subcorpus of
4,000 sentences, for which the LFG grammar
produced a unique parse, from the full train-
2
The German LFG grammar is being imple-
mented in the Xerox Linguistic Environment (XLE,
see Maxwell and Kaplan (1996)) as part of the Paral-
lel Grammar (ParGram) project at the IMS Stuttgart.
The coverage of the grammar is about 50% for unre-
stricted newspaper text. For the experiments reported
here, the eectivecoverage was lower, since the cor-
pus preprocessing we applied was minimal. Note that
for the disambiguation task we were interested in,
the overall grammar coverage was of subordinate rel-
evance.
ing corpus. The average sentence length of
7.5 for this automatically constructed parse-
bank is only slightly smaller than that of
10.5 for the full set of 36,000 training sen-
tences and 250,000 parses. Thus, we conjec-
ture that the parsebank includes a representa-
tive variety of linguistic phenomena. Estima-
tion from this automatically disambiguated
parsebank enjoys the same complete-data es-
timation properties
3
as training from manu-
ally disambiguated treebanks. This makes a
comparison of complete-data estimation from
this parsebank to incomplete-data estimation
from the full set of training data interesting.
4.2 Test Data and Evaluation Tasks
To evaluate our models, we constructed
two dierent test corpora. We rst parsed
with the LFG grammar 550 sentences
which are used for illustrative purposes in
the foreign  learner's grammar of
Helbig and Buscha (1996). In a next step, the
correct parse was indicated by a human dis-
ambiguator, according to the reading intended
in Helbig and Buscha (1996). Thus a precise
3
For example, convergence to the global maximum
of the complete-data log-likelihood function is guar-
anteed, which is a good condition for highly precise
statistical disambiguation.
indication of correct c/f-structure pairs was
possible. However, the average ambiguity of
this corpus is only 5.4 parses per sentence, for
sentences with on average 7.5 words. In order
to evaluate on sentences with higher ambigu-
ity rate, we manually disambiguated further
375 sentences of LFG-parsed newspaper text.
The sentences of this corpus have on average
25 parses and 11.2 words.
We tested our models on two evalua-
tion tasks. The statistical disambiguator was
tested on an exact match task, where ex-
act correspondence of the full c/f-structure
pair of the hand-annotated correct parse and
the most probable parse is checked. Another
evaluation was done on a frame match task,
where exact correspondence only of the sub-
categorization frame of the main verb of the
most probable parse and the correct parse is
checked. Clearly, the latter task involves a
smaller eective ambiguity rate, and is thus
to be interpreted as an evaluation of the com-
bined system of highly-constrained symbolic
parsing and statistical disambiguation.
Performance on these two evaluation tasks
was assessed according to the following evalu-
ation measures:
Precision =
#correct
#correct+#incorrect
,
Eectiveness =
#correct
#correct+#incorrect+#don't know
.
Correct and incorrect species a suc-
cess/failure on the respectiveevaluation tasks;;
don't know cases are cases where the system
is unable to make a decision, i.e. cases with
more than one most probable parse.
4.3 Experimental Results
For each task and each test corpus, we cal-
culated a random baseline by averaging over
several models with randomly chosen pa-
rameter values. This baseline measures the
disambiguation power of the pure symbolic
parser. The results of an exact-match evalu-
ation on the Helbig-Buscha corpus is shown
in Fig. 2. The random baseline was around
33% for this case. The columns list dierent
models according to their property-vectors.
Basic models consist of 190 congurational
properties as described in Sec. 3.1. Lexical-
ized models are extended by 45 lexical pre-
disambiguation properties as described in Sec.
3.2. Selected + lexicalized models result
from a simple property selection procedure
where a cuto on the number of parses with
non-negative value of the property-functions
was set. Estimation of basic models from com-
plete data gave 68% precision (P), whereas
training lexicalized and selected models from
incomplete data gave 86.1% precision, which
is an improvement of 18%. Comparing lex-
icalized models in the estimation method
shows that incomplete-data estimation gives
an improvement of 12% precision over train-
ing from the parsebank. A comparison of mod-
els trained from incomplete data shows that
lexicalization yields a gain of 13% in preci-
sion. Note also the gain in eectiveness (E)
due to the pre-disambigution routine included
in the lexicalized properties. The gain due to
property selection both in precision and eec-
tiveness is minimal. A similar pattern of per-
formance arises in an exact match evaluation
on the newspaper corpus with an ambiguity
rate of 25. The lexicalized and selected model
trained from incomplete data achieved here
60.1% precision and 57.9% eectiveness, for a
random baseline of around 17%.
As shown in Fig. 3, the improvement in per-
formance due to both lexicalization and EM
training is smaller for the easier task of frame
evaluation. Here the random baseline is 70%
for frame evaluation on the newspaper corpus
with an ambiguity rate of 25. An overall gain
of roughly 10% can be achieved by going from
unlexicalized parsebank models (80.6% preci-
sion) to lexicalized EM-trained models (90%
precision). Again, the contribution to this im-
provement is about the same for lexicalization
and incomplete-data training. Applying the
same evaluation to the Helbig-Buscha corpus
shows 97.6% precision and 96.7% eectiveness
for the lexicalized and selected incomplete-
data model, compared to around 80% for the
random baseline.
Optimal iteration numbers were decided by
repeated evaluation of the models at every
fth iteration. Fig. 4 shows the precision of
lexicalized and selected models on the exact
68
70
72
74
76
78
80
82
84
86
88
10 20 30 40 50 60 70 80 90
precision
number of iterations
complete-data estimation
incomplete-data estimation
Figure 4: Precision on exact match task in number of training iterations
match task plotted against the number of it-
erations of the training algorithm. For parse-
bank training, the maximal precision value
is obtained at 35 iterations. Iterating fur-
ther shows a clear overtraining eect. For
incomplete-data estimation more iterations
are necessary to reach a maximal precision
value. A comparison of models with random
or uniform starting values shows an increase
in precision of 10% to 40% for the latter.
In terms of maximization of likelihood, this
corresponds to the fact that uniform starting
values immediately push the likelihood up to
nearly its nal value, whereas random starting
values yield an initial likelihood which has to
be increased by factors of 2 to 20 to an often
lower nal value.
5 Discussion
The most direct points of compar-
ison of our method are the ap-
proaches of Johnson et al. (1999) and
Johnson and Riezler (2000). In the rst ap-
proach, log-linear models on LFG grammars
using about 200 congurational properties
were trained on treebanks of about 400
sentences by maximum pseudo-likelihood
estimation. Precision was evaluated on an
exact match task in a 10-way cross valida-
tion paradigm for an ambiguity rate of 10,
and achieved 59% for the rst approach.
Johnson and Riezler (2000) achieved a gain
of 1% over this result by including a class-
based lexicalization. Our best models clearly
outperform these results, both in terms of
precision relative to ambiguity and in terms
of relative gain due to lexicalization. A
comparison of performance is more dicult
for the lexicalized PCFG of Beil et al. (1999)
which was trained by EM on 450,000 sen-
tences of German newspaper text. There, a
70.4% precision is reported on a verb frame
recognition task on 584 examples. However,
the gain achieved by Beil et al. (1999) due to
grammar lexicalizaton is only 2%, compared
to about 10% in our case. A comparison
is dicult also for most other state-of-the-
art PCFG-based statistical parsers, since
dierent training and test data, and most
importantly, dierentevaluation criteria were
used. A comparison of the performance gain
due to grammar lexicalization shows that our
results are on a par with that reported in
Charniak (1997).
6 Conclusion
Wehave presented a new approachtostochas-
tic modeling of constraint-based grammars.
Our experimental results show that EM train-
ing can in fact be very helpful for accurate
stochastic modeling in natural  pro-
cessing. We conjecture that this result is due
partly to the fact that the space of parses
produced by a constraint-based grammar is
only mildly incomplete, i.e. the ambiguity
rate can be kept relatively low. Another rea-
son may be that EM is especially useful for
log-linear models, where the search space in
maximization can be kept under control. Fur-
thermore, we have introduced a new class-
based grammar lexicalization, which again
uses EM training and incorporates a pre-
disambiguation routine into log-linear models.
An impressive gain in performance could also
be demonstrated for this method. Clearly, a
central task of future work is a further explo-
ration of the relation between complete-data
and incomplete-data estimation for larger,
manually disambiguated treebanks. An inter-
esting question is whether a systematic vari-
ation of training data size along the lines
of the EM-experiments of Nigam et al. (2000)
for text classication will show similar results,
namely a systematic dependence of the rela-
tive gain due to EM training from the relative
sizes of unannotated and annotated data. Fur-
thermore, it is important to show that EM-
based methods can be applied successfully
also to other statistical parsing frameworks.
Acknowledgements
We thank Stefanie Dipper and Bettina
Schrader for help with disambiguation of the
test suites, and the anonymous ACL review-
ers for helpful suggestions. This research was
supported by the ParGram project and the
project B7 of the SFB 340 of the DFG.
References
Franz Beil, Glenn Carroll, Detlef Prescher, Stefan
Riezler, and Mats Rooth. 1999. Inside-outside
estimation of a lexicalized PCFG for German.
In Proceedings of the 37th ACL, College Park,
MD.
Eugene Charniak. 1997. Statistical parsing with
acontext-free grammar and word statistics. In
Proceedings of the 14th AAAI, Menlo Park, CA.
Michael Collins. 1997. Three generative, lexi-
calised models for statistical parsing. In Pro-
ceedings of the 35th ACL, Madrid.
J.N. Darroch and D. Ratcli. 1972. General-
ized iterative scaling for log-linear models. The
Annals of Mathematical Statistics, 43(5):1470
1480.
Stephen Della Pietra, Vincent Della Pietra, and
John Laerty. 1997. Inducing features of ran-
dom elds. IEEE PAMI, 19(4):380393.
A. P. Dempster, N. M. Laird, and D. B. Ru-
bin. 1977. Maximum likelihood from incom-
plete data via the EM algorithm. Journal of
the Royal Statistical Society, 39(B):138.
David Elworthy. 1994. Does Baum-Welch re-
estimation help taggers? In Proceedings of the
4th ANLP, Stuttgart.
Gerhard Helbig and Joachim Buscha. 1996.
Deutsche Grammatik. Ein Handbuch für den
Ausländerunterricht. Langenscheidt, Leipzig.
Edwin T. Jaynes. 1957. Information theory
and statistical mechanics. Physical Review,
106:620630.
Mark Johnson and Stefan Riezler. 2000. Ex-
ploiting auxiliary distributions in stochastic
unication-based grammars. In Proceedings of
the 1st NAACL, Seattle, WA.
Mark Johnson, Stuart Geman, Stephen Canon,
ZhiyiChi, and Stefan Riezler. 1999. Estimators
for stochastic unication-based grammars. In
Proceedings of the 37th ACL, CollegePark,MD.
SolomonKullback. 1959. Information Theory and
Statistics. Wiley, New York.
Mitchell P. Marcus, Beatrice Santorini, and
Mary Ann Marcinkiewicz. 1993. Build-
ing a large annotated corpus of english: The
Penn treebank. Computational Linguistics,
19(2):313330.
John Maxwell and R. Kaplan. 1996. Unication-
based parsers that automatically take ad-
vantage of context freeness. Unpublished
manuscript, XeroxPalo Alto Research Center.
Kamal Nigam, Andrew McCallum, Sebastian
Thrun, and Tom Mitchell. 2000. Text classi-
cation from labeled and unlabeled documents
using EM. Machine Learning, 39(2/4):103134.
Fernando Pereira and Yves Schabes. 1992. Inside-
outside reestimation from partially bracketed
corpora. In Proceedings of the 30th ACL,
Newark, Delaware.
Detlef Prescher, Stefan Riezler, and Mats Rooth.
2000. Using a probabilistic class-based lexicon
for lexical ambiguity resolution. In Proceedings
of the 18th COLING, Saarbrücken.
Adwait Ratnaparkhi. 1997. A linear observed
time statistical parser based on maximum en-
tropy models. In Proceedings of EMNLP-2.
Stefan Riezler. 1999. Probabilistic Constraint
Logic Programming Ph.D. thesis, Seminar
für Sprachwissenschaft, Universität Tübingen.
AIMS Report, 5(1), IMS, Universität Stuttgart.
