Inside-Outside Estimation of a Lexicalized PCFG for German 
Franz Beil, Glenn Carroll, Detlef Prescher, Stefan Riezler and Mats Rooth 
Institut ffir Maschinelle Sprachverarbeitung, University of Stuttgart 
Abstract 
The paper describes an extensive experiment in 
inside-outside estimation of a lexicalized proba- 
bilistic context free grammar for German verb- 
final clauses. Grammar and formalism features 
which make the experiment feasible are de- 
scribed. Successive models are evaluated on pre- 
cision and recall of phrase markup. 
1 Introduction 
Charniak (1995) and Carroll and Rooth (1998) 
present head-lexicalized probabilistic context 
free grammar formalisms, and show that they 
can effectively be applied in inside-outside es- 
timation of syntactic language models for En- 
glish, the parameterization of which encodes 
lexicalized rule probabilities and syntactically 
conditioned word-word bigram collocates. The 
present paper describes an experiment where a 
slightly modified version of Carroll and Rooth's 
model was applied in a systematic experiment 
on German, which is a language with rich in- 
flectional morphology and free word order (or 
rather, compared to English, free-er phrase or- 
der). We emphasize techniques which made it 
practical to apply inside-outside estimation of 
a lexicalized context free grammar to such a 
language. These techniques relate to the treat- 
ment of argument cancellation and scrambled 
phrase order; to the treatment of case features in 
category labels; to the category vocabulary for 
nouns, articles, adjectives and their projections; 
to lexicalization based on uninflected lemmata 
rather than word forms; and to exploitation of 
a parameter-tying feature. 
2 Corpus and morphology 
The data for the experiment is a corpus of Ger- 
man subordinate clauses extracted by regular 
expression matching from a 200 million token 
newspaper corpus. The clause length ranges be- 
tween four and 12 words. Apart from infiniti- 
val VPs as verbal arguments, there are no fur- 
ther clausal embeddings, and the clauses do 
not contain any punctuation except for a ter- 
minal period. The corpus contains 4128873 to- 
kens and 450526 clauses which yields an average 
of 9.16456 tokens per clause. Tokens are auto- 
matically annotated with a list of part-of-speech 
(PoS) tags using a computational morpholog- 
ical analyser based on finite-state technology 
(Karttunen et al. (1994), Schiller and StSckert 
(1995)). 
A problem for practical inside-outside esti- 
mation of an inflectional language like German 
arises with the large number of terminal and 
low-level non-terminal categories in the gram- 
mar resulting from the morpho-syntactic fea- 
tures of words. Apart from major class (noun, 
adjective, and so forth) the analyser provides an 
ambiguous word with a list of possible combina- 
tions of inflectional features like gender, person, 
number (cf. the top part of Fig. 1 for an exam- 
ple ambiguous between nominal and adjectival 
PoS; the PoS is indicated following the '+' sign). 
In order to reduce the number of parameters 
to be estimated, and to reduce the size of the 
parse forest used in inside-outside estimation, 
we collapsed the inflectional readings of adjec- 
tives, adjective derived nouns, article words, and 
pronouns to a single morphological feature (see 
of Fig. 1 for an example). This reduced the num- 
ber of low-level categories, as exemplified in Fig. 
2: das has one reading as an article and one as 
a demonstrative; westdeutschen has one reading 
as an adjective, with its morphological feature 
N indicating the inflectional suffix. 
We use the special tag UNTAGGED indicating 
that the analyser fails to provide a tag for the 
word. The vast majority of UNTAGGED words are 
proper names not recognized as such. These 
gaps in the morphology have little effect on our 
experiment. 
3 Grammar 
The grammar is a manually developed headed 
context-free phrase structure grammar for Ger- 
man subordinate clauses with 5508 rules and 
269 
analyze> Deutsche 
i. deutsch'ADJ.Pos+NN.Fem.Akk.Sg 
2. deutsch^ADJ.Pos+NN.Fem.Nom.Sg 
3. deutsch^ADJ.Pos+NN.Masc.Nom. Sg. Sw 
4. deutsch^ADJ.Pos+NN.Neut.Akk.Sg. Sw 
5. deutsch^ADJ.Pos+NN.Neut.Nom. Sg.Sw 
6. deutsch-ADJ.Pos+NN.NoGend.Akk.Pi.St 
7. deutsch^ADJ.Pos+NN.NoGend.Nom.Pl.St 
8. *deutsch+ADJ.Pos.Fem.Akk.Sg 
9. *deutsch+ADJ.Pos.Fem.Nom.Sg 
i0. *deutsch+ADJ.Pos.Masc.Nom.Sg.Sw 
ii. *deutsch+ADJ.Pos.Neut.Akk.Sg.Sw 
12. *deutsch+ADJ.Pos.Neut.Nom.Sg. Sw 
13. *deutsch+ADJ.Pos.NoGend.Akk.Pi.St 
14. *deutsch+ADJ.Pos.NoGend.Nom.Pl.St 
==> Deutsche { ADJ.E, NNADJ.E } 
Figure 1: Collapsing Inflectional Features 
w~hrend { ADJ.Adv, ADJ.Pred, KOUS, 
APPR.Dat, APPR.Gen } 
sich { PRF.Z } 
das { DEMS.Z, ART.Def.Z } 
Preisniveau { NN.Neut.NotGen. Sg } 
dem { DEMS.M, ART.Def.M } 
westdeutschen { ADJ.N } 
snn~dlere { VVFIN } 
{ PER } 
Figure 2: Corpus Clip 
562 categories, 209 of which are terminal cat- 
egories. The formalism is that of Carroll and 
Rooth (1998), henceforth C+R: 
mother -> non-heads head' non-heads (freq) 
The rules are head marked with a prime. The 
non-head sequences may be empty, freq is a 
rule frequency, which is initialized randomly and 
subsequently estimated by the inside outside- 
algorithm. To handle systematic patterns re- 
lated to features, rules were generated by Lisp 
functions, rather than being written directly in 
the above form. With very few exceptions (rules 
for coordination, S-rule), the rules do not have 
more than two daughters. 
Grammar development is facilitated by a 
chart browser that permits a quick and efficient 
discovery of grammar bugs (Carroll, 1997a). Fig. 
3 shows that the ambiguity in the chart is quite 
considerable even though grammar and corpus 
are restricted. For the entire corpus, we com- 
puted an average 9202 trees per clause. In the 
chart browser, the categories filling the cells in- 
dicate the most probable category for that span 
with their estimated frequencies. The pop-up 
window under IP presents the ranked list of all 
possible categories for the covered span. Rules 
(chart edges) with frequencies can be viewed 
with a further menu. In the chart browser, colors 
are used to display frequencies (between 0 and 1) 
estimated by the inside-outside algorithm. This 
allows properties shared across tree analyses to 
be checked at a glance; often grammar and es- 
timation bugs can be detected without mouse 
operations. 
The grammar covers 88.5~o of the clauses and 
87.9% of the tokens contained in the corpus. 
Parsing failures are mainly due to UNTAGGED 
words contained in 6.6% of the failed clauses, 
the pollution of the corpus by infinitival con- 
structions (~1.3%), and a number of coordina- 
tions not covered by the grammar (~1.6%). 
3.1 Case features and agreement 
On nominal categories, in addition to the four 
cases Nom, Gen, Dat, and Akk, case features 
with a disjunctive interpretation (such as Dir 
for Nom or Akk) are used. The grammar is writ- 
ten in such a way that non-disjunctive features 
are introduced high up in the tree. This results 
in some reduction in the size of the parse forest, 
and some parameter pooling. Essentially the full 
range of agreement inside the noun phrase is en- 
forced. Agreement between the nominative NP 
and the tensed verb (e.g. in number) is not en- 
forced by the grammar, in order to control the 
number of parameters and rules. 
For noun phrases we employ Abney's chunk 
grammar organization (Abney, 1996). The noun 
chunk (NC) is an approximately non-recursive 
projection that excludes post-head complements 
and (adverbial) adjuncts introduced higher than 
pre-head modifiers and determiners but in- 
cludes participial pre-modifiers with their com- 
plements. Since we perform complete context 
free parsing, parse forest construction, and 
inside-outside estimation, chunks are not moti- 
vated by deterministic parsing. Rather, they fa- 
cilitate evaluation and graphical debugging, by 
tending to increase the span of constituents with 
high estimated frequency. 
270 
daEI 
:: i: i ::::: :(:: 
: ~: 
0 
,LXJ9 VPP.np.np 
;495 VPP.n 
n86 VPK.n 
%'¢JZ VPP.dp.dp 
1743 VPP.d 
1556 VPP,nd.nd 
10Z$ VPP 
Figure 3: Chart browser 
Word-by-word gloss of the clause: 'that Sarajevo over the airport with the essentials supplied will can' 
class # frame types VPA.na.na VPA.na.na 
VPA 15 n, na, nad, nai, nap, nar, nd, ndi, ~ 
ndp, ndr, ni, nir, np, npr, nr 
/ \ / \ 
NP.Nom VPA.na.a NP.Akk VPA.na.n 
VPP 13 d, di, dp, dr, i, ir, n, nd, ni, np, p, 
pr, r ~ /~ 
VPI 10 a, ad, ap, ar, d, dp, dr, p, pr, r NP.Akk VPA.na NP.Nom VPA.na 
VPK 2 i, n 
Figure 4: Number and types of verb frames 
Figure 5: Coding of canonical and scrambled ar- 
gument order 
3.2 Subcategorisation frames of verbs 
The grammar distinguishes four subcategorisa- 
tion frame classes: active (VPA), passive (VPP), 
infinitival (VPI) frames, and copula construc- 
tions (VPK). A frame may have maximally three 
arguments. Possible arguments in the frames are 
nominative (n), dative (d) and accusative (a) 
NPs, reflexive pronouns (r), PPs (p), and infini- 
tival VPs (i). The grammar does not distinguish 
plain infinitival VPs from zu-infinitival VPs. The 
grammar is designed to partially distinguish dif- 
ferent PP frames relative to the prepositional 
head of the PP. A distinct category for the spe- 
cific preposition becomes visible only when a 
subcategorized preposition is cancelled from the 
subcat list. This means that specific prepositions 
do not figure in the evaluation discussed below. 
The number and the types of frames in the dif- 
ferent frame classes are given in figure 4. 
German, being a language with comparatively 
free phrase order, allows for scrambling of ar- 
guments. Scrambling is reflected in the particu- 
lar sequence in which the arguments of the verb 
frame are saturated. Compare figure 5 for an ex- 
ample of a canonical subject-object order in an 
active transitive frame and its scrambled object- 
subject order. The possibility of scrambling verb 
arguments yields a substantial increase in the 
number of rules in the grammar (e.g. 102 com- 
binatorically possible argument rules for all in 
VPA frames). Adverbs and non-subcategorized 
PPs are introduced as adjuncts to VP categories 
which do not saturate positions in the subcat 
frame. 
In earlier experiments, we employed a flat 
clausal structure, with rules for all permutations 
of complements. As the number of frames in- 
creased, this produced prohibitively many rules, 
particularly with the inclusion of adjuncts. 
4 Parameters 
The parameterization is as in C+R, with one 
significant modification. Parameters consist of 
(i) rule parameters, corresponding to right hand 
271 
sides conditioned by parent category and par- 
ent head; (ii) lexical choice parameters for non- 
head children, corresponding to child lemma 
conditioned by child category, parent category, 
and parent head lemma. See C+R or Charniak 
(1995) for an explanation of how such parame~ 
ters define a probabilistic weighting of trees. The 
change relative to C+R is that lexicalization is 
by uninflected lemma rather than word form. 
This reduces the number of lexical parameters, 
giving more acceptable model sizes and elimi- 
nating splitting of estimated frequencies among 
inflectional forms. Inflected forms are generated 
at the leaves of the tree, conditioned on termi- 
nal category and lemma. This results in a third 
family of parameters, though usually the choice 
of inflected form is deterministic. 
A parameter pooling feature is used for argu- 
ment filling where all parent categories of the 
form VP.x.y are mapped to a category VP.x in 
defining lexical choice parameters. The conse- 
quence is e.g. that an accusative daughter of a 
nominative-accusative verb uses the same lexical 
choice parameter, whether a default or scram- 
bled word order is used. (This feature was used 
by C÷R for their phrase trigram grammar, not 
in the linguistic part of their grammar.) Not all 
desirable parameter pooling can be expressed in 
this way, though; for instance rule parameters 
are not pooled, and so get split when the parent 
category bears an inflectional feature. 
5 Estimation 
The training of our probabilistic CFG proceeds 
in three steps: (i) unlexicalized training with 
the supar parser, (ii) bootstrapping a lexical- 
ized model from the trained unlexicalized one 
with the ultra parser, and finally (iii) lexical- 
ized training with the hypar parser (Carroll, 
1997b). Each of the three parsers uses the inside- 
outside algorithm, supar and ultra use an un- 
lexicalized weighting of trees, while hypar uses a 
lexicalized weighting of trees, ultra and hypar 
both collect frequencies for lexicalized rule and 
lexical choice events, while supar collects only 
unlexicalized rule frequencies. 
Our experiments have shown that training an 
unlexicalized model first is worth the effort. De- 
spite our use of a manually developed grammar 
that does not have to be pruned of superfluous 
rules like an automatically generated grammar, 
A B C 
1: 52.0199 
2: 25.3652 
3: 24.5905 
: : 
13: 24.2872 
14: 24.2863 
15: 24.2861 
16: 24.2861 
17: 24.2867 
1: 53.7654 1: 49.8165 
2: 26.3184 2: 23.1008 
3: 25.5035 3: 22.4479 
: : : : 
55: 25.0548 70: 22.1445 
56: 25.0549 80: 22.1443 
57: 25.0549 90: 22.1443 
58: 25.0549 95: 22.1443 
59: 25.055 96: 22.1444 
Figure 6: Overtraining 
on heldout data) 
(iteration: cross-entropy 
the lexicalized model is notably better when 
preceded by unlexicalized training (see also Er- 
san and Charniak (1995) for related observa- 
tions). A comparison of immediate lexicalized 
training (without prior training of an unlexical- 
ized model) and our standard training regime 
that involves preliminary unlexicalized training 
speaks in favor of our strategy (cf. the differ- 
ent 'lex 0' and 'lex 2' curves in figures 8 and 9). 
However, the amount of unlexicalized training 
has to be controlled in some way. 
A standard criterion to measure overtraining 
is to compare log-likelihood values on held-out 
data of subsequent iterations. While the log- 
likelihood value of the training data is theo- 
retically guaranteed to converge through sub- 
sequent iterations, a decreasing log-likelihood 
value of the held-out data indicates over- 
training. Instead of log-likelihood, we use the 
inversely proportional cross-entropy measure. 
Fig. 6 shows comparisons of different sizes of 
training and heldout data (training/heldout): 
(A) 50k/50k, (B) 500k/500k, (C) 4.1M/500k. 
The overtraining effect is indicated by the in- 
crease in cross-entropy from the penultimate to 
the ultimate iteration in the tables. Overtraining 
results for lexicalized models are not yet avail- 
able. 
However, a comparison of precision/recall 
measures on categories of different complexity 
through iterative unlexicalized training shows 
that the mathematical criterion for overtraining 
may lead to bad results from a linguistic point 
of view. While we observed more or less con- 
verging precision/recall measures for lower level 
structures such as noun chunks, iterative unlexi- 
calized training up to the overtraining threshold 
turned out to be disastrous for the evaluation of 
complex categories that depend on almost the 
272 
"° 0 0 0 0 0 0 - ooooo 
"= 0 0 0 0 0 0 
'. O 0 0 0 0 0 
0 0 0 
0 0 0 
-- ~000 
-- 0 0 0 
0 
0 0 
0 0 
0 0 
0 0 
0 0 
0 
0 0 0 
0 0 0 
O. 0 
0 
Figure 7: Chart browser for manual NC labelling 
O 
entire span of the clause. The recognition of sub- 
categorization frames through 60 iterations of 
unlexicalized training shows a massive decrease 
in precision/recall from the best to the last iter- 
ation, even dropping below the results with the 
randomly initialized grammar (see Fig. 9). 
5.1 Training regime 
We compared lexicalized training with respect 
to different starting points: a random unlexi- 
calized model, the trained unlexicalized model 
with the best precision/recall results, and an un- 
lexicalized model that comes close to the cross- 
entropy overtraining threshold. The details of 
the training steps are as follows: 
(1) 0, 2 and 60 iterations of unlexicalized pars- 
ing with supar; 
(2) lexicalization with ultra using the entire 
corpus; 
(3) 23 iterations of lexicalized parsing with 
hypar. 
The training was done on four machines (two 
167 MHz UltraSPARC and two 296 MHz SUNW 
UltraSPARC-II). Using the grammar described 
here, one iteration of supar on the entire corpus 
takes about 2.5 hours, lexicalization and gen- 
erating an initial lexicalized model takes more 
than six hours, and an iteration of lexicalized 
parsing can be done in 5.5 hours. 
6 Evaluation 
For the evaluation, a total of 600 randomly se- 
lected clauses were manually annotated by two 
labelers. Using a chart browser, the labellers 
filled the appropriate cells with category names 
of NCs and those of maximal VP projections 
(cf. Figure 7 for an example of NC-labelling). 
Subsequent alignment of the labelers decisions 
resulted in a total of 1353 labelled NC categories 
(with four different cases). The total of 584 la- 
belled VP categories subdivides into 21 differ- 
ent verb frames with 340 different lemma heads. 
The dominant frames are active transitive (164 
occurrences) and active intransitive (117 occur- 
rences). They represent almost half of the an- 
notated frames. Thirteen frames occur less than 
ten times, five of which just once. 
6.1 Methodology 
To evaluate iterative training, we extracted 
maximum probability (Viterbi) trees for the 600 
clause test set in each iteration of parsing. For 
extraction of a maximal probability parse in 
unlexicalized training, we used Schmid's lopar 
parser (Schmid, 1999). Trees were mapped to 
a database of parser generated markup guesses, 
and we measured precision and recall against 
the manually annotated category names and 
spans. Precision gives the ratio of correct guesses 
over all guesses, and recall the ratio of correct 
guesses over the number of phrases identified by 
human annotators. Here, we render only the pre- 
cision/recall results on pairs of category names 
and spans, neglecting less interesting measures 
on spans alone. For the figures of adjusted re- 
call, the number of unparsed misses has been 
subtracted from the number of possibilities. 
273 
0.88 
O.86 
0.84 
0.82 
0.8 
0 .78 
0.76 
0 .74 
0.72 
0.7 
0.68 
0 
:::::::::::::::::::::: ..................................... 
i/ 
precision lex 02 -- . ............. .. .... 
precision unlex ...... 
precision lax O0 ........ 
precision lex 60 ...... 
recall lax 02 ....... 
recall unlax ....... 
recall lex O0 ........ 
recall lex 60 ........ 
i , = , i i i i , 
I0 20 30 40 50 60 70 80 90 
iteration # 
Figure 8: Precision/recall measures on NC cases 
0.72 
0.7 
0.68 
0.66 
0.64 
0.62 
"~ 0 . 6 
~. 0.58 
0.56 
0.54 
0.52 
.... ........... ... .... 
\ 
\ \ 
i I 
i0 20 
precision lex 02 --. 
pracision unlax ...... 
precision lex O0 ........ . 
precimion lex 60 ....... 
= = i 
50 40 50 60 70 80 90 
iteration # 
Figure 9: Precision measures on all verb frames 
In the following, we focus on the combination 
of the best unlexicalized model and the lexical- 
ized model that is grounded on the former. 
6.2 NC Evaluation 
Figure 8 plots precision/recall for the training 
runs described in section 5.1, with lexicalized 
parsing starting after 0, 2, or 60 unlexicalized it- 
erations. The best results are achieved by start- 
ing with lexicalized training after two iterations 
of unlexicalized training. Of a total of 1353 an- 
notated NCs with case, 1103 are correctly recog- 
nized in the best unlexicalized model and 1112 
in the last lexicalized model. With a number 
of 1295 guesses in the unlexicalized and 1288 
guesses in the final lexicalized model, we gain 
1.2% in precision (85.1% vs. 86.3%) and 0.6% 
in recall (81.5% vs. 82.1%) through lexicalized 
training. Adjustment to parsed clauses yields 
88% vs. 89.2% in recall. As shown in Figure 8, 
the gain is achieved already within the first it- 
eration; it is equally distributed between correc- 
tions of category boundaries and labels. 
The comparatively small gain with lexical- 
ized training could be viewed as evidence that 
the chunking task is too simple for lexical infor- 
mation to make a difference. However, we find 
about 7% revised guesses from the unlexicalized 
to the first lexicalized model. Currently, we do 
not have a clear picture of the newly introduced 
errors. 
The plots labeled "00" are results for lexi- 
calized training starting from a random initial 
grammar. The precision measure of the first lex- 
icalized model falls below that of the unlexi- 
calized random model (74%), only recovering 
through lexicalized training to equalize the pre- 
cision measure of the random model (75.6%). 
This indicates that some degree of unlexicalized 
initialization is necessary, if a good lexica\]ized 
model is to be obtained. 
(Skut and Brants, 1998) report 84.4% recall 
and 84.2% for NP and PP chunking without case 
labels. While these are numbers for a simpler 
problem and are slightly below ours, they are 
figures for an experiment on unrestricted sen- 
tences. A genuine comparison has to await ex- 
tension of our model to free text. 
6.3 Verb Frame Evaluation 
Figure 9 gives results for verb frame recogni- 
tion under the same training conditions. Again, 
we achieve best results by lexicalising the sec- 
ond unlexicalized model. Of a total of 584 anno- 
tated verb frames, 384 are correctly recognized 
in the best unlexicalized model and 397 through 
subsequent lexicalized training. Precision for the 
best unlexicalized model is 68.4%. This is raised 
by 2% to 70.4% through lexicalized training; re- 
call is 65.7%/68%; adjustment by 41 unparsed 
misses makes for 70.4%/72.8% in recall. The 
rather small improvements are in contrast to 
88 differences in parser markup, i.e. 15.7%, be- 
tween the unlexicalized and second lexicalized 
model. The main gain is observed within the 
first two iterations (cf. Figure 9; for readability, 
we dropped the recall curves when more or less 
parallel to the precision curves). 
Results for lexicalized training without prior 
unlexicalized training are better than in the NC 
evaluation, but fall short of our best results by 
more than 2%. 
The most notable observation in verb frame 
evaluation is the decrease of precision of frame 
recognition in unlexicalized training from the 
second iteration onward. After several dozen it- 
274 
0.8 
0,75 
~'~ 0.7 
0.65 
0.6 
0.55 
0.5 
Figure 
F.f -f'/-" ....... 
I precision lex 02 -- 
k\ precision unlex ...... " 
precision lex O0 ........ 
' precision lex 60 .............. 
recall unlex ....... \ 
\ 
, , "-~ ......... ; ........... ~ .......... ,, , , , 
i0 20 30 40 50 60 70 80 90 
iteration # 
10: Precision measures on non-PP frames 
erations, results are 5% below a random model 
and 14% below the best model. The primary 
reason for the decrease is the mistaken revi- 
sion of adjoined PPs to argument PPs. E.g. 
the required number of 164 transitive frames 
is missed by 76, while the parser guesses 64 
VPt.nap frames in the final iteration against 
the annotator's baseline of 12. In contrast, lexi- 
calized training generally stabilizes w.r.t, frame 
recognition results after only few iterations. 
The plot labeled "lex 60" gives precision for a 
lexicalized training starting from the unlexical- 
ized model obtained with 60 iterations, which 
measured by linguistic criteria is a very poor 
state. As far as we know, lexicalized EM esti- 
mation never recovers from this bad state. 
6.4 Evaluation of non-PP Frames 
Because examination of individual cases showed 
that PP attachments are responsible for many 
errors, we did a separate evaluation of non-PP 
frames. We filtered out all frames labelled with 
a PP argument from both the maximal proba- 
bility parses and the manually annotated frames 
(91 filtered frames), measuring precision and re- 
call against the remaining 493 labeller anno- 
tated non-PP frames. 
For the best lexicalized model, we find some- 
what but not excessively better results than 
those of the evaluation of the entire set of 
frames. Of 527 guessed frames in parser markup, 
382 are correct, i.e. a precision of 72.5%. The 
recall figure of 77.5~0 is considerably better 
since overgeneration of 34 guesses is neglected. 
The differences with respect to different start- 
ing points for lexicalization emulate those in the 
evaluation of all frames. 
The rather spectacular looking precision and 
recall differences in unlexicalized training con- 
firm what was observed for the full frame 
set. From the first trained unlexicalized model 
throughout unlexicalized training, we find a 
steady increase in precision (70% first trained 
model to 78% final model) against a sharp drop 
in recall (78% peek in the second model vs. 
50% in the final). Considering our above re- 
marks on the difficulties of frame recognition 
in unlexicalized training, the sharp drop in re- 
call is to be expected: Since recall measures the 
correct parser guesses against the annotator's 
baseline, the tendency to favor PP arguments 
over PP-adjuncts leads to a loss in guesses when 
PP-frames are abandoned. Similarly, the rise in 
precision is mainly explained by the decreas- 
ing number of guesses when cutting out non-PP 
frames. For further discussion of what happens 
with individual frames, we refer the reader to 
(Beil et al., 1998). 
One systematic result in these plots is that 
performance of lexicalized training stabilizes af- 
ter a few iterations. This is consistent with 
what happens with rule parameters for individ- 
ual verbs, which are close to their final values 
within five iterations. 
7 Conclusion 
Our principal result is that scrambling-style 
free-er phrase order, case morphology and sub- 
categorization, and NP-internal gender, num- 
ber and case agreement can be dealt with in 
a head-lexicalized PFCG formedism by means 
of carefully designed categories and rules which 
limit the size of the packed parse forest and give 
desirable pooling of parameters. Hedging this, 
we point out that we made compromises in the 
grammar (notably, in not enforcing nominative- 
verb agreement) in order to control the number 
of categories, rules, and parameters. 
A second result is that iterative lexicalized 
inside-outside estimation appears to ,be bene- 
ficial, although the precision/recall increments 
are small. We believe this is the first substan- 
tial investigation of the utility of iterative lexi- 
calized inside-outside estimation of a lexicalized 
probabilistic grammar involving a carefully built 
grammar where parses can be evaluated by lin- 
guistic criteria. 
A third result is that using too many unlexi- 
calized iterations (more than two) is detrimen- 
tal. A criterion using cross-entropy overtraining 
275 
on held-out data dictates many more unlexical- 
ized iterations, and this criterion is therefore in- 
appropriate. 
Finally, we have clear cases of lexicalized 
EM estimation being stuck in linguistically bad 
states. As far as we know, the model which gave 
the best results could also be stuck in a compar- 
atively bad state. We plan to experiment with 
other lexicalized training regimes, such as ones 
which alternate between different training cor- 
pora. 
The experiments are made possible by im- 
provements in parser and hardware speeds, the 
carefully built grammar, and evaluation tools. 
In combination, these provide a unique environ- 
ment for investigating training regimes for lexi- 
calized PCFGs. Much work remains to be done 
in this area, and we feel that we are just begin- 
ning to develop understanding of the time course 
of parameter estimation, and of the general effi- 
cacy of EM estimation of lexicalized PCFGs as 
evaluated by linguistic criteria. 
We believe our current grammar of Ger- 
man could be extended to a robust free-text 
chunk/phrase grammar in the style of the En- 
glish grammar of Carroll and Rooth (1998) 
with about a month's work, and to a free-text 
grammar treating verb-second clauses and addi- 
tional complementation structures (notably ex- 
traposed clausal complements) with about one 
year of additional grammar development and 
experiment. These increments in the grammar 
could easily double the number of rules. How- 
ever this would probably not pose a problem for 
the parsing and estimation software. 

References 
Steven Abney. 1996. Chunk stylebook. Techni- 
cal report, SfS, Universit~t Tiibingen. 
Franz Beil, Glenn Carroll, Detlef Prescher, Ste- 
fan Riezler, and Mats Rooth. 1998. Inside- 
outside estimation of a lexicalized PCFG for 
German. -Gold-. In Inducing Lexicons with 
the EM Algorithm. AIMS Report 4(3), IMS, 
Universit~t Stuttgart. 
Glenn Carroll and Mats Rooth. 1998. Valence 
induction with a head-lexicalized PCFG. In 
Proceedings of EMNLP-3, Granada. 
Glenn Carroll, 1997a. Manual pages for charge, 
hyparCharge, and tau. IMS, Universit~t 
Stuttgart. 
Glenn Carroll, 1997b. Manual pages for supar, 
ultra, hypar, and genDists. IMS, Stuttgart 
University. 
E. Charniak. 1995. Parsing with context-free 
grammars and word statistics. Technical Re- 
port CS-95-28, Department of Computer Sci- 
ence, Brown University. 
M. Ersan and E. Charniak. 1995. A statistical 
syntactic disambiguation program and what 
it learns. TechnicM Report CS-95-29, Depart- 
ment of Computer Science, Brown University. 
Lauri Karttunen, Todd Yampol, and Gregory 
Grefenstette, 1994. INFL Morphological An- 
alyzer~Generator 3.2.9 (3. 6.4). Xerox Corpo- 
ration. 
Anne Schiller and Chris StSckert, 1995. DMOR. 
IMS, Universit~t Stuttgart. 
Helmut Schmid, 1999. Manual page for lopar. 
IMS, Universit~t Stuttgart. 
Wojciech Skut and Thorsten Brants. 1998. 
A maximum-entropy partiM parser for un- 
restricted text. In Proceedings o/ the Sixth 
Workshop on Very Large Corpora, Montreal, 
Quebec. 
