Robust German Noun Chunking 
With a Probabilistic Context-Free Grammar 
Helmut Schmid and Sabine Schulte im Walde* 
Institut fiir Maschinelle Sprachverarbeitung 
UniversitSt Stuttgart 
Azenbergstrafie 12, 70174 Stuttgart, Germany 
{schmid, s chulte}@ims, uni-stuttgart, de 
Abstract 
We present a noun chunker for German which is 
based on a head-lexicalised probabilistic context- 
fl'ee grammar. A manually developed grammar 
was semi-automatically extended with robustness 
rules in order to allow parsing of unrestricted text. 
Tile model parmncters were learned from unlabellcd 
training data by a probabilistic context-fl'ee parser. 
For extracting noun chunks, the parser generates 
all possible noun chunk analyses, scores them with 
a novel algorithm which maximizes tile best chunk 
sequence criterion, and chooscs the most probable 
chunk sequence. An evaluation of the chunker on 
2,140 hand-annotated noun chunks yielded 92% re- 
call and 93% precision. 
1 Introduction 
A fro'lilt oh'unicef inarks the noun chunks in a sen- 
tence as in the tbllowing example: 
(Wirtschaftsbosse) mit (zweitblhaftem Ruf) 
economy (:hef~ with doubtable reputation 
sind an (der in (Engt)Sssen) mlgewandten Fiihrung) 
are in the in bottlenecks apl)lied guidance 
(des Landes) beteiligt. 
of the country involved. 
'Leading economists with doubtable reI)u- 
tations are involved in guiding the country 
in times of bottlenecks.' 
A tool which identifies noun chunks is useflfl for 
term extraction (most technical terms are nouns or 
comI)lex noun groups), for lexicograt)hic lmrposes 
(see (\]~panainen and JSrvinen, 1.998) on syntacti- 
cally organised concordancing), and as index terms 
for information retrieval. Chunkers may also mark 
other types of chunks like verb groups, adverbial 
t)hrases or adjectival I)hrases. 
Several methods have been develoI)ed tbr noun 
chunking. Church's noun phrase tagger (Church, 
1988), one of the first; noun ehunkers, was based on a 
Hidden Markov Model (HMM) similar to those used 
* Thanks to Mats Rooth and Uli IIeid for many helpflfl com- 
irlonts. 
for part-of-speech tagging. Another HMM-bascd ap- 
proach has been developed by Mats Rooth (Rooth, 
1992). It integrates two HMMs; one of them mod- 
els noun chunks internally, the other models the 
context of noun chunks. Abney's cascaded finite- 
state parser (Almey, 1996) also contains a process- 
ing step which recognises noun chunks and other 
types of chunks. Ramshaw and Marcus (Ramshaw 
and Marcus, 1995) successflflly applied Eric Brill's 
transformation-based learning method to the chunk- 
ing problem. Voutilainen's NPtool (Voutilainen, 
1993) is based on his constraint-grammar system. 
Finally, Brants (Brmlts, 1999) described a Ger- 
man clumker which was implemented with cascaded 
Markov Models. 
In this 1)aper, a prol)abilistic context-free parser 
is aI)I)lied to the noui, chunking task. Tile Ger- 
man grammar used in the experiments was semi- 
autolnati(:ally extended with robustness rules in of 
der to be able to process arbitrary int)ut. The gram- 
mar parameters were trained on unlabelled data. A 
novel algorithm is used for noun chunk extraction. 
It maximises the t)robability of the chunk set. 
The tbllowing section introduces the grammar 
fi'alnework, followed by a description of the chunk- 
ing algorithm in section 3, and the experiments and 
their evaluation in section 4. 
2 The Grammar 
The German grammar is a head-lexicalised I)roba- 
bilistic context-free grainmar. Section 2.1 defines 
probabilistic context-ti'ee grammars and their head- 
lexicalised refinement. Section 2.2 introduces our 
grmnmar architecture, focusing on noun chunks. 
The robustness rules for the ehunker a.re described 
in section 2.3. 
2.1 (Head-Lexiealised) Probabilistie 
Context-Free Grammars 
A probabilistic context-free g~tmmar (PCFG) is a 
context-free grmnmar which assigns a probability P 
to each context-fl'ee grammar rule in the rule set 
R. The probability of a parse tree T is defined 
as \[I,.eRP(,.)'"l, whore I"1 is the number of times 
rule r was applied to build T. The parmneters of 
726 
PCFGs ca:: be learned f'rom unparsed corpora us- 
ing the Inside-Outside algorithm (Lari and Young, 
1990). 
Hcad-lcxicaliscd probabilistic contczt-frcc .qra?lt- 
mar;s (It-L PCFG) (Carroll and Rooth, 1998) ex- 
tend the PCFG al)proach by incorl)orating informa- 
tion about the lcxi('al head of constituents into the 
t)robal)ilistic model.: Each node in a parse of a H- 
L PCFG is labelled with a ca.tegory and the lexi- 
cal head of the category. A H-L PCFG rule looks 
like a PCFG rule in which one of the daughters 
has been marked as the head. The rule i)robabili- 
ties l~.uze(C --+ alC) m'e replaced by lexicalised rule 
probabilities l~,~t~(C -~ o~\]C, h) where h is the lex- 
ical head of the lnother constituent C. The prob- 
ability of a rule therefore depends not only on the 
category of the mother node, but also on its lexi- 
cal head. Assume that the grmnmar has two rules 
VP -+ V NP and VP ---> V. Then the transitive verb 
buy should have a higher probability for the for- 
mer rule whereas the latter rule shouhl be more 
likely for intrm:sitive verbs like sh'.cp. II-L PCFGs 
incorporate another type of parameters called lexi- 
cal choice probabilities. The lexical choice probabil- 
ity l~hoi,,~ (hd\]C,z, G,~, h,,,, ) rel)reseitt,s the prolml)ility 
that a node of category Cd with a mother node of 
category C,~ and lexical head h,,,, bears the texical 
head h,d. The probability of a parse tree is obtained 
by multiplying lexicalised rule protml)ilities and lex~ 
ical choice ln'obal)ilities for all nodes. Since it is 
possible to transform H-L PCFGs into PCFGs, the 
PCFG algorithms are ai)l)licable to I\]:-L PCF(4s. 
2.2 Noun Chunks in the German Grammar 
Currently, the German grammar contains d,619 rules 
and covers 92% of our 15 million words of verb 
final and relative clauses ~. Tl:e structmal :lOtln 
chunk concel)t in tim grammar is defined accord- 
ing to Almey's chunk style (Abney, 1991) who de- 
scribes chunks as syntactic units which correspond 
in some way to prosodic 1)atterns, containing a con- 
tent word surrounded t)y some function word(s): all 
words from the beginning of the noun 1)hrase to the 
head noun are included. :~ The difl'erent kinds of noun 
chunks covered by our grmnmar are listed below and 
illustrated with exmnples: 
.. a combination of a non-obligatory deternfiner, 
optional adjectives or cardinals and the noun 
1Other types of lexicalised PCFGs have been (h!scrib('.d in 
(Charniak, 1997), (Collins, 1997), (G'oodman, 1997), (Chcll)a 
and .lelinek, 1998) mid (Eisner and Sat:a, 1999). 
2'l'he restricted (:orl)ora were exl;ra(:ted mltomatically from 
the llugc German Corpus (I1GC), a collection of German 
newsl)al)ers as well as sl)ecialiscd magazines ibr industry, law, 
computer science. 
3As you will sc'e below, there is one exception, noun chunks 
refinc.d by a proper nanm, which end with the. name instead 
of the head noun. 
itself: 
(1) cine gutc Mec 
a good idea 
(2) viclcn Menschcn 
for many 1)eot)le 
(3) dercn kiinstliche Stimme 
whose m'tificial voice 
(4) elf Ladungen 
eleven cargos 
(5) Wasscr 
water 
and prepositional phrases where the definite m'- 
ticle of the embedded noun chunk is morpholog- 
ically combined with a 1)rel)osition, so the pure 
noun chunk could not be set)arated: 
(6) zum Schluss 
at the end 
• personal pronouns: ich (I), mir (me) 
• reflexive pronouns: reich (myself), sich (him- 
self/herself/itself) 
• possessive pronou:ls: 
(7) Mcins ist sauber. 
Mine is clean. 
• demonstrative t)ronouns: 
(8) Jcncr ffi.hrt viel sclmeller. 
That one goes much faster. 
• indefinite 1)ronom~s: 
(9) Einige sind durchgefifllen. 
Some failed. 
• relative 1)ronouns: 
(10) Ich mag Menschen, die ehrlich sind. 
I like peol)le who are honest. 
• nonfinalised adjectives: Wichtigcm (important 
things) 
• l)roper nmnes: Christoph, Kolumbus 
• a noun chunk refined by a prol)er name: 
(1.1.) der Erobere.r Christoph Kolumbus 
the conquerer Christoph Cohlmbus 
• cardinals indicating a ycm': 
(1.2) Ich begann 1996. 
I started 1996. 
The chunks may be recursive in case they appear as 
c, omplement of an adjectival phrase, as in (dcr (ira 
Rc.qc, 0 wartendc 5'oh, n) (the son who was waiting in 
the rain). 
Noun chunks have features for case, without fi:r- 
ther agreement features for nouns and verbs. The 
case is constrained by the time:ion of the noun 
chunk, as verbal or adjectival co:nplement with nom- 
inative, accusative, dative or genitive case, as mod- 
ifier with genitive case, or as part of a prel:ositional 
727 
phrase (also in the special case representing a prepo- 
sitional phrase itself) with accusative or dative case. 
Both structure mid case of noun phrases may be 
ambiguous and have to be disambiguated: 
• ambiguity concenfing structure: 
diesen (this) is disregarding the context a 
demonstrative pronoun mnbiguous between rep- 
resenting a standalone noun chunk (cf. example 
(8)) or a determiner within a noun chunk (cf. 
example (2)) 
• mnbiguity concerning case: 
die Beitriige (the contributions) is disregarding 
tile context ambiguous between nonfinative mid 
accusative case 
The disambiguation is learned during grammar 
training, since the lexicalised rule probabilities as 
well as the lexical choice probabilities tend to enforce 
the correct structure and case information. Con- 
sidering the above examples, the trained grmnmar 
should be able to parse diesen I(rie9 (this war) as 
one noun chunk instead of two (with diesen repre- 
senting a standalone noun clmnk) because of (i) the 
preferred use of denlonstrative pronouns as deter- 
miners (+ lexicalised rule probabilities), and (ii) the 
lexical coherence between the two words (~ lexical 
choice probabilities); in a sentence like er zahlte die 
Beitr@e (lie paid the contributions) the accusative 
case of the latter noun chunk should be identified 
because of the lexical coherence between the verb 
zaMen (pay) and the lexical head of the subcate- 
gorised noun phrase Beitrag (contribution) as re- 
lated direct object head (+ lexical choice probat)il- 
ides). 
2.3 Robustness Rules 
Tile Gernlan grammar covers over 90% of tile clauses 
of our verb final and relative clause corpora. This 
is sufiqcient for the extraction of lexical infornlation, 
e.g. the subcategorisation of verbs (see (Beil et al., 
1999)). For chunking, however, it is usually neces- 
sary to analyse all sentences. Therefore, the gram- 
mar was augmented with a set of robustness rules. 
Three types of robustness rules have been consid- 
ered, namely unigram rules, bigram rules aud tri- 
gram rules. 
Unigram rules are rules of the form X -+ YP X, 
where YP is a grammatical category and X is a new 
category. If such a rule is added for each grammar 
category 4, the coverage is 100% because the gram- 
mar is then able to generate any sequence of category 
labels, hi practice, some of the rules can be omitted 
while still retaining full coverage: e.g. the rule X -+ 
4Also needed are two rules which start and terminate the 
"X chain". We used the rules T0P --+ START X and X --+ END. 
START and END expand to SGML tags which mark the begin- 
ning and the end of a sentence, respectively. 
ADV X is not necessary if the grmnmar already con- 
tains tile rules ADVP --+ ADV and X --+ ADVP X. Uni- 
gram rules are insensitive to their context so that all 
permutations of the categories which are generated 
by the X chain have the stone probability. 
The second type of robustness rules, called trigram 
rules (Carroll and Rooth, 1998) is more context sen- 
sitive. Trigram rules have the form X:Y -+ Y Y:Z 
where X, Y, Z are categories and X:Y and Y:Z are 
new categories. \]Mgram rules choose the next cat- 
egory on the basis of the two preceding categories. 
Therefore the number of rules grows as the number 
of categories raises to the third power. For exam- 
ple, 125,000 trigrmn rules are needed to generate 50 
different categories in arbitrary order. 
Since unigram rules are context insensitive and 
trigram rules are too numerous, a tlfird type of ro- 
bustness rules, called bi9ram rules, was developed. 
A bigram rule actually consists of two rules, a rule 
of the form :Y --+ Y Y: which generates the COl> 
stituent Y deternlinistically, and a rule Y: -~ :Z 
which selects the next constituent Z based on the 
current one. Given n categories, we obtain n rules 
of the first form and n 2 rules of the second fornl. 
Even when categories which directly project to some 
other category were oufitted in the generation of the 
bigram rules for our Germm~ grmnmar, the num- 
ber of rules was still fairly large. Hence we gener- 
alised some of the grammatical categories by adding 
additional chain rules. For example, the preposi- 
tional phrase categories PP. Akk : an, PP. Akk : auf, 
PP.Akk:gegen etc. were generalised to PPX by 
adding the rules PPX --~ PP.Akk:an et(:. Instead of 
n + 1 t)igram rules for each of tlm 23 prepositional 
categories, we now obtained only n + 2 rules with 
the new category PPX. Altogedmr, 3,332 robustness 
rules were added. 
3 Chunking 
A head-lexicalised probabilistic context-fl'ee parser, 
called LoPar (Schnfid, 1999), was used for pa.rs- 
ing. The f'unctionality of LoPar encompasses lmrely 
synlbolic parsing as well as Vitcrbi parsing, inside- 
outside computation, POS tagging, chunking and 
training with PCFGs as well as H-L PCFGs. Be- 
cause of the large number of parameters in l)al'tic- 
ular of H-L PCFGs, the parser smoothes the prob- 
ability distributions in order to re,old zero proba- 
bilities. The absolute discounting method (Ney et 
al., 1994) was adapted to fractional counts for this 
purpose. LoPar also supports lemmatisation of the 
lexical heads of a H-L PCFG. Tile input to the 
parser consists of ambiguously tagged words. The 
tags are provided by a German morphological mlal- 
yser (Schiller and St6ckert, 1995). 
The best chunk set of a sentence is defined as tile 
set of chunks (with category, start mid end position) 
728 
for which the stun of the prolmbilities of all parses 
which c, ontain exactly that chunk set is maximal. 
The chunk set of the most likely parse (i.e. Viterbi 
parse) is not necessarily the best chunk set according 
to this definition, as the folh)wing PCFG shows. 
S -~A 0.6 B-+x 1.0 
S -~ B 0.4 C--+x 1.0 
A-~C 0.5 l)-+x 1.0 
A -+ D 0.5 
This grmmnar generates the three parse trees (S 
(A (C x))), (S (A (D x))),and (S (B x)). The 
parse tree probal)ilities are 0.3, 0.3 and 0.4, respec- 
tively. The last parse is therefore the Viterbi parse of 
x. Now assume that {A,B} is the set; of chunk cate- 
gories. The most likely chunk set is then { (A, 0,1)} 
because the sum of the l/robal/ilities of all parses 
which contain h is 0.6, whereas the sum over tit(; 
l/robal/ilities of all 1)arses containing B is only 0.4. 
computeChunks is a slightly simlllified l)seudo- 
code version of the actual chunking algorithm: 
computeChunks(G, Prul,:) 
hfitialize float array p\[Gv\] 
Initialize chunk set array.ch.'unks\[Gv\] 
for each vertex v ill GV in bottom-up order do 
if v is an or-node then 
Initialize float array prob\[ch:unk.@l(v)\]\] to 0 
fin" each daughter 'u C d(v) do 
-,-p\[',,,\] 
<- v,.ol,\[d 
p\[,,,\] +- 
else 
v\[q <- II,<,,(,,t 2,\[,*\] <- 
if v is labelled with a chunk category (7 then 
ch/,,,,a:.~\[v\] +- ~-l,,,,,,~a:.~\[,,\] U {(C, .,'t,,'t(,,), c,,.d(v))} 
return ch,'.,Tzks\[,'oot(G)\] 
computeChunks takes two arguments. The first, ar- 
gument is a parse fore.st G which is represented as an 
and-or-graph. Gv is the set of vertices. The second 
argument is the rule probability vector, d is a flmc- 
tion which returns the daughters of a vertex. The al- 
gorithm comtmtes the best elmnk set; ch,,,m, ks\[v\] and 
the corresponding I)robability ply\] for all vertices v 
in bottom-up order, chunks\[d(v)\] returns the set of 
chunk sets of the daughter nodes of vertex v. r'ule(v) 
reSurns tile rule which created v and is only defined 
tbr and-nodes, start(v) and end(v) return the start 
aml end position of the constituent represented by 
V. 
The chunking algorithm was extmrimentally eoltl- 
pared with chunk extraction fl'om Viterbi parses. In 
35 out of 41 ewfluation rims with different parame- 
ter settings '~, the f-score of tile chunking algorithm 
S'Fhe runs differed wrt. training strategy and number of 
iterations. See section 4 for details. 
was better than that of the Viterbi algorithm. The 
average f-score of the chunking algorithm was 84.7 % 
compared to 84.0 % for the Viterbi algorithm. 
4 Experiments 
We performed two main chunking experiments, hfi- 
tially, the parser trained the chunk grammar based 
on the restricted grmnmar described in section 2 ac- 
cording to tbur different training strategies. A pre- 
ferred training strategy was then applied to inves- 
tigate the potential of grammar refim;ment and ex- 
tended training data. 
4.1 Training 
Ill the frst exlmriment, the chunker version of the 
grmmnar was trained oil a corpus comprising a 1 
million word subcortms of relative clauses, a 1 mil- 
lion word subeorpus of verb final clauses and 2 mil- 
lion words of consecutive text. All data had been 
extracted from the Huge German Corpus. The test 
data used for the later evahmtion was not included 
in the training corpus. 
For training strategy 1, the elmnker gralnmar was 
first; trained on the whole cortms in mflexiealised 
mode, i.e. like a PCFG. The tmrmneters were rees- 
timated once in the middle and once at the end of 
the eorlms. In the next stel) , the grammar was lexi- 
calised, i.e. the parser computed |;tie parse probabil- 
ities with the unlexicalised model, lint extracted De- 
quencies for the lexicalised model. These fl'equencies 
were summed over the. whole eorl)us. Three more 
iterations on the whole corpus tbllowed in which 
the parmneters of the lexicalised model were rees- 
timate(t. 
The parameters of the unlexicalised chunker gram- 
mar were initialised in the following way: a fl'e- 
queney of 7500 was assigned to all original granunar 
rules and 0 to the majority of robustness rules. The 
parmneters were then estimated on the basis of these 
Dequencies. Because of the smoothing, the t)roba- 
bilities of the robustness rules were small lint not 
zero. 
For training strategy 2, the chunker rules were 
initialised with frequencies fl'om a grammar without 
robustness rule extensions, which had been trained 
mflexiealised on a 4 million subeortms of verb final 
clauses and a 4 million word subcorpus of relative 
clauses. 
Training strategy 3 again set the fi'equency of the 
original rules to 7500 and of tile robustness rules to 
0. The parser trained with three unlexicalised iter- 
ations over the whole training corpus, reestimating 
the parameters only at the end of the corpus, ill of 
der to find out; whether the lexicalised probabilistic 
parser had been better than tile fully trained mflexi- 
calised parser on the task of chunk parsing. Training 
strategy 4 repeated this procedure, but with initial- 
729 
ising the chunker frequencies on basis of a trained 
gramnlar. 
For each training strategy, further iterations were 
added until the precision and recall values ceased to 
improve. 
For the second part of the experiments, the base 
grammar was extended with a few simple verb-first 
and verb-second clause rules. Strategy 4 was applied 
for training the ehunker 
(A) on the same training corpus as betbre, i.e. 2 
million words of relative and verb final clauses, 
and 2 million words of unrestricted corpus data 
from the HGC, 
(B) on a training corpus consisting of 10 million 
words of unrestricted corpus data from the 
HGC. 
4.2 Evaluation 
The evaluation of tile ctmnker was carried out 
on noun chunks from 378 unrestricted sentences 
from the German newspaper Frankfu~'ter Allgcmci~c 
Zeitun9 (FAZ). Two persons independently anno- 
tated all noun chunks in the corpus -a total of 2,140 
noun chunks-, according to the noun chunk deft- 
nition in section 2.2, without considering grammar 
coverage, i.e. noun chunks not actually covered by 
the grammar (e.g. noun chunk ellipsis such as die 
klcinc~ \[ \]N) were annotated as such. As labels, we 
used the identifier NC plus case information: NC. Nom, 
IqC. Ace, NC. Dat, NC.Gen. In addition, we included 
identifiers for prepositional phrases where the prepo- 
sition is nlorphologically merged with the definite 
article, (el. example (6)), also including case infor- 
mation: PNC.Acc, PNC.Dat. 
For each training strategy described in section 4.1 
we evaluated the chunker before the training process 
and after each training iteration: the model in its 
current training state parsed the test sentences and 
extracted the most probable clnmk sequence as de- 
fined in section 3. We then compared the extracted 
noun elmnks with tile haud-ammtated data, accord- 
ing to 
* the range of the chunks, i.e. (lid the chunker 
find a chunk at all? 
. the range and the identifier of the chunks, i.e. 
did the ehunker find a chunk and identify the 
correct syntactic category and case? 
Figures 1 and 2 display the results of the eval- 
uation in tile first experiment, ° according to noun 
chunk range only and according to noun chunk 
range, syntactic category and case, respectively. 
Bold font highlights the best versions. 
Training strategy 2 with two iterations of lexi- 
calised training produced tile best f-scores tbr noun 
6The lexieatised ehunker versions obtained by strategy 2 
were also utilised for parsing the test sentences unlexiealised. 
chunk boundary recognition if unlexicalised parsing 
was done. The respective precision and recall val- 
ues were 93.06% and 92.19%. For recognising noun 
chunks with range, category and case, the best; chun- 
ker version was created by training strategy 4, after 
five iterations of unlexicalised training; precision and 
recall values were 79.28% and 76.75%, respectively. 
From the experimental results, we can conclude 
that: 
1. initialisation of the chunker grammar frequen- 
cies on the basis of a trained grammar improves 
the untrained version of the elumker, but the 
difference vanishes in the training process 
2. unlexicalised parsing is sufficient for noun chunk 
extraction; for extraction of chunks with case 
ilfformation, unlexicalised training turned out 
to be even more successflfl than a combination 
with lexicalised training 
Figures 3 and 4 display the results of the evalu- 
ation concerning the second experilnent, compared 
to the initial w, lues from the first experiinent. 
Extending the base grammar and the training cor- 
pus slightly increased precision and recall values for 
recognising noun chunks according to range only. 
The main inlprovement was ill noun chunk recogni- 
tion according to range, category and case: precision 
and recall values increased to 83.88% and 83.21%, 
respectively. 
4.3 Failure Analysis 
A comparison of the parsed noun chunks with the 
mmotated data showed that failure in detecting a 
noun chunk was mainly caused by proper names, 
for exalnple Neta~j(E~,~t, abbreviations like OSZE, or 
composita like So~tth Ch, ina Mor,ti,tg Post. The di- 
versity of proper names makes it difficult for tile 
chunker to learn them properly. On the one hand, 
the lexieal infornl~tion for proper names is unreliable 
because Inany proper nalnes were not reeognised as 
such. On the other hand, most prot)er names are 
too rare to learn reliable statistics tbr them. 
Minor mistakes were cruised by (a) articles which 
are morphologically identical to noun chunks con- 
sisting of a pronoun, for example den Rc,t,t~,e,'~ (tiLe 
pensionersd(,t,) was analysed as two noun clumks, dcTt 
(demonstrative pronoun) and Rent~t,e~'7t, (b) capital 
letter eonfnsion: since Gerinan nouns typically start 
with capital letters, sentence beginnings are wrongly 
interpreted as nouns, for example Wiirden as the 
conditional of the auxiliary wcrdc~ (to become) is 
interpreted as the dative case of Wib'dc (dignity), (e) 
noun chunk internal lumctuation as in seine ' Pa~'t- 
,tcr' ' (his ' partners '). 
Failure in assigning the correct; syntactic cate- 
gory and case to a noun chunk was mainly caused 
by (a) assigning accusative case instead of nomina- 
tive case, and (b) assigning dative case or nomina- 
730 
la':~ined 
I(;xl 
lex2 
I(;x3 
lex4 
lex5 
lexO 
lexl. 
lo, x2 
lc, x3 
Strategy 1 Str~ttegy 2 -l)~trsed unlex Str~tegy 3 Strategy 4 
1)roe I'e{'. 
83.63% 
91.88% 
89.13% 
88.37% 
88.25% 
88.17% 
83.63% 
89.62% 
89.71% 
89.52% 
89.57% 
89.62% 
prec r(~c 
90.22% 99.18% 
92.84% 91.58% 
90.12% 90.41% 
88.97% 89.76% 
89.79% 90.46% 
89.d2% 90.13% 
\])I'(!C 1"(~(~ 
90.22% 00.18% 
92.84% 9J .58% 
93.01% 91.49% 
93.02% 91.67% 
93.06% 92.19% 
.93.05% 92.05% 
prec leC 
83.(13% 83.63% 
:H .aa% 8:).62% 
:)2.55% 90.04% 
92.78% 90.22% 
l) r('.c l'eC 
90.22% 90.18% 
92.84% 91.58% 
:)3.01% 9\] .49% 
92.95% 91.25% 
93.(19% 90.79% 
93.29% 9(I.32% 
Figure 1: Comt)aring training strategies: noun chunk (',valuation according t;o range only 
St;l'~tl;(;gy 1 Strategy 2 parsed lmlex Stl';tl;(.~gy 23 Strategy 4 
l)re(: re(: 
untraille(1 
unlexl 
unlex2 
unlex:~ 
unlex4 
ulllox5 
lex0 
lexl 
lex2 
lex3 
63.52% 
74.50% 
7:{.68% 
72.02% 
72.76% 
711.97% 
63.52% 
78.11% 
73.15% 
72.97% 
73.85% 
73.15% 
\])re(" I'(\]C 
72.02% 71.98% 
75.87% 74.84% 
75.10% 75.35% 
74.69% 75.35% 
75.2(i% 75.82% 
75.03% 75.63% 
\] )F(!C r(~C 
72.02% 71.98% 
75.87% 74.84% 
78.27% 76.99% 
77.27% 7(1.15% 
77.48% 76.75% 
77.45% 7(i.61% 
l)l'(!(; 1"(IC 
63.52% 63.52% 
74.50% 73.t 1% 
76.88% 74.79% 
77.97% 75.82% 
pl'OC F(~C 
72.02% 71.98% 
75.87% 74.84% 
78.27% 76.99% 
78.70% 77.27% 
78.80% 76.85% 
79.28% 76.75% 
Figure 2: Comi)aring training strategies: noun chun\]{ evahtal;ion according l;o range and label 
tive case instead of a(:cusal;ive (:an(;. 2}he confl,sion 
between l~ominal;ive an(t accusative case is due (;() 
1;he facl; that both cases at(', (~xI)rc'ss(.'(t t)y i(lenl;ical 
m()ri)tlology in l;he f(;minine and neul;ra\] genders in 
G(n'man. The morl)hologi(" similarity l)ei:ween a(:- 
(:u,;ative and (lative is less substantial, but esl)ecially 
prol)er names and bar(; nouns are sl;ill sul)jecl; 1;o (:(m- 
fllsion. As |;lie evaluation resull;s show, the (lisLinc- 
tion between 1;he cases could be learned in general, 
bul; morl)hological similaril;y and in addil;ion l;he rel- 
atively Dee word order in German impose high de- 
mure(Is on the n(;(:essary i)rot)a|)ilil;y model. 
5 Summary 
We t)resenl;ed a German noun ('hunker for unre- 
stricted text. The chunker in based on a head- 
lexicalised probabilistic context-free grammar and 
|;r~,ined on unlal)elled data. ~\]'he base grammar was 
semi-automatically augmenl;ed with robust, heSS rules 
in order to cover unrestricted input. An algorit;hm 
for chunk exl, ract, ion was develoi)ed which maximises 
the probabilil;y of l;he chunk set;s rather than the 
probability of single t)arses like l;he Vil;erl)i algo- 
rithm. 
German noun chunks were del;ected wil;h 93% 1)re- 
cision and 92(~) re(:all. Asking the clmnker to addi- 
tionally identil~y the syntactic category and l;he case 
of the chunks resulted in recall of 83% and precision 
of 84~). A COml)arison of different training strate- 
gies showed that unlexicalised parsing inforlnation 
xv~s sufIi('ienl; for noun chunk extra(:l;ion wil;h and 
wil;ho111; (:~s('. informal;ion. The base gralltllt~r played 
an iml)orl;ant role in the chunker dev(dot)ment: (i) 
building the (:hunker on |,11(; basis of an ~dready 
train(~(t gr~mmmr iml)rov(~d the chtmker rules, and 
(ii) relining l;he base grammar wil;h even simple verb- 
tirs\[; and verl)-se(:ond rules improved accuracy, so it 
should \])e worthwhile to flirt;her extend lhe grammar 
rules. Increasing l;he ~mlounl; of training (tal;a also 
improved noun ('hunk r(;cognition, especially case 
disaml)iguat;ion. I~(;IA:er heuristi(:s for guessing the 
I)arts-of-st)eech of unknown words should flu'ther im- 
prove l;he noun chunk recognition, since lnalk~, errors 
were ('ause(1 l)y llnk\]~own words. 
731 
untrained 
unlexl 
reflex2 
unlex3 
reflex4 
unlex5 
unlex6 
unlex7 
Strategy 4 
Initial A B 
pree rec prec t}rec rec 
90.22% 90.18% 
92.84% 91.58% 
93.01% 91.49% 
92.95% 91.25% 
93.09% 90.7{}% 
93.29% 90.32% 
90.43% 
91.65% 
92.45% 
92.64% 
93.11% 
93.20% 
90.60% 
91.35% 
91.58% 
91.21% 
91.07% 
91.02% 
90.43% 
91.52% 
91.89% 
92.21% 
92.73% 
92.73% 
92.91% 
92.83% 
Figure 3: Granunar and training data extensions: noun chunk evaluation 
rec 
90.60% 
9\].35% 
91.67% 
91..86% 
91..86% 
91.86% 
91.96% 
92.10% 
according to range only 
untrained 
unlexl 
unlex2 
unlcx3 
unlcx4 
unlex5 
unlex6 
unlex7 
Strategy 4 
Initial A B 
prec rec prec rec 1)roe ree 
72.02% 71.98% 
75.87% 74.84% 
78.27% 76.99% 
78.70% 77".27% 
78.80% 76.85% 
79.28% 76.75% 
74.42% 
78.51% 
80.74% 
81.24% 
81.83% 
81.85% 
74.56% 
78.25% 
79.98% 
79.98% 
80.03% 
79.93% 
74.42% 
77.60% 
79.89% 
81.17% 
82.44% 
82.53% 
82.94% 
88.88% 
74.56% 
77.46% 
79.70% 
80.87% 
81.67% 
81.76% 
82.09% 
83.21% 
Figure 4: Grammar and trailfing data extensions: noun chunk evaluation according to range mid label 

References 

Steven Abney. 1991. Parsing by Chunks. In Robert 
Berwick, Steven Almey, and Carol Tenny, editors, 
Pri'~tciplc-Bascd ParsiTt.q. Kluwer Academic Pul)- 
lishers. 

St;even Almey. 1996. Partial Parsing via Finite- 
St;ate Cascades. In ProcccdiTt.qs of the \]~;SSLLI '96 
Rob't, st Pa,;s'i~zq Wo,'t~shop. 

Franz 13eil, Glenn Carroll, Det, lef Prescher, Stefim 
Riezler, and Mal;s ll,ooth. 1999. Inside-Outside 
Estimation of a Lexi('alize(t PCFG for German. 
In ProceediT~,.qs of th, e 37th Annual Mceti,z!l of the 
ACL, pages 269-276. 

Thorsten Brants. 1999. Cascaded inarkov models. 
Ill Proceedings of EA CL '99. 

Glenn Carroll and Mats Rooth. 1998. Valence In- 
duction with a Head-Lexicalized PCFG. In Pro- 
ccedings of Third Conference on Empirical Meth- 
ods in Natural Language Processing. 

Eugene Charniak. 1997. Statistical Parsing with a 
Context-Free Granunar and Word Statistics. In 
Proceedings of the l~th National Confcrence on 
Artificial Intelligence. 

Ciprian Chelba and Frederick Jelinek. 1998. Ex- 
ploiting Syntactic Structure for Language Mod- 
eling, hi Proceedings of the 35th Annual Meeting 
of the A CL. 

Kenneth W. Church. 1988. A Stochastic Parts Pro- 
granl and Noun Phrase Pm'ser for unrestricted 
Text. In Proceedings of the Second Conference on 
Applied Natural Language Processing, pages 136- 
143. 

Michael Collins. 1997. Three Generative, Lexi- 
calised Models for Statistical Parsing. In Proceed- 
ings of the 35th Annual Meeting of the A CL. 

Jason Eisner and Giorgio Satta. 1999. Efficient 
Parsing for Bilexical Context-Free Grmmnars and 
Head Automaton Grammars. In Procecdings of 
the 37th Annual Meeting of the ACL, pages 457- 
464. 

Joshua Goodman. 1996. Parsing Algorithms and 
Metrics. In Proceedings of the 34th Annual Meet- 
ing of the ACL, pages 177-183. 

Joshna Goodman. 1997. Probabilistic Feature 
Grammars. In Procecdings of the 5th Interna- 
tional Workshop on Parsing Technologies, pages 
89-100. 

K. Lari and S. Young. 1990. The Estimation 
of Stochastic Context-Free Grmmnars using the 
Inside-Outside Algorithm. Computation Specch 
and Language Proccssing, 4:35-56. 

Herlnann Ney, U. Essen, and R. Kneser. 1994. 
On Structuring Probabilistic Dependencies in 
Stochastic Language Modelling. Computer Speech 
and Language, 8:1 38. 

L. Ramshaw and M. Marcus. 1995. Text Chunking 
using Transtbrlnation-Based Learning. Ill Pro- 
ccedings of thc Third Workshop on Vcry LaTye 
Corpora, pages 82-94. 

Mats Rooth. 1992. Statistical NP Tagging. Unpub- 
lished manuscript. 

Anne Schiller and Chris StSckert, 1995. DMOR. 
Institut fiir Maschinelle Spraehverarbeitung, Uni- 
versit~t Stuttgart. 

Hehnut Schmid. 1999. Lopar: Design and hn- 
plementation. Technical report, Institut fiir 
Maschinelle Sprachverarbeitung, Universit'gt 
StuttgArt. 

Pasi Tapanainen and Time JSrvinen. 1998. De- 
pendeney Concordances. International Journal of 
Lexicography, 11(3):187-203. 

Atro Voutilainen. 1993. NPtool, a Detector of En- 
glish Noun Phrases. In Proceedings of the Work- 
shop on Vcry Large Corpora, pages 48-57. 
