Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 515–522, Vancouver, October 2005. c©2005 Association for Computational Linguistics
Disambiguation of Morphological Structure using a PCFG
Helmut Schmid
Institute for Natural Language Processing (IMS)
University of Stuttgart
Germany
schmid@ims.uni-stuttgart.de
Abstract
German has a productive morphology and
allows the creation of complex words
which are often highly ambiguous. This
paper reports on the development of a
head-lexicalized PCFG for the disam-
biguation of German morphological anal-
yses. The grammar is trained on unla-
beled data using the Inside-Outside algo-
rithm. The parser achieves a precision
of more than 68% on difficult test data,
which is 23% more than the baseline ob-
tained by randomly choosing one of the
simplest analyses. Remarkable is the fact
that precision drops to 52% without lexi-
calization.
1 Introduction
German words may be as complex as the fol-
lowing title of a bill: Rindfleischetikettierungs-
¨uberwachungsaufgaben¨ubertragungsgesetz (law for
the transfer of the task of controlling the labeling
of beef). The complexity is due to the productive
morphological processes of derivation (e.g. Etiket-
tierung (labeling) = Etikett (label) + ier (deriva-
tional suffix) + ung (nominalization suffix)) and
compounding (e.g. Rindfleisch (beef) = Rind (cattle)
+ Fleisch (meat)). For many words, there is more
than one possible analysis. The German SMOR
morphology (Schmid et al., 2004) e.g. generates 24
analyses for the word Abteilungen. If differences in
the case feature are ignored, there are still six analy-
ses, all of them plural:
• Abt (abbot) Ei (egg) Lunge (lung) n (plural in-
flectional ending)
• Abt (abbot) ei (abbot → abbey) Lunge (lung) n
(plural inflectional ending)
• Abt (abbot) eil (hurry) ung (nominalization suf-
fix) en (plural inflectional ending)
• Abtei (abbey) Lunge (lung) n (plural inflec-
tional ending)
• Abteilung (department) en (plural inflectional
ending)
• ab (separable verb prefix) teil (divide) ung
(nominalization suffix) en (plural inflectional
ending)
Here – and in many other cases, as well – the least
complex analysis (defined as the number of deriva-
tion and compounding steps), namely the plural of
Abteilung (department), is the correct one. This
heuristic is not always successful, however. The
word Reisende e.g. is analyzed as the compound of
Reis (rice) + Ende (end), and alternately as the nom-
inalization of the present participle reisend (travel-
ing). The latter one is correct although it requires
two derivational steps (formation of the participle
plus nominalization), while the former requires only
one compounding step.
The least complex analysis is not necessarily
unique. One reason is, that German word forms are
often ambiguous wrt. number, gender and case. The
adjective kleine (small) e.g. receives 7 analyses by
SMOR which differ only in the agreement features.
515
Another reason is that word forms are ambiguous
wrt. the part of speech. The word gerecht e.g. is ei-
ther an adjective (fair) or the past participle of the
verb rechen (to rake). Similarly, the word gerade
is either an adjective (straight) or an adverb (just).
These ambiguities can be resolved based on the con-
text e.g. with a part-of-speech tagger.
Other types of ambiguities are not disambiguated
by the syntactic context because the morphosyntac-
tic features are invariant. Compounds with three el-
ements like Sonderpreisliste, for instance, are sys-
tematicallyambiguousbetweenaleft-branching(list
of special prices) and a right-branching structure
(special price-list), but unambiguous regarding their
part of speech and agreement features. Some word
forms like Schmerzfreiheit (absence of pain) can ei-
ther be analyzed as derivations (schmerzfrei-heit –
“painless-ness”) or as compounds (Schmerz-freiheit
– “pain freedom”). Again, there is no difference in
the morphosyntactic features. A further source of
ambiguity are the stems: Consider the word Mit-
telzuweisung (allocation of resources). The com-
pounding stem mittel could either originate from the
adjective mittel (average) or from the noun Mittel
(means). All these ambiguities are not resolvable by
the syntactic context because their syntactic prop-
erties are identical. However, most of these words
have a preferred reading. Nahverkehrszug (com-
muter train) e.g. is likely to have a left-branching
structure, whereas the correct analysis of Computer
bildschirm (computer monitor) is right-branching.
Considering these features of German morphol-
ogy, the following disambiguation strategy for mor-
phological ambiguities is proposed: Frequent words
should be manually disambiguated and the cor-
rect analysis/analyses should be explicitly stored in
the lexicon. Ambiguities involving different mor-
phosyntactic features should be resolved by a tagger
orparser. Theeliminationoftheremainingambigui-
ties, namely ambiguities of rare words which are not
reflected by the morphosyntactic features, requires a
different method. A general strategy is to generate
the set of possible analyses, to rank them accord-
ing to some criterion and to return the best analysis
(or analyses). One very simple ranking criterion is
the complexity of the analysis e.g. measured by the
number of derivational and compounding steps. We
will use this criterion as a baseline to which we com-
pare our method.
Given an FST-based morphological analyzer and
a training corpus consisting of manually disam-
biguatedanalyses, itisalsopossibletoestimatetran-
sition probabilities for the finite state transducer and
to disambiguate by choosing the most probable path
through the transducer network for a given word. A
drawback of this approach is the limitation of the
type of analyses that finite state transducers can gen-
erate. A finite state transducer maps a regular lan-
guage, the set of word forms, to another regular lan-
guage, the set of analyses. Therefore it is not able to
produce structured analyses as shown in figure 1 (for
arbitrary depths). It also fails to represent non-local
dependencies, like the one between Vertrag (con-
tract) and L¨osung (solution) in the second analysis
of figure 1.
ung
auf
Losung..
N
N N
V N V N
miet VertragsaufV
los..
N N
N
V N
mietVertrags
V Suff
Pref
Figure 1: Two morphological analyses of the Ger-
man word Mietvertragsaufl¨osung (leasing contract
cancellation); the first one is correct.
Given the limitations of weighted finite-state
transducers, we propose to use a more power-
ful formalism, namely head-lexicalized probabilis-
tic context-free grammars (Carroll and Rooth, 1998;
Charniak, 1997) to rank the analyses. Context-free
grammars have, of course, no difficulties to generate
the analyses shown in figure 1. By assigning prob-
abilities to the grammar rules, we obtain a proba-
bilistic context-free grammar (PCFG) which allows
the parser to distinguish between frequent and rare
morphological constructions. Nouns e.g. are much
more likely to be compounds than verbs. In head-
lexicalized PCFGs (HL-PCFGs), the probability of
a rule also depends on the lexical head of the con-
stituent. HL-PCFGs are therefore able to learn that
nouns headed by Problem (problem) are more likely
to be compounds (e.g. Schulprobleme (school prob-
lems)) than nouns headed by Samstag (Saturday).
Moreover, HL-PCFGs represent lexical dependen-
516
cies like that between Vertrag and L¨osung in fig-
ure 1.
a98 a98 a98 a98
a98 a98
a98
a98
Abt
Ab ung
lunge ennei
eil
Abtei
Abteilung
teil
Figure 2: Morpheme lattice
In this paper, we present a HL-PCFG-based dis-
ambiguator for German. Using the SMOR morpho-
logical analyzer, the input words are first split into
morpheme sequences and then analyzed with a HL-
PCFG parser. Due to ambiguities, the parser’s input
is actually a lattice rather than a sequence (see the
example in figure 2).
The rest of the paper is organized as follows:
In section 2, we briefly review head-lexicalized
PCFGs. Section 3 summarizes some important fea-
tures of SMOR. The development of the grammar
will be described in section 4. Section 5 explains the
trainingstrategy, andsection6reportsontheannota-
tion of the test data. Section 7 presents the results of
an evaluation, section 8 comments on related work,
and section 9 summarizes the main points of the pa-
per. Finally, section 10 gives an outlook on future
work.
2 Head-Lexicalized PCFGs
A head-lexicalized parse tree is a parse tree in which
each constituent is labeled with its category and its
lexical head. The lexical head of a terminal symbol
is the symbol itself and the lexical head of a non-
terminal symbol is the lexical head of its (unique)
head child.
In a head-lexicalized PCFG (HL-PCFG) (Carroll
and Rooth, 1998; Charniak, 1997), one symbol on
the right-hand side of each rule is marked as the
head. A HL-PCFG assumes that (i) the probability
of a rule depends on the category and the lexical
head of the expanded constituent and (ii) that the
lexical head of a non-head node depends on its
own category, and the category and the lexical
head of the parent node. The probability of a
head-lexicalized parse tree is therefore:
pstart(cat(root)) pstart(head(root)|cat(root))∗producttext
n∈N prule(rule(n)|cat(n),head(n))∗producttext
n∈A phead(head(n)|cat(n),pcat(n),phead(n))
where
root is the root node of the parse tree
cat(n) is the syntactic category of node n
head(n) is the lexical head of node n
rule(n) is the grammar rule which expands node n
pcat(n) is the syntactic category of the parent of n
phead(n) is the lexical head of the parent of n
HL-PCFGs have a large number of parameters
which need to estimated from training data. In
order to avoid sparse data problems, the parameters
usually have to be smoothed. HL-PCFGs can either
be trained on labeled data (supervised training) or
on unlabeled data (unsupervised training) using
the Inside-Outside algorithm, an instance of the
EM algorithm. Training on labeled data usually
gives better results, but it requires a treebank
which is expensive to create. In our experiments,
we used unsupervised training with the LoPar
parser which is available at http://www.ims.uni-
stuttgart.de/projekte/gramotron/SOFTWARE/LoPar-
en.html.
3 SMOR
SMOR (Schmid et al., 2004) is a German FST-
based morphological analyzer which covers inflec-
tion, compounding, and prefix as well as suffix
derivation. It builds on earlier work reported in
(Schiller, 1996) and (Schmid et al., 2001).
SMOR uses features to represent derivation con-
straints. German derivational suffixes select their
base in terms of part of speech, the stem type
(derivation or compounding stem)1, the origin (na-
tive, classical, foreign), and the structure (simplex,
compound, prefix derivation, suffix derivation) of
the stem which they combine with. This informa-
tion is encoded with features. The German deriva-
tion suffix lich e.g. combines with a simplex deriva-
tion stem of a native noun to form an adjective. The
feature constraints of lich are therefore (1) part of
speech = NN (2) stem type = deriv (3) origin = na-
tive and (4) structure = simplex.
1Suffixes which combine with compounding stems histori-
cally evolved from compounding constructions.
517
4 The Grammar
The grammar used by the morphological disam-
biguator has a small set of rather general cate-
gories for prefixes (P), suffixes (S), uninflected base
stems (B), uninflected base suffixes (SB), inflec-
tional endings (F) and other morphemes (W). There
is only one rule for compounding and prefix and
suffix derivation, respectively, and two rules for the
stem and suffix inflection. Additional rules intro-
duce the start symbol TOP and generate special
word forms like hyphenated (Thomas-Mann-Straße)
or truncated words (Vor-). Overall, the base gram-
mar has 13 rules. Inflection is always attached low
in order to avoid spurious ambiguities. The part of
speech is encoded as a feature.
Like SMOR, the grammar encodes derivation
constraints with features. Number, gender and case
are not encoded. Ambiguities in the agreement fea-
tures are therefore not reflected in the parses which
the grammar generates. This allows us to abstract
away from this type of ambiguity which cannot be
resolved without contextual information. If some
application requires agreement information, it has to
be reinserted after disambiguation.
The feature grammar is compiled into a context-
free grammar with 1973 rules. In order to reduce
the grammar size, the features for origin and com-
plexity were not compiled out. Figure 3 shows a
compounding rule (building a noun base stem from
a noun compounding stem and a noun base stem),
a suffix derivation rule (building an adjective base
stem from a noun derivation stem and a derivation
suffix), a prefix derivation rule (prefixing a verbal
compounding stem) and two inflection rules (for the
inflection of a noun and a nominal derivation suffix,
respectively) from the resulting grammar. The quote
symbol marks the head of a rule.
W.NN.base → W.NN.compound W.NN.base’
W.ADJ.base → W.NN.deriv S.NN.deriv.ADJ.base
W.V.compound → P.V W.V.compound’
W.NN.base → B.NN.base’ F.NN
S.ADJ.deriv.NN.base → SB.ADJ.deriv.NN.base’
F.NN
Figure 3: Some rules from the context-free grammar
The parser retrieves the categories of the mor-
phemes from a lexicon which also contains infor-
mation about the standard form of a morpheme.
The representation of the morphemes returned by
the FST-based word splitter is close to the surface
form. Only capitalization is taken over from the
standardform. Theadjectiveurs¨achlich(causal), for
instance, is split into Urs¨ach and lich. The lexicon
assigns to Urs¨ach the category W.NN.deriv and the
standard form Ursache (cause).
5 PCFG Training
PCFG training normally requires manually anno-
tated training data. Because a treebank of Ger-
man morphological analyses was not available,
we decided to try unsupervised training using the
Inside-Outside algorithm (Lari and Young, 1990).
We worked with unlexicalized as well as head-
lexicalized PCFGs (Carroll and Rooth, 1998; Char-
niak, 1997). The lexicalized models used the stan-
dard form of the morphemes (see the previous sec-
tion) as heads.
The word list from a German 300 million word
newspaper corpus was used as training data. From
the 3.2 million tokens in the word list, SMOR suc-
cessfully analyzed 2.3 million tokens which were
used in the experiment. Training was either type-
based(witheachwordformhavingthesameweight)
or token-based (with weights proportional to the fre-
quency). We experimented with uniform and non-
uniform initial distributions. In the uniform model,
each rule had an initial frequency of 1 from which
the probabilities were estimated. In the non-uniform
model, the frequency of two classes of rules was in-
creased to 1000. The first class are the rules which
expand the start symbol TOP to an adjective or ad-
verb, leading to a preference of these word classes
over other word classes, in particular verbs. The
second class is formed by rules generating inflec-
tional endings, which induces a preference for sim-
pler analyses.
6 Test Data
The test data was extracted from a corpus of the Ger-
man newspaper Die Zeit which was not part of the
training corpus. We prepared two different test cor-
pora. The first test corpus (data1) consisted of 425
words extracted from a randomly selected part of the
518
corpus. We only extracted words with at least one
letter which were ambiguous (ignoring ambiguities
in number, gender and case) and either nouns, verbs
or adjectives and not from the beginning of a sen-
tence. Duplicates were retained. The words were
parsed and manually disambiguated. We looked at
the context of a word, where this was necessary for
disambiguation. Words without a correct analysis
were deleted.
In order to obtain more information on the types
of ambiguity and their frequency, 200 words were
manually classified wrt. the class of the ambiguity.
The following results were obtained:
• 39 words (25%) were ambiguous between an
adjective and a verb like gerecht - “just” (ad-
jective) vs. past participle of rechen (to rake).
• 28 words (18%) were ambiguous between a
noun and a proper name like Mann - “man” vs.
Thomas Mann
• 19wordswereambiguousbetweenanadjective
and an adverb like gerade - “straight” vs. “just”
(adverb)
• 14 words (9%) showed a complex ambiguity
involving derivation and compounding like the
word ¨uberlieferung (tradition) which is either
a nominalization of the prefix verb ¨uberliefern
(to bequeath) or a compound of the stems ¨uber
(over) and Lieferung (delivery).
• 13 words (8%) were compounds which were
ambiguous between a left-branching and a
right-branching structure like Weltrekordh¨ohe
(world record height)
• In 10 words (5%), there was an ambiguity be-
tween an adjective and a proper name or noun
stem - as in H¨ochstleistung (maximum perfor-
mance) where h¨ochst can be derived from the
proper name H¨ochst (a German city) or the su-
perlative h¨ochst (highest)
• 6 words (3%) showed a systematic ambigu-
ity between an adjective and a noun caused
by adding the suffix er to a city name, like
Moskauer - “Moskau related” vs. “person from
Moskau”
• Another 6 words were ambiguous between two
different noun stems like Halle which is either
singular form of Halle (hall) or the plural form
of Hall (reverberation)
Overall 50% of the ambiguities involved a part-of-
speech ambiguity.
The second set of test data (data2) was designed
to contain only infrequent words which were not
ambiguous wrt. part of speech. It was extracted
from the same newspaper corpus. Here, we ex-
cluded words which were (1) sentence-initial (in or-
der to avoid problems with capitalized words) (2)
not analyzed by SMOR (3) ambiguous wrt. part of
speech (4) from closed word classes or (5) simplex
words. Furthermore, we extracted only words with
more than one simplest2 analysis, in order to make
the test data more challenging. The extracted words
were sorted by frequency and a block of 1000 word
forms was randomly selected from the lower fre-
quency range. All of them had occurred 4 times. We
focussed on rare words because frequent words are
better disambiguated manually and stored in a table
(see the discussion in the introduction).
The 1000 selected word forms were parsed and
manually disambiguated. 193 problematic words
were deleted from the evaluation set because either
(1) no analysis was correct (e.g. Elsevier, which was
not analyzed as a proper name) or (2) there was a
true ambiguity (e.g. Rottweiler which is either a dog
breed or a person from the city of Rottweil or (3)
the lemma was not unique (Dreht¨ur (revolving door)
could be lemmatized to Dreht¨ur or Dreht¨ure with no
difference in meaning.) or (4) several analyses were
equivalent. The disambiguation was often difficult.
Even among the words retained in the test set, there
were many that we were not fully sure about. An
example is the compound Natureisbahn (“natural
ice rink”) which we decided to analyze as Natur-
Eisbahn (nature ice-rink) rather than Natureis-Bahn
(nature-ice rink).
7 Results
The parser was trained using the Inside-Outside al-
gorithm. By default, (a) the initialization of the rule
2The complexity of an analysis is measured by the number
of derivation and compounding steps.
519
probabilities was non-uniform as described in sec-
tion 5, (b) training was based on tokens (i.e. the
frequency of the training items was taken into ac-
count), and (c) all training iterations were lexical-
ized. Training was quite fast. One training iteration
on 2.3 million word forms took about 10 minutes on
a Pentium IV running at 3 GHz.
Figure 4 shows the exact match accuracy of the
Viterbi parses depending on the number of training
iterations, whichrangesfrom0(theinitial, untrained
model) to 15. For comparison, a baseline result is
shown which was obtained by selecting the set of
simplest analyses and choosing one of them at ran-
dom3. The baseline accuracy was 45.3%. The pars-
ingaccuracyofthedefaultmodeljumpsfromastart-
ing value of 41.8% for the untrained model (which is
below the baseline) to 58.5% after a single training
iteration. The peak performance is reached after 8
iterations with 65.4%. The average accuracy of the
models obtained after 6-25 iterations is 65.1%.
40
45
50
55
60
65
70
0 2 4 6 8 10 12 14 16
accuracy
iterations
defaulttype-based
uniformunlexicalised
baseline
Figure 4: Exact match accuracy on data2
Results obtained with type-based training, where
each word receives the same weight ignoring its
frequency, were virtually identical to those of the
default model. If the parser training was started
with a uniform initial model, however, the accu-
racy dropped by about 6 percentage points. Figure 4
also shows that the performance of an unlexicalized
3In fact, we counted a word with n simplest analyses as 1/n
correct instead of actually selecting one analysis at random, in
ordertoavoidadependencyofthebaselineresultontherandom
number generation.
PCFG is about 13% lower.
We also experimented with a combination of un-
lexicalized and lexicalized training. Lexicalized
models have a huge number of parameters. There-
fore, there is a large number of locally optimal pa-
rameter settings to which the unsupervised training
can be attracted. Purely lexicalized training is likely
to get stuck in a local optimum which is close to the
starting point. Unlexicalized models, on the other
hand, have fewer parameters, a smaller number of
local optima and a smoother search space. Unlex-
icalized training is therefore more likely to reach a
globally (near-)optimal point and provides a better
starting point for the lexicalized training.
Figure 5 shows that initial unlexicalized training
indeed improves the accuracy of the parser. With
one iteration of unlexicalized training (see “unlex 1”
in figure 5), the accuracy increased by about 3%.
The maximum of 68.4% was reached after 4 iter-
ations of lexicalized training. The results obtained
with 2 iterations of unlexicalized training were very
similar. With 3 iterations, the performance dropped
almost to the level of the default model. It seems
that some of the general preferences learned during
unlexicalized training are so strong after three itera-
tions that they cannot be overruled anymore by the
lexeme-specific preferences learned in the lexical-
ized training.
40
45
50
55
60
65
70
0 2 4 6 8 10 12 14 16 18
accuracy
iterations
defaultunlex 1
unlex 2unlex 3
Figure 5: Results on data2 with 0 (default), 1, 2,
or 3 iterations of unlexicalized training, followed by
lexicalized training
In order to assess the parsing results qualitatively,
520
100 parsing errors of version “unlex 2” were ran-
domly selected and inspected. It turned out that
the parser always preferred right-branching struc-
tures over left-branching structures in complex com-
pounds with three or more elements, which resulted
in 57 errors caused by left-branching structures.
Grammars trained without the initial unlexicalized
training showed no systematic preference for right-
branching structures. In the test data, left-branching
structures were two times more frequent than right-
branching structures.
29 disambiguation errors resulted from selecting
the wrong stem although the structure of the analy-
sis was otherwise correct. In the word Rechtskon-
struktion (legal construction), for instance, the first
elementRechtswasderivedfromtheadjectiverechts
(right) rather than the noun Recht (law). Similarly,
the adjective quelloffen (open-source) was derived
from the verb quellen (to swell) rather than the noun
Quelle (source).
Six errors involved a combination of compound-
ing and suffix derivation (e.g. the word Flugbe-
gleiterin (stewardess)). The parser preferred the
analysis where the derivation is applied first (Flug-
Begleiterin (flight attendant[female])), whereas in
the gold standard analysis, the compound is formed
first (Flugbegleiter-in (steward-ess).
In order to better understand the benefits of unlex-
icalized training, we also examined the differences
between the best model obtained with one iteration
of unlexicalized (unlex1), and the best model ob-
tained without unlexicalized training (default).
30 cases involved left-branching vs. right-
branching compounds. The unlex1 model showed a
higherpreferenceforright-branchingstructuresthan
the default model, but produced also left-branching
structures (unlike the model unlex2). In 15 of
the 30 cases, unlex1 correctly decided for a right-
branching structure; in 13 cases, unlex1 was wrong
with proposing a right-branching structure. In two
cases, unlex1 correctly predicted a left-branching
structure and the default model predicted a right-
branching structure.
32 differences were caused by lexical ambigui-
ties. In 24 cases, only one stem was ambiguous. 15
times unlex1 was right (e.g. Moskaureise - Moskow
trip[sg] vs. Moskow rice[pl]) and nine times the de-
fault model was right (e.g. Jodtabletten - iodine pill
vs. iodine tablet). In 8 cases, two morphemes were
involved in the ambiguity. In all these cases, un-
lex1 generated the correct analysis (e.g. Sportraum -
“sport room” vs. “Spor[name] dream”).
Nine ambiguities involved the length of verb pre-
fixes. Six times, unlex1 correctly decided for a
longerprefix(e.g.gegen¨uber-stehen(toface)instead
of gegen-¨uberstehen (to “counter-survive”).
51
52
53
54
55
56
57
58
59
0 5 10 15 20 25
accuracy
iterations
defaultunlex 1
unlex 2
Figure 6: Accuracy on data1 after 0, 1, or 2 itera-
tions of unlexicalized training followed by lexical-
ized training
In another experiment, we tested the parser on the
first test data set (data1) where simplex words, part-
of-speech ambiguities, frequent words and repeated
occurrences were not removed. The baseline accu-
racy on this data was 43.75%. Figure 6 shows the
results obtained with different numbers of unlexical-
izedtrainingiterationsanalogoustofigure5. Strictly
lexicalized training produced the best results, here.
The maximal accuracy was 58.59% which was ob-
tained after 7 iterations. In contrast to the exper-
iments on data2, the accuracy decreased by more
than 1.5% when the training was continued. As said
in the introduction, we think that part-of-speech am-
biguities are better resolved by a part-of-speech tag-
ger and that frequent words can be disambiguated
manually.
8 Related Work
New methods are often first developed for English
and later adapted to other languages. This might ex-
plain why morphological disambiguation has been
521
so rarely addressed in the past: English morphology
is seldom ambiguous except for noun compounds.
We are not aware of any work on the disam-
biguation of morphological analyses which is di-
rectly comparable to ours. Mark Lauer (1995) only
considered English noun compounds and applied a
different disambiguation strategy based on word as-
sociation scores.
Koehn and Knight (2003) proposed a splitting
method for German compounds and showed that
it improves statistical machine translation. Com-
pounds are split into smaller pieces (which have to
be words themselves) if the geometric mean of the
word frequencies of the pieces is higher than the fre-
quency of the compound. Information from a bilin-
gualcorpusisusedtoimprovethesplittingaccuracy.
Andreas Eisele (unpublished work) implemented
a statistical disambiguator for German based on
weighted finite-state transducers as described in the
introduction. However, his system fails to represent
and disambiguate the ambiguities observed in com-
pounds with three or more elements and similar con-
structions with structural ambiguities.
9 Summary
We presented a disambiguation method for German
morphological analyses which is based on a head-
lexicalized probabilistic context-free grammar. The
words are split into morpheme lattices by a finite
state morphology, and then parsed with the prob-
abilistic context-free grammar. The grammar was
trained on unlabeled data using the Inside-Outside
algorithm and evaluated on 807 manually disam-
biguated analyses of infrequent words. Different
training strategies have been compared. A com-
bination of one iteration of unlexicalized training
and four iterations of lexicalized training returned
the best results with over 68% exact match accu-
racy, compared to a baseline of 45% which was ob-
tained by randomly choosing one of the minimally
complex analyses. Without lexicalization, the ac-
curacy dropped by 15 percentage points, indicating
that lexicalization is essential for morphological dis-
ambiguation.
10 Future Work
There are several starting points for improvement.
Guidelines should be developed for the manual an-
notationofdatainordertomakeitlessdependenton
the annotator’s intuitions. More data should be an-
notated to create a treebank of morphological anal-
yses. Given such a treebank, the parser could be
trained on labeled data or on a combination of la-
beled and unlabeled data, which presumably would
further increase the parsing accuracy.
References
Glenn Carroll and Mats Rooth. 1998. Valence induc-
tion with a head-lexicalized PCFG. In Proceedings of
theThirdConferenceonEmpiricalMethodsinNatural
Language Processing, Granada, Spain.
Eugene Charniak. 1997. Statistical parsing with a
context-free grammar and word Statistics. In Proceed-
ings of the 14th National Conference on Artificial In-
telligence, Menlo Parc.
Philipp Koehn and Kevin Knight. 2003. Empirical meth-
ods forcompound splitting. In Proceedings of the 10th
Conference of the European Chapter of the Associ-
ation for Computational Linguistics, Budapest, Hun-
gary.
K. Lari and S. Young. 1990. The estimation of stochas-
tic context-free grammars using the inside-outside al-
gorithm. Computation Speech and Language Process-
ing, 4:35–56.
Mark Lauer. 1995. Corpus statistics meet the noun com-
pound: Some empirical results. In Proceedings of the
33rd Annual Meeting of the ACL, Massachusetts In-
stitute of Technology, pages 47–54, Cambridge, Mass.
electronically available at http://xxx.lanl.gov/abs/cmp-
lg/9504033.
Anne Schiller. 1996. Deutsche Flexions- und Kom-
positionsmorphologie mit PC-KIMMO. In Roland
Hausser, editor, Proceedings, 1. Morpholympics, Er-
langen, 7./8. Mrz 1994, T¨ubingen. Niemeyer.
Tanja Schmid, Anke Ldeling, Bettina Suberlich, Ulrich
Heid, and Bernd Mbius. 2001. DeKo: Ein System zur
Analyse komplexer Wrter. In GLDV - Jahrestagung
2001, pages 49–57.
Helmut Schmid, Arne Fitschen, and Ulrich Heid. 2004.
SMOR: A German computational morphology cover-
ing derivation, composition and inflection. In Pro-
ceedings of the 4th International Conference on Lan-
guage Resources and Evaluation, volume 4, pages
1263–1266, Lisbon, Portugal.
522
