Tagging Grammatical Functions 
Thorsten Brants, Wojciech Skut, Brigitte Krenn 
Universit~t des Saarlandes 
Computational Linguistics 
D-66041 Saarbr/icken, Germany 
{brant s, skut,krenn}@coli.uni-sb, de 
Abstract 
This paper addresses issues in automated 
treebank construction. We show how stan- 
dard part-of-speech tagging techniques ex- 
tend to the more general problem of struc- 
tural annotation, especially for determi- 
ning grammatical functions and syntactic 
categories. Annotation is viewed as an in- 
teractive process where manual and auto- 
matic processing alternate. Efficiency and 
accuracy results are presented. We also dis- 
cuss further automation steps. 
1 Introduction 
The aim of the work reported here is to construct a 
corpus of German annotated with syntactic structu- 
res (treebank). The required size of the treebank and 
granularity of encoded information make it neces- 
sary. to ensure high annotation efficiency and accu- 
racy. Annotation automation has thus become one 
of the central issues of the project. 
In this section, we discuss the relation between au- 
tomatic and manual annotation. Section 2 focuses 
on the annotation format employed in our treebank. 
The annotation software is presented in section 3. 
Sections 4 and 5 deal with automatic assignment of 
grammatical functions and phrasal categories. Ex- 
periments on automating the annotation are presen- 
ted in section 6. 
1.1 Automatic vs. Manual Annotation 
A problem for corpus annotation is the trade-off bet- 
ween efficiency, accuracy and coverage. Although 
accuracy increases significantly as annotators gain 
expertise, incorrect hand-parses still occur. Their 
frequency depends on the granularity of the enco- 
ded information. 
Due to this residual error rate, automatic anno- 
tation of frequently occurring phenomena is likely 
to yield better results than even well-trained hu- 
man annotators. For infrequently occurring con- 
structions, however, manual annotation is more re- 
liable, as is manual annotation of phenomena invol- 
ving non-syntactic information (e.g., resolution of 
attachment ambiguities based on world knowledge). 
As a consequence, efficiency and reliability of an- 
notation can be significantly increased by combining 
automatic annotation with human processing skills 
and supervision, especially if this combination is im- 
plemented as an interactive process. 
2 Annotation Scheme 
Existing treebanks of English ((Marcus et al., 1994), 
(Sampson, 1995), (Black et al., 1996)) contain con- 
ventional phrase-structure trees augmented with an- 
notations for discontinuous constituents. As this en- 
coding strategy is not well-suited to a free word or- 
der language like German, we have focussed on a less 
surface-oriented level of description, most closely re- 
lated to the LFG f-structure, and representations 
used in dependency grammar. To avoid confusion 
with theory-specific constructs, we use the generic 
term argument structure to refer to our annotation 
format. The main advantages of the model are: it is 
relatively theory-independent and closely related to 
semantics. For more details on the linguistic speci- 
fications of the annotation scheme see (Skut et al., 
1997). A similar approach has been also successfully 
applied in the TSNLP database, cf. (Lehmann et al., 
1996). 
In contrast to conventional phrase-structure 
grammars, argument structure annotations are not 
influenced by word order. Local and non-local de- 
pendencies are represented in the same way, the 
latter indicated by crossing branches in the hier- 
archical structure, as shown in figure 1 where in 
the VP the terminals of the direct object OA (den 
Traum yon der kleinen Gastst~tte) are not adjacent 
to the head HD aufgegeben 1. For a related handling 
1 See appendix A for a description of tags used throng- 
64 
Den 
ART 
The 
Traum 
NN 
dream 
+ 
von 
APPR 
of 
der kleinen Gastst"atte hat er noch 
ART ADJA NN VAFIN PPER ADV 
the small inn has he yet 
'He has not yet given up the dream of a small inn.' 
nicht 
PTKNEG 
not 
aufgegeben 
VVPP 
given up 
Figure 1: Example sentence 
of non-projective phenomena see (Tapanainen and 
J/irvinen, 1997). 
Such a representation permits clear separation of 
word order (in the surface string) and syntactic de- 
pendencies (in the hierarchical structure). Thus 
we avoid explicit explanatory statements about the 
complex interrelation between word order and syn- 
tactic structure in free word order languages. Such 
statements are generally theory-specific and there- 
fore are not appropriate for a descriptive approach 
to annotation. The relation between syntactic de- 
pendencies and surface order can nontheless be in- 
ferred from the data. This provides a promising way 
of handling free word order phenomena. 2. 
3 Annotation Tool 
Since syntactic annotation of corpora is time- 
consuming, a partially automated annotation tool 
has been developed in order to increase efficiency. 
3.1 The User Interface 
For optimal human-machine interaction, the tool 
supports immediate graphical representation of the 
structure being annotated. 
Since keyboard input is most efficient for assigning 
categories to words and phrases, cf. (Lehmann et al., 
1996; Marcus et al., 1994), and structural manipula- 
tions are executed most efficiently using the mouse, 
both an elaborate keyboard and optical interface is 
provided. As suggested by Robert MacIntyre 3, it is 
hout this paper. 
2'Free' word order is a function of several interacting 
parameters such as category, case and topic-focus arti- 
culation. Varying the order of words in a sentence yields 
a continuum of grammaticality judgments rather than a 
simple right-wrong distinction. 
3personal communication, Oct. 1996 
most efficient to use one hand for structural com- 
mands with the mouse and the other hand for short 
keyboard input. 
By additionally offering online menus for com- 
mands and labels, the tool suits beginners as well 
as experienced users. Commands such as "group 
words", "group phrases", "ungroup", "change la- 
bels", "re-attach nodes", "generate postscript out- 
put", etc. are available. 
The three tagsets (word, phrase, and edge labels) 
used by the annotation tool are variable. They are 
stored together with the corpus, which allows easy 
modification and exchange of tagsets. In addition, 
appropriateness checks are performed automatically. 
Comments can be added to structures. 
Figure 2 shows a screen dump of the graphical 
interface. 
3.2 Automating Annotation 
Existing treebank annotation tools are characterised 
by a high degree of automation. The task of the 
annotator is to correct the output of a parser, i.e., 
to eliminate wrong readings, complete partial parses, 
and adjust partially incorrect ones. 
Since broad-coverage parsers for German, espe- 
cially robust parsers that assign predicate-argument 
structure and allow crossing branches, are not availa- 
ble, or require an annotated traing corpus (cf. (Col- 
lins, 1996), (Eisner, 1996)). 
As a consequence, we have adopted a bootstrap- 
ping approach, and gradually increased the degree 
of automation using already annotated sentences as 
training material for a stochastic processing module. 
This aspect of the work has led to a new model 
of human supervision. Here automatic annotation 
and human supervision are combined interactively 
whereby annotators are asked to confirm the local 
65 
- G_enm'al: 
Corpus: IRefCorpus Tes~,ople I J~\] 
Editor: IThorsten J~\] 
I-~-Ii _,,oa, li E.,t i i O~,o°. i 
-Sentence: 
No.: 4 / 1269 
Comment: I 
Odgln: refcorp.tt 
Last edited: Thorsten, 28/05/97, 14:08:48 
Es o spleit I ebe~ keine 3 Roll% 
PPER WFIN ADV PlAT NN 
I<U 
511 
KOUS ART NN 
gef"~llg 9 iS~o 
ADJD VAFIN 
-Move: 
I .~r~, II _"..' I ~_o'o:' , 
I -,o II +,o I D ~,,,e, 
I -'°° II +,oo I Mat~eo:o 
i r--_Dependeney: / -s¢~''°n: I 
_Command: I 
| i ~-"~'° I 
I 
IB 
_ our o ,o "°u°"' ;i' mu, 
I i 
T ag: 
Node no.: I J 
Zag: I IB 
I-""' II "-'°~ I1-~-I 
I Switchin~ to sentence no, 4.,. Done. 
Figure 2: Screen dump of the annotation tool 
predictions of the parser. The size of such 'super- 
vision increments' varies from local trees of depth 
one to larger chunks, depending on the amount of 
training data available. 
We distinguish six degrees of automation: 
0) Completely manual annotation. 
1) The user determines phrase boundaries and 
syntactic categories (S, NP, VP, ...). The pro- 
gram automatically assigns grammatical func- 
tions. The annotator can alter the assigned tags 
(cf. figure 3). 
2) The user only determines the components of a 
new phrase (local tree of depth 1), while both 
category and function labels are assigned auto- 
matically. Again, the annotator has the option 
of altering the assigned tags (cf. figure 4). 
3) The user selects a substring and a category, 
whereas the entire structure covering the sub- 
string is determined automatically (cf. figure 5). 
4) The program performs simple bracketing, i.e., 
finds 'kernel phrases' without the user having 
to explicitly mark phrase boundaries. The task 
can be performed by a chunk parser that is 
equipped with an appropriate finite state gram- 
mar (Abney, 1996). 
5) The program suggests partiM or complete par- 
ses. 
A set of 500 manually annotated training sent- 
ences (step 0) was sufficient for a statistical tagger 
to reliably assign grammatical functions, provided 
the user determines the elements of a phrase and 
its category (step 1). Approximately 700 additio- 
nal sentences have been annotated this way. An- 
notation efficiency increased by 25 %, namely from 
an average annotation time of 4 minutes to 3 minu- 
tes per sentence (300 to 400 words per hour). The 
1,200 sentences were used to train the tagger for au- 
tomation step 2. Together with improvements in the 
user interface, this increased the efficiency by ano- 
ther 33%, from approximately 3 to 2 minutes (600 
words per hour). The fastest annotators cover up to 
66 
das 1993 startende Bonusprogramm for Vielflieger 
ART CARD ADJA NN APPR NN 
'the bonus program for .h'equent fliers starting in 1993' 
Figure 3: Example for automation level 1: the user 
has marked das, the AP, Bonusprogramm, and the 
PP as a constituent of category NP, and the tool's 
task is to determine the new edge labels (marked 
with question marks), which are, from left to right, 
NK, NK, NK, MNR. 
das 1993 startende Bonusprogramm ffir Vielflieger 
ART CARD ADJA NN APPR NN 
'the bonus program for frequent fliers starting in 1993' 
Figure 4: Example for automation level 2: the user 
has marked das, the AP, Bonusprogramm and the PP 
as a constituent, and the tool's task is to determine 
the new node and edge labels (marked with question 
marks). 
1000 words per hour. 
At present, the treebank comprises 3000 sent- 
ences, each annotated independently by two anno- 
tators. 1,200 of the sentences are compared with 
the corresponding second annotation and are clea- 
ned, 1,800 are currently cleaned. 
In the following sections, the automation steps 1 
and 2 are presented in detail. 
4 Tagging Grammatical Functions 
4.1 The Tagger 
In contrast to a standard part-of-speech tagger 
which estimates lexical and contextual probabilities 
of tags from sequences of word-tag pairs in a corpus, 
(e.g. (Cutting et al., 1992; Feldweg, 1995)), the tag- 
ger for grammatical functions works with lexical and 
contextual probability measures Pq(.) depending on 
the category of the mother node (Q). Each phrasal 
category (S, VP, NP, PP etc.) is represented by a 
different Markov model. The categories of the dau- 
+++®+ ++ 
das 1993 startende Bonusprograrnm for Vielflieger 
ART CARD ADJA NN APPR NN 
'the bonus program for frequent fliers starting in 1993' 
Figure 5: Example for automation level 3: the user 
has marked the words as a constituent, and the tool's 
task is to determine simple sub-phrases (the AP and 
PP) as well as the new node and edge labels (cf. 
previous figures ~br the resulting structure). 
Selbst 
ADV 
himself 
l"l 
besucht hat Peter 
VVPP VAFIN NE 
visited has Peter 
+l 
Sabine nie 
NE ADV 
Sabine never 
'Peter never visited Sabine himself' 
Figure 6: Example sentence 
ghter nodes correspond to the outputs of the Mar- 
kov model, while grammatical functions correspond 
to states. 
The structure of a sample sentence is shown in 
figure 6. Figure 7 shows those parts of the Markov 
models for sentences (S) and verb phrases (VP) that 
represent the correct paths for the example. 4 
Given a sequence of word and phrase categories 
T = T1...Tk and a parent category Q, we cal- 
culate the sequence of grammatical functions G = 
G1 ... Gk that link T and Q as 
argmaxPQ(GIT ) (1) 
G 
Pq(a). Pq(TIC) = argmax 
a PQ(T) 
= argm xPq(a). Pq(TJG) 
G 
Assuming the Markov property we have 
4cf. appendix A for a description of tags used in the 
example 
67 
VP VA FIN NE A D V 
&--@--®---®--@--® 
O ~m m ~a 
o ~ .2. ~ ~- 
ADV VVPP NE 
Ps(ADVIMO) 1 PVp (VVPP IHD) l Pvp(N~IOA) 1 
N ~ o 
o ~ d d 
Figure 7: Parts of the Markov models used in Selbst besucht hat Peter Sabine hie (cf. figure 6). All unused 
states, transitions and outputs are omitted. 
and 
k 
PQ(TIG) = II PQ(~qlG,) (2) 
i=1 
k 
Pq(a) = II P (a, lC,) (3) 
i=1 
The contexts Ci are modeled by a fixed number of 
surrounding elements. Currently, we use two gram- 
matical functions, which results in a trigram model: 
PO(G) = H Po(GiIGi-2, Gi-1) (4) 
i=1 
The contexts are smoothed by linear interpolation 
of unigrams, bigrams, and trigrams. Their weights 
are calculated by deleted interpolation (Brown et al., 
1992). 
The predictions of the tagger are correct in ap- 
prox. 94% of Ml cases. In section 4.3, we demons- 
trate how to cope with wrong predictions. 
4.2 Serial Order 
As the annotation format permits trees with crossing 
branches, we need a convention for determining the 
relative position of overlapping sibling phrases in or- 
der to assign them a position in a Markov model. For 
instance, in figure 6 the range of the terminal node 
positions of VP overlaps with those of the subject 
$B and the finite verb HD. Thus there is no single 
a-priori position for the VP node 5. 
The position of a phrase depends on the position 
of its descendants. We define the relative order of 
two phrases recursively as the order of their anchors, 
i.e., some specified daughter nodes. If the anchors 
are words, we simply take their linear order. 
The exact definition of the anchor is based on lin- 
guistic knowledge. We choose the most intuitive al- 
ternative and define the anchor as the head of the 
phrase (or some equivalent function). Noun phrases 
do not necessarily have a unique head; instead, we 
use the last element in the noun kernel (elements 
of the noun kernel are determiners, adjectives, and 
nouns) to mark the anchor position. Except for NPs, 
we employ a default rule that takes the leftmost ele- 
ment as the anchor in case the phrase has no (uni- 
que) head. 
Thus the position of the VP in figure 6 is defined 
as equal to the string position of besucht. The po- 
sition of the VP node in figure 1 is equal to that of 
anfgegeben, and the position of the NP in figure 3 is 
equivalent to that of Bonusprograrara. 
4.3 Reliability 
Experience gained from the development of the Penn 
Treebank (Marcus et al., 1994) has shown that au- 
SWithout crossing edges, the serial order of phrases 
is trivial: phrase Q1 precedes phrase Q2 if and only if 
all terminal nodes derived from Qa precede those of Q2. 
This suffices to uniquely determine the order of sibling 
nodes. 
68 
tomatic annotation is useful only if it is absolutely 
correct, while wrong analyses are often difficult to 
detect and their correction can be time-consuming. 
To prevent the human annotator from missing er- 
rors, the tagger for grammatical functions is equip- 
ped with a measure for the reliability of its output. 
Given a sequence of categories, the tagger cal- 
culates the most probable sequence of grammatical 
functions. In addition, it computes the probabili- 
ties of the second-best functions of each daughter 
node. If some of these probabilities are close to that 
of the best sequence, the alternatives are regarded 
as equally suited and the most probable one is not 
taken to be the sole winner, the prediction is marked 
as unreliable in the output of the tagger. 
These unreliable predictions can be further classi- 
fied in that we distinguish "unreliable" sequences as 
opposed to "almost reliable" ones. 
The distance between two probabilities for the 
best and second-best alternative, Pbest and Pse¢ond, 
is measured by their quotient. The classification of 
reliability is based on thresholds. In the current im- 
plementation we employ three degrees of reliability 
which are separated by two thresholds 01 and 02. 01 
separating unreliable decisions from those conside- 
red almost reliable. 02 marks the difference between 
almost and fully reliable predictions. 
Unreliable: 
Pbes-----k- < 01 
Pseeond 
The probabilities of alternative assignments are wi- 
thin some small specified distance. In this case, it 
is the annotator who has to specify the grammatical 
function. 
Almost reliable: 
01 < Pbes_____t__ < 02 
Psecond 
The probability of an alternative is within some lar- 
ger distance. In this case, the most probable func- 
tion is displayed, but the annotator has to confirm 
it. 
Reliable: 
Pbes-----L- __> 02 
Psecond 
The probabilitiesof all alternatives are much smaller 
than that of the best assignment, thus the latter is 
assigned. 
For efficiency, an extended Viterbi algorithm is 
used. Instead of keeping track of the best path only 
(of. (Rabiner, 1989)), we keep track of all paths that 
fall into the range marked by the probability of the 
best path and 02, i.e., we keep track of all alternative 
paths with probability Palt for which 
Pbest 
Part _> 02 " 
Suitable values for 01 and 02 were determined em- 
pirically (cf. section 6). 
5 Tagging Phrase Categories 
The second level of automation (cf. section 3) au- 
tomates the recognition of phrasal categories, and 
so frees the annotator from typing phrase labels. 
The task is performed by an extension of the tag- 
ger presented in the previous section where different 
Markov models for each category were introduced. 
The annotator determines the category of the cur- 
rent phrase, and the tool runs the appropriate model 
to determine the edge labels. 
To assign the phrase label automatically, we run 
all models in parallel. Each model assigns gramma- 
tical functions and, more important for this step, 
a probability to the phrase. The model assigning 
the highest probability is assumed to be most ade- 
quate, and the corresponding label is assigned to the 
phrase. 
Formally, we calculate the phrase category Q (and 
at the same time the sequence of grammatical func- 
tions G = G1 ... Gk) on the basis of the sequence of 
daughters T = T1 ... Tk with 
argmax maXPQ(G\]T). O G 
This procedure is equivalent to a different view 
on the same problem involving one large (combined) 
Markov model that enables a very efficient calcula- 
tion of the maximum. 
Let ~Q be the set of all grammatical functions 
that can occur within a phrase of type Q. Assume 
that these sets are pairwise disjoint. One can easily 
achieve this property by indexing all used gramma- 
tical functions with their associated phrases and, if 
necessary, duplicating labels, e.g., instead of using 
HD, MO, ..., use the indexed labels HDs, HDvp, 
MONp, ...This property makes it possible to deter- 
mine a phrase category by inspecting the gramma- 
tical functions involved. 
When applied, the combined model assigns gram- 
matical functions to the elements of a phrase (not 
knowing its category in advance). If transitions bet- 
ween states representing labels with different indices 
are forced to zero probability (together with smoo- 
thing applied to other transitions), all labels assi- 
gned to a phrase get the same index. This uniquely 
identifies a phrase category. 
The two additional conditions 
G e GQi :=v G ¢ GQ2 (Qi ¢ Q2) 
and 
G1 E CO A G2 ~ GQ :::V P(G2\[G1) = 0 
69 
are sufficient to calculate 
argmax P( G\[T) 
G 
using the Viterbi algorithm and to identify both 
the phrase category and the respective grammatical 
functions. 
Again, as described in section 4, we calculate pro- 
babilities for alternative candidates in order to get 
reliability estimates. 
The overall accuracy of this approach is approx. 
95%, and higher if we only consider the reliable ca- 
ses. Details about the accuracy are reported in the 
next section. 
6 Experiments 
To investigate the possibility of automating anno- 
tation, experiments were performed with the clea- 
ned part of the treebank 6 (approx. 1,200 sentences, 
24,000 words). The first run of experiments was car- 
ried out to test tagging of grammatical functions, the 
second run to test tagging of phrase categories. 
6.1 Grammatical Functions 
This experiment tested the reliability of assigning 
grammatical functions given the category of the 
phrase and the daughter nodes (supplied by the an- 
notator). 
Let us consider the sentence in figure 6: two se- 
quences of grammatical functions are to be determi- 
ned, namely the grammatical functions of the dau- 
ghter nodes of S and VP. The information given 
for selbst besucht Sabine is its category (VP) and 
the daughter categories: adverb (ADV), past parti- 
ciple (wee), and proper noun (NE). The task is 
to assign the functions modifier (MO) to ADV, head 
(SO) to wee and direct (accusative) object (OA) 
to NE. Similarly, function tags are assigned to the 
components of the sentence (S). 
The tagger described in section 4 was used. 
The corpus was divided into two disjoint parts, 
one for training (90% of the respective corpus), and 
one for testing (10%). This procedure was repeated 
10 times with different partitions. Then the average 
accuracy was calculated. 
The thresholds for search beams were set to 61 = 5 
and 62 = 100, i.e., a decision is classified as reliable 
if there is no alternative with a probability larger 
than 1~0 of the best function tag. The prediction 
is classified as unreliable if the probability of an al- 
ternative is larger than ~ of the most probable tag. 
6The corpus is part of the German newspaper text 
provided on the ECI CD-ROM. It has been part-of- 
speech tagged and manually corrected previously, cf. 
(Thielen and Schiller, 1995). 
Table 1: Levels of reliability and the percentage ca- 
ses where the tagger assigned a correct grammati- 
cal function (or would have assigned if a decision is 
forced). 
reliable 
marked 
unreliable 
overall 
cases correct 
89% 96.7% 
7% 84.3% 
4% 57.3% 
100% 94.2% 
If there is an akernative between these two thres- 
holds, the prediction is classified as almost reliable 
and marked in the output (cf. section 4.3: marked 
assignments are to be confirmed by the annotator, 
unreliable assignments are deleted, annotation is left 
to the annotator). 
Table 1 shows tagging accuracy depending on the 
three different levels of reliability. The results con- 
firm the choice of reliability measures: the lower the 
reliability, the lower the accuracy. 
Table 2 shows tagging accuracy depending on the 
category of the phrase and the level of reliability. 
The table contains the following information: the 
number of all mother-daughter relations (i.e., num- 
ber of words and phrases which are immediately do- 
minated by a mother node of a particular category), 
the overall accuracy for that phrasal category and 
the accuraciees for the three reliability intervals. 
6.2 Error Analysis for Function 
Assignment 
The inspection of tagging errors reveals several sour- 
ces of wrong assignments. Table 3 shows the 10 most 
frequent errors 7 which constitute 25% of all errors 
(1509 errors occurred during 10 test runs). 
Read the table in the following way: line 2 shows 
the second-most frequent error. It concerns NPs oc- 
curring in a sentence (S); this combination occurred 
1477 times during testing. In 286 of these occur- 
rences the N P is assigned the grammatical function 
OA (accusative object) manually, but of these 286 
cases the tagger assigned the function SB (subject) 
56 times. 
The errors fall into the following classes: 
1. There is insufficient information in the node la- 
bels to disambiguate the grammatical function. 
Line 1 is an example for insufficient information. 
The tag NP is uninformative about its case and the- 
refore the tagger has to distinguish SB (subject) and 
7See appendix A for a description of tags used in the 
table. 
70 
Table 2: Tagging accuracy for assigning grammatical 
functions depending on the category of the mother 
node. For each category, the first row shows the per- 
centage of branches that occur within this category 
and the overall accuracy, the following rows show the 
relative percentage and accuracy for different levels 
of reliability. 
cases correct 
S 26% 89.1% 
decision 85% 92.7% 
marked 8% 81.9% 
no decision 7% 52.9% 
VP 7% 90.9% 
decision 97% 92.2% 
marked 1% 57.7% 
no decision 2% 52.3% 
NP 26% 96.4% 
decision 86% 98.6% 
marked 10% 86.8% 
no decision 4% 73.0% 
PP 24% 97.9% 
decision 92% 99.2% 
marked 6% 85.8% 
no decision 2% 75.5% 
others 18% 94.7% 
decision 91% 98.0% 
marked 6% 82.8% 
no decision 3% 22.1% 
Table 3: The 10 most frequent errors in assigning 
grammatical functions. The table shows a mother 
and a daughter node category, the frequency of this 
particular combination (sum over 10 test runs), the 
grammatical function assigned manually (and its fre- 
quency) and the grammatical function assigned by 
the tagger (and its frequency). 
phrase elem f original assigned 
5. 
6. 
7. 
8. 
9. 
10. 
1. S 
2. S 
3. NP 
4. S 
PP 
VP 
S 
S 
S 
VP 
NP 1477 
NP 1477 
PP 470 
VP 613 
PP 252 
NP 286 
NP 1477 
NP 1477 
S 186 
PP 453 
SB 
OA 
PG 
PD 
PG 
DA 
PD 
MO 
MO 
SBP 
894 OA 
286 SB 
52 MNR 
47 OC 
30 MNR 
32 OA 
72 SB 
33 SB 
78 PD 
21 MO 
65 
56 
50 
42 
30 
26 
25 
21 
21 
21 
OA (accusative object) on the basis of its position, 
which is not very reliable in German. Missing in- 
formation in the labels is the main source of errors. 
Therefore, we currently investigate the benefits of a 
morphological component and percolation of selec- 
ted information to parent nodes. 
2. Due to the n-gram approach, the tagger only 
sees a local window of the sentences. 
Some linguistic knowledge is inherently global, e.g., 
there is at most one subject in a sentence and one 
head in a VP. Errors of this type may be reduced by 
introducing finite state constraints that restrict the 
possible sequences of functions within each phrase. 
3. The manual annotation is wrong, and a correct 
tagger prediction is counted as an error. 
At earlier stages of annotation, the main source of 
errors was wrong or missing manual annotation. In 
some cases, the tagger was able to abstract from 
these errors during the training phase and subse- 
quently assigned the correct tag for the test data. 
However, when performing a comparison against the 
corpus, these differences are marked as errors. Most 
of these errors were eliminated by comparing two 
independent annotations and cleaning up the data. 
6.3 Phrase Categories 
In this experiment, the reliability of assigning phrase 
categories given the categories of the daughter nodes 
(they are supplied by the annotator) was tested. 
Consider the sentence in figure 6: two phrase ca- 
tegories are to be determined (VP and S). The in- 
formation given for selbst besucM Sabine is the se- 
quence of categories: adverb (ADV), past participle 
71 
Table 4: Levels of reliability and the percentage of 
cases in which the tagger assigned a correct phrase 
category (or would have assigned if a decision is 
forced). 
reliable 
marked 
unreliable 
overall 
cases correct 
79% 98.5% 
16% 90.4% 
5% 65.9% 
100% 95.4% 
(VVPP), and proper noun (NE). The task is to as- 
sign category VP. Subsequently, S is to be assigned 
based on the categories of the daughters VP, VAFIN, 
NE, and ADV. 
The extended tagger using a combined model as 
described in section 5 was applied. 
Again, the corpus is divided into two disjoint 
parts, one for training (90% of the corpus), and 
one for testing (10%). The procedure is repeated 
10 times with different partitions. Then the average 
accuracy was calculated. 
The same thresholds for search beams as for the 
first set of experiments were used. 
Table 4 shows tagging accuracy depending on the 
three different levels of reliability. 
Table 5 shows tagging accuracy depending on the 
category of the phrase and the level of reliability. 
The table contains the following information: the 
percentage of occurrences of the particular phrase, 
the overall accuracy for that phrasal category and 
the accuracy for each of the three reliability inter- 
vals. 
6.4 Error Analysis for Category 
Assignment 
When forced to make a decision (even in unrelia- 
ble cases) 435 errors occured during the 10 test 
runs (4.5% error rate). Table 6 shows the 10 most- 
frequent errors which constitute 50% of all errors. 
The most frequent error was the confusion of S 
and VP. They differ in that sentences S contain fi- 
nite verbs and verb phrases VP contain non-finite 
verbs. But the tagger is trained on data that con- 
tain incomplete sentences and therefore sometimes 
erroneously assumes an incomplete S instead of a 
VP. To avoid this type of error, the tagger should 
be able to take the neighborhood of phrases into ac- 
count. Then, it could detect the finite verb that 
completes the sentence. 
Adjective phrases AP and noun phrases NP are 
confused by the tagger (line 5 in table 6), since al- 
most all AP's can be NP's. This error could also 
Table 5: Tagging accuracy for assigning phrase cate- 
gories, depending on the manually assigned category. 
For each category, the first row shows the percentage 
of phrases belongi:lg to a specific category (accor- 
ding to manual ~,zsignment) and the percentage of 
correct assignments. The following rows show the 
relative percentage and accuracy for different levels 
of reliability. 
cases correct 
S 20% 97.5% 
decision 96% 99.7% 
marked 2% 63.2% 
no decision 2% 29.0% 
VP 9% 93.2% 
decision 71% 96.4% 
marked 24% 91.3% 
no decision 5% 60.9% 
NP 29% 96.1% 
decision 81% 99.3% 
marked 13% 91.8% 
no decision 6% 64.9% 
PP 24% 98.7% 
decision 94% 99.6% 
marked 4% 92.5% 
no decision 2% 70.8% 
others 18% 89.0% 
decision 42% 91.7% 
marked 45% 90.6% 
no decision 12% 73.2% 
72 
Table 6: The 10 most frequent errors in assigning 
phrase categories (summed over reliability levels). 
The table shows the phrase category assigned manu- 
ally (and its frequency) and the category erroneously 
assigned by the tagger (and its frequency). 
I° 
2. 
3. 
4. 
5. 
6. 
7. 
8. 
9. 
10. 
phrase f assigned 
VP 828 S 
NP 2812 NM 
NP 2812 PP 
NP 2812 S 
AP 419 NP 
DL 20 CS 
PP 2298 NP 
S 1910 NP 
AP 419 PP 
MPN 293 NP 
46 
32 
31 
25 
15 
15 
15 
15 
11 
11 
be fixed by inspecting the context and detecting the 
associated NP. 
As for assigning grammatical functions, insuffi- 
cient information in the labels is a significant source 
of errors, cf. the second-most frequent error. A 
large number of cardinal-noun pairs forms a nume- 
rical component (NM), like 7 Millionen, 50 Prozent, 
etc (7 million, 50 percent). But this combination 
also occurs in NPs like 20 Leule, 3 Monate, ... (20 
people, 3 months), which are mis-tagged since they 
are less frequent. This can be fixed by introducing 
an extra tag for nouns denoting numericals. 
7 Conclusion 
A German newspaper corpus is currently being an- 
notated with a new annotation scheme especially de- 
signed for free word order languages. 
Two levels of automatic annotation (level 1: assi- 
gning grammatical functions and level 2: assigning 
phrase categories) have been presented and evalua- 
ted in this paper. 
The overall accuracy for assigning grammatical 
functions is 94.2%, ranging from 89% to 98%, de- 
pending on the type of phrase. The least accuracy 
is achieved for sentences, the best for prepositional 
phrases. By suppressing unreliable decisions, pre- 
cision can be increased to range from 92% to 99%. 
The overall accuracy for assigning phrase catego- 
ries is 95.4%, ranging from 89% to 99%, depending 
the category. By suppressing unreliable decisions, 
precision can also be increased to range from 92% to 
over 99%. 
In the error analysis, the following sources of mi- 
sinterpretation could be identified: insufficient lin- 
guistic information in the nodes (e.g., missing case 
information), and insufficient information about the 
global structure of phrases (e.g., missing valency in- 
formation). Morphological information in the tag- 
set, for example, helps to identify the objects and 
the subject of a sentence. Using a more fine-grained 
tagset, however, requires methods for adjusting the 
granularity of the tagset to the size (and coverage) 
of the corpus, in order to cope with the sparse data 
problem. 
8 Acknowledgements 
This work is part of the DFG Sonderforschungs- 
bereich 378 Resource-Adaptive Cognitive Processes, 
Project C3 Concu rent Grammar Processing. 
We wish to tl~ank the universities of Stuttgart 
and Tiibingen for kindly providing us with a hand- 
corrected part-of-speech tagged corpus. We also 
wish to thank Jason Eisner, Robert MacIntyre and 
Ann Taylor for valuable discussions on dependency 
parsing and the Penn Treebank annotation. Special 
thanks go to Oliver Plaehn, who implemented the 
annotation tool, and to our six fearless annotators. 

References 
Abney, Steven. 1996. Partial parsing via finite-state 
cascades. In Proceedings of the ESSLLI'96 Robust 
Parsing Workshop, Prague, Czech Republic. 
Black, Ezra, Stephen Eubank, Hideki Kashioka, 
David Magerman, Roger Garside, and Geoffrey 
Leech. 1996. Beyond skeleton parsing: Producing 
a comprehensive large-scale general-english tree- 
bank with full grammaticall analysis. In Proc. of 
COLING-96, pages 107-113, Kopenhagen, Den- 
mark. 
Brown, P. F., V. J. Della Pietra, Peter V. deSouza, 
Jenifer C. Lai, and Robert L. Mercer. 1992. Class- 
based n-gram models of natural language. Com- 
putational Linguistics, 18(4):467-479. 
Collins, Michael. 1996. A new statistical parser ba- 
sed on bigram lexical dependencies. In Procee- 
dings ofACL-96, Sant Cruz, CA, USA. 
Cutting, Doug, Julian Kupiee, Jan Pedersen, and 
Penelope Sibun. 1992. A practical part-of-speech 
tagger. In Proceedings of the 3rd Conference on 
Applied Natural Language Processing (ACL), pa- 
ges 133-140. 
Eisner, Jason M. 1996. Three new probabilistic 
models for dependency parsing: An exploration. 
In Proceedings of COLING-96, Kopenhagen, Den- 
mark. 
Feldweg, Helmut. 1995. Implementation and eva- 
luation of a german hmm for pos disambiguation. 
In Proceedings of EACL-SIGDAT-95 Workshop, 
Dublin, Ireland. 
Lehmann, Sabine, Stephan Oepen, Sylvie Regnier- 
Prost, Klaus Netter, Veronika Lux, Judith Klein, 
Kirsten Falkedal, Frederik Fouvry, Dominique 
Estival, Eva Dauphin, I-Ierv~ Compagnion, Judith 
Baur, Lorna Balkan, and Doug Arnold. 1996. TS- 
NLP -- Test Suites for Natural Language Proces- 
sing. In Proceedings of COLING 1996, Kopenha- 
gen. 
Marcus, Mitchell, Grace Kim, Mary Ann Marcinkie- 
wicz, Robert MacIntyre, Ann Bies, Mark Fergu- 
son, Karen Katz, and Britta Schasberger. 1994. 
The penn treebank: Annotating predicate argu- 
ment structure. In Proceedings of the Human Lan- 
guage Technology Workshop, San Francisco, Mor- 
gan Kaufmann. 
Rabiner, L. R. 1989. A tutorial on hidden markov 
models and selected applications in speech reco- 
gnition. In Proceedings of the IEEE, volume 77(2), 
pages 257-285. 
Sampson, Geoffrey. 1995. English for the Computer. 
Oxford University Press, Oxford. 
Skut, Wojciech, Brigitte Krenn, Thorsten Brants, 
and Hans Uszkoreit. 1997. An annotation scheme 
for free word order languages. In Proceedings of 
ANLP-97, Washington, DC. 
Tapanainen, Pasi and Timo J~irvinen. 1997. A non- 
projective dependency parser. In Proceedings of 
ANLP-97, Washington, DC. 
Thielen, Christine and Anne Schiller. 1995. Ein klei- 
nes und erweitertes Tagset ffirs Deutsche. In Ta- 
gungsberichte des Arbeitstreffens Lezikon + Text 
17./18. Februar 1994, Schlofl Hohent~bingen. Le- 
zicographica Series Maior, Tfibingen. Niemeyer. 
