A Maximum-Entropy Partial Parser for Unrestricted Text 
Wojciech Skut and Thorsten Brants 
Universit~t des Saarlandes 
Computational Linguistics 
D-66041 Saarbrficken, Germany 
{ skut, brant s} @coli. uni-sb, de 
Abstract 
This paper describes a partial parser that as- 
signs syntactic structures to sequences of part- 
of-speech tags. The program uses the maximum 
entropy parameter estimation method, which M- 
lows a flexible combination of different know- 
ledge sources: the hierarchical structure, parts 
of speech and phrasal categories. In effect, the 
parser goes beyond simple bracketing and recog- 
nises even fairly complex structures. We give 
accuracy figures for different applications of the 
parser. 
1 Introduction 
The maximum entropy framework has proved 
to be a powerful modelling tool in many ar- 
eas of natural language processing. Its ap- 
plications range from sentence boundary dis- 
ambiguation (Reynar and Ratnaparkhi, 1997) 
to part-of-speech tagging (Ratnaparkhi, 1996), 
parsing (Ratnaparkhi, 1997) and machine trans- 
lation (Berger et al., 1996). 
In the present paper, we describe a partial 
parser based on the maximum entropy mod- 
elling method. After a synopsis of the maximum 
entropy framework in section 2, we present the 
motivation for our approach and the techniques 
it exploits (sections 3 and 4). Applications and 
results are the subject of the sections 5 and 6. 
2 Maximum Entropy Modelling 
The expressiveness and modelling power of the 
maximum entropy approach arise from its abil- 
ity to combine information coming from differ- 
ent knowledge sources. Given a set X of possi- 
ble histories and a set Y of futures, ~e can char- 
acterise events from the joint event space X, Y 
by defining a number of features, i.e., equiva- 
lence relations over X x Y. By defining these 
features, we express our insights about informa- 
tion relevant to modelling. 
In such a formalisation, the maximum en- 
tropy technique consists in finding a model that 
(a) fits the empirical expectations of the pre- 
defined features, and (b) does not assume any- 
thing specific about events that are not sub- 
ject to constraints imposed by the features. In 
other words, we search for the maximum en- 
tropy probability distribution p*: 
p" = argmax g (p) 
pEP 
where P = {p:p meets the empirical feature ex- 
pectations} and H(p) denotes the entropy of p. 
For parameter estimation, we can use the Im- 
proved Iterative Scaling (IIS) algorithm (Berger 
et all., 1996), which assumes p to have the form: 
p(x, y) = 
where fi : X x Y ~ {0, 1} is the indicator func- 
tion of the i-th feature, Ai the weight assigned 
to this feature, and Z a normalisation constant. 
IIS iteratively adjusts the weights (Ai) of the 
features; the model converges to the maximum 
entropy distribution. 
One of the most attractive properties of the 
maximum entropy approach is its ability to cope 
with feature decomposition and overlapping fea- 
tures. In the following sections, we will show 
how these advantages can be exploited for par- 
tial parsing, i.e., the recognition of syntactic 
structures of limited depth. 
3 Context Information for Parsing 
An interesting feature of many partial parsers 
is that they recognise phrase boundaries mainly 
on the basis of cues provided by strictly local 
143 
contexts. Regardless of whether or not abstrac- 
tions such as phrases occur in the model, most 
of the relevant information is contained directly 
in the sequence of words and part-of-speech tags 
to be processed. 
An archetypal representative of this approach 
is the method described by Church (1988), who 
used corpus frequencies to determine the bound- 
aries of simple non, recursive NPs. For each pair 
of part-of-speech tags ti, tj, the probability of an 
NP boundary (' \[' or '\]') occurring between ti and 
tj is computed. On the basis of these context 
probabilities, the program inserts the symbols 
'\[' and '\]' into sequences of part-of-speech tags. 
Information about lexical contexts also sig- 
nificantly improves the performance of deep 
parsers. For instance, Joshi and Srinivas 
(1994) encode partial structures in the Tree Ad- 
joining Grammar framework and use tagging 
techniques to restrict a potentially very large 
amount of alternative structures. Here, the con- 
text incorporates information about both the 
terminal yield and the syntactic structure built 
so far. 
Local configurations of words and p~irts of 
speech are a particularly important knowledge 
source for lexicalised grammars. In the Link 
Grammar framework (Lagerty et al., 1992; 
Della Pietra et al., 1994), strictly local contexts 
are naturally combined with long-distance in- 
formation coming from long-range trigrams. 
Since modelling syntactic context is a very 
knowledge-intensive problem, the maximum en- 
tropy framework seems to be a particularly ap- 
propriate approach. Ratnaparkhi (1997) intro- 
duces several conteztual predicates which pro- 
vide rich information about the syntactic con- 
text of nodes in a tree (basically, the structure 
and category of nodes dominated by or dom- 
inating the current phrase). These predicates 
are used to guide the actions of a parser. 
The use of a rich set of contextual features is 
also the basic idea of the approach taken by Her- 
mjakob and Mooney (1997), who employ predi- 
cates capturing syntactic and semantic context 
in their parsing and machine translation system. 
4 A Partial Parser for German 
The basic idea underlying our appr*oach to par- 
tial parsing can be characterised as follows: 
• An appropriate encoding format makes it 
possible to express all relevant lexical, cat- 
egorial and structural information in a fi- 
nite alphabet of structural tags assigned to 
words (section 4.1). 
• Given a sequence of words tagged with 
part-of-speech labels, a Markov model is 
used to determine the most probable se- 
quence of structural tags (section 4.2). 
• Parameter estimation is based on the max- 
imum entropy technique, which takes full 
advantage of the multi-dimensional charac- 
ter of the structural tags (section 4.3). 
The details of the method employed are ex- 
plained in the remainder of this section. 
4.1 Relevant Contextual Information 
Three pieces of information associated with a 
word wi are considered relevant to the parser: 
• the part-of-speech tag ti assigned to wi 
• the structural relation ri between wi and 
its predecessor wi-1 
• the syntactic category ca of parent(wi) 
On the basis of these three dimensions, struc- 
tural tags are defined as triples of the form 
Si = (ti,ri,ca). For better readability, we will 
sometimes use attribute-value matrices to de- 
note such tags. 
= 
Since we consider structures of limited depth, 
only seven values of the REL attribute are dis- 
tinguished. 
r/= 
0 if parent(wi) = parent(wi_l) 
+ if parent(wi) = parent2(wi_l) 
++ if parent(wi) = parent3(wi_l) 
- if parent2(wi) = parent(wi_l) 
-- if parent3(wi) = parent(wi_l) 
= if parent2(wi) = parent2(wi_t) 
1 else 
If more than one of the conditions above are 
met, the first of the corresponding tags in the 
144 
list is assigned. Figure 1 exemplifies the encod- 
ing format. 
r2=0 7.2=+ r2=++ 
7"2 = -- 
-_ 
7" 2 = ---- 
°-~°" 
v~ 
m r2-- 
Figure 1: Tags r 2 assigned to word w2 
These seven values of the ri attribute are 
mostly sufficient to represent the structure of 
even fairly complex NPs, PPs and APs, involv- 
ing PP and genitive NP attachment as well as 
complex prenominal modifiers. The only NP 
components that are not treated here are rela- 
tive clauses and infinitival complements. A Ger- 
man prepositional phrase and its encoding are 
shown in figure 2. 
etwa 50 000 
ADV CARD CARD 
-- ".OatR." ".O=tSg 
approx. 50 000 
mit DM 
APPR NN 
-- ".DlJt R 
1 ++ 
with DM 
pro Jahr 
APPR NN 
-- ".O.t~ 
- 0 
per year 
4.2 A Markovian Parser 
The task of the parser is to determine the best 
sequence of triples (ti, ri, Ci ) for a given sequence 
of part-of-speech tags (to, tl,...tn). Since the 
attributes TAG, REL and CAT can take only 
a finite number of values, the number of such 
triples will also be finite, and they can be used 
to construct a 2-nd order Markov model. The 
triples Si = (ti,ri,ci) are states of the model, 
which emits POS tags (tj) as signals. 
In this respect, our approach does not much 
differ from standard part-of-speech tagging 
techniques. We simply assign the most probable 
sequence of structural tags S = (So, &,..., &) 
to a sequence of part-of-speech tags T = 
(to, tt,...,tn). Assuming the Markov property, 
we obtain: 
argmax P( SIT) (1) 
S 
= argmax P(S). P(TIS) 
R 
k 
= argmax rI P(SilSi-2, Si-t)P(tilSi) 
R i=1 
The part-of-speech tags are encoded in the 
structural tag (the ti dimension), so S uniquely 
determines T. Therefore, we have P(ti\[Si) = 1 
if Si = (ti, ri, ci) and 0 otherwise, which simpli- 
• ties calculations. 
4.3 Parameter Estimation 
The more interesting aspect of our parser is the 
estimation of contextual probabilities, i.e., cal- 
culating the probability of a structural tag Si 
(the "future") conditional on its immediate pre- 
decessors Si- 1 and Si-2 (the "history"). 
history future 
CAT: c/-2\] 
ti-2J 
"CAT: ~_ I\] 
REL: ri-1 / 
TAG: ti-lJ 
"CAT: ci\] 
REL: ri 
TAG: ti 
Figure 2: A sample structure. 
explained in Appendix B. 
The labels are In the following two subsections, we contrast 
the traditional HMM estimation method and 
the maximum entropy approach. 
145 
4.3.1 Linear Interpolation 
One possible way of parameter estimation is to 
use standard HMM techniques while treating 
the triples Si = (ti, ci,ri} as atoms. Trigram 
probabilities are estimated from an annotated 
corpus by using relative frequencies r: 
=/(&-2,&-l,&) 
/(Si-2, Si-1) 
A standard method of handling sparse data is to 
use a linear combination of unigrams, bigrams, 
and trigrams/5: 
/5(&l&-2,&-,) = Air(&) 
+X2r(S/l&-1) 
+Aar(&l&-2, &-l) 
The Ai denote weights for different context sizes 
and sum up to 1. They are commonly estimated 
by deleted interpolation (Brown et hi., 1992). 
4.3.2 Features 
A disadvantage of the traditional method is that 
it considers only full n-grams Si-n+l, ..., Si and 
ignores a lot of contextual information, such 
as regular behaviour of the single attributes 
TAG, REL and CAT. The maximum entropy 
approach offers an attractive alternative in this 
respect since we are now free to define fea- 
tures accessing different constellations of the at- 
tributes. For instance, we can abstract over one 
or more dimensions, like in the context descrip- 
tion in figure 1. 
history future 
\[REL: 0 REL: 
!TAG: NNJ 
Table 1: A partial trigram feature 
Such "partial n-grams" permit a better ex- 
ploitation of information coming from con- 
texts observed in the training data. We 
say that a feature fk defined by the triple 
(Mi-2, Mi-1, Mi) of attribute-value matrices is 
active on a trigram context (S~_2, S~_i, S~) (i.e., 
fk(S~_ 2, S~_1, S~) = 1) iff Mj unifies with the 
attribute-value matrix /t'I~ encoding the infor- 
mation contained in S~ for j = i - 2, i - 1, i. A 
novel context would on average activate more 
features than in the standard HMM approach, 
which treats the (ti, ri, c~> triples as atoms. 
The actual features are extracted from the 
training corpus in the following way: we first de- 
fine a number of feature patterns that say which 
attributes of a trigram context are relevant. All 
feature pattern instantiations that occur in the 
training corpus are stored; this procedure yields 
several thousands of features for each pattern. 
After computing the weights Ai of the fea- 
tures occurring in the training sample, we can 
calculate the contextual probability of a multi- 
dimensional structural tag Si following the two 
tags Si-2 and Si-l: 
1 . e~,'~i'Ii(Si-2'&-"sd p(&l&-2, &-,) = E 
We achieved the best results with 22 empir- 
ically determined feature patterns comprising 
full and partial n-grams, n _< 3. These patterns 
are listed in Appendix A. 
5 Applications 
Below, we discuss two applications of our max- 
imum entropy parser: treebank annotation and 
chunk parsing of unrestricted text. For precise 
results, see section 6. 
5.1 Treebank Annotation 
The partial parser described here is used for cor- 
pus annotation in a treebank project, cf. (Skut 
et hi., 1997). The annotation process is more in- 
teractive than in the Penn Treebank approach 
(Marcus et hi., 1994), where a sentence is first 
preprocessed by a partial parser and then edited 
by a human annotator. In our method, man- 
ual and automatic annotation steps are closely 
interleaved. Figure 3 exemplifies the human- 
computer interaction during annotation. 
The annotations encode four kinds of linguis- 
tic information: 1) parts of speech and inflec- 
tion, 2) structure, 3) phrasal categories (node 
labels), 4) grammatical functions (edge labels). 
Part-of-speech tags are assigned in a prepro- 
cessing step. The automatic instantiation of la- 
bels is integrated into the assignment of struc- 
tures. The annotator marks the words and 
phrases to be grouped into a new substructure, 
and the node and edge labels are inserted by the 
program, cf. (Brants et al., 1997). 
146 
Das Volumen lag in besseren Zeiten bei etwa 
ART NN VVFIN APPR ADJA NN APPR ADV 
Oef.Ne~.Norn.Sg Neut.Nom.Sg." 3.Sg.Past.lnd Oat Comp.'.DaLP1.SI Fem.Dat.Pt" Dat -- 
acht Millionen Tonnen 
CARD NN NN 
Fern.DatPl." Fem.Nom.P1." 
$. 
N 
Figure 3: A chunked sentence (in better times, the volume was around eight million tons). Gram- 
matical function labels: NK nominal kernel component, AC adposition, NMC number component, 
MO modifier. 
Initially, such annotation increments were 
just local trees of depth one. In this mode, the 
annotation of the PP bei etwa acht Millionen 
Tonnen (\[at\] around eight million tons) involves 
three annotation steps (first the number phrase 
acht Millionen, then the AP, and the PP). Each 
time, the annotator highlights the immediate 
constituents of the phrase being constructed. 
The use of the partial parser described in this 
paper makes it possible to construct the whole 
PP in only one step: The annotator marks the 
words dominated by the PP node, and the inter- 
nal structure of the new phrase is assigned auto- 
matically. This significantly reduces the amount 
of manual annotation work. The method yields 
reliable results in the case of phrases that ex- 
hibit a fairly rigid internal structure. More than 
88% of all NPs, PPs and APs are assigned the 
correct structure, including PP attachment and 
complex prenominal modifiers. 
Further examples of structures recognised by 
the parser are shown in figure 4. A more de- 
tailed description of the annotation mode can 
be found in (Brants and Skut, 1998). 
5.2 NP Chunker 
Apart from treebank annotation, our partial 
parser can be used to chunk part-of-speech 
tagged text into major phrases. Unlike in the 
previous application, the tool now has to deter- 
mine not only the internal structure, but also 
the external boundaries of phrases. This makes 
the task more difficult; especially for determin- 
ing FP attachment. 
However, if we restrict the coverage of the 
parser to the prenominal part of the NP/PP, it 
performs quite well, correctly assigning almost 
95% of all structural tags, which corresponds to 
a bracketing precision of ca. 87%. 
6 Results 
In this section, we report the results of a cross- 
validation of the parser carried out on the Ne- 
Gra Treebank (Skut et al., 1997). The corpus 
was converted into structural tags and parti- 
tioned into a training and a testing part (90% 
and 10%, respectively). We repeated this proce- 
dure ten times with different partitionings;the 
results of these test runs were averaged. 
The weights of the features used by the 
maximum entropy parser were determined with 
the help of the Maximum Entropy Modelling 
Toolkit, cf. (Ristad, 1996). The number of fea- 
tures reached 120,000 for the full training cor- 
pus (12,000 sentences). Interestingly, tagging 
accuracy decreased after after 4-5 iterations of 
Improved Iterative Scaling, so only 3 iterations 
were carried out in each of the test runs. 
The accuracy measures employed are ex- 
plained as follows. 
tags: the percentage of structural tags with the 
correct value r~ of the REL attribute, 
bracketing: the percentage of correctly recog- 
nised nodes, 
labelled bracketing: like bracketing, but in- 
cluding the syntactic category of the nodes, 
structural match: the percentage of correctly 
recognised tree structures (top-level chunks 
only, labelling is ignored). 
147 
D 
'3 
( 
° I ° 
Ein geradezu pathetischer Aufruf zum 
ART ADV ADJA NN APPRART 
An almost pathetic call for a 
gemeinsamen Kampf 
ADJA NN 
joint fight 
+ 
for einen gerechten Frieden 
APPR ART ADJA NN 
for a just peace 
® 
t0r Oklob4r geplsnlen 
APF'f~ NN AJDJA 
for Oclob,~. ptan:,',ed 
Eine Kostprobe lull emem Progrsmm 
ART NN APtPR ART NN 
A samc~e ol • ~ogt=m 
T 
~Jer ell>l~',memellen 
A~ ART M3JA 
Vedo,~'~,duf~l zw,lchen Rock- ~ Chormu~k 
NN AP'PR TRUNC KON NN 
betweert rock and ¢ho~ mut¢ 
E'3 
® 
irn Nachtragshaushalt vorgesehene 
APPRART NN AOJA 
in the additional budget planned 
Ober die Stellenverteilung 
APPR ART NN 
about the allocation of jobs 
in der Verwaltung 
APPR ART NN 
in the administration 
Figure 4: Examples of complex NPs and PPs correctly recognised by the parser. In the treebank 
application, such phrases are part of larger structures. The external boundaries (the first and 
the last word of the examples) are highlighted by an annotator, the parser recognises the internal 
boundaries and assigns labels. 
6.1 Treebank Application 
In the treebank application, information about 
the external boundaries of a phrase is supplied 
by an annotator. To imitate this situation, we 
extracted from the NeGra corpus all sequences 
of part-of-speech tags spanned by NPs PPs, 
APs and complex adverbials. Other tags were 
left out since they do not appear in chunks 
recognised by the parser. Thus, the sentence 
shown in figure 3 contributed three substrings 
to the chunk corpus: ART NN, APPR ADJA NN 
and APPR ADV CARD NN NN, which would also 
be typical annotator input. A designated sepa- 
rator character was used to mark chunk bound- 
aries. 
Table 2 shows the performance of the parser 
on the chunk corpus. 
148 
Table 2: Recall and precision results for the in- 
teractive annotation mode. 
measure 
tags 
bracketing 
lab. brack. 
struct, match 
total correct 
129822 123435 
56715 49715 
56715 47415 
37942 33450 
recall I prec. 
95.1% 
87.7% 89.1% 
83.6% 84.8% 
88.2% 88.0% 
6.2 Chunking Application 
Table 3 shows precision and recall for the chunk- 
ing application, i.e., the recognition of kernel 
NPs and PPs in part-of-speech tagged text. 
Post-nominal PP attachment is ignored. Un- 
like in the treebank application, there is no pre- 
editing by a human expert. The absolute num- 
bers differ from those in table 2 because cer- 
tain structures are ignored. The total number 
of structural tags is higher since we now parse 
whole sentences rather then separate chunks. 
In addition to the four accuracy measures 
defined above, we also give the percentage 
of chunks with correctly recognised external 
boundaries (irrespective of whether or not there 
are errors concerning their internal structure). 
Table 3: Recall and precision for the chunk- 
ing application. The parser recognises only the 
prenominal part of the NP/PP (without focus 
adverbs such as also, only, etc.). 
measure 
tags 
bracketing 
lab. brack. 
struct, match 
ext. bounds 
total 
166995 
51912 
51912 
46599 
46599 
correct 
158541 
45241 
43813 
41422 
43833 
recall I prec. 
94.9% 
87.2% 86.9% 
84.4% 84.2% 
88.9% 87.6% 
94.1% 93.4% 
6.3 Comparison to a Standard Tagger 
In the following, we compare the performance of 
the maximum-entropy parser with the precision 
of a standard HMM-based approach trained on 
the same data, but using only the frequencies 
of complete trigrams, bigrams and unigrams, 
whose probabilities are smoothed by linear in- 
terpolation, as described in section 4.3.1. 
Figure 5 shows the percentage of correctly as- 
signed values ri of the R.EL attribute depending 
on the size of the training corpus. Generally, the 
maximum entropy approach outperforms the 
linear extrapolation technique by about 0.5% - 
1.5%, which corresponds to a 1% - 3% difference 
in structural match. The difference decreases as 
the size of the training sample grows. For the 
full corpus consisting of 12,000 sentences, the 
linear interpolation tagger is still inferior to the 
maximum entropy one, but the difference in pre- 
cision becomes insignificant (0.2%). Thus, the 
maximum entropy technique seems to particu- 
larly advantageous in the case of sparse data. 
7 Conclusion 
We have demonstrated a partial parser capa- 
ble of recognising simple and complex NPs, 
PPs and APs in unrestricted German text. 
The maximum entropy parameter estimation 
method allows us to optimally use the con- 
text information contained in the training sam- 
ple. On the other hand, the parser can still be 
viewed as a Markov model, which guarantees 
high efficiency (processing in linear time). The 
program can be trained even with a relatively 
small amount of treebank data; then it can be J 
used for parsing unrestricted pre-tagged text. 
As far as coverage is concerned, our parser 
can handle recursive structures, which is an ad- 
vantage compared to simpler techniques such 
as that described by Church (1988). On the 
other hand, the Markov assumption underlying 
our approach means that only strictly local de- 
pendencies are recognised. For full parsing, one 
would probably need non-local contextual infor- 
mation, such as the long-range trigrams in Link 
Grammar (Della Pietra et al., 1994). 
Our future research will focus on exploiting 
morphological and lexical knowledge for partial 
parsing. Lexical context is particularly relevant 
for the recognition of genitive NP and PP at- 
tachment, as well as complex proper names. We 
hope that our approach will benefit from re- 
lated work on this subject, cf. (Ratnaparkhi 
et al., 1994). Further precision gain can also 
be achieved by enriching the structural context, 
e.g. with information about the category of the 
grandparent node. 
8 Acknowledgements 
This work is part of the DFG Collaborative Re- 
search Programme 378 Resource-Adaptive. Cog- 
149 
9~ correct 
95 
90 
95.1% 
94.9% 
I i I I I 
0 2000 4000 6000 8000 10000 12000 
# training sentences 
Figure 5: Tagging precision achieved by the maximum entropy parser (-t-) and a tagger using 
linear interpolation (--e--). Precision is shown for different numbers of training sentences. 
rive Processes, Project C3 Concurrent Gram- 
mar Processing. 
Many thanks go to Eric S. Ristad. We used 
his freely available Maximum Entropy. Mod- 
elling Toolkit to estimate context probabilities. 

References 
Adam L. Berger, Stephen A. Della Pietra, and 
Vincent J. Della Pietra. 1996. A maxim-m 
entropy approach to natural language pro- 
cessing. Computational Linguistics Vol. 22 
No. 1, 22(1):39-71. 
Thorsten Brants and Wojciech Skut. 1998. Au- 
tomation of treebank annotation. In Proceed- 
ings of NeMLaP-3, Sydney, Australia. 
Thorsten Brants, Wojciech Skut, and Brigitte 
Krenn. 1997. Tagging grammatical func- 
tions. In Proceedings of EMNLP-97, Provi- 
dence, RI, USA. 
P. F. Brown, V. 3. Della Pietra, Peter V: 
de Souza, Jenifer C. Lai, and Robert L. Mer- 
cer. 1992. Class-based n-gram models of 
natural language. Computational Linguistics, 
18(4):467-479. 
Kenneth Ward Church. 1988. A stochastic 
parts program and noun phrase parser for 
unrestricted text. In Proc. Second Confer- 
ence on Applied Natural Language Process- 
ing, pages 136-143, Austin, Texas, USA. 
S. Della Pietra, V. Della Pietra, J. Gillett, 
J. Lafferty, H. Printz, and L. Ures. 1994. 
Inference and estimation of a long-range tri- 
gram model. In Proceedings of the Second In- 
ternational Colloquium on Grammatical In- 
ference and Applications, Lecture Notes in 
Artificial Intelligence. Springer Verlag. 
Ulf Hermjakob and Raymond J. Mooney. 1997. 
Learning parse and translation decisions from 
examples with rich contexts. In Proceedings 
of A CL-97, pages 482 - 489, Madrid, Spain. 
Aravind K. Joshi and B. Srinivas. 1994. Dis- 
ambiguation of super parts of speech (or su- 
pertags). In Proceedings COLING 9~, Kyoto, 
Japan. 
John Lafferty, Daniel Sleator, and Davy Tem- 
perley. 1992. Grammatical trigrams: A prob- 
abilistic model of link grammar. In Proceed- 
ings of the AAAI Conference on Probabilistic 
Approaches to Natural Language. 
Mitchell Marcus, Beatrice Santorini, and 
Mary Ann Marcinkiewicz. 1994. Building a 
large annotated corpus of English: the Penn 
Treebank. In Susan Armstrong, editor, Using 
Large Corpora. MIT Press. 
Adwait Ratnaparkhi, Jeff Reynar, and Salim 
Roukos. 1994. A maximum entropy model 
for prepositional phrase attachment. In Pro- 
ceedings of the ARPA Human Language Tech- 
nology Workshop, pages 250--255. 
Adwait Ratnaparkhi. 1996. A maximum en- 
tropy model for part-of-speech tagging. In 
Proceedings of EMNLP- g6, Philadelphia, Pa., 
USA. 
Adwait Ratnaparkhi. 1997. A linear observed 
time statistical parser based on maximum en- 
tropy models. In Proceedings of EMNLP-97, 
Providence, RI, USA. 
Jeffrey Reynar and Adwait Ratnaparkhi. 1997. 
A maximum entropy approach to identify- 
ing sentence boundaries. In Proceedings of 
ANLP-97, Washington, DC, USA. 
Eric Sven Ristad, 1996. Maximum Entropy 
Modelling Toolkit, User's Manual. Prince- 
ton University, Princeton. Available at 
cmp-lg/9612005. 
Wojciech Skut, Brigitte Krenn, Thorsten 
Brants, and Hans Uszkoreit. 1997. An anno- 
tation scheme for free word order languages. 
In Proceedings of ANLP-97, Washington, DC. 
Christine Thielen and Anne Schiller. 1995. 
Ein kleines und erweitertes Tagset fiirs 
Deutsche. In Tagungsberichte des Arbeitstr- 
effens Lexikon + Text 17./18. Februar 199~, 
Schlofl Hohentiibingen. Lexicographica Series 
Mawr, Tiibingen. Niemeyer. 
