Best Analysis Selection in Inflectional Languages
Aleˇs Hor´ak and Pavel Smrˇz
Faculty of Informatics, Masaryk University Brno
Botanick´a 68a, 60200 Brno, Czech Republic
E-mail: {hales,smrz}@fi.muni.cz
Abstract
Ambiguity is the fundamental property of
natural language. Perhaps, the most bur-
densome case of ambiguity manifests itself
on the syntactic level of analysis. In order
to face up to the high number of obtained
derivation trees, this paper describes several
techniques for evaluation of the figures of
merit, which define a sort order on parsing
trees. The presented methods are based on
language specific features of synthetical lan-
guages and they improve the results of sim-
ple stochastic approaches.
1 Introduction
Ambiguity on all levels of representation is
an inherent property of natural languages
and it also forms a central problem of natu-
ral language parsing. A consequence of the
natural language ambiguity is a high num-
ber of possible outputs of a parser that are
usually represented by labeled trees. The av-
erage number of parsing trees per input sen-
tence strongly depends on the background
grammar and thence on the language. There
are natural language grammars producing
at most hundreds or thousands of parsing
trees but also highly ambiguous grammar
systems producing enormous number of re-
sults. For example, a grammar extracted
from the Penn Treebank and tested on a
set of sentences randomly generated from a
probabilistic version of the grammar has on
average 7.2×1027 parses per sentence accord-
ing to Moore’s work (Moore, 2000). Such a
mammoth extent of result is also no excep-
tion in parsing of Czech (Smrˇz and Hor´ak,
2000) (see Fig. 1) due to free word order and
Figure 1: The dependence of number of re-
sulting analysis on the number of words in
the input sentence
rich morphology of word forms whose gram-
matical case cannot often be unambiguously
determined.
A traditional solution for these problems
is presented by probabilistic parsing tech-
niques (Bunt and Nijholt, 2000) aiming at
finding the most probable parse of a given
input sentence. This methodology is usually
based on the relative frequencies of occur-
rences of the possible relations in a repre-
sentative corpus. “Best” trees are judged by
a probabilistic figure of merit (FOM).
The term “figure of merit” is usually used
to refer to a function that prunes implausi-
ble partial analyses during parsing. In this
paper, we rather take figure of merit as a
measure bounding the true probabilities of
the complete parses.
S
a8a8a8 a72a72a72NP1
a8a8a72a72AP
a8a8
a8a8
a72a72
a72a72
ADJ and ADJ
N1
V NP4
a8a8a72a72ADJ NP4
a8a8a72a72N4 N2 −→
selected trigrams: [ADJ,and,ADJ]
[ADJ,N1,V]
[N1,V,N4]
[V,ADJ,N4]
[ADJ,N4,N2]
Figure 2: Lexical heads as n-gram’s elements.
The standard methods of the best analy-
sis selection (Caraballo and Charniak, 1998)
usually use simple stochastic functions inde-
pendent on the peculiarities of the underly-
ing language. This approach seems to work
satisfactorily in case of analytical languages.
On the other hand, the obstacles brought
by the synthetical languages in relationship
with those simple statistical techniques are
indispensable.
Therefore, we try to improve the standard
FOMs taking into consideration specific fea-
tures of free word order languages. The fol-
lowing text discusses the assets of three fig-
ures of merit that reflect selected phenomena
of the Czech language.
2 Figures of Merit
The overall figure of merit of the syntactic
analysis results is determined as a combina-
tion of several contributory FOMs that re-
flect particular language features such as
• frequency of syntactic constructs repre-
sented by pre-computed rule probabili-
ties
• augmented n-gram model based on the
occurrence of adjacent lexical heads
standing for the corresponding subtrees
• affinity between constituents modeled
by valency frames of verbs, adjectives
and nouns
The selected FOMs participate on the de-
termination of the most probable analysis.
A straightforward approach lies in the linear
combination of FOMs:
ξ = λ1 ·ξ1 + λ2 ·ξ2 + λ3 ·ξ3
where ξi are the FOMs’ contributions and
λi are empirically assigned weights (usually
taken as normalizing coefficients). However,
our experiments showed that the weights λi
need to reflect the behaviour of particular
lexical items, their categories or even anal-
ysed constituents. We thus need to handle
the λi variables as functions of various pa-
rameters.
ξ = λ1( )·ξ1 + λ2( )·ξ2 + λ3( )·ξ3
The following sections deal with the figures
of merit that play a crucial role in the search
for the best output analysis.
2.1 Rule-tied Actions and ξ1 FOM
A key question is then what the good can-
didates for FOMs are. The use of proba-
bilistic context-free grammars (PCFGs) in-
volves simple CF rule probabilities to form
a FOM (Chitrao and Grishman, 1990; Bo-
brow, 1991).
The evaluation of the first FOM is based
on the mechanism of contextual actions built
into the metagrammar conception (Smrˇz and
Hor´ak, 2000). It distinguishes four kinds of
contextual actions, tests or constraints:
1. rule-tied actions
2. agreement fulfilment constraints
3. post-processing actions
4. actions based on derivation tree
The rule-based probability estimations are
solved on the first level by the rule-tied ac-
tions, which also serve as rule parameteriza-
tion modifiers.
Agreement fulfilment constraints are used
in generating the expanded grammar (Smrˇz
and Hor´ak, 1999) or they serve also as
chart pruning actions. In terms of (Maxwell
III and Kaplan, 1991), the agreement ful-
filment constraints represent the functional
constraints, whose processing can be inter-
leaved with that of phrasal constraints.
The post-processing actions are not trig-
gered until the chart is already completed.
The main part of FOM computation for a
particular input sentence is driven by ac-
tions on this level. Some figures of merit
(e.g. verb valency FOM, see Section 2.3) de-
mand exponential resources for computation
over the whole chart structure. This prob-
lem is solved by splitting the calculation pro-
cess into the pruning part (run on the level
of post-processing actions) and the reorder-
ing part, that is postponed until the actions
based on derivation tree.
The actions that do not need to work with
the whole chart structure are run after the
best or n most probable derivation trees are
selected. These actions are used, for exam-
ple, for determination of possible verb va-
lencies within the input sentence, which can
produce a new ordering of the selected trees.
2.2 Augmented n-grams and ξ2 FOM
The ξ1 FOM is based on rule frequencies and
is not capable of describing the contextual
information in the input. A popular tech-
nique for capturing the relations between
sentence constituents is the n-gram method,
which takes advantage of a fast and efficient
evaluation algorithm.
For instance, (Caraballo and Charniak,
1998) presents and evaluate different figures
of merit in the context of best-first chart
parsing. They recommend boundary trigram
estimate that has achieved the best perfor-
mance on two testing grammars. This tech-
nique, as well as stochastic POS tagging
based on n-gram statistics, achieves satis-
factory results for analytical languages (like
English). However, in case of free word or-
der languages, current studies suggest that
these simple stochastic techniques consider-
ably suffer from the data sparseness problem
and require a huge amount of training data.
The reduction of the number of possible
training schemata, which correctly keeps the
correspondence with the syntactic tree struc-
ture, is achieved by elaborate selection of
n-gram candidates. While the standard n-
gram techniques work on the surface level,
this approach allows us to move up to the
syntactic tree level. We advantageously use
the ability of lexical heads to represent the
key features of the subtree formed by its de-
pendants (see Figure 2). The principle of
lexical heads has shown to be fruitfully ex-
ploited in the analysis of free word order
languages. The obtained cut-down of the
amount of training data may be also crucial
to the usability of this stochastic technique.
2.3 Verb Valencies and ξ3 FOM
Our experiments have shown that, in case of
a really free word order language, the FOMs
ξ1 and ξ2 are not always able to discover
the correct reordering of analyses. So as
to cope with the above mentioned difficul-
ties in Slavonic languages (namely Czech),
we propose to exploit the language specific
features. Preliminary results indicate that
the most advantageous approach is the one
based upon valencies of the verb phrase — a
crucial concept in traditional linguistics.
The part of the system dedicated to ex-
ploitation of information obtained from a list
of verb valencies (Pala and ˇSeveˇcek, 1997)
is necessary for solving the prepositional at-
tachment problem in particular. During the
analysis of noun groups and prepositional
noun groups in the role of verb valencies
in a given input sentence one needs to be
able to distinguish free adjuncts or modi-
fiers from obligatory valencies. We are test-
ing a set of heuristic rules that determine
With Charles Peter angered at the last meeting
Na Karlabracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
<HUMAN>
se Petr rozhnˇeval na posledn´ı sch˚uzibracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
<ACTIVITY>
about the lost advance for payroll
kv˚uli ztracen´e z´aloze na mzdu.bracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
<RECOMPENSE>
Figure 3: Free adjuncts identification by means of lexico-semantic constraints.
whether a found noun group typically serves
as a free adjunct. The heuristics are based
on the lexico-semantic constraints (Smrˇz and
Hor´ak, 1999).
An example of the application of the heuris-
tics is depicted in Figure 3. In the presented
Czech sentence, the expression na Karla
(with Charles) is denoted as a verb argument
by the valency list of the verb rozhnˇevat se
(anger), while the prepositional noun phrase
na sch˚uzi (at the meeting) is classified as
a free adjunct by the rule specifying that
the preposition na (at) in combination with
an <ACTIVITY> class member (in locative)
forms a location expression. The remaining
constituent na mzdu (for payroll) is finally
recommended as a modifier of the preceding
noun phrase z´aloze ([about the] advance).
Certainly, we also need to discharge the
dependence on the surface order. Therefore,
before the system confronts the actual verb
valencies from the input sentence with the
list of valency frames found in the lexicon,
all the valency expressions are reordered. By
using the standard ordering of participants,
the valency frames can be handled as pure
sets independent on the current position of
verb arguments.
2.4 Preferred Word Order
In analytical languages, the word order is
usually taken as rather fixed and that is why
it can be employed in parsing tree prun-
ing algorithms. However, in case of inflec-
tional languages, the approaches to word or-
der analysis are diverse. The most influen-
tial theory works with the topic-focus artic-
ulation (Sgall et al., 1986). Although nearly
all rules that could limit the order of con-
stituents in Czech sentences can be fully re-
laxed, a standard order of participants can
be defined. A corpus analysis of general
texts affirms that this preferred word order
is often followed and that it can be advanta-
geously used as an arbiter for best analysis
selection.
Cases where the ξi FOMs do not unam-
biguously elect the best candidates can be
routed by the preferred word order in the
form of functional weights λi( ) with appro-
priate parameters.
3 Results
This section presents results of experiments
with the stated figures of merit for the best
analysis selection algorithm. First, the ac-
quisition of training data set derived by ex-
ploitation of a standard dependency tree
bank for Czech is described. Then, we step
to a comparison of parser running times with
that of another available parser.
3.1 The Training Set Acquisition
A common approach to acquiring the sta-
tistical data for analysis of syntax employs
learning the values from a fully tagged tree
bank training corpus. Building of such cor-
pora is a tedious and expensive work and
it requires a team cooperation of linguists
and computer scientists. At present the only
source of Czech tree bank data is the Prague
Dependency Tree Bank (PDTB) (Hajiˇc,
1998), which includes dependency analyses
of about 100000 Czech sentences.
First, in order to be able to exploit the
data from PDTB, we have supplemented our
grammar with the dependency specification
precision on sentences percentage
of 1-10 words 86.9%
of 11-20 words 78.2%
of more than 20 words 63.1%
overall precision 79.3%
number of sentences with 8.0%
mistakes in input
Table 1: Precision estimate
for constituents. Thus the output of the
analysis can be presented in the form of pure
dependency tree. In the same time we unify
classes of derivation trees that correspond to
one dependency structure. We then define a
canonical form of the derivation to select one
representative of the class that is used for as-
signing the edge probabilities.
This technique enables us to relate the
output of our parser to the PDTB data.
However, the profit of exploitation of the
information from the dependency structures
can be higher than that and can run in an
automatically controlled environment. For
this purpose, we use the mechanism of prun-
ing constraints. A set of strict limitations is
given to the syntactic analyser, which passes
on just the compliant parses. The con-
straints can be either supplied manually for
particular sentence by linguists, or obtained
from the transformed dependency tree in
PDTB.
The Table 1 summarizes the precision es-
timates counted on real corpus data. These
measurements presented here may discount
the actual benefits of our approach due to
the estimated 8% of mistakes in the input
corpus.
3.2 Running Time Comparison
The effectivity comparison of different
parsers and parsing techniques brings a
strong impulse to improving the actual im-
plementations. Since there is no other gen-
erally applicable and available NL parser for
Czech, we have compared the running times
of our syntactic analyser on the data pro-
vided at http://www.cogs.susx.ac.uk/
lab/nlp/carroll/cfg-resources/.
These WWW pages resulted from discus-
sions at the Efficiency in Large Scale Parsing
Systems Workshop at COLING’2000, where
one of the main conclusions was the need for
a bank of data for standardization of parser
benchmarking. The best results reported
on standard data sets (ATIS and PT gram-
mars) until today are the comparison data
by Robert C. Moore (Moore, 2000). In the
package, only the testing grammars with in-
put sentences are at the disposal, the release
of referential implementation of the parser is
currently being prepared (Moore, personal
communication).
ATIS grammar, Moore’s LC3 + UTF 11.6
ATIS grammar, our system 7.2
PT grammar, Moore’s LC3 + UTF 41.8
PT grammar, our system 57.2
Table 2: Running times comparison (in sec-
onds)
Since we could not run the referential im-
plementation of Moore’s parser on the same
machine, the above mentioned times are not
fully comparable (we assume that our tests
were run on a slightly faster machine than
that of Moore’s tests). We prepare a de-
tailed comparison, which will try to explain
the differences of results when parsing with
grammars of varying ambiguity level.
4 Conclusions
The methods of the best analysis selection
algorithm described in this paper show that
the parsing of inflectional languages calls for
sensitive approaches to the evaluation of the
appropriate figures of merit. The case study
of Czech suggests that the use of language
specific features can improve the results of
simple stochastic techniques on annotated
corpus data.
Future directions of our research lead to
improvements of the quality of training data
set so that it would cover all the most fre-
quent language phenomena. Our investiga-
tions indicate that, in addition to verbs, the
best analysis selection algorithms could also
take advantage of valency frames of other
POS categories (nouns, adjectives).

References

R. J. Bobrow. 1991. Statistical agenda
parsing. In Proceedings of the February
1991 DARPA Speech and Natural Lan-
guage Workshop, pages 222-224. San Ma-
teo: Morgan Kaufmann.

H. Bunt and A. Nijholt, editors. 2000. Ad-
vances in Probabilistic and Other Parsing
Technologies. Kluwer Academic Publish-
ers.

S. Caraballo and E. Charniak. 1998. New
figures of merit for best-first probabilistic
chart parsing. Computational Linguistics,
24(2):275-298.

M. Chitrao and R. Grishman. 1990. Statisti-
cal parsing of messages. In Proceedings of
the Speech and Natural Language Work-
shop, pages 263-266, Hidden Valley, PA.

J. Hajiˇc. 1998. Building a syntactically an-
notated corpus: The Prague Dependency
Treebank. In Issues of Valency and Mean-
ing, pages 106-132, Prague. Karolinum.

J. T. Maxwell III and R. M. Kaplan. 1991.
The interface between phrasal and func-
tional constraints. In M. Rosner, C. J.
Rupp, and R. Johnson, editors, Proceed-
ings of the Workshop on Constraint Prop-
agation, Linguistic Description, and Com-
putation, pages 105-120. Instituto Dalle
Molle IDSIA, Lugano. Also in Computa-
tional Linguistics, Vol. 19, No. 4, 571-590,
1994.

R. C. Moore. 2000. Improved left-corner
chart parsing for large context-free gram-
mars. In Proceedings of the 6th IWPT,
pages 171-182, Trento, Italy.

K. Pala and P. ˇSeveˇcek. 1997. Valencies of
Czech verbs. In Proceedings of Works of
Philosophical Faculty at the University of
Brno, pages 41-54. Brno. (in Czech).

P. Sgall, E. Hajiˇcov´a, and J. Panevov´a.
1986. The Meaning of the Sentence
and Its Semantic and Pragmatic As-
pects. Academia/Reidel Publishing Com-
pany, Prague, Czech Republic/Dordrecht,
Netherlands.

P. Smrˇz and A. Hor´ak. 1999. Implementa-
tion of efficient and portable parser for
Czech. In Text, Speech and Dialogue:
Proceedings of the Second International
Workshop TSD’1999, Pilsen, Czech Re-
public. Springer Verlag, Lecture Notes in
Computer Science, Volume 1692.

Pavel Smrˇz and Aleˇs Hor´ak. 2000. Large
scale parsing of Czech. In Proceedings of
Efficiency in Large-Scale Parsing Systems
Workshop, COLING’2000, pages 43-50,
Saarbrucken: Universitaet des Saarlandes.
