Workshop on Computationally Hard Problemsand Joint Inference in Speech and Language Processing, pages 33–40,
New York City, New York, June 2006. c©2006 Association for Computational Linguistics
A Probabilistic Search for the Best Solution Among Partially Completed
Candidates
Filip Ginter, Aleksandr Myll¨ari, and Tapio Salakoski
Turku Centre for Computer Science (TUCS) and
Department of Information Technology
University of Turku
Lemmink¨aisenkatu 14 A
20520 Turku, Finland
first.last@it.utu.fi
Abstract
We consider the problem of identifying
among many candidates a single best so-
lution which jointly maximizes several
domain-specific target functions. Assum-
ing that the candidate solutions can be
generated incrementally, we model the er-
ror in prediction due to the incomplete-
ness of partial solutions as a normally
distributed random variable. Using this
model, we derive a probabilistic search al-
gorithm that aims at finding the best solu-
tion without the necessity to complete and
rank all candidate solutions. We do not as-
sume a Viterbi-type decoding, allowing a
wider range of target functions.
We evaluate the proposed algorithm on the
problem of best parse identification, com-
bining simple heuristic with more com-
plex machine-learning based target func-
tions. We show that the search algorithm
is capable of identifying candidates with a
very high score without completing a sig-
nificant proportion of the candidate solu-
tions.
1 Background
Most of the current NLP systems assume a pipeline
architecture, where each level of analysis is imple-
mented as a module that produces a single, locally
optimal solution that is passed to the next module in
the pipeline. There has recently been an increased
interest in the application of joint inference, which
identifies a solution that is globally optimal through-
out the system and avoids some of the problems of
the pipeline architecture, such as error propagation.
We assume, at least conceptually, a division of
the joint inference problem into two subproblems:
that of finding a set of solutions that are structurally
compatible with each of the modules, and that of se-
lecting the globally best of these structurally correct
solutions. Many of the modules define a target func-
tion that scores the solutions by some domain cri-
teria based on local knowledge. The globally best
solution maximizes some combination of the target
functions, for example a sum.
For illustration, consider a system comprising of
two modules: a POS tagger and a parser. The POS
tagger generates a set of tag sequences that are com-
patible with the sentence text. Further, it may im-
plement a target function, based, for instance, on
tag n-grams, that scores these sequences according
to POS-centric criteria. The parser produces a set
of candidate parses and typically also implements a
target function that scores the parses based on their
structural and lexical features. Each parse that is
compatible with both the POS tagger and the parser
is structurally correct. The best solution may be de-
fined, for instance, as such a solution that maximizes
the sum of the scores of the POS- and parser-centric
target functions.
In practice, the set of structurally correct solu-
tions may be computed, for example, through the
intersection or composition of finite-state automata
as in the formalism of finite-state intersection gram-
mars (Koskenniemi, 1990). Finding the best so-
33
lution may be implemented as a best-path search
through Viterbi decoding, given a target function
that satisfies the Viterbi condition.
Most of the recent approaches to NLP tasks like
parse re-ranking make, however, use of feature-
based representations and machine-learning induced
target functions, which do not allow efficient search
strategies that are guaranteed to find the global op-
timum. In general case, all structurally correct so-
lutions have to be generated and scored by the tar-
get functions in order to guarantee that the globally
optimal solution is found. Further, each of the vari-
ous problems in natural language processing is typ-
ically approached with a different class of models,
ranging from n-gram statistics to complex regressors
and classifiers such as the support vector machines.
These different approaches need to be combined in
order to find the globally optimal solution. There-
fore, in our study we aim to develop a search strat-
egy that allows to combine a wider range of target
functions.
An alternative approach is that of propagating n
best solutions through the pipeline system, where
each step re-ranks the solutions by local criteria
(Ji et al., 2005). Incorporating a wide range of
features representing information from all levels of
analysis into a single master classifier is other com-
monly used method (Kambhatla, 2004; Zelenko et
al., 2004).
In this paper, we assume the possibility of gen-
erating the structurally correct solutions incremen-
tally, through a sequence of partially completed so-
lutions. We then derive a probabilistic search algo-
rithm that attempts to identify the globally best solu-
tion, without fully completing all structurally correct
solutions. Further, we do not impose strong restric-
tions, such as the Viterbi assumption, on the target
functions.
To a certain extent, this approach is related to the
problem of cost-sensitive learning, where obtaining
a feature value is associated with a cost and the
objective is to minimize the cost of training data
acquisition and the cost of instance classification
(Melville et al., 2004). However, the crucial dif-
ference is that we do not assume the possibility to
influence when advancing a partial solution, which
feature will be obtained next.
2 Method
Let us consider a system in which there are N so-
lutions s1, . . ., sN ∈ S to a problem and M tar-
get functions f1, . . ., fM, where fk : S → R, that
assign a score to each of the solutions. The score
fk(si) expresses the extent to which the solution
si satisfies the criterion implemented by the target
function fk. The overall score of a solution si
f(si) =
Msummationdisplay
k=1
fk(si) (1)
is the sum of the scores given by the individual target
functions. The objective is to identify ˆs, the best
among the N possible solutions, that maximizes the
overall score:
ˆs = arg maxs
i
f(si) . (2)
Suppose that the solutions are generated in-
crementally so that each solution si can be
reached through a sequence of F partial solutions
si,1, si,2, . . ., si,F , where si,F = si. Let further
u : S → (0, 1] be a measure of a degree of com-
pletion for a particular solution. For a complete so-
lution si, u(si) = 1, and for a partial solution si,n,
u(si) < 1. For instance, when assigning POS tags
to the words of a sentence, the degree of completion
could be defined as the number of words assigned
with a POS tag so far, divided by the total number of
words in the sentence.
The score of a partial solution si,n is, to a certain
extent, a prediction of the score of the correspond-
ing complete solution si. Intuitively, the accuracy of
this prediction depends on the degree of completion.
The score of a partial solution with a high degree
of completion is generally closer to the final score,
compared to a partial solution with a low degree of
completion.
Let
δk(si,n) = fk(si) − fk(si,n) (3)
be the difference between the scores of si and si,n.
That is, δk(si,n) is the error in score caused by the in-
completeness of the partial solution si,n. As the so-
lutions are generated incrementally, the exact value
of δk(si,n) is not known at the moment of generating
si,n because the solution si has not been completed
34
yet. However, we can model the error based on the
knowledge of si,n. We assume that, for a given si,n,
the error δk(si,n) is a random variable distributed ac-
cording to a probability distribution with a density
function ∆k, denoted as
δk(si,n) ∼ ∆k(δ; si,n) . (4)
The partial solution si,n is a parameter to the distri-
bution and, in theory, each partial solution gives rise
to a different distribution of the same general shape.
We assume that the error δ(si,n) is distributed
around a mean value and for a ‘reasonably behav-
ing’ target function, the probability of a small error
is higher than the probability of a large error. Ideally,
the target function will not exhibit any systematic er-
ror, and the mean value would thus be zero1. For in-
stance, a positive mean error indicates a systematic
bias toward underestimating the score. The mean
error should approach 0 as the degree of completion
increases and the error of a complete solution is al-
ways 0. We have further argued that the reliability
of the prediction grows with the degree of comple-
tion. That is, the error of a partial solution with a
high degree of completion should exhibit a smaller
variance, compared to that of a largely incomplete
solution. The variance of the error for a complete
solution is always 0.
Knowing the distribution ∆k of the error δk, the
density of the distribution dk(f; si,n) of the final
score fk(si) is obtained by shifting the density of
the error δk(si,n) by fk(si,n), that is,
fk(si) ∼ dk(f; si,n) , (5)
where
dk(f; si,n) = ∆k(f − fk(si,n) ; si,n) . (6)
So far, we have discussed the case of a single tar-
get function fk. Let us now consider the general
case of M target functions. Knowing the final score
density dk for the individual target functions fk, it is
now necessary to find the density of the overall score
f(si). By Equation 1, it is distributed as the sum
1We will see in our evaluation experiments that this is not
the case, and the target functions may exhibit a systematic bias
in the error δ.
µ(si;n)
d(f; si;n)
σ2(si;n)
ε
η(si;n)f(si;n)0
Sys. bias in error δ
Figure 1: The probability density d(f; si,n) of the
distribution of the final score f(si), given a partial
solution si,n. The density is assumed normally dis-
tributed, with mean µ(si,n) and variance σ2(si,n).
With probability 1 − ε, the final score is less than
η(si,n).
of the random variables f1(si) , . . ., fM(si). There-
fore, assuming independence, its density is the con-
volution of the densities of these variables, that is,
given si,n,
d(f; si,n) = (d1 ∗ . . . ∗ dM)(f; si,n) , (7)
and
f(si) ∼ d(f; si,n) . (8)
We have assumed the independence of the target
function scores. Further, we will make the assump-
tion that d takes the form of the normal distribution,
which is convolution-closed, a property necessary
for efficient calculation by Equation 7. We thus have
d(f; si,n) = nparenleftbigf; µ(si,n) , σ2(si,n)parenrightbig , (9)
where n is the normal density function. While it
is unlikely that independence and normality hold
strictly, it is a commonly used approximation, nec-
essary for an analytical solution of (7). The notions
introduced so far are illustrated in Figure 1.
2.1 The search algorithm
We will now apply the model introduced in the pre-
vious section to derive a probabilistic search algo-
rithm.
35
Let us consider two partial solutions si,n and sj,m
with the objective of deciding which one of them is
‘more promising’, that is, more likely to lead to a
complete solution with a higher score. The condi-
tion of ‘more promising’ can be defined in several
ways. For instance, once again assuming indepen-
dence, it is possible to directly compute the proba-
bility P(f(si) < f(sj)):
P(f(si) < f(sj))
= P(f(si) − f(sj) < 0)
=
integraldisplay 0
−∞
(dsi,n ∗ (−dsj,m))(f) df ,
(10)
where dsi,n refers to the function d(f; si,n). Since
d is the convolution-closed normal density, Equa-
tion 10 can be directly computed using the normal
cumulative distribution. The disadvantage of this
definition is that the cumulative distribution needs
to be evaluated separately for each pair of partial
solutions. Therefore, we assume an alternate defi-
nition of ‘more promising’ in which the cumulative
distribution is evaluated only once for each partial
solution.
Let ε ∈ [0, 1] be a probability value and η(si,n)
be the score such that P(f(si) > η(si,n)) = ε. The
value of η(si,n) can easily be computed from the in-
verse cumulative distribution function correspond-
ing to the density function d(f; si,n). The interpre-
tation of η(si,n) is that with probability of 1 − ε,
the partial solution si,n, once completed, will lead
to a score smaller than η(si,n). The constant ε is
a parameter, set to an appropriate small value. See
Figure 1 for illustration.
We will refer to η(si,n) as the maximal expected
score of si,n. Of the two partial solutions, we con-
sider as ‘more promising’ the one, whose maximal
expected score is higher. As illustrated in Figure 2,
it is possible for a partial solution si,n to be more
promising even though its score f(si,n) is lower than
that of some other partial solution sj,m.
Further, given a complete solution si and a partial
solution sj,m, a related question is whether sj,m is a
promising solution, that is, whether it is likely that
advancing it will lead to a score higher than f(si).
Using the notion of maximal expected score, we say
that a solution is promising if η(sj,m) > f(si).
With the definitions introduced so far, we are
f(si;n) f(sj;m) ·(si;n)
·(sj;m)
d(f; si;n)
d(f; sj;m)
Figure 2: Although the score of si,n is lower than
the score of sj,m, the partial solution si,n is more
promising, since η(si,n) > η(sj,m). Note that for
the sake of simplicity, a zero systematic bias of the
error δ is assumed, that is, the densities are centered
around the partial solution scores.
now able to perform two basic operations: compare
two partial solutions, deciding which one of them is
more promising, and compare a partial solution with
some complete solution, deciding whether the par-
tial solution is still promising or can be disregarded.
These two basic operations are sufficient to devise
the following search algorithm.
• Maintain a priority queue of partial solutions,
ordered by their maximal expected score.
• In each step, remove from the queue the par-
tial solution with the highest maximal expected
score, advance it, and enqueue any resulting
partial solutions.
• Iterate while the maximal expected score of the
most promising partial solution remains higher
than the score of the best complete solution dis-
covered so far.
The parameter ε primarily affects how early the
algorithm stops, however, it influences the order in
which the solutions are considered as well. Low val-
ues of ε result in higher maximal expected scores
36
and therefore partial solutions need to be advanced
to a higher degree of completion before they can be
disregarded as unpromising.
While there are no particular theoretical restric-
tions on the target functions, there is an important
practical consideration. Since the target function is
evaluated every time a partial solution si,n is ad-
vanced into si,n+1, being able to use the informa-
tion about si,n to efficiently compute fk(si,n+1) is
necessary.
The algorithm is to a large extent related to the Astar
search algorithm, which maintains a priority queue
of partial solutions, ordered according to a score
g(x) + h(x), where g(x) is the score of x and h(x)
is a heuristic overestimate2 of the final score of the
goal reached from x. Here, the maximal expected
score of a partial solution is an overestimate with
the probability of 1−ε and can be viewed as a prob-
abilistic counterpart of the Astar heuristic component
h(x). Note that Astar only guarantees to find the best
solution if h(x) never underestimates, which is not
the case here.
2.2 Estimation of µk(si,n) and σ2k(si,n)
So far, we have assumed that for each partial so-
lution si,n and each target function fk, the density
∆k(δ; si,n) is defined as a normal density specified
by the mean µk(si,n) and variance σ2k(si,n). This
density models the error δk(si,n) that arises due to
the incompleteness of si,n. The parameters µk(si,n)
and σ2k(si,n) are, in theory, different for each si,n and
reflect the behavior of the target function fk as well
as the degree of completion and possibly other at-
tributes of si,n. It is thus necessary to estimate these
two parameters from data.
Let us, for each target function fk, consider a
training set of observations Tk ⊂ S × R. Each
training observation tj = parenleftbigsj,nj, δkparenleftbigsj,njparenrightbigparenrightbig ∈ Tk
corresponds to a solution sj,nj with a known error
δkparenleftbigsj,njparenrightbig = fk(sj) − fkparenleftbigsj,njparenrightbig.
Before we introduce the method to estimate the
density ∆k(δ; si,n) for a particular si,n, we discuss
data normalization. The overall score f(si,n) is de-
fined as the sum of the scores assigned by the in-
dividual target functions fk. Naturally, it is desir-
2In the usual application of Astar to shortest-path search, h(x)
is a heuristic underestimate since the objective is to minimize
the score.
able that these scores are of comparable magnitudes.
Therefore, we normalize the target functions using
the z-normalization
z(x) = x − mean(x)stdev(x) . (11)
Each target function fk is normalized separately,
based on the data in the training set Tk. Throughout
our experiments, the values of the target function are
always z-normalized.
Let us now consider the estimation of the mean
µk(si,n) and variance σ2k(si,n) that define the den-
sity ∆k(δ; si,n). Naturally, it is not possible to es-
timate the distribution parameters for each solution
si,n separately. Instead, we approximate the parame-
ters based on two most salient characteristics of each
solution: the degree of completion u(si,n) and the
score fk(si,n). Thus,
µk(si,n) ≈ µk(u(si,n) , fk(si,n)) (12)
σ2k(si,n) ≈ σ2k(u(si,n) , fk(si,n)) . (13)
Let us assume the following notation: ui = u(si,n),
fi = fk(si,n), uj = uparenleftbigsj,njparenrightbig, fj = fkparenleftbigsj,njparenrightbig, and
δj = δkparenleftbigsj,njparenrightbig. The estimate is obtained from Tk
using kernel smoothing (Silverman, 1986):
µk(ui, fi) =
summationtext
tj∈T δjKsummationtext
tj∈T K
(14)
and
σ2k(ui, fi) =
summationtext
tj∈T (δj − µk(ui, fi))
2 K
summationtext
tj∈T K
, (15)
where K stands for the kernel value Kui,fi(uj, fj).
The kernel K is the product of two Gaussians, cen-
tered at ui and fi, respectively.
Kui,fi(uj, fj)
= nparenleftbiguj; ui, σ2uparenrightbig · nparenleftbigfj; fi, σ2fparenrightbig , (16)
where nparenleftbigx; µ, σ2parenrightbig is the normal density function.
The variances σ2u and σ2f control the degree of
smoothing along the u and f axes, respectively.
High variance results in stronger smoothing, com-
pared to low variance. In our evaluation, we set the
37
0.2 0.6 1.0
−4
−2
0
1
µA
0.2 0.6 1.0
−4
−2
0
1
σA2
0.2 0.6 1.0
−1
1
3
5
µB
0.2 0.6 1.0
−1
1
3
5
σB2
Figure 3: Mean and variance of the error δ(si,n).
By (12) and (13), the error is approximated as a
function of the degree of completion u(si,n) and the
score fk(si,n). The degree of completion is on the
horizontal and the score on the vertical axis. The
estimates (µA, σ2A) and (µB, σ2B) correspond to the
RLSC regressor and average link length target func-
tions, respectively.
variance such that σu and σf equal to 10% of the dis-
tance from min(uj) to max(uj) and from min(fj)
to max(fj), respectively.
The kernel-smoothed estimates of µ and σ2 for
two of the four target functions used in the evalua-
tion experiments are illustrated in Figure 3. While
both estimates demonstrate the decrease both in
mean and variance for u approaching 0, the tar-
get functions generally exhibit a different behav-
ior. Note that the values are clearly dependent on
both the score and the degree of completion, indi-
cating that the degree of completion alone is not suf-
ficiently representative of the partial solutions. Ide-
ally, the values of both the mean and variance should
be strictly 0 for u = 1, however, due to the effect of
smoothing, they remain non-zero.
3 Evaluation
We test the proposed search algorithm on the prob-
lem of dependency parsing. We have previously de-
veloped a finite-state implementation (Ginter et al.,
2006) of the Link Grammar (LG) parser (Sleator and
Temperley, 1991) which generates the parse through
the intersection of several finite-state automata. The
resulting automaton encodes all candidate parses.
The parses are then generated from left to right, pro-
ceeding through the automaton from the initial to
the final state. A partial parse is a sequence of n
words from the beginning of the sentence, together
with string encoding of their dependencies. Advanc-
ing a partial parse corresponds to appending to it the
next word. The degree of completion is then defined
as the number of words currently generated in the
parse, divided by the total number of words in the
sentence.
To evaluate the ability of the proposed method to
combine diverse criteria in the search, we use four
target functions: a complex state-of-the-art parse re-
ranker based on a regularized least-squares (RLSC)
regressor (Tsivtsivadze et al., 2005), and three mea-
sures inspired by the simple heuristics applied by
the LG parser. The criteria are the average length of
a dependency, the average level of nesting of a de-
pendency, and the average number of dependencies
linking a word. The RLSC regressor, on the other
hand, employs complex features and word n-gram
statistics.
The dataset consists of 200 sentences ran-
domly selected from the BioInfer corpus of
dependency-parsed sentences extracted from ab-
stracts of biomedical research articles (Pyysalo et
al., 2006). For each sentence, we have randomly
selected a maximum of 100 parses. For sentences
with less than 100 parses, all parses were selected.
The average number of parses per sentence is 62.
Further, we perform 5 × 2 cross-validation, that is,
in each of five replications, we divide the data ran-
domly to two sets of 100 sentences and use one set to
estimate the probability distributions and the other
set to measure the performance of the search algo-
rithm. The RLSC regressor is trained once, using a
different set of sentences from the BioInfer corpus.
The results presented here are averaged over the 10
folds. As a comparative baseline, we use a simple
38
greedy search algorithm that always advances the
partial solution with the highest score until all so-
lutions have been generated.
3.1 Results
For each sentence s with parses S{s1, . . ., sN}, let
SC ⊆ S be the subset of parses fully completed be-
fore the algorithm stops and SN = S \SC the sub-
set of parses not fully completed. Let further TC be
the number of iterations taken before the algorithm
stops, and T be the total number of steps needed to
generate all parses in S. Thus, |S| is the size of the
search space measured in the number of parses, and
T is the size of the search space measured in the
number of steps. For a single parse si, rank(si) is
the number of parses in S with a score higher than
f(si) plus 1. Thus, the rank of all solutions with
the maximal score is 1. Finally, ord(si) corresponds
to the order in which the parses were completed by
the algorithm (disregarding the stopping criterion).
For example, if the parses were completed in the
order s3, s8, s1, then ord(s3) = 1, ord(s8) = 2,
and ord(s1) = 3. While two solutions have the
same rank if their scores are equal, no two solutions
have the same order. The best completed solution
ˆsC ∈ SC is the solution with the highest rank in SC
and the lowest order among solutions with the same
rank. The best solution ˆs is the solution with rank 1
and the lowest order among solutions with rank 1. If
ˆs ∈ SC, then ˆsC = ˆs and the objective of the algo-
rithm to find the best solution was fulfilled. We use
the following measures of performance: rank(ˆsC),
ord(ˆs), |SC||S| , and TCT . The most important criteria
are rank(ˆsC) which measures how good the best
found solution is, and TCT which measures the pro-
portion of steps actually taken by the algorithm of
the total number of steps needed to complete all the
candidate solutions. Further, ord(ˆs), the number
of parses completed before the global optimum was
reached regardless the stopping criterion, is indica-
tive about the ability of the search to reach the global
optimum early among the completed parses. Note
that all measures except for ord(ˆs) equal to 1 for
the baseline greedy search, since it lacks a stopping
criterion.
The average performance values for four settings
of the parameter ε are presented in Table 1. Clearly,
ε rank(ˆsC) ord(ˆs) |SC||S| TCT
0.01 1.6 8.8 0.78 0.94
0.05 2.8 11.2 0.62 0.85
0.10 4.0 12.2 0.53 0.79
0.20 6.0 13.5 0.41 0.73
Base 1.0 28.7 1.00 1.00
Table 1: Average results over all sentences.
the algorithm behaves as expected with respect to
the parameter ε. While with the strictest setting
ε = 0.01, 94% of the search space is explored, with
the least strict setting of ε = 0.2, 73% is explored,
thus pruning one quarter of the search space. The
proportion of completed parses is generally consid-
erably lower than the proportion of explored search
space. This indicates that the parses are generally
advanced to a significant level of completion, but
then ruled out. The behavior of the algorithm is
thus closer to a breadth-first, rather than depth-first
search. We also notice that the average rank of the
best completed solution is very low, indicating that
although the algorithm does not necessarily identify
the best solution, it generally identifies a very good
solution. In addition, the order of the best solution is
low as well, suggesting that generally good solutions
are identified before low-score solutions. Further,
compared to the baseline, the globally optimal solu-
tion is reached earlier among the completed parses,
although this does not imply that it is reached earlier
in the number of steps. Apart from the overall aver-
ages, we also consider the performance with respect
to the number of alternative parses for each sentence
(Table 2). Here we see that even with the least strict
setting, the search finds a reasonably good solution
while being able to reduce the search space to 48%.
4 Conclusions and future work
We have considered the problem of identifying a
globally optimal solution among a set of candidate
solutions, jointly optimizing several target functions
that implement domain criteria. Assuming the solu-
tions are generated incrementally, we have derived
a probabilistic search algorithm that aims to identify
the globally optimal solution without completing all
of the candidate solutions. The algorithm is based on
a model of the error in prediction caused by the in-
39
ε = 0.01 ε = 0.2 Base
|S| # rank(ˆsC) ord(ˆs) |SC||S| TCT rank(ˆsC) ord(ˆs) |SC||S| TCT ord(ˆs)
1-10 40 1.0 1.6 1.00 1.00 1.2 1.8 0.84 0.95 2.85
11-20 18 1.1 4.4 0.88 0.97 2.8 7.0 0.54 0.79 9.82
21-30 8 1.0 2.9 1.00 1.00 1.0 2.4 0.80 0.98 14.75
31-40 9 1.2 7.8 0.79 0.95 2.6 10.8 0.48 0.74 20.67
41-50 6 1.0 4.4 0.80 0.89 4.9 9.8 0.28 0.61 18.07
51-60 3 1.0 2.3 0.64 0.88 7.1 5.9 0.30 0.59 38.67
61-70 5 1.1 26.9 0.86 0.99 3.4 23.2 0.22 0.68 32.60
71-80 3 1.0 8.7 0.78 0.98 9.2 19.6 0.30 0.71 49.67
81-90 6 2.5 8.2 0.61 0.94 9.3 16.6 0.24 0.76 47.67
91-100 102 5.2 20.9 0.50 0.81 18.9 38.2 0.15 0.48 52.69
Table 2: Average results with respect to the number of alternative parses. The column # contains the number
of sentences in the dataset which have the given number of parses.
completeness of a partial solution. Using the model,
the order in which partial solutions are explored is
defined, as well as a stopping criterion for the algo-
rithm.
We have performed an evaluation using best parse
identification as the model problem. The results in-
dicate that the method is capable of combining sim-
ple heuristic criteria with a complex regressor, iden-
tifying solutions with a very low average rank.
The crucial component of the method is the model
of the error δ. Improving the accuracy of the model
may potentially further improve the performance of
the algorithm, allowing a more accurate stopping
criterion and better order in which the parses are
completed. We have assumed independence be-
tween the scores assigned by the target functions. As
a future work, a multivariate model will be consid-
ered that takes into account the mutual dependencies
of the target functions.

References
Filip Ginter, Sampo Pyysalo, Jorma Boberg, and Tapio
Salakoski. 2006. Regular approximation of Link
Grammar. Manuscript under review.
Heng Ji, David Westbrook, and Ralph Grishman. 2005.
Using semantic relations to refine coreference deci-
sions. In Proceedings of Human Language Technol-
ogy Conference and Conference on Empirical Methods
in Natural Language Processing (HLT/EMNLP’05),
Vancouver, Canada, pages 17–24. ACL.
Nanda Kambhatla. 2004. Combining lexical, syntactic,
and semantic features with maximum entropy models
for information extraction. In The Companion Vol-
ume to the Proceedings of 42st Annual Meeting of the
Association for Computational Linguistics (ACL’04),
Barcelona, Spain, pages 178–181. ACL.
Kimmo Koskenniemi. 1990. Finite-state parsing and
disambiguation. In Proceedings of the 13th In-
ternational Conference on Computational Linguis-
tics (COLING’90), Helsinki, Finland, pages 229–232.
ACL.
Prem Melville, Maytal Saar-Tsechansky, Foster Provost,
and Raymond Mooney. 2004. Active feature-value
acquisition for classifier induction. In Proceedings
of the Fourth IEEE International Conference on Data
Mining (ICDM’04), pages 483–486. IEEE Computer
Society.
Sampo Pyysalo, Filip Ginter, Juho Heimonen, Jari
Bj¨orne, Jorma Boberg, Jouni J¨arvinen, and Tapio
Salakoski. 2006. Bio Information Extraction Re-
source: A corpus for information extraction in the
biomedical domain. Manuscript under review.
Bernard W. Silverman. 1986. Density Estimation for
Statistics and Data Analysis. Chapman & Hall.
Daniel D. Sleator and Davy Temperley. 1991. Pars-
ing English with a link grammar. Technical Report
CMU-CS-91-196, Department of Computer Science,
Carnegie Mellon University, Pittsburgh, PA.
Evgeni Tsivtsivadze, Tapio Pahikkala, Sampo Pyysalo,
Jorma Boberg, Aleksandr Myll¨ari, and Tapio
Salakoski. 2005. Regularized least-squares for parse
ranking. In Proceedings of the 6th International
Symposium on Intelligent Data Analysis (IDA’05),
Madrid, Spain, pages 464–474. Springer, Heidelberg.
Dmitry Zelenko, Chinatsu Aone, and Jason Tibbets.
2004. Binary integer programming for information ex-
traction. In Proceedings of the ACE Evaluation Meet-
ing.
