Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 155–163,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Extremely Lexicalized Models for Accurate and Fast HPSG Parsing
Takashi Ninomiya
Information Technology Center
University of Tokyo
Takuya Matsuzaki
Department of Computer Science
University of Tokyo
Yoshimasa Tsuruoka
School of Informatics
University of Manchester
Yusuke Miyao
Department of Computer Science
University of Tokyo
Jun’ichi Tsujii
Department of Computer Science, University of Tokyo
School of Informatics, University of Manchester
SORST, Japan Science and Technology Agency
Hongo 7-3-1, Bunkyo-ku, Tokyo, 113-0033, Japan
{ninomi, matuzaki, tsuruoka, yusuke, tsujii}@is.s.u-tokyo.ac.jp
Abstract
This paper describes an extremely lexi-
calized probabilistic model for fast and
accurate HPSG parsing. In this model,
the probabilities of parse trees are de-
fined with only the probabilities of select-
ing lexical entries. The proposed model
is very simple, and experiments revealed
that the implemented parser runs around
four times faster than the previous model
and that the proposed model has a high
accuracy comparable to that of the previ-
ous model for probabilistic HPSG, which
is defined over phrase structures. We
also developed a hybrid of our probabilis-
tic model and the conventional phrase-
structure-based model. The hybrid model
is not only significantly faster but also sig-
nificantly more accurate by two points of
precision and recall compared to the pre-
vious model.
1 Introduction
For the last decade, accurate and wide-coverage
parsing for real-world text has been intensively
and extensively pursued. In most of state-of-the-
art parsers, probabilistic events are defined over
phrase structures because phrase structures are
supposed to dominate syntactic configurations of
sentences. For example, probabilities were de-
fined over grammar rules in probabilistic CFG
(Collins, 1999; Klein and Manning, 2003; Char-
niak and Johnson, 2005) or over complex phrase
structures of head-driven phrase structure gram-
mar (HPSG) or combinatory categorial grammar
(CCG) (Clark and Curran, 2004b; Malouf and van
Noord, 2004; Miyao and Tsujii, 2005). Although
these studies vary in the design of the probabilistic
models, the fundamental conception of probabilis-
tic modeling is intended to capture characteristics
of phrase structures or grammar rules. Although
lexical information, such as head words, is known
to significantly improve the parsing accuracy, it
was also used to augment information on phrase
structures.
Another interesting approach to this problem
was using supertagging (Clark and Curran, 2004b;
Clark and Curran, 2004a; Wang and Harper, 2004;
Nasr and Rambow, 2004), which was originally
developed for lexicalized tree adjoining grammars
(LTAG) (Bangalore and Joshi, 1999). Supertag-
ging is a process where words in an input sen-
tence are tagged with ‘supertags,’ which are lex-
ical entries in lexicalized grammars, e.g., elemen-
tary trees in LTAG, lexical categories in CCG,
and lexical entries in HPSG. Supertagging was,
in the first place, a technique to reduce the cost
of parsing with lexicalized grammars; ambiguity
in assigning lexical entries to words is reduced
by the light-weight process of supertagging be-
fore the heavy process of parsing. Bangalore and
Joshi (1999) claimed that if words can be assigned
correct supertags, syntactic parsing is almost triv-
ial. What this means is that if supertags are cor-
rectly assigned, syntactic structures are almost de-
155
termined because supertags include rich syntac-
tic information such as subcategorization frames.
Nasr and Rambow (2004) showed that the accu-
racy of LTAG parsing reached about 97%, assum-
ing that the correct supertags were given. The
concept of supertagging is simple and interesting,
and the effects of this were recently demonstrated
in the case of a CCG parser (Clark and Curran,
2004a) with the result of a drastic improvement in
the parsing speed. Wang and Harper (2004) also
demonstrated the effects of supertagging with a
statistical constraint dependency grammar (CDG)
parser. They achieved accuracy as high as the
state-of-the-art parsers. However, a supertagger it-
self was used as an external tagger that enumerates
candidates of lexical entries or filters out unlikely
lexical entries just to help parsing, and the best
parse trees were selected mainly according to the
probabilistic model for phrase structures or depen-
dencies with/without the probabilistic model for
supertagging.
We investigate an extreme case of HPSG pars-
ing in which the probabilistic model is defined
with only the probabilities of lexical entry selec-
tion; i.e., the model is never sensitive to charac-
teristics of phrase structures. The model is simply
defined as the product of the supertagging proba-
bilities, which are provided by the discriminative
method with machine learning features of word
trigrams and part-of-speech (POS) 5-grams as de-
fined in the CCG supertagging (Clark and Curran,
2004a). The model is implemented in an HPSG
parser instead of the phrase-structure-based prob-
abilistic model; i.e., the parser returns the parse
tree assigned the highest probability of supertag-
ging among the parse trees licensed by an HPSG.
Though the model uses only the probabilities of
lexical entry selection, the experiments revealed
that it was as accurate as the previous phrase-
structure-based model. Interestingly, this means
that accurate parsing is possible using rather sim-
ple mechanisms.
We also tested a hybrid model of the su-
pertagging and the previous phrase-structure-
based probabilistic model. In the hybrid model,
the probabilities of the previous model are mul-
tiplied by the supertagging probabilities instead
of a preliminary probabilistic model, which is in-
troduced to help the process of estimation by fil-
tering unlikely lexical entries (Miyao and Tsujii,
2005). In the previous model, the preliminary
probabilistic model is defined as the probability
of unigram supertagging. So, the hybrid model
can be regarded as an extension of supertagging
from unigram to n-gram. The hybrid model can
also be regarded as a variant of the statistical CDG
parser (Wang, 2003; Wang and Harper, 2004), in
which the parse tree probabilities are defined as
the product of the supertagging probabilities and
the dependency probabilities. In the experiments,
we observed that the hybrid model significantly
improved the parsing speed, by around three to
four times speed-ups, and accuracy, by around two
points in both precision and recall, over the pre-
vious model. This implies that finer probabilistic
model of lexical entry selection can improve the
phrase-structure-based model.
2 HPSG and probabilistic models
HPSG (Pollard and Sag, 1994) is a syntactic the-
ory based on lexicalized grammar formalism. In
HPSG, a small number of schemata describe gen-
eral construction rules, and a large number of
lexical entries express word-specific characteris-
tics. The structures of sentences are explained us-
ing combinations of schemata and lexical entries.
Both schemata and lexical entries are represented
by typed feature structures, and constraints repre-
sented by feature structures are checked with uni-
fication.
An example of HPSG parsing of the sentence
“Spring has come” is shown in Figure 1. First,
each of the lexical entries for “has” and “come”
is unified with a daughter feature structure of the
Head-Complement Schema. Unification provides
the phrasal sign of the mother. The sign of the
larger constituent is obtained by repeatedly apply-
ing schemata to lexical/phrasal signs. Finally, the
parse result is output as a phrasal sign that domi-
nates the sentence.
Given a set W of words and a set F of feature
structures, an HPSG is formulated as a tuple, G =
〈L,R〉, where
L = {l = 〈w,F〉|w ∈W,F ∈F} is a set of
lexical entries, and
R is a set of schemata; i.e., r ∈ R is a partial
function: F ×F →F.
Given a sentence, an HPSG computes a set of
phrasal signs, i.e., feature structures, as a result of
parsing. Note that HPSG is one of the lexicalized
grammar formalisms, in which lexical entries de-
termine the dominant syntactic structures.
156
Spring
HEAD  nounSUBJ  < >COMPS  < >2HEAD  verbSUBJ  <    >COMPS  <    >1
has
HEAD  verbSUBJ  <    >COMPS  < >1
come
2head-comp
HEAD  verbSUBJ  < >
COMPS  < >
HEAD  nounSUBJ  < >COMPS  < >1
=⇒
Spring
HEAD  nounSUBJ  < >COMPS  < >2HEAD  verbSUBJ  <    >COMPS  <    >1
has
HEAD  verbSUBJ  <    >COMPS  < >1
come
2
HEAD  verbSUBJ  <    >COMPS  < >1
HEAD  verbSUBJ  < >COMPS  < >
1
subject-head
head-comp
Figure 1: HPSG parsing.
Previous studies (Abney, 1997; Johnson et al.,
1999; Riezler et al., 2000; Malouf and van Noord,
2004; Kaplan et al., 2004; Miyao and Tsujii, 2005)
defined a probabilistic model of unification-based
grammars including HPSG as a log-linear model
or maximum entropy model (Berger et al., 1996).
The probability that a parse result T is assigned to
a given sentence w = 〈w1,...,wn〉 is
phpsg(T|w) = 1Z
w
exp
parenleftBiggsummationdisplay
u
λufu(T)
parenrightBigg
Zw = summationdisplay
Tprime
exp
parenleftBiggsummationdisplay
u
λufu(Tprime)
parenrightBigg
,
where λu is a model parameter, fu is a feature
function that represents a characteristic of parse
tree T, and Zw is the sum over the set of all pos-
sible parse trees for the sentence. Intuitively, the
probability is defined as the normalized product
of the weights exp(λu) when a characteristic cor-
responding to fu appears in parse result T. The
model parameters, λu, are estimated using numer-
ical optimization methods (Malouf, 2002) to max-
imize the log-likelihood of the training data.
However, the above model cannot be easily es-
timated because the estimation requires the com-
putation of p(T|w) for all parse candidates as-
signed to sentence w. Because the number of
parse candidates is exponentially related to the
length of the sentence, the estimation is intractable
for long sentences. To make the model estimation
tractable, Geman and Johnson (Geman and John-
son, 2002) and Miyao and Tsujii (Miyao and Tsu-
jii, 2002) proposed a dynamic programming algo-
rithm for estimating p(T|w). Miyao and Tsujii
HEAD  verbSUBJ  <>
COMPS <>
HEAD  nounSUBJ  <>
COMPS <>
HEAD  verbSUBJ  <   >
COMPS <>
HEAD  verbSUBJ  <   >
COMPS <   >
HEAD  verbSUBJ  <   >
COMPS <>
subject-head
head-comp
Spring/NN has/VBZ come/VBN
1
1 11 22
froot= <S, has, VBZ,                  >HEAD  verbSUBJ  <NP>COMPS <VP>
fbinary= head-comp,1, 0,1,VP, has,VBZ,                    ,
1,VP, come,VBN,
HEAD  verbSUBJ  <NP>
COMPS <VP>HEAD  verb
SUBJ  <NP>COMPS <>
flex= <spring,NN,                    > HEAD  nounSUBJ  <>COMPS <>
Figure 2: Example of features.
(2005) also introduced a preliminary probabilistic
model p0(T|w) whose estimation does not require
the parsing of a treebank. This model is intro-
duced as a reference distribution of the probabilis-
tic HPSG model; i.e., the computation of parse
trees given low probabilities by the model is omit-
ted in the estimation stage. We have
(Previous probabilistic HPSG)
phpsgprime(T|w) = p0(T|w) 1Z
w
exp
parenleftBiggsummationdisplay
u
λufu(T)
parenrightBigg
Zw = summationdisplay
Tprime
p0(Tprime|w)exp
parenleftBiggsummationdisplay
u
λufu(Tprime)
parenrightBigg
p0(T|w) =
nproductdisplay
i=1
p(li|wi),
where li is a lexical entry assigned to word wi in T
and p(li|wi) is the probability of selecting lexical
entry li for wi.
In the experiments, we compared our model
with the probabilistic HPSG model of Miyao and
Tsujii (2005). The features used in their model are
combinations of the feature templates listed in Ta-
ble 1. The feature templates fbinary and funary
are defined for constituents at binary and unary
branches, froot is a feature template set for the
root nodes of parse trees, and flex is a feature tem-
plate set for calculating the preliminary probabilis-
tic model. An example of features applied to the
parse tree for the sentence “Spring has come” is
shown in Figure 2.
157
fbinary =
angbracketleftBigg r,d,c,
spl,syl,hwl,hpl,hll,
spr,syr,hwr,hpr,hlr
angbracketrightBigg
funary = 〈r,sy,hw,hp,hl〉
froot = 〈sy,hw,hp,hl〉
flex = 〈wi,pi,li〉
combinations of feature templates for fbinary
〈r,d,c,hw,hp,hl〉,〈r,d,c,hw,hp〉,〈r,d,c,hw,hl〉,
〈r,d,c,sy,hw〉,〈r,c,sp,hw,hp,hl〉,〈r,c,sp,hw,hp〉,
〈r,c,sp,hw,hl〉,〈r,c,sp,sy,hw〉,〈r,d,c,hp,hl〉,
〈r,d,c,hp〉,〈r,d,c,hl〉,〈r,d,c,sy〉,〈r,c,sp,hp,hl〉,
〈r,c,sp,hp〉,〈r,c,sp,hl〉,〈r,c,sp,sy〉
combinations of feature templates for funary
〈r,hw,hp,hl〉,〈r,hw,hp〉,〈r,hw,hl〉,〈r,sy,hw〉,
〈r,hp,hl〉,〈r,hp〉,〈r,hl〉,〈r,sy〉
combinations of feature templates for froot
〈hw,hp,hl〉,〈hw,hp〉,〈hw,hl〉,
〈sy,hw〉,〈hp,hl〉,〈hp〉,〈hl〉,〈sy〉
combinations of feature templates for flex
〈wi,pi,li〉,〈pi,li〉
r name of the applied schema
d distance between the head words of the daughters
c whether a comma exists between daughtersand/or inside daughter phrases
sp number of words dominated by the phrase
sy symbol of the phrasal category
hw surface form of the head word
hp part-of-speech of the head word
hl lexical entry assigned to the head word
wi i-th word
pi part-of-speech for wi
li lexical entry for wi
Table 1: Features.
3 Extremely lexicalized probabilistic
models
In the experiments, we tested parsing with the pre-
vious model for the probabilistic HPSG explained
in Section 2 and other three types of probabilis-
tic models defined with the probabilities of lexi-
cal entry selection. The first one is the simplest
probabilistic model, which is defined with only
the probabilities of lexical entry selection. It is
defined simply as the product of the probabilities
of selecting all lexical entries in the sentence; i.e.,
the model does not use the probabilities of phrase
structures like the previous models.
Given a set of lexical entries, L, a sentence,
w = 〈w1,...,wn〉, and the probabilistic model
of lexical entry selection, p(li ∈ L|w,i), the first
model is formally defined as follows:
(Model 1)
pmodel1(T|w) =
nproductdisplay
i=1
p(li|w,i),
where li is a lexical entry assigned to word wi
in T and p(li|w,i) is the probability of selecting
lexical entry li for wi.
The second model is defined as the product of
the probabilities of selecting all lexical entries in
the sentence and the root node probability of the
parse tree. That is, the second model is also de-
fined without the probabilities on phrase struc-
tures:
(Model 2)
pmodel2(T|w) =
1
Zmodel2pmodel1(T|w)exp

 summationdisplay
u
(fu∈froot)
λufu(T)


Zmodel2 =
summationdisplay
Tprime
pmodel1(Tprime|w)exp

 summationdisplay
u
(fu∈froot)
λufu(Tprime)

,
where Zmodel2 is the sum over the set of all pos-
sible parse trees for the sentence.
The third model is a hybrid of model 1 and the
previous model. The probabilities of the lexical
entries in the previous model are replaced with the
probabilities of lexical entry selection:
(Model 3)
pmodel3(T|w) =
1
Zmodel3pmodel1(T|w)exp
parenleftBiggsummationdisplay
u
λufu(T)
parenrightBigg
Zmodel3 =
summationdisplay
Tprime
pmodel1(Tprime|w)exp
parenleftBiggsummationdisplay
u
λufu(Tprime)
parenrightBigg
.
In this study, the same model parameters used
in the previous model were used for phrase struc-
tures.
The probabilities of lexical entry selection,
p(li|w,i), are defined as follows:
(Probabilistic Model of Lexical Entry Selection)
p(li|w,i) = 1Z
w
exp
parenleftBiggsummationdisplay
u
λufu(li,w,i)
parenrightBigg
158
fexlex =
angbracketleftbigg
wi−1,wi,wi+1,
pi−2,pi−1,pi,pi+1,pi+2
angbracketrightbigg
combinations of feature templates
〈wi−1〉,〈wi〉,〈wi+1〉,
〈pi−2〉,〈pi−1〉,〈pi〉,〈pi+1〉,〈pi+2〉,〈pi+3〉,
〈wi−1,wi〉,〈wi,wi+1〉,
〈pi−1,wi〉,〈pi,wi〉,〈pi+1,wi〉,
〈pi,pi+1,pi+2,pi+3〉,〈pi−2,pi−1,pi〉,
〈pi−1,pi,pi+1〉,〈pi,pi+1,pi+2〉
〈pi−2,pi−1〉,〈pi−1,pi〉,〈pi,pi+1〉,〈pi+1,pi+2〉
Table 2: Features for the probabilities of lexical
entry selection.
procedure Parsing(〈w1,...,wn〉, 〈L,R〉, α, β, κ, δ, θ)
for i = 1 to n
foreach Fprime ∈ {F|〈wi,F〉 ∈ L}
p =
summationtext
u λufu(F
prime)
pi[i−1,i] ← pi[i−1,i]∪{Fprime}
if (p > ρ[i−1,i,Fprime]) then
ρ[i−1,i,Fprime] ← p
LocalThresholding(i−1, i,α, β)
for d = 1 to n
for i = 0 to n−d
j = i+d
for k = i+1 to j −1
foreach Fs ∈ φ[i,k], Ft ∈ φ[k,j], r ∈ R
if F = r(Fs,Ft) has succeeded
p = ρ[i,k,Fs]+ρ[k,j,Ft]+
summationtext
u λufu(F)pi[i,j] ← pi[i,j]∪{F}
if (p > ρ[i,j,F]) then
ρ[i,j,F] ← p
LocalThresholding(i, j,κ, δ)
GlobalThresholding(i, n, θ)
procedure IterativeParsing(w, G, α0, β0, κ0, δ0, θ0, ∆α, ∆β, ∆κ,
∆δ, ∆θ, αlast, βlast, κlast, δlast, θlast)
α ← α0; β ← β0; κ ← κ0; δ ← δ0; θ ← θ0;
loop while α ≤ αlast and β ≤ βlast and κ ≤ κlast and δ ≤ δlast
and θ ≤ θlast
call Parsing(w, G, α, β, κ, δ, θ)
if pi[1,n] negationslash= ∅ then exit
α ← α+∆α; β ← β +∆β;
κ ← κ+∆κ; δ ← δ +∆δ; θ ← θ +∆θ;
Figure 3: Pseudo-code of iterative parsing for
HPSG.
Zw =
summationdisplay
lprime
exp
parenleftBiggsummationdisplay
u
λufu(lprime,w,i)
parenrightBigg
,
where Zw is the sum over all possible lexical en-
tries for the word wi. The feature templates used
in our model are listed in Table 2 and are word
trigrams and POS 5-grams.
4 Experiments
4.1 Implementation
We implemented the iterative parsing algorithm
(Ninomiya et al., 2005) for the probabilistic HPSG
models. It first starts parsing with a narrow beam.
If the parsing fails, then the beam is widened, and
parsing continues until the parser outputs results
or the beam width reaches some limit. Though
the probabilities of lexical entry selection are in-
troduced, the algorithm for the presented proba-
bilistic models is almost the same as the original
iterative parsing algorithm.
The pseudo-code of the algorithm is shown in
Figure 3. In the figure, the pi[i,j] represents
the set of partial parse results that cover words
wi+1,...,wj, and ρ[i,j,F] stores the maximum
figure-of-merit (FOM) of partial parse result F
at cell (i,j). The probability of lexical entry
F is computed as summationtextu λufu(F) for the previous
model, as shown in the figure. The probability
of a lexical entry for models 1, 2, and 3 is com-
puted as the probability of lexical entry selection,
p(F|w,i). The FOM of a newly created partial
parse, F, is computed by summing the values of
ρ of the daughters and an additional FOM of F if
the model is the previous model or model 3. The
FOM for models 1 and 2 is computed by only sum-
ming the values of ρ of the daughters; i.e., weights
exp(λu) in the figure are assigned zero. The terms
κ and δ are the thresholds of the number of phrasal
signs in the chart cell and the beam width for signs
in the chart cell. The terms α and β are the thresh-
olds of the number and the beam width of lexical
entries, and θ is the beam width for global thresh-
olding (Goodman, 1997).
4.2 Evaluation
We evaluated the speed and accuracy of parsing
with extremely lexicalized models by using Enju
2.1, the HPSG grammar for English (Miyao et al.,
2005; Miyao and Tsujii, 2005). The lexicon of
the grammar was extracted from Sections 02-21 of
the Penn Treebank (Marcus et al., 1994) (39,832
sentences). The grammar consisted of 3,797 lex-
ical entries for 10,536 words1. The probabilis-
tic models were trained using the same portion of
the treebank. We used beam thresholding, global
thresholding (Goodman, 1997), preserved iterative
parsing (Ninomiya et al., 2005) and other tech-
1An HPSG treebank is automatically generated from the
Penn Treebank. Those lexical entries were generated by ap-
plying lexical rules to observed lexical entries in the HPSG
treebank (Nakanishi et al., 2004). The lexicon, however, in-
cluded many lexical entries that do not appear in the HPSG
treebank. The HPSG treebank is used for training the prob-
abilistic model for lexical entry selection, and hence, those
lexical entries that do not appear in the treebank are rarely
selected by the probabilistic model. The ‘effective’ tag set
size, therefore, is around 1,361, the number of lexical entries
without those never-seen lexical entries.
159
No. of tested sentences Total No. of Avg. length of tested sentences
≤ 40 words ≤ 100 words sentences ≤ 40 words ≤ 100 words
Section 23 2,162 (94.04%) 2,299 (100.00%) 2,299 20.7 22.2
Section 24 1,157 (92.78%) 1,245 (99.84%) 1,247 21.2 23.0
Table 3: Statistics of the Penn Treebank.
Section 23 (≤ 40 + Gold POSs) Section 23 (≤ 100 + Gold POSs)
LP LR UP UR Avg. time LP LR UP UR Avg. time
(%) (%) (%) (%) (ms) (%) (%) (%) (%) (ms)
previous model 87.65 86.97 91.13 90.42 468 87.26 86.50 90.73 89.93 604
model 1 87.54 86.85 90.38 89.66 111 87.23 86.47 90.05 89.27 129
model 2 87.71 87.02 90.51 89.80 109 87.38 86.62 90.17 89.39 130
model 3 89.79 88.97 92.66 91.81 132 89.48 88.58 92.33 91.40 152
Section 23 (≤ 40 + POS tagger) Section 23 (≤ 100 + POS tagger)
LP LR UP UR Avg. time LP LR UP UR Avg. time
(%) (%) (%) (%) (ms) (%) (%) (%) (%) (ms)
previous model 85.33 84.83 89.93 89.41 509 84.96 84.25 89.55 88.80 674
model 1 85.26 84.31 89.17 88.18 133 85.00 84.01 88.85 87.82 154
model 2 85.37 84.42 89.25 88.26 134 85.08 84.09 88.91 87.88 155
model 3 87.66 86.53 91.61 90.43 155 87.35 86.29 91.24 90.13 183
Table 4: Experimental results for Section 23.
niques for deep parsing2. The parameters for beam
searching were determined manually by trial and
error using Section 22: α0 = 4,∆α = 4,αlast =
20,β0 = 1.0,∆β = 2.5,βlast = 11.0,δ0 =
12,∆δ = 4,δlast = 28, κ0 = 6.0,∆κ =
2.25,κlast = 15.0, θ0 = 8.0,∆θ = 3.0, and
θlast = 20.0. With these thresholding parame-
ters, the parser iterated at most five times for each
sentence.
We measured the accuracy of the predicate-
argument relations output of the parser. A
predicate-argument relation is defined as a tu-
ple 〈σ,wh,a,wa〉, where σ is the predicate type
(e.g., adjective, intransitive verb), wh is the head
word of the predicate, a is the argument label
(MODARG, ARG1, ..., ARG4), and wa is the
head word of the argument. Labeled precision
(LP)/labeled recall (LR) is the ratio of tuples cor-
rectly identified by the parser3. Unlabeled pre-
cision (UP)/unlabeled recall (UR) is the ratio of
tuples without the predicate type and the argu-
ment label. This evaluation scheme was the
same as used in previous evaluations of lexicalized
grammars (Hockenmaier, 2003; Clark and Cur-
2Deep parsing techniques include quick check (Malouf
et al., 2000) and large constituent inhibition (Kaplan et al.,
2004) as described by Ninomiya et al. (2005), but hybrid
parsing with a CFG chunk parser was not used. This is be-
cause we did not observe a significant improvement for the
development set by the hybrid parsing and observed only a
small improvement in the parsing speed by around 10 ms.
3When parsing fails, precision and recall are evaluated,
although nothing is output by the parser; i.e., recall decreases
greatly.
ran, 2004b; Miyao and Tsujii, 2005). The ex-
periments were conducted on an AMD Opteron
server with a 2.4-GHz CPU. Section 22 of the
Treebank was used as the development set, and
the performance was evaluated using sentences of
≤ 40 and 100 words in Section 23. The perfor-
mance of each parsing technique was analyzed us-
ing the sentences in Section 24 of ≤ 100 words.
Table 3 details the numbers and average lengths of
the tested sentences of ≤ 40 and 100 words in Sec-
tions 23 and 24, and the total numbers of sentences
in Sections 23 and 24.
The parsing performance for Section 23 is
shown in Table 4. The upper half of the table
shows the performance using the correct POSs in
the Penn Treebank, and the lower half shows the
performance using the POSs given by a POS tag-
ger (Tsuruoka and Tsujii, 2005). The left and
right sides of the table show the performances for
the sentences of ≤ 40 and ≤ 100 words. Our
models significantly increased not only the pars-
ing speed but also the parsing accuracy. Model
3 was around three to four times faster and had
around two points higher precision and recall than
the previous model. Surprisingly, model 1, which
used only lexical information, was very fast and
as accurate as the previous model. Model 2 also
improved the accuracy slightly without informa-
tion of phrase structures. When the automatic POS
tagger was introduced, both precision and recall
dropped by around 2 points, but the tendency to-
wards improved speed and accuracy was again ob-
160
76.00%
78.00%
80.00%
82.00%
84.00%
86.00%
88.00%
0 100 200 300 400 500 600 700 800 900Parsing time (ms/sentence)
F-sc
ore
previous model
model 1
model 2
model 3
Figure 4: F-score versus average parsing time for sentences in Section 24 of ≤ 100 words.
served.
The unlabeled precisions and recalls of the pre-
vious model and models 1, 2, and 3 were signifi-
cantly different as measured using stratified shuf-
fling tests (Cohen, 1995) with p-values < 0.05.
The labeled precisions and recalls were signifi-
cantly different among models 1, 2, and 3 and
between the previous model and model 3, but
were not significantly different between the previ-
ous model and model 1 and between the previous
model and model 2.
The average parsing time and labeled F-score
curves of each probabilistic model for the sen-
tences in Section 24 of≤100 words are graphed in
Figure 4. The superiority of our models is clearly
observed in the figure. Model 3 performed sig-
nificantly better than the previous model. Models
1 and 2 were significantly faster with almost the
same accuracy as the previous model.
5 Discussion
5.1 Supertagging
Our probabilistic model of lexical entry selection
can be used as an independent classifier for select-
ing lexical entries, which is called the supertag-
ger (Bangalore and Joshi, 1999; Clark and Curran,
2004b). The CCG supertagger uses a maximum
entropy classifier and is similar to our model.
We evaluated the performance of our probabilis-
tic model as a supertagger. The accuracy of the re-
sulting supertagger on our development set (Sec-
tion 22) is given in Table 5 and Table 6. The test
sentences were automatically POS-tagged. Re-
sults of other supertaggers for automatically ex-
test data accuracy (%)
HPSG supertagger 22 87.51
(this paper)
CCG supertagger 00/23 91.70 / 91.45
(Curran and Clark, 2003)
LTAG supertagger 22/23 86.01 / 86.27
(Shen and Joshi, 2003)
Table 5: Accuracy of single-tag supertaggers. The
numbers under “test data” are the PTB section
numbers of the test data.
γ tags/word word acc. (%) sentence acc. (%)
1e-1 1.30 92.64 34.98
1e-2 2.11 95.08 46.11
1e-3 4.66 96.22 51.95
1e-4 10.72 96.83 55.66
1e-5 19.93 96.95 56.20
Table 6: Accuracy of multi-supertagging.
tracted lexicalized grammars are listed in Table 5.
Table 6 gives the average number of supertags as-
signed to a word, the per-word accuracy, and the
sentence accuracy for several values of γ, which is
a parameter to determine how many lexical entries
are assigned.
When compared with other supertag sets of au-
tomatically extracted lexicalized grammars, the
(effective) size of our supertag set, 1,361 lexical
entries, is between the CCG supertag set (398 cat-
egories) used by Curran and Clark (2003) and the
LTAG supertag set (2920 elementary trees) used
by Shen and Joshi (2003). The relative order based
on the sizes of the tag sets exactly matches the or-
der based on the accuracies of corresponding su-
pertaggers.
161
5.2 Efficacy of extremely lexicalized models
The implemented parsers of models 1 and 2 were
around four times faster than the previous model
without a loss of accuracy. However, what sur-
prised us is not the speed of the models, but
the fact that they were as accurate as the previ-
ous model, though they do not use any phrase-
structure-based probabilities. We think that the
correct parse is more likely to be selected if the
correct lexical entries are assigned high probabil-
ities because lexical entries include specific infor-
mation about subcategorization frames and syn-
tactic alternation, such as wh-movement and pas-
sivization, that likely determines the dominant
structures of parse trees. Another possible rea-
son for the accuracy is the constraints placed by
unification-based grammars. That is, incorrect
parse trees were suppressed by the constraints.
The best performer in terms of speed and ac-
curacy was model 3. The increased speed was,
of course, possible for the same reasons as the
speeds of models 1 and 2. An unexpected but
very impressive result was the significant improve-
ment of accuracy by two points in precision and
recall, which is hard to attain by tweaking param-
eters or hacking features. This may be because
the phrase structure information and lexical in-
formation complementarily improved the model.
The lexical information includes more specific in-
formation about the syntactic alternation, and the
phrase structure information includes information
about the syntactic structures, such as the dis-
tances of head words or the sizes of phrases.
Nasr and Rambow (2004) showed that the accu-
racy of LTAG parsing reached about 97%, assum-
ing that the correct supertags were given. We ex-
emplified the dominance of lexical information in
real syntactic parsing, i.e., syntactic parsing with-
out gold-supertags, by showing that the proba-
bilities of lexical entry selection dominantly con-
tributed to syntactic parsing.
The CCG supertagging demonstrated fast and
accurate parsing for the probabilistic CCG (Clark
and Curran, 2004a). They used the supertag-
ger for eliminating candidates of lexical entries,
and the probabilities of parse trees were calcu-
lated using the phrase-structure-based model with-
out the probabilities of lexical entry selection. Our
study is essentially different from theirs in that the
probabilities of lexical entry selection have been
demonstrated to dominantly contribute to the dis-
ambiguation of phrase structures.
We have not yet investigated whether our results
can be reproduced with other lexicalized gram-
mars. Our results might hold only for HPSG be-
cause HPSG has strict feature constraints and has
lexical entries with rich syntactic information such
as wh-movement.
6 Conclusion
We developed an extremely lexicalized probabilis-
tic model for fast and accurate HPSG parsing.
The model is very simple. The probabilities of
parse trees are defined with only the probabili-
ties of selecting lexical entries, which are trained
by the discriminative methods in the log-linear
model with features of word trigrams and POS 5-
grams as defined in the CCG supertagging. Ex-
periments revealed that the model achieved im-
pressive accuracy as high as that of the previous
model for the probabilistic HPSG and that the im-
plemented parser runs around four times faster.
This indicates that accurate and fast parsing is pos-
sible using rather simple mechanisms. In addi-
tion, we provided another probabilistic model, in
which the probabilities for the leaf nodes in a parse
tree are given by the probabilities of supertag-
ging, and the probabilities for the intermediate
nodes are given by the previous phrase-structure-
based model. The experiments demonstrated not
only speeds significantly increased by three to four
times but also impressive improvement in parsing
accuracy by around two points in precision and re-
call.
We hope that this research provides a novel ap-
proach to deterministic parsing in which only lex-
ical selection and little phrasal information with-
out packed representations dominates the parsing
strategy.

References
Steven P. Abney. 1997. Stochastic attribute-value
grammars. Computational Linguistics, 23(4):597–
618.
Srinivas Bangalore and Aravind Joshi. 1999. Su-
pertagging: An approach to almost parsing. Com-
putational Linguistics, 25(2):237–265.
Adam Berger, Stephen Della Pietra, and Vincent Della
Pietra. 1996. A maximum entropy approach to nat-
ural language processing. Computational Linguis-
tics, 22(1):39–71.
Eugene Charniak and Mark Johnson. 2005. Coarse-
to-fine n-best parsing and maxent discriminative
reranking. In Proc. of ACL’05, pages 173–180.
Stephen Clark and James R. Curran. 2004a. The im-
portance of supertagging for wide-coverage CCG
parsing. In Proc. of COLING-04.
Stephen Clark and James R. Curran. 2004b. Parsing
the WSJ using CCG and log-linear models. In Proc.
of ACL’04, pages 104–111.
Paul R. Cohen. 1995. Empirical Methods for Artificial
Intelligence. The MIT Press.
Michael Collins. 1999. Head-Driven Statistical Mod-
els for Natural Language Parsing. Ph.D. thesis,
Univ. of Pennsylvania.
James R. Curran and Stephen Clark. 2003. Investigat-
ing GIS and smoothing for maximum entropy tag-
gers. In Proc. of EACL’03, pages 91–98.
Stuart Geman and Mark Johnson. 2002. Dynamic pro-
gramming for parsing and estimation of stochastic
unification-based grammars. In Proc. of ACL’02,
pages 279–286.
Joshua Goodman. 1997. Global thresholding and mul-
tiple pass parsing. In Proc. of EMNLP-1997, pages
11–25.
Julia Hockenmaier. 2003. Parsing with generative
models of predicate-argument structure. In Proc. of
ACL’03, pages 359–366.
Mark Johnson, Stuart Geman, Stephen Canon, Zhiyi
Chi, and Stefan Riezler. 1999. Estimators for
stochastic “unification-based” grammars. In Proc.
of ACL ’99, pages 535–541.
R. M. Kaplan, S. Riezler, T. H. King, J. T. Maxwell
III, and A. Vasserman. 2004. Speed and accuracy
in shallow and deep stochastic parsing. In Proc. of
HLT/NAACL’04.
Dan Klein and Christopher D. Manning. 2003. Ac-
curate unlexicalized parsing. In Proc. of ACL’03,
pages 423–430.
Robert Malouf and Gertjan van Noord. 2004. Wide
coverage parsing with stochastic attribute value
grammars. In Proc. of IJCNLP-04 Workshop “Be-
yond Shallow Analyses”.
Robert Malouf, John Carroll, and Ann Copestake.
2000. Efficient feature structure operations with-
out compilation. Journal of Natural Language En-
gineering, 6(1):29–46.
Robert Malouf. 2002. A comparison of algorithms for
maximum entropy parameter estimation. In Proc. of
CoNLL-2002, pages 49–55.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1994. Building a large annotated
corpus of English: The Penn Treebank. Computa-
tional Linguistics, 19(2):313–330.
Yusuke Miyao and Jun’ichi Tsujii. 2002. Maximum
entropy estimation for feature forests. In Proc. of
HLT 2002, pages 292–297.
Yusuke Miyao and Jun’ichi Tsujii. 2005. Probabilis-
tic disambiguation models for wide-coverage HPSG
parsing. In Proc. of ACL’05, pages 83–90.
Yusuke Miyao, Takashi Ninomiya, and Jun’ichi Tsujii,
2005. Keh-Yih Su, Jun’ichi Tsujii, Jong-Hyeok Lee
and Oi Yee Kwong (Eds.), Natural Language Pro-
cessing - IJCNLP 2004 LNAI 3248, chapter Corpus-
oriented Grammar Development for Acquiring a
Head-driven Phrase Structure Grammar from the
Penn Treebank, pages 684–693. Springer-Verlag.
Hiroko Nakanishi, Yusuke Miyao, and Jun’ichi Tsujii.
2004. An empirical investigation of the effect of lex-
ical rules on parsing with a treebank grammar. In
Proc. of TLT’04, pages 103–114.
Alexis Nasr and Owen Rambow. 2004. Supertagging
and full parsing. In Proc. of the 7th International
Workshop on Tree Adjoining Grammar and Related
Formalisms (TAG+7).
Takashi Ninomiya, Yoshimasa Tsuruoka, Yusuke
Miyao, and Jun’ichi Tsujii. 2005. Efficacy of beam
thresholding, unification filtering and hybrid pars-
ing in probabilistic hpsg parsing. In Proc. of IWPT
2005, pages 103–114.
Carl Pollard and Ivan A. Sag. 1994. Head-Driven
Phrase Structure Grammar. University of Chicago
Press.
Stefan Riezler, Detlef Prescher, Jonas Kuhn, and Mark
Johnson. 2000. Lexicalized stochastic modeling
of constraint-based grammars using log-linear mea-
sures and EM training. In Proc. of ACL’00, pages
480–487.
Libin Shen and Aravind K. Joshi. 2003. A SNoW
based supertagger with application to NP chunking.
In Proc. of ACL’03, pages 505–512.
Yoshimasa Tsuruoka and Jun’ichi Tsujii. 2005. Bidi-
rectional inference with the easiest-first strategy for
tagging sequence data. In Proc. of HLT/EMNLP
2005, pages 467–474.
Wen Wang and Mary P. Harper. 2004. A statisti-
cal constraint dependency grammar (CDG) parser.
In Proc. of ACL’04 Incremental Parsing work-
shop: Bringing Engineering and Cognition To-
gether, pages 42–49.
Wen Wang. 2003. Statistical Parsing and Language
Modeling based on Constraint Dependency Gram-
mar. Ph.D. thesis, Purdue University.
