Proceedings of the 43rd Annual Meeting of the ACL, pages 75–82,
Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
Probabilistic CFG with latent annotations
Takuya Matsuzakia0 Yusuke Miyaoa0 Jun’ichi Tsujiia0a2a1
a0 Graduate School of Information Science and Technology, University of Tokyo
Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033
a1 CREST, JST(Japan Science and Technology Agency)
Honcho 4-1-8, Kawaguchi-shi, Saitama 332-0012
a3 matuzaki, yusuke, tsujii
a4 @is.s.u-tokyo.ac.jp
Abstract
This paper defines a generative probabilis-
tic model of parse trees, which we call
PCFG-LA. This model is an extension of
PCFG in which non-terminal symbols are
augmented with latent variables. Fine-
grained CFG rules are automatically in-
duced from a parsed corpus by training a
PCFG-LA model using an EM-algorithm.
Because exact parsing with a PCFG-LA is
NP-hard, several approximations are de-
scribed and empirically compared. In ex-
periments using the Penn WSJ corpus, our
automatically trained model gave a per-
formance of 86.6% (Fa5 , sentences a6 40
words), which is comparable to that of an
unlexicalized PCFG parser created using
extensive manual feature selection.
1 Introduction
Variants of PCFGs form the basis of several broad-
coverage and high-precision parsers (Collins, 1999;
Charniak, 1999; Klein and Manning, 2003). In those
parsers, the strong conditional independence as-
sumption made in vanilla treebank PCFGs is weak-
ened by annotating non-terminal symbols with many
‘features’ (Goodman, 1997; Johnson, 1998). Exam-
ples of such features are head words of constituents,
labels of ancestor and sibling nodes, and subcatego-
rization frames of lexical heads. Effective features
and their good combinations are normally explored
using trial-and-error.
This paper defines a generative model of parse
trees that we call PCFG with latent annotations
(PCFG-LA). This model is an extension of PCFG
models in which non-terminal symbols are anno-
tated with latent variables. The latent variables work
just like the features attached to non-terminal sym-
bols. A fine-grained PCFG is automatically induced
from parsed corpora by training a PCFG-LA model
using an EM-algorithm, which replaces the manual
feature selection used in previous research.
The main focus of this paper is to examine the
effectiveness of the automatically trained models in
parsing. Because exact inference with a PCFG-LA,
i.e., selection of the most probable parse, is NP-hard,
we are forced to use some approximation of it. We
empirically compared three different approximation
methods. One of the three methods gives a perfor-
mance of 86.6% (Fa5 , sentences a6 40 words) on the
standard test set of the Penn WSJ corpus.
Utsuro et al. (1996) proposed a method that auto-
matically selects a proper level of generalization of
non-terminal symbols of a PCFG, but they did not
report the results of parsing with the obtained PCFG.
Henderson’s parsing model (Henderson, 2003) has a
similar motivation as ours in that a derivation history
of a parse tree is compactly represented by induced
hidden variables (hidden layer activation of a neu-
ral network), although the details of his approach is
quite different from ours.
2 Probabilistic model
PCFG-LA is a generative probabilistic model of
parse trees. In this model, an observed parse tree
is considered as an incomplete data, and the corre-
75
a7a9a8a10a12a11 : a7 :a13 a8a14a16a15a17a11
a18a20a19a21a8a14a23a22a24a11
a25a26a7a27a8a14a29a28a30a11
the
a18a31a8a14a2a32a30a11
cat
a33a34a19a21a8a14a23a35a24a11
a33a36a8a14a29a37a30a11
grinned
a13
a18a20a19
a25a26a7
the
a18
cat
a33a9a19
a33
grinned
Figure 1: Tree with latent annotations a38a40a39a41a43a42 (com-
plete data) and observed tree a38 (incomplete data).
sponding complete data is a tree with latent annota-
tions. Each non-terminal node in the complete data
is labeled with a complete symbol of the form a44a45a39a46a47a42 ,
where a44 is the non-terminal symbol of the corre-
sponding node in the observed tree and a46 is a latent
annotation symbol, which is an element of a fixed
set a48 .
A complete/incomplete tree pair of the sentence,
“the cat grinned,” is shown in Figure 2. The com-
plete parse tree, a38a40a39a41a43a42 (left), is generated through
a process just like the one in ordinary PCFGs, but
the non-terminal symbols in the CFG rules are anno-
tated with latent symbols, a41a50a49a52a51a17a46 a5a54a53 a46a56a55 a53a58a57a58a57a58a57a60a59 . Thus,
the probability of the complete tree (a38a40a39a41a43a42 ) is
a61
a51a62a38a45a39a41a63a42 a59
a49a65a64a34a51a67a66a36a39a46a68a5a69a42 a59a21a70a72a71 a51a67a66a12a39a46a73a5a69a42a75a74a77a76
a61
a39a46 a55 a42a17a78
a61
a39a46a80a79a75a42 a59
a70a81a71 a51a82a76
a61
a39a46a80a55a75a42a75a74a77a83a84a38a45a39a46a56a85a75a42a86a76a87a39a46a89a88a75a42 a59
a70a81a71 a51a82a83a63a38a40a39a46a80a85a75a42a75a74a91a90a93a92a56a94 a59a21a70a95a71 a51a82a76a96a39a46a56a88a60a42a75a74a77a97a60a98a99a90 a59
a70a81a71 a51a67a78
a61
a39a46 a79 a42a75a74a100a78a101a39a46a56a102a103a42 a59a21a70a72a71 a51a67a78a101a39a46a80a102a75a42a75a74a105a104a99a106a2a107a67a108a109a108a68a94a54a110 a59a103a53
where a64a34a51a67a66a12a39a46 a5 a42 a59 denotes the probability of an occur-
rence of the symbol a66a36a39a46 a5 a42 at a root node and a71 a51a17a106 a59
denotes the probability of a CFG rule a106 . The proba-
bility of the observed tree a61 a51a62a38 a59 is obtained by sum-
ming a61 a51a62a38a40a39a41a43a42 a59 for all the assignments to latent an-
notation symbols, a41 :
a61
a51a62a38 a59 a49a112a111
a113
a15a69a114a2a115
a111
a113
a22a116a114a29a115a95a117a58a117a58a117
a111
a113
a37 a114a2a115
a61
a51a62a38a40a39a41a43a42 a59a103a57 (1)
Using dynamic programming, the theoretical
bound of the time complexity of the summation in
Eq. 1 is reduced to be proportional to the number of
non-terminal nodes in a parse tree. However, the cal-
culation at node a108 still has a cost that exponentially
grows with the number of a108 ’s daughters because we
must sum up the probabilities of a118a48a119a118a121a120a123a122 a5 combina-
tions of latent annotation symbols for a node with
a110 daughters. We thus took a kind of transforma-
tion/detransformation approach, in which a tree is
binarized before parameter estimation and restored
to its original form after parsing. The details of the
binarization are explained in Section 4.
Using syntactically annotated corpora as training
data, we can estimate the parameters of a PCFG-
LA model using an EM algorithm. The algorithm
is a special variant of the inside-outside algorithm
of Pereira and Schabes (1992). Several recent work
also use similar estimation algorithm as ours, i.e,
inside-outside re-estimation on parse trees (Chiang
and Bikel, 2002; Shen, 2004).
The rest of this section precisely defines PCFG-
LA models and briefly explains the estimation algo-
rithm. The derivation of the estimation algorithm is
largely omitted; see Pereira and Schabes (1992) for
details.
2.1 Model definition
We define a PCFG-LA a124 as a tuple a124 a49
a125
a76a31a126a116a127 a53 a76a128a127 a53 a48 a53a116a129a45a53 a64 a53a69a71a131a130 , where
a76 a126a123a127a133a132 a set of observable non-terminal symbols
a76a128a127 a132 a set of terminal symbols
a48 a132 a set of latent annotation symbols
a129 a132 a set of observable CFG rules
a64a34a51a82a44a40a39a46a89a42 a59 a132 the probability of the occurrence
of a complete symbol a44a40a39a46a89a42 at a root node
a71 a51a17a106 a59 a132 the probability of a rule a106a135a134 a129 a39a48a136a42 a57
We use a44 a53a116a137a101a53a58a57a58a57a58a57 for non-terminal symbols in
a76a31a126a116a127 ; a138 a5a58a53 a138a12a55 a53a58a57a58a57a58a57 for terminal symbols in a76a40a127 ;
and a46 a53a69a139a80a53a58a57a58a57a58a57 for latent annotation symbols in a48 .
a76a31a126a116a127a60a39a48a136a42 denotes the set of complete non-terminal
symbols, i.e., a76a40a126a123a127a60a39a48a136a42a73a49a141a140a142a44a40a39a46a89a42a27a118a29a44a100a134a143a76a128a126a123a127 a53 a46a144a134a143a48a144a145 .
Note that latent annotation symbols are not attached
to terminal symbols.
In the above definition, a129 is a set of CFG rules
of observable (i.e., not annotated) symbols. For
simplicity of discussion, we assume that a129 is a
CNF grammar, but extending to the general case
is straightforward. a129 a39a48a136a42 is the set of CFG rules
of complete symbols, such as a78a43a39a46a47a42a26a74 grinned or
a66a36a39a46a89a42a75a74a77a76
a61
a39a139 a42a17a78
a61
a39a146a2a42 . More precisely,
a129 a39a48a136a42a80a49a105a140a133a51a82a44a40a39a46a89a42a109a74a147a138 a59 a118a148a51a82a44a72a74a91a138 a59 a134 a129a45a149 a46a150a134a136a48a144a145a152a151
a140a133a51a82a44a45a39a46a47a42a75a74 a137 a39a139 a42a17a153a84a39a146a2a42 a59 a118a148a51a82a44a154a74 a137 a153 a59 a134 a129a45a149 a46 a53a69a139a80a53 a146a63a134a136a48a144a145 a57
76
We assume that non-terminal nodes in a parse tree
a38 are indexed by integers a107a155a49a157a156 a53a58a57a58a57a58a57a158a53a69a159 , starting
from the root node. A complete tree is denoted by
a38a40a39a41a43a42 , where a41 a49a160a51a17a46 a5a54a53a58a57a58a57a58a57a142a53 a46a89a161 a59 a134a100a48
a161 is a vec-
tor of latent annotation symbols and a46a109a162 is the latent
annotation symbol attached to the a107 -th non-terminal
node.
We do not assume any structured parametrizations
in a71 and a64 ; that is, each a71 a51a17a106 a59 a51a17a106a163a134 a129 a39a48a95a42 a59 and
a64a34a51a82a44a45a39a46a47a42 a59 a51a82a44a45a39a46a47a42a131a134a95a76 a126a123a127 a39a48a136a42 a59 is itself a parameter to be
tuned. Therefore, an annotation symbol, say, a46 , gen-
erally does not express any commonalities among
the complete non-terminals annotated by a46 , such as
a44a45a39a46a47a42 a53a116a137 a39a46a47a42 a53 a94a54a90a30a97 .
The probability of a complete parse tree a38a45a39a41a63a42 is
defined as
a61
a51a62a38a40a39a41a43a42 a59 a49a65a64a34a51a82a44 a5 a39a46 a5 a42 a59 a164
a165
a114a2a166a168a167a133a169a170a89a171
a71 a51a17a106 a59a103a53 (2)
where a44 a5 a39a46 a5 a42 is the label of the root node of a38a40a39a41a43a42
and a83a40a172a68a173a174a81a175 denotes the multiset of annotated CFG
rules used in the generation of a38a40a39a41a43a42 . We have the
probability of an observable tree a38 by marginalizing
out the latent annotation symbols in a38a40a39a41a43a42 :
a61
a51a62a38 a59 a49 a111
a176 a114a29a115a9a177
a64a34a51a82a44a128a5a158a39a46a73a5a69a42 a59 a164
a165
a114a2a166a168a167a133a169a170a47a171
a71 a51a17a106 a59a103a53 (3)
where a159 is the number of non-terminal nodes in a38 .
2.2 Forward-backward probability
The sum in Eq. 3 can be calculated using a dynamic
programming algorithm analogous to the forward al-
gorithm for HMMs. For a sentence a138 a5 a138a12a55 a57a58a57a58a57 a138a36a178
and its parse tree a38 , backward probabilities a179 a162
a172
a51a17a46 a59
are recursively computed for the a107 -th non-terminal
node and for each a46a65a134a180a48 . In the definition below,
a76 a162 a134a155a76a31a126a116a127 denotes the non-terminal label of the a107 -th
node.
a181 If node
a107 is a pre-terminal node above a termi-
nal symbol a138a27a182 , then a179 a162
a172
a51a17a46 a59 a49 a71 a51a82a76 a162 a39a46a47a42a183a74a147a138a9a182 a59 .
a181 Otherwise, let
a184 and a185 be the two daughter
nodes of a107 . Then
a179
a162
a172 a51a17a46 a59 a49 a111
a113a123a186a54a187a113a142a188
a114a2a115
a71 a51a82a76 a162 a39a46a47a42a75a74a141a76a152a182a133a39a46a16a182a75a42a86a76a40a189a16a39a46a80a189a54a42 a59
a70 a179
a182
a172
a51a17a46a16a182 a59 a179
a189
a172
a51a17a46a80a189 a59a103a57
Using backward probabilities, a61 a51a62a38 a59 is calculated as
a61
a51a62a38 a59 a49a191a190 a113
a15 a114a2a115
a64a34a51a82a76 a5 a39a46 a5 a42 a59 a179
a5
a172
a51a17a46 a5a75a59 .
We define forward probabilities a192 a162
a172
a51a17a46 a59 , which are
used in the estimation described below, as follows:
a181 If node
a107 is the root node (i.e., a107 = 1), then
a192
a162
a172
a51a17a46 a59 a49a65a64a34a51a82a76 a162 a39a46a89a42 a59 .
a181 If node
a107 has a right sibling a185 , let a184 be the
mother node of a107 . Then
a192
a162
a172 a51a17a46 a59 a49 a111
a113a123a186a60a187a113a142a188
a114a2a115
a71 a51a82a76a20a182a99a39a46a193a182a75a42a75a74a77a76 a162 a39a46a89a42a86a76a31a189a193a39a46a56a189a142a42 a59
a70 a192
a182
a172
a51a17a46a16a182 a59 a179
a189
a172
a51a17a46a80a189 a59a103a57
a181 If node
a107 has a left sibling, a192
a162
a172
a51a17a46 a59 is defined
analogously.
2.3 Estimation
We now derive the EM algorithm for PCFG-LA,
which estimates the parameters a194a128a49a141a51 a71a34a53 a64 a59 . Let a195a91a49
a140a58a38 a5a54a53 a38a183a55 a53a58a57a58a57a58a57 a145 be the training set of parse trees and
a76
a162
a5
a53a58a57a58a57a58a57a158a53 a76
a162
a161a26a196 be the labels of non-terminal nodes in
a38a183a162 . Like the derivations of the EM algorithms for
other latent variable models, the update formulas for
the parameters, which update the parameters from a194
to a194a23a197a152a49a198a51 a71 a197 a53 a64a183a197 a59 , are obtained by constrained opti-
mization of a199a84a51a82a194 a197 a118a194 a59 , which is defined as
a199a84a51a82a194 a197 a118a194 a59 a49 a111
a172 a196
a114a201a200
a111
a174 a196
a114a29a115 a177
a196
a61a131a202
a51a82a41 a162 a118a38 a162a24a59a16a203a205a204a29a206
a61a131a202a93a207
a51a62a38 a162 a39a41 a162 a42 a59a103a53
where a61a131a202 and a61a131a202 a207 denote probabilities under a194 and
a194 a197 , and
a61
a51a82a41a155a118a38 a59 is the conditional probability of la-
tent annotation symbols given an observed tree a38 ,
i.e., a61 a51a82a41a144a118a38 a59 a49 a61 a51a62a38a45a39a41a63a42 a59a69a208 a61 a51a62a38 a59 . Using the La-
grange multiplier method and re-arranging the re-
sults using the backward and forward probabilities,
we obtain the update formulas in Figure 2.
3 Parsing with PCFG-LA
In theory, we can use PCFG-LAs to parse a given
sentence a138 by selecting the most probable parse:
a38a183a209a211a210a93a212a24a213a168a49a65a214a2a215 a206a23a216 a214a201a217
a172
a114a2a218a168a219a221a220a109a222
a61
a51a62a38a45a118a138 a59 a49a65a214a2a215 a206a23a216 a214a201a217
a172
a114a2a218a168a219a223a220a183a222
a61
a51a62a38 a59a103a53 (4)
where a224a84a51a17a138 a59 denotes the set of possible parses for
a138 under the observable grammar a129 . While the opti-
mization problem in Eq. 4 can be efficiently solved
77
a225
a207a17a226a228a227
a8a14a201a11a230a229a155a231a12a8a232a54a11a223a233a26a8a234a103a11a223a235a148a236a95a237a81a238
a15
a239
a169a240a93a171a68a241
a167a116a242a67a243a103a244
a19
a226
a7
a196
a235a230a238
a15a60a245
a241
a246a186a230a247a188 a247a248a121a249
a243
Covereda246
a167 a242
a247a239a133a250a34a251a133a252 a249
a253
a186
a167 a242
a226
a14a23a235 a225
a226a228a227
a8a14a201a11a230a229a155a231a254a8a232a58a11a223a233a26a8a234a103a11a223a235a16a255
a188
a167 a242
a226
a232a201a235a16a255
a248
a167 a242
a226
a234a60a235
a225
a207a17a226a228a227
a8a14a201a11a230a229a1a0a131a235a193a236a95a237a81a238
a15
a239
a169a240a93a171a68a241
a167 a242 a243a103a244
a19
a226
a7
a196
a235a230a238
a15
a241
a186
a243
Covereda246
a167a123a242
a247a239a99a250a3a2 a249
a253
a186
a167 a242
a226
a14a23a235 a225
a226a228a227
a8a14a201a11a230a229a1a0a131a235
a4
a207a62a226a228a227
a8a14a158a11a235a47a236a6a5a7a8a5a238
a15
a241
a167 a242 a243
Roota246
a244
a247a239 a249
a19
a226
a7
a196
a235a230a238
a15
a4
a226a228a227
a8a14a201a11a223a235a16a255
a15
a167a116a242
a226
a14a23a235
a237
a239
a169a240a93a171 a236
a241
a167 a242 a243a75a244
a19
a226
a7
a196
a235 a238
a15
a241
a186
a243
Labeleda246
a167 a242
a247a239 a249
a253
a186
a167 a242
a226
a14a23a235a193a255
a186
a167 a242
a226
a14a23a235
Covered
a226
a7
a196a10a9
a227
a229a105a231a27a233a81a235a47a236
a11
a226a13a12
a9a15a14a16a9a18a17
a235a19a5a75a18
a196
a186
a229a144a18
a196
a188
a18
a196
a248a21a20
a25 a167 a242a23a22
a226
a18
a196
a186
a9
a18
a196
a188
a9
a18
a196
a248
a235a47a236
a226a228a227
a9
a231
a9
a233a81a235a18a24
Covered
a226
a7
a196a10a9
a227
a229a25a0a131a235a47a236 a11
a12
a5a123a18
a196
a186
a229a1a0
a20
a25 a167 a242a22a18
a196
a186
a236
a227
a24
Labeled
a226
a7
a196a10a9
a227
a235a47a236 a11
a12
a5a75a18
a196
a186
a236
a227
a24
Root
a226
a7
a9
a227
a235a47a236 a11a7
a196
a20
a7a26a5 the root of a7
a196 is labeled with
a227
a24
Figure 2: Parameter update formulas.
for PCFGs using dynamic programming algorithms,
the sum-of-products form of a61 a51a62a38 a59 in PCFG-LA
models (see Eq. 2 and Eq. 3) makes it difficult to
apply such techniques to solve Eq. 4.
Actually, the optimization problem in Eq. 4 is NP-
hard for general PCFG-LA models. Although we
omit the details, we can prove the NP-hardness by
observing that a stochastic tree substitution grammar
(STSG) can be represented by a PCFG-LA model in
a similar way to one described by Goodman (1996a),
and then using the NP-hardness of STSG parsing
(Sima´an, 2002).
The difficulty of the exact optimization in Eq. 4
forces us to use some approximations of it. The rest
of this section describes three different approxima-
tions, which are empirically compared in the next
section. The first method simply limits the number
of candidate parse trees compared in Eq. 4; we first
create N-best parses using a PCFG and then, within
the N-best parses, select the one with the highest
probability in terms of the PCFG-LA. The other two
methods are a little more complicated, and we ex-
plain them in separate subsections.
3.1 Approximation by Viterbi complete trees
The second approximation method selects the best
complete tree a38 a197 a39a41 a197a42 , that is,
a38 a197 a39a41 a197a42a183a49 a214a2a215 a206a23a216 a214a201a217
a172
a114a29a218a168a219a221a220a109a222
a187
a174
a114a29a115a28a27
a170
a27
a61
a51a62a38a40a39a41a43a42 a59a103a57 (5)
We call a38 a197a211a39a41a43a197a221a42 a Viterbi complete tree. Such a tree
can be obtained in a29a84a51a116a118a138a135a118a85 a59 time by regarding the
PCFG-LA as a PCFG with annotated symbols.1
The observable part of the Viterbi complete
tree a38a254a197a211a39a41a101a197a223a42 (i.e., a38 a197 ) does not necessarily coin-
cide with the best observable tree a38a168a209a211a210a93a212a67a213 in Eq. 4.
However, if a38a109a209a211a210a93a212a24a213 has some ‘dominant’ assign-
ment a30 to its latent annotation symbols such
that a61 a51a62a38a109a209a211a210a93a212a67a213a116a39a30a136a42 a59a32a31 a61 a51a62a38a109a209a211a210a30a212a24a213 a59 , then a61 a51a62a38 a197 a59a33a31
a61
a51a62a38a109a209a211a210a30a212a24a213 a59 because
a61
a51a62a38a109a209a211a210a93a212a67a213a116a39a30a95a42 a59 a6
a61
a51a62a38a254a197a211a39a41a101a197a221a42 a59 and
a61
a51a62a38 a197 a39a41 a197a42 a59 a6
a61
a51a62a38 a197 a59 , and thus a38 a197 and a38a109a209a211a210a93a212a24a213 are al-
most equally ‘good’ in terms of their marginal prob-
abilities.
3.2 Viterbi parse in approximate distribution
In the third method, we approximate the true dis-
tribution a61 a51a62a38a45a118a138 a59 by a cruder distribution a199a84a51a62a38a135a118a138 a59 ,
and then find the tree with the highest a199a63a51a62a38a45a118a138 a59 in
polynomial time. We first create a packed repre-
sentation of a224a84a51a17a138 a59 for a given sentence a138 .2 Then,
the approximate distribution a199a84a51a62a38a135a118a138 a59 is created us-
ing the packed forest, and the parameters in a199a84a51a62a38a45a118a138 a59
are adjusted so that a199a84a51a62a38a45a118a138 a59 approximates a61 a51a62a38a45a118a138 a59
as closely as possible. The form of a199a84a51a62a38a135a118a138 a59 is that
of a product of the parameters, just like the form of
a PCFG model, and it enables us to use a Viterbi al-
gorithm to select the tree with the highest a199a84a51a62a38a135a118a138 a59 .
A packed forest is defined as a tuple a125a10a34 a53a36a35a201a130 . The
first component,a34 , is a multiset of chart items of the
form a51a82a44 a53 a179 a53 a94 a59 . A chart item a51a82a44 a53 a179 a53 a94 a59 a134 a34 indicates
that there exists a parse tree in a224a135a51a17a138 a59 that contains a
constituent with the non-terminal label a44 that spans
1For efficiency, we did not actually parse sentences with
a37a26a8a38a152a11 but selected a Viterbi complete tree from a packed rep-
resentation of candidate parses in the experiments in Section 4.
2In practice, fully constructing a packed representation of
a39
a226
a0a131a235 has an unrealistically high cost for most input sentences.
Alternatively, we can use a packed representation of a subset
ofa39
a226
a0a131a235 , which can be obtained by parsing with beam thresh-
olding, for instance. An approximate distributiona40
a226
a7a41a5a0a131a235 on
such subsets can be derived in almost the same way as one for
the fulla39
a226
a0a131a235 , but the conditional distribution, a19
a226
a7a42a5a0 a235 , is re-
normalized so that the total mass for the subset sums to 1.
78
a227
a231
a233
a0 a15
a25
a0a68a22
a43
a0 a28
a227
a233
a0 a15
a231
a25
a0a68a22
a43
a0a73a28
a44a27a236 a11a46a45a15
a9
a45a22
a9
a45a28
a9
a45a32
a9
a45a35
a9
a45a37a24
a45a15 a236
a226a228a227
a9a23a47a48a9a18a49
a235
a9
a45a22 a236
a226
a231
a9a23a47a48a9a15a50
a235
a9
a45a28 a236
a226
a231
a9a18a50a51a9a18a49
a235 ,
a45a32a68a236
a226
a233
a9a23a47a52a9a23a47
a235
a9
a45a35a68a236
a226
a25
a9a53a50a54a9a15a50
a235
a9
a45a37 a236
a226
a43
a9a10a49a55a9a15a49
a235
a56
a226
a45a15 a235a89a236 a11
a226
a45a22
a9
a45a37a116a235
a9
a226
a45a28
a9
a45a32 a235a18a24
a56
a226
a45a22a69a235a89a236 a11
a226
a45a32
a9
a45a35a69a235a18a24
a9
a56
a226
a45a28a116a235a47a236 a11
a226
a45a35
a9
a45a37 a235a18a24
a56
a226
a45a32a116a235a47a236 a11a0 a15a57a24
a9
a56
a226
a45a35a93a235a89a236 a11a0a68a22a46a24
a9
a56
a226
a45a37 a235a47a236 a11a0a73a28a36a24
Figure 3: Two parse trees and packed representation
of them.
from the a179 -th to a94 -th word in a138 . The second compo-
nent,a35 , is a function ona34 that represents dominance
relations among the chart items ina34 ;a35 a51a17a107 a59 is a set of
possible daughters of a107 if a107 is not a pre-terminal node,
anda35 a51a17a107 a59 a49a50a140a54a138a254a189a133a145 if a107 is a pre-terminal node above
a138a12a189 . Two parse trees for a sentence a138 a49 a138 a5 a138a12a55a75a138a12a85
and a packed representation of them are shown in
Figure 3.
We require that each tree a38a77a134a143a224a135a51a17a138 a59 has a unique
representation as a set of connected chart items in
a34 . A packed representation satisfying the uniqueness
condition is created using the CKY algorithm with
the observable grammar a129 , for instance.
The approximate distribution, a199a84a51a62a38a45a118a138 a59 , is defined
as a PCFG, whose CFG rules a129 a220 is defined as
a129
a220
a49 a140a133a51a17a107a31a74 a58 a59 a118a73a107a45a134
a34
a149a58 a134 a35 a51a17a107 a59 a145 . We usea59a148a51a17a106 a59
to denote the rule probability of rule a106 a134 a129 a220 and
a59
a165
a51a17a107 a59 to denote the probability with which a107a45a134
a34 is
generated as a root node. We define a199a63a51a62a38a45a118a138 a59 as
a199a84a51a62a38a45a118a138 a59 a49a60a59
a165
a51a17a107 a5a75a59
a161
a164
a189a54a61 a5
a59a148a51a17a107a24a189 a74a62a58a23a189 a59a103a53
where the set of connected items a140a54a107 a5a54a53a58a57a58a57a58a57a142a53 a107a67a161a128a145a64a63
a34
is the unique representation of a38 .
To measure the closeness of approximation by
a199a84a51a62a38a135a118a138 a59 , we use the ‘inclusive’ KL-divergence,
a65a1a66
a51
a61
a118a205a118a199 a59 (Frey et al., 2000):
a65a1a66
a51
a61
a118a205a118a199 a59 a49 a111
a172
a114a29a218a168a219a221a220a109a222
a61
a51a62a38a135a118a138 a59a16a203a205a204a29a206
a61
a51a62a38a45a118a138 a59
a199a84a51a62a38a45a118a138 a59
a57
Minimizing a65a1a66 a51
a61
a118a205a118a199 a59 under the normalization
constraints ona59a165 anda59 yields closed form solutions
fora59a165 anda59 , as shown in Figure 4.
a61
in and
a61
out in Figure 4 are similar to ordinary in-
side/outside probabilities. We define a61 in as follows:
a181 If
a107a155a49 a51a82a44 a53 a185 a53 a185 a59 a134
a34 is a pre-terminal node
above a138a254a189 , then a61 ina51a17a107a103a39a46a89a42 a59 a49 a71 a51a82a44a45a39a46a47a42a73a74a147a138a12a189 a59 .
a181 Otherwise,
a61
ina51a17a107a103a39a46a89a42 a59 a49 a111
a182 a189
a114a68a67a123a219
a162
a222
a111
a69a142a187a70
a114a2a115
a71 a51a82a44a45a39a46a47a42a75a74 a137 a182a16a39a139 a42a17a153a26a189a16a39a146a2a42 a59
a70
a61
ina51a205a184a56a39a139 a42 a59
a61
ina51a211a185a109a39a146a29a42 a59a103a53
where a137 a182 and a153a21a189 denote non-terminal symbols
of chart items a184 and a185 .
The outside probability, a61 out, is calculated using a61 in
and PCFG-LA parameters along the packed struc-
ture, like the outside probabilities for PCFGs.
Once we have computeda59a148a51a17a107a109a74a71a58 a59 anda59a165 a51a17a107 a59 , the
parse tree a38 that maximizes a199a84a51a62a38a45a118a138 a59 is found using
a Viterbi algorithm, as in PCFG parsing.
Several parsing algorithms that also use inside-
outside calculation on packed chart have been pro-
posed (Goodman, 1996b; Sima´an, 2003; Clark and
Curran, 2004). Those algorithms optimize some
evaluation metric of parse trees other than the pos-
terior probability a61 a51a62a38a45a118a138 a59 , e.g., (expected) labeled
constituent recall or (expected) recall rate of depen-
dency relations contained in a parse. It is in contrast
with our approach where (approximated) posterior
probability is optimized.
4 Experiments
We conducted four sets of experiments. In the first
set of experiments, the degree of dependency of
trained models on initialization was examined be-
cause EM-style algorithms yield different results
with different initial values of parameters. In the
second set of experiments, we examined the rela-
tionship between model types and their parsing per-
formances. In the third set of experiments, we com-
pared the three parsing methods described in the pre-
vious section. Finally, we show the result of a pars-
ing experiment using the standard test set.
We used sections 2 through 20 of the Penn WSJ
corpus as training data and section 21 as heldout
data. The heldout data was used for early stop-
ping; i.e., the estimation was stopped when the rate
79
a72 If
a45a15
a20
a44 is not a pre-terminal node, for each
a73
a236 a45a22a45a28
a20
a56
a226
a45a15a24a235 , let
a227
a9
a231 , and a233 be non-terminal symbols ofa45a15
a9
a45a22 , anda45a28 .
Then,
a74
a226
a45a15a17a229
a73
a235a148a236a76a75
a240a75a243a78a77
a75a80a79
a243a54a77
a75a82a81
a243a54a77
a19
out
a226
a45a15 a8a14a201a11a223a235 a225
a226a228a227
a8a14a158a11a230a229a144a231a12a8a232a58a11a233a26a8a234a103a11a223a235a82a19
in
a226
a45a22 a8a232a54a11a223a235a82a19
in
a226
a45a28 a8a234a123a11a235
a75
a240a75a243a54a77
a19
out
a226
a45a15a116a8a14a158a11a235a82a19
in
a226
a45a15a69a8a14a201a11a223a235 a83
a72 If
a45
a20
a44 is a pre-terminal node above worda0
a188 , thena74
a226
a45 a229a25a0
a188
a235a47a236
a47.
a72 If
a45
a20
a44 is a root node, let
a227
be the non-terminal symbol ofa45. Thena74a46a84
a226
a45a235a47a236
a47
a19
a226
a0 a235
a241
a240a75a243a54a77
a4
a226a228a227
a8a14a201a11a223a235a82a19
in
a226
a45a8a14a201a11a223a235 .
Figure 4: Optimal parameters of approximate distribution a199 .
a85
a86a88a87 a86a90a89 a91 a92a90a87 a92a93a89
Figure 5: Original subtree.
of increase in the likelihood of the heldout data be-
came lower than a certain threshold. Section 22 was
used as test data in all parsing experiments except
in the final one, in which section 23 was used. We
stripped off all function tags and eliminated empty
nodes in the training and heldout data, but any other
pre-processing, such as comma raising or base-NP
marking (Collins, 1999), was not done except for
binarizations.
4.1 Dependency on initial values
To see the degree of dependency of trained mod-
els on initializations, four instances of the same
model were trained with different initial values of
parameters.3 The model used in this experiment was
created by CENTER-PARENT binarization and a118a48a119a118
was set to 16. Table 1 lists training/heldout data log-
likelihood per sentence (LL) for the four instances
and their parsing performances on the test set (sec-
tion 22). The parsing performances were obtained
using the approximate distribution method in Sec-
tion 3.2. Different initial values were shown to affect
the results of training to some extent (Table 1).
3The initial value for an annotated rule probability,
a225
a226a228a227
a8a14a201a11a9a229 a231a12a8a232a54a11a223a233a26a8a234a103a11a223a235 , was created by randomly multiplying
the maximum likelihood estimation of the corresponding PCFG
rule probability, a19
a226a228a227
a229a105a231a27a233a81a235 , as follows:
a225
a226a228a227
a8a14a201a11a2a229a105a231a12a8a232a58a11a233a26a8a234a103a11a223a235a89a236a95a237a81a238
a15
a239a26a94a46a95
a19
a226a228a227
a229a105a231a27a233a81a235
a9
wherea96 is a random number that is uniformly distributed in
a8a98a97a100a99a102a101a52a103
a49a55a9
a99a102a101a48a103
a49
a11 and a237
a239 is a normalization constant.
1 2 3 4 averagea104a106a105
training LL -115 -114 -115 -114 -114a104 0.41
heldout LL -114 -115 -115 -114 -114a104 0.29
LR 86.7 86.3 86.3 87.0 86.6a104 0.27
LP 86.2 85.6 85.5 86.6 86.0a104 0.48
Table 1: Dependency on initial values.
CENTER-PARENT CENTER-HEAD
a85
a86a87 a107a85a109a108a68a110
a86a90a89 a107a85a109a111a112a110
a107a85a109a111a112a110
a91 a92a90a87
a92a113a89
a85
a86a87 a107a91a108a114a110
a86a90a89 a107a91a111a93a110
a107a91a111a112a110
a91 a92a90a87
a92a113a89
LEFT RIGHT
a85
a86a88a87 a107a85a109a110
a86a90a89 a107a85a109a110
a91 a107a85a109a110
a92a87 a92a89
a85
a107a85a109a110
a107a85a109a110
a107a85a109a110
a86a87 a86a89
a91
a92a90a87
a92a113a89
Figure 6: Four types of binarization (H: head daugh-
ter).
4.2 Model types and parsing performance
We compared four types of binarization. The orig-
inal form is depicted in Figure 5 and the results are
shown in Figure 6. In the first two methods, called
CENTER-PARENT and CENTER-HEAD, the head-
finding rules of Collins (1999) were used. We ob-
tained an observable grammar a129 for each model by
reading off grammar rules from the binarized train-
ing trees. For each binarization method, PCFG-LA
models with different numbers of latent annotation
symbols, a118a48a119a118a29a49a77a156 a53a36a115a16a53a117a116a148a53a36a118 , and a156a51a119 , were trained.
80
72
74
76
78
80
82
84
86
10000 100000 1e+06 1e+07 1e+08
F1
# of parameters
CENTER-PARENT
CENTER-HEAD
RIGHT
LEFT
Figure 7: Model size vs. parsing performance.
The relationships between the number of param-
eters in the models and their parsing performances
are shown in Figure 7. Note that models created
using different binarization methods have different
numbers of parameters for the same a118a48a119a118. The pars-
ing performances were measured using Fa5 scores of
the parse trees that were obtained by re-ranking of
1000-best parses by a PCFG.
We can see that the parsing performance gets bet-
ter as the model size increases. We can also see that
models of roughly the same size yield similar perfor-
mances regardless of the binarization scheme used
for them, except the models created using LEFT bi-
narization with small numbers of parameters (a118a48a87a118a99a49
a156 anda115 ). Taking into account the dependency on ini-
tial values at the level shown in the previous exper-
iment, we cannot say that any single model is supe-
rior to the other models when the sizes of the models
are large enough.
The results shown in Figure 7 suggest that we
could further improve parsing performance by in-
creasing the model size. However, both the memory
size and the training time are more than linear in a118a48a119a118,
and the training time for the largest (a118a48a119a118a23a49a91a156a51a119 ) mod-
els was about 15 hours for the models created us-
ingCENTER-PARENT,CENTER-HEAD, andLEFT
and about 20 hours for the model created using
RIGHT. To deal with larger (e.g., a118a48a119a118 = 32 or 64)
models, we therefore need to use a model search that
reduces the number of parameters while maintaining
the model’s performance, and an approximation dur-
ing training to reduce the training time.
84
84.5
85
85.5
86
86.5
0 1 2 3 4 5 6 7 8 9 10
F1
parsing time (sec)
N-best re-ranking
Viterbi complete tree
approximate distribution
Figure 8: Comparison of parsing methods.
4.3 Comparison of parsing methods
The relationships between the average parse time
and parsing performance using the three parsing
methods described in Section 3 are shown in Fig-
ure 8. A model created using CENTER-PARENT
with a118a48a87a118a99a49a91a156a51a119 was used throughout this experiment.
The data points were made by varying config-
urable parameters of each method, which control the
number of candidate parses. To create the candi-
date parses, we first parsed input sentences using a
PCFG4, using beam thresholding with beam width
a120 . The data points on a line in the figure were cre-
ated by varyinga120 with other parameters fixed. The
first method re-ranked the a76 -best parses enumerated
from the chart after the PCFG parsing. The two lines
for the first method in the figure correspond to a76
= 100 and a76 = 300. In the second and the third
methods, we removed all the dominance relations
among chart items that did not contribute to any
parses whose PCFG-scores were higher thana121 a61 max,
where a61 max is the PCFG-score of the best parse in
the chart. The parses remaining in the chart were the
candidate parses for the second and the third meth-
ods. The different lines for the second and the third
methods correspond to different values ofa121 .
The third method outperforms the other two meth-
ods unless the parse time is very limited (i.e., a122 1
4The PCFG used in creating the candidate parses is roughly
the same as the one that Klein and Manning (2003) call a
‘markovised PCFG with vertical order = 2 and horizontal or-
der = 1’ and was extracted from Section 02-20. The PCFG itself
gave a performance of 79.6/78.5 LP/LR on the development set.
This PCFG was also used in the experiment in Section 4.4.
81
a123 40 words LR LP CB 0 CB
This paper 86.7 86.6 1.19 61.1
Klein and Manning (2003) 85.7 86.9 1.10 60.3
Collins (1999) 88.5 88.7 0.92 66.7
Charniak (1999) 90.1 90.1 0.74 70.1
a123 100 words LR LP CB 0 CB
This paper 86.0 86.1 1.39 58.3
Klein and Manning (2003) 85.1 86.3 1.31 57.2
Collins (1999) 88.1 88.3 1.06 64.0
Charniak (1999) 89.6 89.5 0.88 67.6
Table 2: Comparison with other parsers.
sec is required), as shown in the figure. The superi-
ority of the third method over the first method seems
to stem from the difference in the number of can-
didate parses from which the outputs are selected.5
The superiority of the third method over the second
method is a natural consequence of the consistent
use of a61 a51a62a38 a59 both in the estimation (as the objective
function) and in the parsing (as the score of a parse).
4.4 Comparison with related work
Parsing performance on section 23 of the WSJ cor-
pus using a PCFG-LA model is shown in Table 2.
We used the instance of the four compared in the
second experiment that gave the best results on the
development set. Several previously reported results
on the same test set are also listed in Table 2.
Our result is lower than the state-of-the-art lex-
icalized PCFG parsers (Collins, 1999; Charniak,
1999), but comparable to the unlexicalized PCFG
parser of Klein and Manning (2003). Klein and
Manning’s PCFG is annotated by many linguisti-
cally motivated features that they found using ex-
tensive manual feature selection. In contrast, our
method induces all parameters automatically, except
that manually written head-rules are used in bina-
rization. Thus, our method can extract a consider-
able amount of hidden regularity from parsed cor-
pora. However, our result is worse than the lexical-
ized parsers despite the fact that our model has ac-
cess to words in the sentences. It suggests that cer-
tain types of information used in those lexicalized
5Actually, the number of parses contained in the packed for-
est is more than 1 million for over half of the test sentences
whena124 =a47a117a125a238
a32
anda96 a236 a47a117a125a238
a28
, while the number of parses for
which the first method can compute the exact probability in a
comparable time (around 4 sec) is only about 300.
parsers are hard to be learned by our approach.
References
Eugene Charniak. 1999. A maximum-entropy-inspired
parser. Technical Report CS-99-12.
David Chiang and Daniel M. Bikel. 2002. Recovering
latent information in treebanks. In Proc. COLING,
pages 183–189.
Stephen Clark and James R. Curran. 2004. Parsing the
wsj using ccg and log-linear models. In Proc. ACL,
pages 104–111.
Michael Collins. 1999. Head-Driven Statistical Models
for Natural Language Parsing. Ph.D. thesis, Univer-
sity of Pennsylvania.
Brendan J. Frey, Relu Patrascu, Tommi Jaakkola, and
Jodi Moran. 2000. Sequentially fitting “inclusive”
trees for inference in noisy-OR networks. In Proc.
NIPS, pages 493–499.
Joshua Goodman. 1996a. Efficient algorithms for pars-
ing the DOP model. In Proc. EMNLP, pages 143–152.
Joshua Goodman. 1996b. Parsing algorithms and metric.
In Proc. ACL, pages 177–183.
Joshua Goodman. 1997. Probabilistic feature grammars.
In Proc. IWPT.
James Henderson. 2003. Inducing history representa-
tions for broad coverage statistical parsing. In Proc.
HLT-NAACL, pages 103–110.
Mark Johnson. 1998. PCFG models of linguis-
tic tree representations. Computational Linguistics,
24(4):613–632.
Dan Klein and Christopher D. Manning. 2003. Accurate
unlexicalized parsing. In Proc. ACL, pages 423–430.
Fernando Pereira and Yves Schabes. 1992. Inside-
outside reestimation from partially bracketed corpora.
In Proc. ACL, pages 128–135.
Libin Shen. 2004. Nondeterministic LTAG derivation
tree extraction. In Proc. TAG+7, pages 199–203.
Khalil Sima´an. 2002. Computational complexity of
probabilistic disambiguation. Grammars, 5(2):125–
151.
Khalil Sima´an. 2003. On maximizing metrics for syn-
tactic disambiguation. In Proc. IWPT.
Takehito Utsuro, Syuuji Kodama, and Yuji Matsumoto.
1996. Generalization/specialization of context free
grammars based-on entropy of non-terminals. In Proc.
JSAI (in Japanese), pages 327–330.
82
