File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1010_metho.xml
Size: 14,949 bytes
Last Modified: 2025-10-06 14:09:42
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1010"> <Title>Probabilistic CFG with latent annotations</Title> <Section position="3" start_page="0" end_page="76" type="metho"> <SectionTitle> 2 Probabilistic model </SectionTitle> <Paragraph position="0"> PCFG-LA is a generative probabilistic model of parse trees. In this model, an observed parse tree is considered as an incomplete data, and the corre- null plete data) and observed tree a38 (incomplete data). sponding complete data is a tree with latent annotations. Each non-terminal node in the complete data is labeled with a complete symbol of the form a44a45a39a46a47a42 , where a44 is the non-terminal symbol of the corresponding node in the observed tree and a46 is a latent annotation symbol, which is an element of a fixed set a48 .</Paragraph> <Paragraph position="1"> A complete/incomplete tree pair of the sentence, &quot;the cat grinned,&quot; is shown in Figure 2. The complete parse tree, a38a40a39a41a43a42 (left), is generated through a process just like the one in ordinary PCFGs, but the non-terminal symbols in the CFG rules are annotated with latent symbols, a41a50a49a52a51a17a46 a5a54a53 a46a56a55 a53a58a57a58a57a58a57a60a59 . Thus, the probability of the complete tree (a38a40a39a41a43a42 ) is</Paragraph> <Paragraph position="3"> where a64a34a51a67a66a12a39a46 a5 a42 a59 denotes the probability of an occurrence of the symbol a66a36a39a46 a5 a42 at a root node and a71 a51a17a106 a59 denotes the probability of a CFG rule a106 . The probability of the observed tree a61 a51a62a38 a59 is obtained by summing a61 a51a62a38a40a39a41a43a42 a59 for all the assignments to latent annotation symbols, a41 :</Paragraph> <Paragraph position="5"> Using dynamic programming, the theoretical bound of the time complexity of the summation in Eq. 1 is reduced to be proportional to the number of non-terminal nodes in a parse tree. However, the calculation at node a108 still has a cost that exponentially grows with the number of a108 's daughters because we must sum up the probabilities of a118a48a119a118a121a120a123a122 a5 combinations of latent annotation symbols for a node with a110 daughters. We thus took a kind of transformation/detransformation approach, in which a tree is binarized before parameter estimation and restored to its original form after parsing. The details of the binarization are explained in Section 4.</Paragraph> <Paragraph position="6"> Using syntactically annotated corpora as training data, we can estimate the parameters of a PCFG-LA model using an EM algorithm. The algorithm is a special variant of the inside-outside algorithm of Pereira and Schabes (1992). Several recent work also use similar estimation algorithm as ours, i.e, inside-outside re-estimation on parse trees (Chiang and Bikel, 2002; Shen, 2004).</Paragraph> <Paragraph position="7"> The rest of this section precisely defines PCFG-LA models and briefly explains the estimation algorithm. The derivation of the estimation algorithm is largely omitted; see Pereira and Schabes (1992) for details.</Paragraph> <Section position="1" start_page="75" end_page="76" type="sub_section"> <SectionTitle> 2.1 Model definition </SectionTitle> <Paragraph position="0"> We define a PCFG-LA a124 as a tuple a124 a49</Paragraph> <Paragraph position="2"> a76 a126a123a127a133a132 a set of observable non-terminal symbols a76a128a127 a132 a set of terminal symbols a48 a132 a set of latent annotation symbols a129 a132 a set of observable CFG rules a64a34a51a82a44a40a39a46a89a42 a59 a132 the probability of the occurrence of a complete symbol a44a40a39a46a89a42 at a root node a71 a51a17a106 a59 a132 the probability of a rule a106a135a134 a129 a39a48a136a42 a57 We use a44 a53a116a137a101a53a58a57a58a57a58a57 for non-terminal symbols in a76a31a126a116a127 ; a138 a5a58a53 a138a12a55 a53a58a57a58a57a58a57 for terminal symbols in a76a40a127 ; and a46 a53a69a139a80a53a58a57a58a57a58a57 for latent annotation symbols in a48 . a76a31a126a116a127a60a39a48a136a42 denotes the set of complete non-terminal symbols, i.e., a76a40a126a123a127a60a39a48a136a42a73a49a141a140a142a44a40a39a46a89a42a27a118a29a44a100a134a143a76a128a126a123a127 a53 a46a144a134a143a48a144a145 . Note that latent annotation symbols are not attached to terminal symbols.</Paragraph> <Paragraph position="3"> In the above definition, a129 is a set of CFG rules of observable (i.e., not annotated) symbols. For simplicity of discussion, we assume that a129 is a CNF grammar, but extending to the general case is straightforward. a129 a39a48a136a42 is the set of CFG rules of complete symbols, such as a78a43a39a46a47a42a26a74 grinned or</Paragraph> <Paragraph position="5"> We assume that non-terminal nodes in a parse tree a38 are indexed by integers a107a155a49a157a156 a53a58a57a58a57a58a57a158a53a69a159 , starting from the root node. A complete tree is denoted by a38a40a39a41a43a42 , where a41 a49a160a51a17a46 a5a54a53a58a57a58a57a58a57a142a53 a46a89a161 a59 a134a100a48 a161 is a vector of latent annotation symbols and a46a109a162 is the latent annotation symbol attached to the a107 -th non-terminal node.</Paragraph> <Paragraph position="6"> We do not assume any structured parametrizations in a71 and a64 ; that is, each a71 a51a17a106 a59 a51a17a106a163a134 a129 a39a48a95a42 a59 and a64a34a51a82a44a45a39a46a47a42 a59 a51a82a44a45a39a46a47a42a131a134a95a76 a126a123a127 a39a48a136a42 a59 is itself a parameter to be tuned. Therefore, an annotation symbol, say, a46 , generally does not express any commonalities among the complete non-terminals annotated by a46 , such as a44a45a39a46a47a42 a53a116a137 a39a46a47a42 a53 a94a54a90a30a97 .</Paragraph> <Paragraph position="7"> The probability of a complete parse tree a38a45a39a41a63a42 is defined as</Paragraph> <Paragraph position="9"> where a44 a5 a39a46 a5 a42 is the label of the root node of a38a40a39a41a43a42 and a83a40a172a68a173a174a81a175 denotes the multiset of annotated CFG rules used in the generation of a38a40a39a41a43a42 . We have the probability of an observable tree a38 by marginalizing out the latent annotation symbols in a38a40a39a41a43a42 :</Paragraph> <Paragraph position="11"> where a159 is the number of non-terminal nodes in a38 .</Paragraph> </Section> <Section position="2" start_page="76" end_page="76" type="sub_section"> <SectionTitle> 2.2 Forward-backward probability </SectionTitle> <Paragraph position="0"> The sum in Eq. 3 can be calculated using a dynamic programming algorithm analogous to the forward algorithm for HMMs. For a sentence a138 a5 a138a12a55 a57a58a57a58a57 a138a36a178 and its parse tree a38 , backward probabilities a179 a162</Paragraph> <Paragraph position="2"> are recursively computed for the a107 -th non-terminal node and for each a46a65a134a180a48 . In the definition below, a76 a162 a134a155a76a31a126a116a127 denotes the non-terminal label of the a107 -th node.</Paragraph> </Section> <Section position="3" start_page="76" end_page="76" type="sub_section"> <SectionTitle> 2.3 Estimation </SectionTitle> <Paragraph position="0"> We now derive the EM algorithm for PCFG-LA, which estimates the parameters a194a128a49a141a51 a71a34a53 a64 a59 . Let a195a91a49 a140a58a38 a5a54a53 a38a183a55 a53a58a57a58a57a58a57 a145 be the training set of parse trees and</Paragraph> <Paragraph position="2"> a161a26a196 be the labels of non-terminal nodes in a38a183a162 . Like the derivations of the EM algorithms for other latent variable models, the update formulas for the parameters, which update the parameters from a194 to a194a23a197a152a49a198a51 a71 a197 a53 a64a183a197 a59 , are obtained by constrained optimization of a199a84a51a82a194 a197 a118a194 a59 , which is defined as</Paragraph> <Paragraph position="4"> where a61a131a202 and a61a131a202 a207 denote probabilities under a194 and a194 a197 , and a51a82a41a155a118a38 a59 is the conditional probability of latent annotation symbols given an observed tree a38 , i.e., a61 a51a82a41a144a118a38 a59 a49 a61 a51a62a38a45a39a41a63a42 a59a69a208 a61 a51a62a38 a59 . Using the Lagrange multiplier method and re-arranging the results using the backward and forward probabilities, we obtain the update formulas in Figure 2.</Paragraph> </Section> </Section> <Section position="4" start_page="76" end_page="78" type="metho"> <SectionTitle> 3 Parsing with PCFG-LA </SectionTitle> <Paragraph position="0"> In theory, we can use PCFG-LAs to parse a given sentence a138 by selecting the most probable parse:</Paragraph> <Paragraph position="2"> where a224a84a51a17a138 a59 denotes the set of possible parses for a138 under the observable grammar a129 . While the optimization problem in Eq. 4 can be efficiently solved for PCFGs using dynamic programming algorithms, the sum-of-products form of a61 a51a62a38 a59 in PCFG-LA models (see Eq. 2 and Eq. 3) makes it difficult to apply such techniques to solve Eq. 4.</Paragraph> <Paragraph position="3"> Actually, the optimization problem in Eq. 4 is NP-hard for general PCFG-LA models. Although we omit the details, we can prove the NP-hardness by observing that a stochastic tree substitution grammar (STSG) can be represented by a PCFG-LA model in a similar way to one described by Goodman (1996a), and then using the NP-hardness of STSG parsing (Sima'an, 2002).</Paragraph> <Paragraph position="4"> The difficulty of the exact optimization in Eq. 4 forces us to use some approximations of it. The rest of this section describes three different approximations, which are empirically compared in the next section. The first method simply limits the number of candidate parse trees compared in Eq. 4; we first create N-best parses using a PCFG and then, within the N-best parses, select the one with the highest probability in terms of the PCFG-LA. The other two methods are a little more complicated, and we explain them in separate subsections.</Paragraph> <Section position="1" start_page="77" end_page="77" type="sub_section"> <SectionTitle> 3.1 Approximation by Viterbi complete trees </SectionTitle> <Paragraph position="0"> The second approximation method selects the best complete tree a38 a197 a39a41 a197a42 , that is,</Paragraph> <Paragraph position="2"> We call a38 a197a211a39a41a43a197a221a42 a Viterbi complete tree. Such a tree can be obtained in a29a84a51a116a118a138a135a118a85 a59 time by regarding the PCFG-LA as a PCFG with annotated symbols.1 The observable part of the Viterbi complete tree a38a254a197a211a39a41a101a197a223a42 (i.e., a38 a197 ) does not necessarily coincide with the best observable tree a38a168a209a211a210a93a212a67a213 in Eq. 4. However, if a38a109a209a211a210a93a212a24a213 has some 'dominant' assignment a30 to its latent annotation symbols such that a61 a51a62a38a109a209a211a210a93a212a67a213a116a39a30a136a42 a59a32a31 a61 a51a62a38a109a209a211a210a30a212a24a213 a59 , then a61 a51a62a38 a197 a59a33a31</Paragraph> <Paragraph position="4"> a51a62a38 a197 a59 , and thus a38 a197 and a38a109a209a211a210a93a212a24a213 are almost equally 'good' in terms of their marginal probabilities. null</Paragraph> </Section> <Section position="2" start_page="77" end_page="78" type="sub_section"> <SectionTitle> 3.2 Viterbi parse in approximate distribution </SectionTitle> <Paragraph position="0"> In the third method, we approximate the true distribution a61 a51a62a38a45a118a138 a59 by a cruder distribution a199a84a51a62a38a135a118a138 a59 , and then find the tree with the highest a199a63a51a62a38a45a118a138 a59 in polynomial time. We first create a packed representation of a224a84a51a17a138 a59 for a given sentence a138 .2 Then, the approximate distribution a199a84a51a62a38a135a118a138 a59 is created using the packed forest, and the parameters in a199a84a51a62a38a45a118a138 a59 are adjusted so that a199a84a51a62a38a45a118a138 a59 approximates a61 a51a62a38a45a118a138 a59 as closely as possible. The form of a199a84a51a62a38a135a118a138 a59 is that of a product of the parameters, just like the form of a PCFG model, and it enables us to use a Viterbi algorithm to select the tree with the highest a199a84a51a62a38a135a118a138 a59 . A packed forest is defined as a tuple a125a10a34 a53a36a35a201a130 . The first component,a34 , is a multiset of chart items of the form a51a82a44 a53 a179 a53 a94 a59 . A chart item a51a82a44 a53 a179 a53 a94 a59 a134 a34 indicates that there exists a parse tree in a224a135a51a17a138 a59 that contains a constituent with the non-terminal label a44 that spans</Paragraph> <Paragraph position="2"> from the a179 -th to a94 -th word in a138 . The second component,a35 , is a function ona34 that represents dominance relations among the chart items ina34 ;a35 a51a17a107 a59 is a set of possible daughters of a107 if a107 is not a pre-terminal node, anda35 a51a17a107 a59 a49a50a140a54a138a254a189a133a145 if a107 is a pre-terminal node above a138a12a189 . Two parse trees for a sentence a138 a49 a138 a5 a138a12a55a75a138a12a85 and a packed representation of them are shown in We require that each tree a38a77a134a143a224a135a51a17a138 a59 has a unique representation as a set of connected chart items in a34 . A packed representation satisfying the uniqueness condition is created using the CKY algorithm with the observable grammar a129 , for instance.</Paragraph> <Paragraph position="3"> The approximate distribution, a199a84a51a62a38a45a118a138 a59 , is defined as a PCFG, whose CFG rules a129 a220 is defined as</Paragraph> <Paragraph position="5"> to denote the rule probability of rule a106 a134 a129 a220 and where the set of connected items a140a54a107 a5a54a53a58a57a58a57a58a57a142a53 a107a67a161a128a145a64a63 is the unique representation of a38 .</Paragraph> <Paragraph position="6"> To measure the closeness of approximation by out in Figure 4 are similar to ordinary inside/outside probabilities. We define a61 in as follows: where a137 a182 and a153a21a189 denote non-terminal symbols of chart items a184 and a185 .</Paragraph> <Paragraph position="7"> The outside probability, a61 out, is calculated using a61 in and PCFG-LA parameters along the packed structure, like the outside probabilities for PCFGs. Once we have computeda59a148a51a17a107a109a74a71a58 a59 anda59a165 a51a17a107 a59 , the parse tree a38 that maximizes a199a84a51a62a38a45a118a138 a59 is found using a Viterbi algorithm, as in PCFG parsing.</Paragraph> <Paragraph position="8"> Several parsing algorithms that also use inside-outside calculation on packed chart have been proposed (Goodman, 1996b; Sima'an, 2003; Clark and Curran, 2004). Those algorithms optimize some evaluation metric of parse trees other than the posterior probability a61 a51a62a38a45a118a138 a59 , e.g., (expected) labeled constituent recall or (expected) recall rate of dependency relations contained in a parse. It is in contrast with our approach where (approximated) posterior probability is optimized.</Paragraph> </Section> </Section> class="xml-element"></Paper>