File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2048_metho.xml
Size: 17,377 bytes
Last Modified: 2025-10-06 14:10:24
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2048"> <Title>Exploring the Potential of Intractable Parsers</Title> <Section position="5" start_page="369" end_page="372" type="metho"> <SectionTitle> 3 The Generative Model </SectionTitle> <Paragraph position="0"> The goal of this section is to develop a probabilistic process that generates labeled trees in a manner considerably different from PCFGs. We will use the tree in Figure 2 to motivate our model. In this example, nodes of the tree are labeled with either an A or a B. We can represent this tree using two charts. One chart labels each span with a boolean value, such that a span is labeled true iff it is a constituent in the tree. The other chart labels each span with a label from our labeling scheme (A or B) or with the value null (to represent that the span is unlabeled). We show these charts in Figure 3. Notice that we may want to have more than one labeling scheme. For instance, in the parse tree of Figure 1, there are three different types of labels: word labels, preterminal labels, and nonterminal labels. Thus we would use four 5x5 charts instead of two 3x3 charts to represent that tree.</Paragraph> <Paragraph position="1"> We will pause here and generalize these concepts. De ne a labeling scheme as a set of symbols including a special symbol null (this will desig- null the left chart tells us which spans are tree constituents, and the right chart tells us the labels of the spans (null means unlabeled).</Paragraph> <Paragraph position="2"> nate that a given span is unlabeled). For instance, we can de ne L1 = {null,A,B} to be a labeling scheme for the example tree.</Paragraph> <Paragraph position="3"> Let L = {L1,L2,...Lm} be a set of labeling schemes. De ne a model variable of L as a symbol of the form Sij or Lkij, for positive integers i, j, k, such that i [?] j and k [?] m. Model variables of the form Sij indicate whether span (i,j) is a tree constituent, hence the domain of Sij is {true,false}. Such variables correspond to entries in the left chart of Figure 3. Model variables of the form Lkij indicate which label from scheme Lk is assigned to span (i,j), hence the domain of model variable Lkij is Lk. Such variables correspond to entries in the right chart of Figure 3. Here we have only one labeling scheme.</Paragraph> <Paragraph position="4"> Let VL be the (countably in nite) set of model variables of L. Usually we are interested in trees over a given sentence of nite length n. Let VnL denote the nite subset of VL that includes precisely the model variables of the form Sij or Lkij, where j [?] n.</Paragraph> <Paragraph position="5"> Basically then, our model consists of two types of decisions: (1) whether a span should be labeled, and (2) if so, what label(s) the span should have.</Paragraph> <Paragraph position="6"> Let us proceed with our example. To generate the tree of Figure 2, the rst decision we need to make is how many leaves it will have (or equivalently, how large our tables will be). We assume that we have a probability distribution PN over the set of positive integers. For our example tree, we draw the value 3, with probability PN (3).</Paragraph> <Paragraph position="7"> Now that we know our tree will have three leaves, we can now decide which spans will be constituents and what labels they will have. In other words, we assign values to the variables in V3L. First we need to choose the order in which we will make these assignments. For our example, we will assign model variables in the following order: S11, L111, S22, L122, S33, L133, S12, L112, S23, L123, S13, L113. A detailed look at this assignment process should help clarify the details of the model.</Paragraph> <Paragraph position="8"> Assigning S11: The rst model variable in our order is S11. In other words, we need to decide whether the span (1, 1) should be a constituent.</Paragraph> <Paragraph position="9"> We could let this decision be probabilistically determined, but recall that we are trying to generate a well-formed tree, thus the leaves and the root should always be considered constituents. To handle situations when we would like to make deterministic variable assignments, we supply an auxilliary function A that tells us (given a model variable X and the history of decisions made so far) whether X should be automatically determined, and if so, what value it should be assigned. In our running example, we ask A whether S11 should be automatically determined, given the previous assignments made (so far only the value chosen for n, which was 3). The so-called auto-assignment function A responds (since S11 is a leaf span) that S11 should be automatically assigned the value true, making span (1, 1) a constituent.</Paragraph> <Paragraph position="10"> Assigning L111: Next we want to assign a label to the rst leaf of our tree. There is no compelling reason to deterministically assign this label. Therefore, the auto-assignment function A declines to assign a value to L111, and we proceed to assign its value probabilistically. For this task, we would like a probability distribution over the labels of labeling scheme L1 = {null,A,B}, conditioned on the decision history so far. The difculty is that it is clearly impractical to learn conditional distributions over every conceivable history of variable assignments. So rst we distill the important features from an assignment history.</Paragraph> <Paragraph position="11"> For instance, one such feature (though possibly not a good one) could be whether an odd or an even number of nodes have so far been labeled with an A. Our conditional probability distribution is conditioned on the values of these features, instead of the entire assignment history. Consider speci cally model variable L111. We compute its features (an even number of nodes zero have so far been labeled with an A), and then we use these feature values to access the relevant prob- null ability distribution over {null,A,B}. Drawing from this conditional distribution, we probabilistically assign the value A to variable L111. Assigning S22, L122, S33, L133: We proceed in this way to assign values to S22, L122, S33, L133 (the S-variables deterministically, and the L1-variables probabilistically).</Paragraph> <Paragraph position="12"> Assigning S12: Next comes model variable S12. Here, there is no reason to deterministically dictate whether span (1, 2) is a constituent or not.</Paragraph> <Paragraph position="13"> Both should be considered options. Hence we treat this situation the same as for the L1 variables. First we extract the relevant features from the assignment history. We then use these features to access the correct probability distribution over the domain of S12 (namely {true,false}). Drawing from this conditional distribution, we probabilistically assign the value true to S12, making span (1, 2) a constituent in our tree.</Paragraph> <Paragraph position="14"> Assigning L112: We proceed to probabilistically assign the value B to L112, in the same manner as we did with the other L1 model variables.</Paragraph> <Paragraph position="15"> Assigning S23: Now we must determine whether span (2, 3) is a constituent. We could again probabilistically assign a value to S23 as we did for S12, but this could result in a hierarchical structure in which both spans (1, 2) and (2, 3) are constituents, which is not a tree. For trees, we cannot allow two model variables Sij and Skl to both be assigned true if they properly overlap, i.e. their spans overlap and one is not a subspan of the other. Fortunately we have already established auto-assignment function A, and so we simply need to ensure that it automatically assigns the value false to model variable Skl if a properly overlapping model variable Sij has previously been assigned the value true.</Paragraph> <Paragraph position="16"> Assigning L123, S13, L113: In this manner, we can complete our variable assignments: L123 is automatically determined (since span (2, 3) is not a constituent, it should not get a label), as is S13 (to ensure a rooted tree), while the label of the root is probabilistically assigned.</Paragraph> <Paragraph position="17"> We can summarize this generative process as a general modeling tool. De ne a hierarchical labeling process (HLP) as a 5-tuple <L,<,A,F,P> where: * L = {L1,L2,...,Lm} is a nite set of labeling schemes.</Paragraph> <Paragraph position="18"> * < is a model order, de ned as a total ordering of the model variables VL such that for all HLPGEN(HLP H = <L,<,A,F,P> ): 1. Choose a positive integer n from distribution PN . Let x be the trivial assignment of VL.</Paragraph> <Paragraph position="19"> 2. In the order de ned by <, compute step 3 for each model variable Y of VnL.</Paragraph> <Paragraph position="20"> 3. If A(Y,x,n) = <true,y> for some y in the domain of model variable Y , then let x = x[Y = y]. Otherwise assign a value to Y from its domain: (a) If Y = Sij, then let x = x[Sij = sij], where sij is a value drawn from distribution PS(s|FS(x,i,j,n)).</Paragraph> <Paragraph position="21"> (b) If Y = Lkij, then let x = x[Lkij = lkij], where lkij is a value drawn from distribution Pk(lk|Fk(x,i,j,n)).</Paragraph> <Paragraph position="22"> 4. Return <n,x> .</Paragraph> <Paragraph position="23"> i,j,k: Sij < Lkij (i.e. we decide whether a span is a constituent before attempting to label it).</Paragraph> <Paragraph position="24"> * A is an auto-assignment function. Speci cally A takes three arguments: a model variable Y of VL, a partial assignment x of VL, and integer n. The function A maps this 3tuple to false if the variable Y should not be automatically assigned a value based on the current history, or the pair <true,y> , where y is the value in the domain of Y that should be automatically assigned to Y .</Paragraph> <Paragraph position="25"> * F = {FS,F1,F2,...,Fm} is a set of feature functions. Speci cally, Fk (resp., FS) takes four arguments: a partial assignment x of VL, and integers i , j , n such that 1 [?] i [?] j [?] n. It maps this 4-tuple to a full assignment fk (resp., fS) of some nite set Fk (resp., FS) of feature variables.</Paragraph> <Paragraph position="26"> * P = {PN,PS,P1,P2,...,Pm} is a set of probability distributions. PN is a marginal probability distribution over the set of positive integers, whereas {PS,P1,P2,...,Pm} are conditional probability distributions.</Paragraph> <Paragraph position="27"> Speci cally, Pk (respectively, PS) is a function that takes as its argument a full assignment fk (resp., fS) of feature set Fk (resp., A(variable Y , assignment x, int n): 1. If Y = Sij, and there exists a properly overlapping model variable Skl such that x(Skl) = true, then return <true,false> .</Paragraph> <Paragraph position="28"> 2. If Y = Sii or Y = S1n, then return <true,true> .</Paragraph> <Paragraph position="29"> 3. If Y = Lkij, and x(Sij) = false, then return <true,null> .</Paragraph> <Paragraph position="30"> 4. Else return false.</Paragraph> <Paragraph position="31"> An HLP probabilistically generates an assignment of its model variables using the generative process shown in Figure 4. Taking an HLP H = <L,<,A,F,P> as input, HLPGEN outputs an integer n, and an H-labeling x of length n, de ned as a full assignment of VnL.</Paragraph> <Paragraph position="32"> Given the auto-assignment function in Figure 5, every H-labeling generated by HLPGEN can be viewed as a labeled tree using the interpretation: span (i,j) is a constituent iff Sij = true; span (i,j) has label lk [?] dom(Lk) iff Lkij = lk.</Paragraph> </Section> <Section position="6" start_page="372" end_page="372" type="metho"> <SectionTitle> 4 Learning </SectionTitle> <Paragraph position="0"> The generative story from the previous section allows us to express the probability of a labeled tree as P(n,x), where x is an H-labeling of length n.</Paragraph> <Paragraph position="1"> For model variable X, de ne V<L(X) as the sub-set of VL appearing before X in model order <.</Paragraph> <Paragraph position="2"> With the help of this terminology, we can decompose P(n,x) into the following product:</Paragraph> <Paragraph position="4"> HLPGEN.</Paragraph> <Paragraph position="5"> Usually in parsing, we are interested in computing the most likely tree given a speci c sentence. In our framework, this generalizes to computing: argmaxxP(x|n,w), where w is a subassignment of an H-labeling x of length n. In natural language parsing, w could specify the constituency and word labels of the leaf-level spans. This would be equivalent to asking: given a sentence, what is its most likely parse? Let W = dom(w) and suppose that we choose a model order < such that for every pair of model variables W [?] W,X [?] VL\W, either W < X or W is always auto-assigned. Then P(x|n,w) can be expressed as:</Paragraph> <Paragraph position="7"> Hence the distributions we need to learn are probability distributions PS(sij|fS) and Pk(lkij|fk). This is fairly straightforward. Given a data bank consisting of labeled trees (such as the Penn Treebank), we simply convert each tree into its H-labeling and use the probabilistically determined variable assignments to compile our training instances. In this way, we compile k + 1 sets of training instances that we can use to induce PS, and the Pk distributions. The choice of which learning technique to use is up to the personal preference of the user. The only requirement is that it must return a conditional probability distribution, and not a hard classi cation. Techniques that allow this include relative frequency, maximum entropy models, and decision trees.</Paragraph> <Paragraph position="8"> For our experiments, we used maximum entropy learning. Speci cs are deferred to Section 6.</Paragraph> </Section> <Section position="7" start_page="372" end_page="373" type="metho"> <SectionTitle> 5 Decoding </SectionTitle> <Paragraph position="0"> For the PCFG parsing model, we can nd argmaxtreeP(tree|sentence) using a cubic-time dynamic programming-based algorithm. By adopting a more exible probabilistic model, we sacri ce polynomial-time guarantees. The central question driving this paper is whether we can jettison these guarantees and still obtain good performance in practice. For the decoding of the probabilistic model of the previous section, we choose a depth- rst branch-and-bound approach, specifically because of two advantages. First, this approach takes linear space. Second, it is anytime, HLPDECODE(HLP H, int n, assignment w): 1. Initialize stack S with the pair <x[?], 1> , where x[?] is the trivial assignment of VL. Let xbest = x[?]; let pbest = 0. Until stack S is empty, repeat steps 2 to 4.</Paragraph> <Paragraph position="1"> 2. Pop topmost pair <x,p> from stack S.</Paragraph> <Paragraph position="2"> 3. If p > pbest and x is an H-labeling of length n, then: let xbest = x; let pbest = p.</Paragraph> <Paragraph position="3"> 4. If p > pbest and x is not yet a H-labeling of length n, then: (a) Let Y be the earliest variable in VnL (according to model order <) unassigned by x.</Paragraph> <Paragraph position="4"> (b) If Y [?] dom(w), then push pair <x[Y = w(Y )],p> onto stack S.</Paragraph> <Paragraph position="5"> (c) Else if A(Y,x,n) = <true,y> for some value y [?] dom(Y ), then push pair <x[Y = y],p> onto stack S.</Paragraph> <Paragraph position="6"> (d) Otherwise for every value y [?] dom(Y ), push pair <x[Y = y],p*q(y)> onto stack S in ascending order of the value of q(y), where:</Paragraph> <Paragraph position="8"> i.e. it nds a (typically good) solution early and improves this solution as the search progresses.</Paragraph> <Paragraph position="9"> Thus if one does not wish the spend the time to run the search to completion (and ensure optimality), one can use this algorithm easily as a heuristic by halting prematurely and taking the best solution found thus far.</Paragraph> <Paragraph position="10"> The search space is simple to de ne. Given an HLP H, the search algorithm simply makes assignments to the model variables (depth- rst) in the order de ned by <.</Paragraph> <Paragraph position="11"> This search space can clearly grow to be quite large, however in practice the search speed is improved drastically by using branch-and-bound backtracking. Namely, at any choice point in the search space, we rst choose the least cost child to expand (i.e. we make the most probable assignment). In this way, we quickly obtain a greedy solution (in linear time). After that point, we can continue to keep track of the best solution we have found so far, and if at any point we reach an internal node of our search tree with partial cost greater than the total cost of our best solution, we can discard this node and discontinue exploration of that subtree. This technique can result in a signi cant aggregrate savings of computation time, depending on the nature of the cost function.</Paragraph> <Paragraph position="12"> Figure 6 shows the pseudocode for the depthrst branch-and-bound decoder. For an HLP H = <L,<,A,F,P> , a positive integer n, and a partial assignment w of VnL, the call HLPDECODE(H, n, w) returns the H-labeling x of length n such that P(x|n,w) is maximized.</Paragraph> </Section> class="xml-element"></Paper>