File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1054_metho.xml
Size: 21,044 bytes
Last Modified: 2025-10-06 14:10:10
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1054"> <Title>A fast finite-state relaxation method for enforcing global constraints on sequence decoding</Title> <Section position="5" start_page="423" end_page="424" type="metho"> <SectionTitle> 3 A brute-force finite-state decoder </SectionTitle> <Paragraph position="0"> To find the best constrained labeling in a lattice, y[?], according to (1), we could simply intersect the lattice with all the constraints, then extract the best path.</Paragraph> <Paragraph position="1"> Weighted FSA intersection is a generalization of ordinary unweighted FSA intersection (Mohri et al., 1996). It is customary in NLP to use the so-called tropical semiring, where weights are represented by their natural logarithms and summed rather than multiplied. Then the intersected automaton L [?] C</Paragraph> <Paragraph position="3"> To find y[?], one would extract the best path in L [?] C1 [?] C2 [?] *** using the Viterbi algorithm, or Dijkstra's algorithm if the lattice is cyclic. This step is fast if the intersected automaton is small.</Paragraph> <Paragraph position="4"> The problem is that the multiple intersections in L[?]C1 [?]C2 [?]*** can quickly lead to an FSA with an intractable number of states. The intersection of two finite-state automata produces an automaton</Paragraph> <Paragraph position="6"> with the cross product state set. That is, if F has m states and G has n states, then F [?]G has up to mn states (fewer if some of the mn possible states do not lie on any accepting path).</Paragraph> <Paragraph position="7"> Intersection of many such constraints, even if they have only a few states each, quickly leads to a combinatorial explosion. In the worst case, the size, in states, of the resulting lattice is exponential in the number of constraints. To deal with this, we present a constraint relaxation algorithm.</Paragraph> </Section> <Section position="6" start_page="424" end_page="424" type="metho"> <SectionTitle> 4 Hard constraints </SectionTitle> <Paragraph position="0"> The simplest kind of constraint is the hard constraint. Hard constraints are necessarily binary-either the labeling satisfies the constraint, or it violates it. Violation is fatal--the labeling produced by decoding must satisfy each hard constraint.</Paragraph> <Paragraph position="1"> Formally, a hard constraint is a mappingC: Y[?] mapsto{0,[?][?]}, encoded as an unweighted FSA. If a string satisfies the constraint, recognition of the string will lead to an accepting state. If it violates the constraint, recognition will end in a non-accepting state.</Paragraph> <Paragraph position="2"> Here we give an algorithm for decoding with a set of such constraints. Later (SS6), we discuss the case of binary soft constraints. In what follows, we will assume that there is always at least one path in the lattice that satisfies all of the constraints.</Paragraph> <Section position="1" start_page="424" end_page="424" type="sub_section"> <SectionTitle> 4.1 Decoding by constraint relaxation </SectionTitle> <Paragraph position="0"> Our decoding algorithm first relaxes the global constraints and solves a simpler problem. In particular, we find the best labeling according to the model,</Paragraph> <Paragraph position="2"> ignoring all the constraints in C.</Paragraph> <Paragraph position="3"> Next, we check whether y[?]0 satisifies the constraints. If so, then we are done--y[?]0 is also y[?]. If not, then we reintroduce the constraints. However, rather than include all at once, we introduce them only as they are violated by successive solutions to the relaxed problems: y[?]0, y[?]1, etc. We define</Paragraph> <Paragraph position="5"> for some constraint C that y[?]0 violates. Similarly, y[?]2 satisfies an additional constraint that y[?]1 violates, HARD-CONSTRAIN-LATTICE(L, C): 1. y := Best-Path(L) 2. while [?]C [?] C such that C(y) = [?][?]: 3. L := L[?]C 4. C := C[?]{C} 5. y := Best-Path(L) 6. return y satisfies all constraints, and this path is returned. To determine whether a labeling y satisfies a constraint C, we represent y as a straight-line automaton and intersect with C, checking the result for nonemptiness. This is equivalent to string recognition. Our hope is that, although intractable in the worst case, the constraint relaxation algorithm will operate efficiently in practice. The success of traditional sequence models on NLP tasks suggests that, for natural language, much of the correct analysis can be recovered from local features and constraints alone. We suspect that, as a result, global constraints will often be easy to satisfy.</Paragraph> <Paragraph position="6"> Pseudocode for the algorithm appears in Figure 2. Note that line 2 does not specify how to choose C from among multiple violated constraints. This is discussed in SS7. Our algorithm resembles the method of Koskenniemi (1990) and later work. The difference is that there lattices are unweighted and may not contain a path that satisfies all constraints, so that the order of constraint intersection matters.</Paragraph> </Section> </Section> <Section position="7" start_page="424" end_page="425" type="metho"> <SectionTitle> 5 Semantic role labeling </SectionTitle> <Paragraph position="0"> The semantic role labeling task (Carreras and M`arques, 2004) involves choosing instantiations of verb arguments from a sentence for a given verb.</Paragraph> <Paragraph position="1"> The verb and its arguments form a proposition. We use data from the CoNLL-2004 shared task--the PropBank (Palmer et al., 2005) annotations of the Penn Treebank (Marcus et al., 1993), with sections 15-18 as the training set and section 20 as the development set. Unless otherwise specified, all measurements are made on the development set.</Paragraph> <Paragraph position="2"> We follow Roth and Yih (2005) exactly, in order to compare system runtimes. They, in turn, follow Hacioglu et al. (2004) and others in labeling only the heads of syntactic chunks rather than all words.</Paragraph> <Paragraph position="3"> We label only the core arguments (A0-A5), treating anything but A0), KNOWN VERB POSITION[2], and DISALLOW ARGUMENTS[A4,A5].</Paragraph> <Paragraph position="4"> adjuncts and references as O.</Paragraph> <Paragraph position="5"> Figure 3 shows an example sentence from the shared task. It is marked with an IOB phrase chunking, the heads of the phrases, and the correct semantic role labeling. Heads are taken to be the rightmost words of chunks. On average, there are 18.8 phrases per proposition, vs. 23.5 words per sentence. Sentences may contain multiple propositions. There are 4305 propositions in section 20.</Paragraph> <Section position="1" start_page="425" end_page="425" type="sub_section"> <SectionTitle> 5.1 Constraints </SectionTitle> <Paragraph position="0"> Roth and Yih use five global constraints on label sequences for the semantic role labeling task. We express these constraints as FSAs. The first two are general, and the seven automata encoding them can be constructed offline:</Paragraph> </Section> </Section> <Section position="8" start_page="425" end_page="425" type="metho"> <SectionTitle> * NO DUPLICATE ARGUMENT LABELS </SectionTitle> <Paragraph position="0"> (Fig. 4(a)) requires that each verb have at most one argument of each type in a given sentence. We separate this into six individual constraints, one for each core argument type.</Paragraph> <Paragraph position="1"> Thus, we have constraints called NO DUPLICATE A0, NO DUPLICATE A1, etc. Each of these is represented as a three-state FSA.</Paragraph> <Paragraph position="2"> * AT LEAST ONE ARGUMENT (Fig. 1) simply requires that the label sequence is not O[?]. This is a two-state automaton as described in SS2.</Paragraph> <Paragraph position="3"> The last three constraints require information about the example, and the automata must be constructed on a per-example basis: * ARGUMENT CANDIDATES (Fig. 5) encodes a set of position spans each of which must receive only a single label type. These spans were proposed using a high-recall heuristic (Xue and Palmer, 2004).</Paragraph> <Paragraph position="4"> * KNOWN VERB POSITION (Fig. 4(b)) simply encodes the position of the verb in question, which must be labeled O.</Paragraph> </Section> <Section position="9" start_page="425" end_page="427" type="metho"> <SectionTitle> * DISALLOW ARGUMENTS (Fig. 4(c)) specifies </SectionTitle> <Paragraph position="0"> argument types that are compatible with the verb in question, according to PropBank.</Paragraph> <Section position="1" start_page="425" end_page="427" type="sub_section"> <SectionTitle> 5.2 Experiments </SectionTitle> <Paragraph position="0"> We implemented our hard constraint relaxation algorithm, using the FSA toolkit (Kanthak and Ney, 2004) for finite-state operations. FSA is an open-source C++ library providing a useful set of algorithms on weighted finite-state acceptors and transducers. For each example we decoded, we chose a random order in which to apply the constraints.</Paragraph> <Paragraph position="1"> Lattices are generated from what amounts to a unigram model--the voted perceptron classifier of Roth and Yih. The features used are a subset of those commonly applied to the task.</Paragraph> <Paragraph position="2"> Our system produces output identical to that of Roth and Yih. Table 1 shows F-measure on the core arguments. Table 2 shows a runtime comparison.</Paragraph> <Paragraph position="3"> The ILP runtime was provided by the authors (personal communication). Because the systems were run under different conditions, the times are not directly comparable. However, constraint relaxation is more than sixteen times faster than ILP despite running on a slower platform.</Paragraph> <Paragraph position="4"> Roth and Yih's linear program has two kinds of numeric constraints. Some encode the shortest path problem structure; the others encode the global constraints of SS5.1. The ILP solver works by relaxing to a (real-valued) linear program, which may obtain a fractional solution that represents a path mixture instead of a path. It then uses branch-and-bound to seek the optimal rounding of this fractional solution to an integer solution (Gu'eret et al., 2002) that represents a single path satisfying the global constraints. Our method avoids fractional solutions: a relaxed solution is always a true single path, which either satisfies or violates each global constraint. In effect, we are using two kinds of domain knowledge. First, we recognize that this is a graph problem, and insist on true paths so we can use Viterbi decoding. Second, we choose to relax only domain-specific constraints that are likely to be satisfied anyway (in our domain), in contrast to the meta-constraint of integrality relaxed by ILP. Thus it is cheaper on average for us to repair a relaxed solution. (Our repair strategy--finite-state intersection in place of branch-and-bound search--remains expensive in the worst case, as the problem is NP-hard.) The y[?]0s, generated with only local information, satisfy most of the global constraints most of the time. Table 3 shows the violations by type.</Paragraph> <Paragraph position="5"> The majority of best labelings according to the local model don't violate any global constraints-a fact especially remarkable because there are no label sequence features in Roth and Yih's unigram model. This confirms our intuition that natural language structure is largely apparent locally. Table 4 shows the breakdown. The majority of examples are very efficient to decode, because they don't require intersection of the lattice with any constraints--y[?]0 is extracted and is good enough. Those examples where constraints are violated are still relatively efficient because they only require a small number of intersections. In total, the average number of intersections needed, even with the naive randomized constraint ordering, was only 0.65. The order doesn't matter very much, since 75% of examples have one violation or fewer.</Paragraph> <Paragraph position="6"> Figure 6 shows the effect of intersection with violated constraints on the average size of lattices, measured in arcs. The vertical bars at k = 0,</Paragraph> <Paragraph position="8"> coding. Vertical bars show the number of examples over which each mean is computed.</Paragraph> <Paragraph position="9"> straint relaxation had to intersect k contraints (i.e., y[?] [?] y[?]k). The trajectory ending at (for example) k = 3 shows how the average lattice size for that subset of examples evolved over the 3 intersections.</Paragraph> <Paragraph position="10"> TheXat k = 3 shows the final size of the brute-force lattice on the same subset of examples.</Paragraph> <Paragraph position="11"> For the most part, our lattices do stay much smaller than those produced by the brute-force algorithm. (The uppermost curve, k = 5, is an obvious exception; however, that curve describes only the seven hardest examples.) Note that plotting only the final size of the brute-force lattice obscures the long trajectory of its construction, which involves 10 intersections and, like the trajectories shown, includes larger intermediate automata.2 This explains the far longer runtime of the brute-force method (Table 2).</Paragraph> <Paragraph position="12"> Harder examples (corresponding to longer trajectories) have larger lattices, on average. This is partly just because it is disproportionately the longer sentences that are hard: they have more opportunities for a relaxed decoding to violate global constraints.</Paragraph> <Paragraph position="13"> Hard examples are rare. The left three columns, requiring only 0-2 intersections, constitute 96% of examples. The vast majority can be decoded without much more than doubling the local-lattice size.</Paragraph> </Section> </Section> <Section position="10" start_page="427" end_page="428" type="metho"> <SectionTitle> 6 Soft constraints </SectionTitle> <Paragraph position="0"> The gold standard labels ^y occasionally violate the hard global constraints that we are using. Counts for the development set appear in Table 5. Counts for violations of NO DUPLICATE A* do not include discontinous arguments, of which there are 104 instances, since we ignore them.</Paragraph> <Paragraph position="1"> Because of the infrequency, the hard constraints still help most of the time. However, on a small sub-set of the examples, they preclude us from inferring the correct labeling.</Paragraph> <Paragraph position="2"> We can apply these constraints with weights, rather than making them inviolable. This constitutes a transition from hard to soft constraints. Formally, a soft constraint C: Y[?] mapsto- R[?] is a mapping from a label sequence to a non-positive penalty.</Paragraph> <Paragraph position="3"> Soft constraints present new difficulty for decodclusion of, for example, DISALLOW ARGUMENTS, which can only remove arcs. That constraint is rarely included in the relaxation lattices because it is rarely violated (see Table 3). SOFT-CONSTRAIN-LATTICE(L, C): 1. (y[?], Score(y[?])) := (empty,[?][?]) 2. branches := [(L,C,0)] 3. while (L,C,penalty) := Dequeue(branches): 4. L := Prune(L, Score(y[?])[?]penalty) 5. unless Empty(L): 6. y := Best-Path(L) 7. for C [?] C: 8. if C(y) < 0: (* soC(y) = wC *) 9. C := C[?]{C} ing, because instead of eliminating paths of L from contention, they just reweight them.</Paragraph> <Paragraph position="4"> In what follows, we consider only binary soft constraints--they are either satisfied or violated, and the same penalty is assessed whenever a violation occurs. That is, [?]C [?] C,[?]wC < 0 such that [?]y,C(y) [?] {0,wC}.</Paragraph> <Section position="1" start_page="428" end_page="428" type="sub_section"> <SectionTitle> 6.1 Soft constraint relaxation </SectionTitle> <Paragraph position="0"> The decoding algorithm for soft constraints is a generalization of that for hard constraints. The difference is that, whereas with hard constraints a violation meant disqualification, here violation simply means a penalty. We therefore must find and compare two labelings: the best that satisfies the constraint, and the best that violates it.</Paragraph> <Paragraph position="1"> We present a branch-and-bound algorithm (Lawler and Wood, 1966), with pseudocode in Figure 7. At line 9, we process and eliminate a currently violated constraint C [?] C by considering two cases. On the first branch, we insist that C be satisfied, enqueuing L[?]C for later exploration. On the second branch, we assume C is violated by all paths, and so continue considering L unmodified, but accept a penalty for doing so; we immediately explore the second branch by returning to the start of the for loop.3 Not every branch needs to be completely explored. Bounding is handled by the PRUNE function at line 4, which shrinks L by removing some 3It is possible that a future best path on the second branch will not actually violate C, in which case we have overpenalized it, but in that case we will also find it with correct penalty on the first branch.</Paragraph> <Paragraph position="2"> or all paths that cannot score better than Score(y[?]), the score of the best path found on any branch so far. Our experiments used almost the simplest possible PRUNE: replace L by the empty lattice if the best path falls below the bound, else leave L unchanged.4 A similar bounding would be possible in the implicit branches. If, during the for loop, we find that the test at line 12 would fail, we can quit the for loop and immediately move to the next branch in the queue at line 3.</Paragraph> <Paragraph position="3"> There are two factors in this algorithm that contribute to avoiding consideration of all of the exponential number of leaves corresponding to the power set of constraints. First, bounding stops evaluation of subtrees. Second, only violated constraints require branching. If a lattice's best path satisifies a constraint, then the best path that violates it can be no better since, by assumption, [?]y,C(y) [?] 0.</Paragraph> </Section> <Section position="2" start_page="428" end_page="428" type="sub_section"> <SectionTitle> 6.2 Runtime experiments </SectionTitle> <Paragraph position="0"> Using the ten constraints from SS5.1, weighted naively by their log odds of violation, the soft constraint relaxation algorithm runs in a time of 58.40 seconds. It is, as expected, slower than hard constraint relaxation, but only by a factor of about two.</Paragraph> <Paragraph position="1"> As a side note, softening these particular constraints in this particular way did not improve decoding quality in this case. It might help to jointly train the relative weights of these constraints and the local model--e.g., using a perceptron algorithm (Freund and Schapire, 1998), which repeatedly extracts the best global path (using our algorithm), compares it to the gold standard, and adjusts the constraint weights. An obvious alternative is maximum-entropy training, but the partition function would have to be computed using the large brute-force lattices, or else approximated by a sampling method.</Paragraph> </Section> </Section> <Section position="11" start_page="428" end_page="429" type="metho"> <SectionTitle> 7 Future work </SectionTitle> <Paragraph position="0"> For a given task, we may be able to obtain further speedups by carefully choosing the order in which to test and apply the constraints. We might treat this as a reinforcement learning problem (Sutton, 1988), 4Partial pruning is also possible: by running the Viterbi version of the forward-backward algorithm, one can discover for each edge the weight of the best path on which it appears. One can then remove all edges that do not appear on any sufficiently good path.</Paragraph> <Paragraph position="1"> where an agent will obtain rewards by finding y[?] quickly. In the hard-constraint algorithm, for example, the agent's possible moves are to test some constraint for violation by the current best path, or to intersect some constraint with the current lattice. Several features can help the agent choose the next move. How large is the current lattice, which constraints does it already incorporate, and which remaining constraints are already known to be satisfied or violated by its best path? And what were the answers to those questions at previous stages? Our constraint relaxation method should be tested on problems other than semantic role labeling. For example, information extraction from bibliography entries, as discussed in SS1, has about 13 fields to extract, and interesting hard and soft global constraints on co-occurrence, order, and adjacency. The method should also be evaluated on a task with longer sequences: though the finite-state operations we use do scale up linearly with the sequence length, longer sequences have more chance of violating a global constraint somewhere in the sequence, requiring us to apply that constraint explicitly.</Paragraph> </Section> class="xml-element"></Paper>