File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2602_metho.xml

Size: 27,010 bytes

Last Modified: 2025-10-06 14:10:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2602">
  <Title>Constraint Satisfaction Inference: Non-probabilistic Global Inference for Sequence Labelling</Title>
  <Section position="3" start_page="9" end_page="10" type="metho">
    <SectionTitle>
2 Theoretical background
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="9" end_page="9" type="sub_section">
      <SectionTitle>
2.1 Class Trigrams
</SectionTitle>
      <Paragraph position="0"> A central weakness of approaches considering each token of a sequence as a separate classification case is their inability to coordinate labels assigned to neighbouring tokens. Due to this, invalid label sequences, or ones that are highly unlikely may result. Van den Bosch and Daelemans (2005) propose to resolve parts of this issue by predicting trigrams of labels as a single atomic class label, thereby labelling three tokens at once, rather than classifying each token separately. Predicting sequences of three labels at once makes sure that at least these short subsequences are known to be syntactically valid sequences according to the training data.</Paragraph>
      <Paragraph position="1"> Applying this general idea, Van den Bosch and Daelemans (2005) label each token with a complex class label composed of the labels for the preceding token, the token itself, and the one following it in the sequence. If such class trigrams are assigned to all tokens in a sequence, the actual label for each of those is effectively predicted three times, since every token but the first and last is covered by three class trigrams. Exploiting this redundancy, a token's possibly conflicting predictions are resolved by voting over them. If two out of three trigrams suggest the same label, this label is selected; in case of three different candidate labels, a classifier-specific confidence metric is used to break the tie.</Paragraph>
      <Paragraph position="2"> Voting over class trigrams is but one possible approach to taking advantage of the redundancy obtained with predicting overlapping trigrams. A disadvantage of voting is that it discards one of the main benefits of the class trigram method: predicted class trigrams are guaranteed to be syntactically correct according to the training data. The voting technique splits up the predicted trigrams, and only refers to their unigram components when deciding ontheoutput labelforatoken; noattempt is made to keep the trigram sequence intact in the final output sequence. The alternative to voting presented later in this paper does try to retain predicted trigrams as part of the output sequence.</Paragraph>
    </Section>
    <Section position="2" start_page="9" end_page="10" type="sub_section">
      <SectionTitle>
2.2 Memory-based learning
</SectionTitle>
      <Paragraph position="0"> The name memory-based learning refers to a class of methods based on the k-nearest neighbour rule.</Paragraph>
      <Paragraph position="1"> At training time, all example instances are stored in memory without attempting to induce an abstract representation of the concept to be learned.</Paragraph>
      <Paragraph position="2"> Generalisation is postponed until a test instance is classified. For a given test instance, the class predicted is the one observed most frequently among a number of most-similar instances in the instance base. By only generalising when confronted with the instance to be classified, a memory-based learner behaves as a local model, specifically suited for that part of the instance space that the test instance belongs to. In contrast, learners that abstract at training time can only generalise globally. This distinguishing property makes memory-based learners especially suited for tasks where different parts of the instance space are structured according to different rules, as is often the case in natural-language processing.</Paragraph>
      <Paragraph position="3"> For the experiments performed in this study we used the memory-based classifier as implemented by TiMBL (Daelemans et al., 2004). In TiMBL, similarity is defined by two parameters: a featurelevelsimilarity metric, whichassigns areal-valued  score to pairs of values for a given feature, and a set of feature weights, that express the importance of the various features for determining the similarity of two instances. Further details on both of these parameters can be found in the TiMBL manual. To facilitate the explanation of our inference procedure in Section 3, we will formally define some notions related to memory-based classification. null The function Ns,w,k(x) maps a given instance x to the set of its nearest neighbours; here, the parameters s, w, and k are the similarity metric, the feature weights, and the number k of nearest neighbours, respectively. They will be considered given in the following, so we will refer to this specific instantiation simply as N(x). The function wd(c,N(x)) returns the weight assigned to class c in the given neighbourhood according to the distance metric d; again we will use the notation w(c,N(x)) to refer to a specific instantiation ofthisfunction. Using thesetwofunctions, wecan formulate the nearest neighbour rule as follows.</Paragraph>
      <Paragraph position="5"> The class c maximising the above expression is returned as the predicted class for the instance x.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="10" end_page="11" type="metho">
    <SectionTitle>
3 Constraint Satisfaction Inference
</SectionTitle>
    <Paragraph position="0"> A strength of the class trigram method is the guarantee that any trigram that is predicted by the base classifier represents a syntactically valid subsequence of length three. This does not necessarily mean the trigram is a correct label assignment within the context of the current classification, but it does reflect the fact that the trigram has been observed in the training data, and, moreover, is deemed most likely according to the base classifier's model. For this reason, it makes sense to try to retain predicted trigrams in the output label sequence as much as possible.</Paragraph>
    <Paragraph position="1"> The inference method proposed in this section seeks to attain this goal by formulating the class trigram disambiguation task as a weighted constraint satisfaction problem (W-CSP). Constraint satisfaction isawell-studied research areawithapplications in numerous fields both inside and outside of computer science. Weighted constraint satisfaction extends the traditional constraint satisfaction framework with soft constraints; such constraints are not required to be satisfied for a solution to be valid, but constraints a given solution does satisfy, are rewarded according to weights assigned to them.</Paragraph>
    <Paragraph position="2"> Formally, a W-CSP is a tuple (X,D,C,W).</Paragraph>
    <Paragraph position="3"> Here, X = {x1,x2,...,xn} is a finite set of variables. D(x) is a function that maps each variable to its domain, that is, the set of values that variable can take on. C is the set of constraints. While a variable's domain dictates the values a single variable is allowed to take on, a constraint specifies which simultaneous value combinations over a number of variables are allowed. For a traditional (non-weighted) constraint satisfaction problem, a valid solution would be an assignment of values to the variables that (1) are a member of the corresponding variable's domain, and (2) satisfy all constraints in the set C. Weighted constraint satisfaction, however, relaxes this requirement to satisfy all constraints. Instead, constraints are assigned weights that may be interpreted as reflecting the importance of satisfying that constraint. Let a constraint c [?] C be defined as a function that maps each variable assignment to 1 if the constraint is satisfied, or to 0 if it is not. In addition, let W: C-IR+ denote a function that maps each constraint to a positive real value, reflecting theweight of thatconstraint. Then, the optimal solution to a W-CSP is given by the following equation. null</Paragraph>
    <Paragraph position="5"> That is, the assignment of values to its variables that maximises the sum of weights of the constraints that have been satisfied.</Paragraph>
    <Paragraph position="6"> Translating the terminology introduced earlier in this paper to the constraint satisfaction domain, each token of a sequence maps to a variable, the domain of which corresponds to the three candidate labels for this token suggested by the trigrams covering the token. This provides us with a definition of the function D, mapping variables to their domain. In the following, yi,j denotes the candidate label for token xj predicted by the trigram assigned to token xi.</Paragraph>
    <Paragraph position="8"> Constraints are extracted from the predicted trigrams. Given the goal of retaining predicted trigramsintheoutput label sequence asmuchaspossible, the most important constraints are simply  the trigrams themselves. A predicted trigram describes a subsequence of length three of the entire output sequence; by turning such a trigram into a constraint, weexpressthewishtohavethistrigram</Paragraph>
    <Paragraph position="10"> No base classifier is flawless though, and therefore not all predicted trigrams can be expected to be correct. Nevertheless, even an incorrect trigram may carry some useful information regarding the output sequence: one trigram also covers two bigrams, and three unigrams. An incorrect trigram may still contain smaller subsequences, of length one or two, that are correct. Therefore, all of these are also mapped to constraints.</Paragraph>
    <Paragraph position="12"> With such an amount of overlapping constraints, the satisfaction problem obtained easily becomes over-constrained, that is, no variable assignment exists that can satisfy all constraints without breaking another. Only one incorrectly predicted class trigram already leads to two conflicting candidate labels for one of the tokens at least. Yet, without conflicting candidate labels no inference would be needed to start with.</Paragraph>
    <Paragraph position="13"> The choice for the weighted constraint satisfaction method always allows a solution to be found, even in the presence of conflicting constraints. Rather than requiring all constraints to be satisfied, each constraint is assigned acertain weight; the optimal solution to the problem is an assignment of values to the variables that optimises the sum of weights of the constraints that are satisfied.</Paragraph>
    <Paragraph position="14"> Constraints can directly be traced back to a prediction made by the base classifier. If two constraints are in conflict, the one which the classifier was most certain of should preferably be satisfied. In the W-CSP framework, this preference can be expressed by weighting constraints according to the classifier confidence for the originating trigram. For the memory-based learner, we define the classifier confidence for a predicted class ci as the weight assigned to that class in the neighbourhood of the test instance, divided by the total weight of all classes.</Paragraph>
    <Paragraph position="16"> Let x denote a test instance, and c[?] its predicted class. Constraints derived from this class are weighted according to the following rules.</Paragraph>
    <Paragraph position="17"> * for a trigram constraint, the weight is simply the base classifier's confidence value for the class c[?] * for a bigram constraint, the weight is the sum of the confidences for all trigram classes in the nearest-neighbour set of x that assign the same label bigram to the tokens spanned by the constraint * for a unigram constraint, the weight is the sum of the confidences for all trigram classes in the nearest-neighbour set of x that assign the same label to the token spanned by the constraint</Paragraph>
  </Section>
  <Section position="5" start_page="11" end_page="13" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> To thoroughly evaluate our new inference procedure, and to show that it performs well over a wide range of natural-language sequence labelling tasks, we composed a benchmark set consisting of six different tasks, covering four areas in natural language processing: syntax (syntactic chunking), morphology (morphological analysis), phonology (grapheme-to-phoneme conversion), and information extraction (general, medical, and biomedical named-entity recognition). Below, thesixdatasets used for these tasks are introduced briefly.</Paragraph>
    <Paragraph position="1"> CHUNK is the task of splitting sentences into non-overlapping syntactic phrases or constituents.</Paragraph>
    <Paragraph position="2"> The data set, extracted from the WSJ Penn Treebank, and first used in the CoNLL-2000 shared task (Tjong Kim Sang and Buchholz, 2000), contains 211,727 training examples and 47,377 test instances.</Paragraph>
    <Paragraph position="3"> NER, named-entity recognition, involves identifying and labelling named entities in text. We employ the English NER shared task data set used in the CoNLL-2003 conference (Tjong Kim Sang and De Meulder, 2003). This data set discriminates four name types: persons, organisations, locations, and a rest category of &amp;quot;miscellany names&amp;quot;. The data set is a collection of newswire  articles from the Reuters Corpus, RCV11. The given training set contains 203,621 examples; as test set we use the &amp;quot;testb&amp;quot; evaluation set which contains 46,435 examples.</Paragraph>
    <Paragraph position="4"> MED is a data set extracted from a semantic annotation of (parts of) two Dutch-language medical encyclopedias. On the chunk-level of this annotation, there are labels for various medical concepts, such asdisease names, body parts, andtreatments, forming a set of twelve concept types in total. Chunk sizes range from one to a few tokens.</Paragraph>
    <Paragraph position="5"> The data have been split into training and test sets, resulting in 428,502 training examples and 47,430 test examples.</Paragraph>
    <Paragraph position="6"> The GENIA corpus (Tateisi et al., 2002) is a collection of annotated abstracts taken from the National Library of Medicine's MEDLINE database.</Paragraph>
    <Paragraph position="7"> Apart from part-of-speech tagging information, the corpus annotates a subset of the substances and the biological locations involved in reactions of proteins. Using a 90%-10% split for producing training and test sets, there are 458,593 training examples and 50,916 test examples.</Paragraph>
    <Paragraph position="8"> PHONEME refers tographeme-to-phoneme conversion for English. The sequences to be labelled are words composed of letters (rather than sentences composed of words). We based ourselves on the English part of the CELEX-2 lexical data base (Baayen et al., 1993), from which we extracted 65,467 word-pronunciation pairs.</Paragraph>
    <Paragraph position="9"> This pair list has been aligned using expectation-maximisation to obtain sensible one-to-one mappings between letters and phonemes (Daelemans and Van den Bosch, 1996). The classes to predict are 58 different phonemes, including some diphones such as [ks] needed to keep the letter-phoneme alignment one-to-one. The resulting datasethas beensplit intoatraining setof515,891 examples, and a test set of 57,279 examples.</Paragraph>
    <Paragraph position="10"> MORPHO refers to morphological analysis of Dutch words. We collected the morphological analysis of 336,698 Dutch words from the CELEX-2 lexical data base (Baayen et al., 1993), and represented the task such that it captures the three most relevant elements of a morphological analysis: (1) the segmentation of the word into morphemes (stems, derivational morphemes, and inflections), (2) the part-of-speech tagging information contained by each morpheme; and (3) all  the trigram method combined both with majority voting, and with constraint satisfaction inference.</Paragraph>
    <Paragraph position="11"> Thelast columnshows theperformance ofthe (hypothetical) oracle inference procedure.</Paragraph>
    <Paragraph position="12"> spelling changes due to compounding, derivation, or inflection that would enable the reconstruction of the appropriate root forms of the involved morphemes. null For CHUNK, and the three information extraction tasks, instances represent a seven-token window of words and their (predicted) part-of-speech tags. Each token is labelled with a class using the IOB type of segmentation coding as introduced by Ramshaw and Marcus (1995), marking whether the middle word is inside (I), outside (O), or at the beginning (B) of a chunk, or named entity. Performance is measured by the F-score on correctly identified and labelled chunks, or named entities.</Paragraph>
    <Paragraph position="13"> Instances for PHONEME, and MORPHO consist of aseven-letter window of letters only. Thelabels assigned to an instance are task-specific and have been introduced above, together with the tasks themselves. Generalisation performance is measured on the word accuracy level: if the entire phonological transcription ofthe word ispredicted correctly, or if all three aspects of the morphological analysis are predicted correctly, the word is counted correct.</Paragraph>
    <Section position="1" start_page="12" end_page="13" type="sub_section">
      <SectionTitle>
4.1 Results
</SectionTitle>
      <Paragraph position="0"> For the experiments, memory-based learners were trained and automatically optimised with wrapped progressive sampling (Van den Bosch, 2004) to predict class trigrams for each of the six tasks introduced above. Table 1 lists the performances of constraint satisfaction inference, and majority votingapplied totheoutput ofthebaseclassifiers, and compares them with the performance of a naive baseline method that treats each token as a separate classification case without coordinating decisions over multiple tokens.</Paragraph>
      <Paragraph position="1"> Without exception, constraint satisfaction infer- null ence outperforms majority voting by a considerable margin. This shows that, given the same sequence of predicted trigrams, the global constraint satisfaction inference manages better to recover sequential correlation, than majority voting. On the other hand, the error reduction attained by majority voting with respect to the baseline is in all cases more impressive than the one obtained by constraint satisfaction inference with respect to majority voting. However, it should be emphasised that, while both methods trace back their origins to the work of Van den Bosch and Daelemans (2005), constraint satisfaction inference is not applied after, but instead of majority voting. This means, that the error reduction attained by majority voting is also attained, independently by constraint satisfaction inference, but in addition constraint satisfaction inference manages to improve performance on top of that.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="13" end_page="45" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> The experiments reported upon in the previous section showed that by globally evaluating the quality of possible output sequences, the constraint satisfaction inference procedure manages to attain better results than the original majority voting approach. In this section, we attempt to further analyse the behaviour of the inference procedure. First, we will discuss the effect that the performance of the trigram-predicting base classifier has on the maximum performance attainable by any inference procedure. Next, we will consider specifically the effect of base classifier accuracy on the performance of constraint satisfaction inference.</Paragraph>
    <Section position="1" start_page="13" end_page="13" type="sub_section">
      <SectionTitle>
5.1 Base classifier accuracy and inference
</SectionTitle>
      <Paragraph position="0"> procedure upper-bounds Aftertrigrams havebeen predicted, foreachtoken, at most three different candidate labels remain. As a result, if the correct label is not among them, the best inference procedure cannot correct that. This suggests that there is an upper-bound on the performance attainable by inference procedures operating on less than perfect class trigram predictions. To illustrate what performance is still possible after a base classifier has predicted the trigrams for a sequence, we devised an oracle inference procedure. null An oracle has perfect knowledge about the true label ofatoken; therefore itisable toselect this label if it is among the three candidate labels. If the correct label is absent among the candidate labels, no inference procedure can possibly predict the correct label for the corresponding token, so the oracle procedure just selects randomly among the candidate labels, which will be incorrect anyway.</Paragraph>
      <Paragraph position="1"> Table1compares the performance ofmajority voting, constraint satisfaction inference, and the oracle after an optimised base classifier has predicted class trigrams.</Paragraph>
    </Section>
    <Section position="2" start_page="13" end_page="45" type="sub_section">
      <SectionTitle>
5.2 Base classifier accuracy and constraint
</SectionTitle>
      <Paragraph position="0"> satisfaction inference performance There is a subtle balance between the quality of the trigram-predicting base classifier, and the gain that any inference procedure for trigram classes can reach. If the base classifier's predictions are perfect, all three candidate labels will agree for all tokens inthe sequence; consequently theinference procedure can only choose from one potential output sequence. On the other extreme, if all three candidate labels disagree for all tokens in the sequence, the inference procedure's task is to select the best sequence among 3n possible sequences, where n denotes the length of the sequence; it is likely that such a huge amount of candidate label sequences cannot be dealt with appropriately.</Paragraph>
      <Paragraph position="1"> Table 2 collects the base classifier accuracies, and the average number of potential output sequences per sentence resulting from its predictions. For all tasks, the number of potential sequences is manageable; far from the theoretical maximum 3n, even for GENIA, that, compared with the other tasks, has a relatively large number of potential output sequences. The factors that have an effect on the number of sequences are rather complex. One important factor is the accuracy of the trigram predictions made by the base classifier. To illustrate this, Figure 1 shows the number ofpotential output sequences asafunction of the base classifier accuracy for the PHONEME task. There is an almost linear decrease of the number of possible sequences as the classifier accuracy improves. This shows that it is important to optimise the performance of the base classifier, since it decreases the number of potential output sequences to consider for the inference procedure.</Paragraph>
      <Paragraph position="2"> Other factors affecting the number of potential output sequences are the length of the sequence, and the number of labels defined for the task. Unlike classifier accuracy, however, these two factors  sequences that result from class trigram predictions made by a memory-based base classifier.</Paragraph>
      <Paragraph position="3"> are inherent properties of the task, and cannot be optimised.</Paragraph>
      <Paragraph position="4"> While we have shown that improved base classifier accuracy has a positive effect on the number of possible output sequences; we have not yet established a positive relation between the number of possible output sequences and the performance of constraint satisfaction inference. Figure 2 illustrates, again for the PHONEME task, that there is indeed a positive, even linear relation between the accuracy of the base classifier, and the performance attained by inference. This relation exists for all inference procedures: majority voting, as well as constraint satisfaction inference, and the oracle procedure. It is interesting to see how the curves for those three procedure compare with each other.</Paragraph>
      <Paragraph position="5"> The oracle always outperforms the other two procedures by a wide margin. However, its increase is less steep. Constraint satisfaction inference consistently outperforms majority voting, though the difference between the two decreases as the base classifier's predictions improve. This is to be expected, since more accurate predictions means more majorities will appear among candi- null straint satisfaction inference, and the oracle inference procedure as a function of base classifier accuracy on the PHONEME task.</Paragraph>
      <Paragraph position="6"> date labels, and the predictive quality of such majorities improves as well. In the limit -with a perfect base classifier- all three curves will meet.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="45" end_page="45" type="metho">
    <SectionTitle>
6 Related work
</SectionTitle>
    <Paragraph position="0"> Many learning techniques specifically designed for sequentially-structured data exist. Given our goal of developing a method usable with non-probabilistic classifiers, we will not discuss the obvious differences with the many probabilistic methods. In this section, wewillcontrast our work with two other approaches that also apply principles of constraint satisfaction to sequentially-structured data.</Paragraph>
    <Paragraph position="1"> Constraint Satisfaction with Classifiers (Punyakanok and Roth, 2001) performs the somewhat more specific task of identifying phrases in a sequence. Like our method, the task of coordinating local classifier decisions is formulated as a constraint satisfaction problem. The variables encode whether or not a certain contiguous span of tokens forms a phrase. Hard constraints enforce that no two phrases in a solution overlap.</Paragraph>
    <Paragraph position="2"> Similarly to our method, classifier confidence estimates are used to rank solutions in order of preference. Unlike in our method, however, both the domains of the variables and the constraints are prespecified; the classifier is used only to estimate the cost of potential variable assignments. In our approach, the classifier predicts the domains of the variables, the constraints, and the weights of those.</Paragraph>
    <Paragraph position="3"> Roth and Yih (2005) replace the Viterbi algo- null rithm for inference in conditional random fields with an integer linear programming formulation.</Paragraph>
    <Paragraph position="4"> This allows arbitrary global constraints to be incorporated in the inference procedure. Essentially, the method adds constraint satisfaction functionality on top of the inference procedure. In our method, constraint satisfaction is the inference procedure. Nevertheless, arbitrary global constraints (both hard and soft) can easily be incorporated in our framework as well.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML