XML Viewer - w06-3206

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3206_metho.xml
Size: 19,369 bytes
Last Modified: 2025-10-06 14:10:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3206">
  <Title>constraint satisfaction inference</Title>
  <Section position="4" start_page="41" end_page="45" type="metho">
    <SectionTitle>
2 Class trigrams and constraint
</SectionTitle>
    <Paragraph position="0"> satisfaction inference Both the letter-phoneme conversion and the morphological analysis tasks treated in this paper can be seen as sequentially-structured classification tasks, where sequences of letters are mapped to sequences of phonemes or morphemes. Such sequence-to-sequence mappings are a frequently reoccurring phenomenon in natural language processing, which suggests that it is preferable to take care of the issue of classifying sequential data once at the machine learning level, rather than repeatedly and in different ways at the level of practical applications. Recently, a machine learning approach for sequential data has been proposed by Van den Bosch and Daelemans (2005) that is suited for discrete machine-learning algorithms such as memory-based learners, which have been shown to perform well on word phonemization and morphological analysis before (Van den Bosch and Daelemans, 1993; Van den Bosch and Daelemans, 1999). In the remainder of this paper, we use as our classifier of choice the IB1 algorithm (Aha et al., 1991) with feature weighting, as implemented in the TiMBL software package1 (Daelemans et al., 2004).</Paragraph>
    <Paragraph position="1"> In the approach to sequence processing proposed by Van den Bosch and Daelemans (2005), the elements of the input sequence (in the remainder of this paper, we will refer to words and letters rather than the more general terms sequences and sequence elements) are assigned overlapping subsequences of output symbols. This subsequence corresponds to the output symbols for a focus letter, and one letter to its left and one letter to its right. Predicting such trigram subsequences for each letter of a word eventually results in three output symbol predictions for each letter. In many cases, those three predictions will not agree, resulting in a number of potential output sequences. We will refer to the procedure for selecting the final output sequence from the space of alternatives spanned by the predicted trigrams as an inference procedure, analogously to the use of this term in probabilistic sequence classification methods (Punyakanok and Roth, 2001).</Paragraph>
    <Paragraph position="2"> The original work on predicting class trigrams implemented a simple inference procedure by voting over the three predicted symbols (Van den Bosch and Daelemans, 2005).</Paragraph>
    <Paragraph position="3"> Predicting trigrams of overlapping output symbols has been shown to be an effective approach  to improve sequence-oriented natural language processing tasks such as syntactic chunking and named-entity recognition, where an input sequence of tokens is mapped to an output sequence of symbols encoding a syntactic or semantic segmentation of the sentence. Letter-phoneme conversion and morphological analysis, though sequentially structured on another linguistic level, may be susceptible to benefiting from this approach as well.</Paragraph>
    <Paragraph position="4"> In addition to the practical improvement shown to be obtained with the class trigram method, there is also a more theoretical attractiveness to it. Since the overlapping trigrams that are predicted are just atomic symbols to the underlying learning algorithm, a classifier will only predict output symbol trigrams that are actually present in the data it was trained on. Consequently, predicted trigrams are guaranteed to be syntactically valid subsequences in the target task. There is no such guarantee in approaches to sequence classification where an isolated local classifier predicts single output symbols at a time, without taking into account predictions made elsewhere in the word.</Paragraph>
    <Paragraph position="5"> While the original voting-based inference procedure proposed by Van den Bosch and Daelemans (2005) manages to exploit the sequential information stored in the predicted trigrams to improve upon the performance of approaches that do not consider the sequential structure of their output at all, it does so only partly. Essentially, the voting-based inference procedure just splits the overlapping trigrams into their unigram components, thereby retaining only the overlapping symbols for each individual letter. As a result, the guaranteed validity of the trigram subsequences is not put to use. In this section we describe an alternative inference procedure, based on principles of constraint satisfaction, that does manage to use the sequential information provided by the trigram predictions.</Paragraph>
    <Paragraph position="6"> At the foundation of this constraint-satisfaction-based inference procedure, more briefly constraint satisfaction inference, is the assumption that the output symbol sequence should preferably be constructed by concatenating the predicted trigrams of output symbols, rather than by chaining individual symbols. However, as the underlying base classifier is by no means perfect, predicted trigrams should not be copied blindly to the output sequence; they may be incorrect. If a trigram prediction is considered to be of insufficient quality, the procedure backs off to symbol bigrams or even symbol unigrams.</Paragraph>
    <Paragraph position="7"> The intuitive description of the inference procedure is formalized by expressing it as a weighted constraint satisfaction problem (W-CSP). Constraint satisfaction is a well-studied research area with many diverse areas of application. Weighted constraint satisfaction extends the traditional constraint satisfaction framework with soft constraints; such constraints are not required to be satisfied for a solution to be valid, but constraints a given solution does satisfy are rewarded according to weights assigned to them. Soft constraints are perfect for expressing our preference for symbol trigrams, with the possibility of a back off to lower-degree n-grams if there is reason to doubt the quality of the trigram predictions. null Formally, a W-CSP is a tuple (X,D,C,W).</Paragraph>
    <Paragraph position="8"> Here, X = {x1,x2,...,xn} is a finite set of variables. D(x) is a function that maps each variable to its domain, that is, the set of values that variable can take on. C is the set of constraints. While a variable's domain dictates the values a single variable is allowed to take on, a constraint specifies which simultaneous value combinations over a number of variables are allowed. For a traditional (nonweighted) constraint satisfaction problem, a valid solution would be an assignment of values to the variables that (1) are a member of the corresponding variable's domain, and (2) satisfy all constraints in the set C. Weighted constraint satisfaction, however, relaxes this requirement to satisfy all constraints. Instead, constraints are assigned weights that may be interpreted as reflecting the importance of satisfying that constraint.</Paragraph>
    <Paragraph position="9"> Let a constraint c [?] C be defined as a function that maps each variable assignment to 1 if the constraint is satisfied, or to 0 if it is not. In addition, let W: C-IR+ denote a function that maps each constraint to a positive real value, reflecting the weight of that constraint. Then, the optimal solution to a W-CSP is given by the following equation.</Paragraph>
    <Paragraph position="11"> hand. The constraints on the right have been marked with a number (between parentheses) that refers to the trigram prediction on the left from which the constraint was derived.</Paragraph>
    <Paragraph position="12"> That is, the assignment of values to its variables that maximizes the sum of weights of the constraints that have been satisfied.</Paragraph>
    <Paragraph position="13"> Translating the terminology used in morpho-phonological tasks to the constraint satisfaction domain, each letter maps to a variable, the domain of which corresponds to the three overlapping candidate symbols for this letter suggested by the trigrams covering the letter. This provides us with a definition of the function D, mapping variables to their domain. In the following, yi,j denotes the candidate symbol for letter xj predicted by the trigram assigned to letter xi.</Paragraph>
    <Paragraph position="15"> Constraints are extracted from the predicted trigrams. Given the goal of retaining predicted tri-grams in the output symbol sequence as much as possible, the most important constraints are simply the trigrams themselves. A predicted trigram describes a subsequence of length three of the entire output sequence; by turning such a trigram into a constraint, we express the wish to have this trigram</Paragraph>
    <Paragraph position="17"> No base classifier is flawless though, and therefore not all predicted trigrams can be expected to be correct. Yet, even an incorrect trigram may carry some useful information regarding the output sequence: one trigram also covers two bigrams, and three unigrams. An incorrect trigram may still contain smaller subsequences of length one or two that are correct. Therefore, all of these are also mapped to constraints.</Paragraph>
    <Paragraph position="19"> To illustrate the above procedure, Figure 1 shows the constraints yielded by a given output sequence of class trigrams for the word &amp;quot;hand&amp;quot;. With such an amount of overlapping constraints, the satisfaction problem obtained easily becomes over-constrained, that is, no variable assignment exists that can satisfy all constraints without breaking another. Even only one incorrectly predicted class trigram already leads to two conflicting candidate symbols for one of the letters at least. In Figure 1, this is the case for the letter &amp;quot;d&amp;quot;, for which both the symbol &amp;quot;d&amp;quot; and &amp;quot;t&amp;quot; are predicted. On the other hand, without conflicting candidate symbols, no inference would be needed to start with. The choice for the weighted constraint satisfaction method always allows a solution to be found, even in the presence of conflicting constraints. Rather than requiring all constraints to be satisfied, each constraint is assigned a certain weight; the optimal solution to the problem is an assignment of values to the variables that optimizes the  for the word booking.</Paragraph>
    <Paragraph position="20"> sum of weights of the constraints that are satisfied.</Paragraph>
    <Paragraph position="21"> As weighted constraints are defined over overlapping subsequences of the output sequence, the best symbol assignment for each letter with respect to the weights of satisfied constraints is decided upon on a global sequence level. This may imply taking into account symbol assignments for surrounding letters to select the best output symbol for a certain letter.</Paragraph>
    <Paragraph position="22"> In contrast, in non-global approaches, ignorant of any sequential context, only the local classifier prediction with highest confidence is considered for selecting a letter's output symbol. By formulating our inference procedure as a constraint satisfaction problem, global output optimization comes for free: in constraint satisfaction, the aim is also to find a globally optimal assignment of variables taking into account all constraints defined over them. Yet, for such a constraint satisfaction formulation to be effective, good constraint weights should be chosen, that is, weights that favor good output sequences over bad ones.</Paragraph>
    <Paragraph position="23"> Constraints can directly be traced back to a prediction made by the base classifier. If two constraints are in conflict, the one which the classifier was most certain of should preferably be satisfied.</Paragraph>
    <Paragraph position="24"> In the W-CSP framework, this preference can be expressed by weighting constraints according to the classifier confidence for the originating trigram. For the memory-based learner, we define the classifier confidence for a predicted class as the weight assigned to that class in the neighborhood of the test instance, divided by the total weight of all classes.</Paragraph>
    <Paragraph position="25"> Let x denote a test instance, and c[?] its predicted class. Constraints derived from this class are weighted according to the following rules: * for a trigram constraint, the weight is simply the base classifier's confidence value for the class c[?]; * for a bigram constraint, the weight is the sum of the confidences for all trigram classes in the nearest-neighbor set of x that assign the same symbol bigram to the letters spanned by the constraint; * for a unigram constraint, the weight is the sum of the confidences for all trigram classes in the nearest-neighbor set of x that assign the same symbol to the letter spanned by the constraint.</Paragraph>
    <Paragraph position="26"> This weighting scheme results in an inference procedure that behaves exactly as we already described intuitively in the beginning of this section. The preference for retaining the predicted trigrams in the output sequence is translated into high rewards for output sequences that do so, since such output sequences not only receive credit for the satisfied tri-gram constraints, but also for all the bigram and unigram constraints derived from that trigram; they are necessarily satisfied as well. Nonetheless, this preference for trigrams may be abandoned if composing a certain part of the output sequence from several symbol bigrams or even unigrams results in higher rewards than when trigrams are used. The latter may happen in cases where the base classifier is not confident about its trigram predictions.</Paragraph>
  </Section>
  <Section position="5" start_page="45" end_page="46" type="metho">
    <SectionTitle>
3 Data preparation
</SectionTitle>
    <Paragraph position="0"> In our experiments we train classifiers on English and Dutch letter-phoneme conversion and morphological analysis. All data for the experiments described in this paper are extracted from the CELEX lexical databases for English and Dutch (Baayen et al., 1993). We encode the examples for our base classifiers in a uniform way, along the following procedure. Given a word and (i) an aligned phonemic transcription or (ii) an aligned encoding of a morphological analysis, we generate letter-by-letter windows. Each window takes one letter in focus, and includes three neighboring letters to the left and to the right. Each seven-letter input window is associated to a trigram class label, composed of the focus class label aligned with the middle letter, plus its immediately preceding and following class labels. Table 1 displays the seven examples made on the basis of the word booking, with tri-gram classes (as explained in Section 2) both for the letter-phoneme conversion task and for the morphological analysis task. The full aligned phonemic transcription of booking is [bu-kIN-] (using the SAMPA coding of the international phonetic alphabet), and the morphological analysis of booking is [book]stem[ing]inflection. The dashes in the phonemic transcription are inserted to ensure a one-to-one mapping between letters and phonemes; the insertion was done by automatical alignment through expectation-maximization (Dempster et al., 1977).</Paragraph>
    <Paragraph position="1"> The English word phonemization data, extracted from the CELEX lexical database, contains 65,467 words, on the basis of which we create a database of 573,170 examples. The Dutch word phonemization data set consists of 293,825 words, totaling to 3,181,345 examples. Both data sets were aligned using the expectation-maximization algorithm (Dempster et al., 1977), using a phonemic null character to equalize the number of symbols in cases in which the phonemic transcription is shorter than the orthographic word, and using &amp;quot;double phonemes&amp;quot; (e.g. [X] for [ks]) in cases where the phonemic transcription is longer, as in taxi - [tAksi].</Paragraph>
    <Paragraph position="2"> CELEX contains 336,698 morphological analyses of Dutch (which we converted to 3,209,090 examples), and 65,558 analyses of English words (573,544 examples). We converted the available</Paragraph>
    <Paragraph position="4"> gram classes derived from the example word abnormaliteiten. null morphological information for the two languages in a coding scheme which is rather straightforward in the case of English, and somewhat more complicated for Dutch. For English, as exemplified in Table 1, a simple segmentation label marks the beginning of either a stem, an inflection (&amp;quot;s&amp;quot; and &amp;quot;i&amp;quot; in Table 1), a stress-affecting affix, or a stress-neutral affix (&amp;quot;1&amp;quot; and &amp;quot;2&amp;quot;, not shown in Table 1). The coding scheme for Dutch incorporates additional information on the part-of-speech of every stem and noninflectional affix, the type of inflection, and also encodes all spelling changes between the base lemma forms and the surface word form.</Paragraph>
    <Paragraph position="5"> To illustrate the more complicated construction of examples for Dutch morphological analysis, Table 2 displays the 15 instances derived from the Dutch example word abnormaliteiten (abnormalities) and their associated classes. The class of the first instance is A, which signifies that the morpheme starting in a is an adjective (A). The class of the eighth instance, 0+Da, indicates that at that position no segment starts (0), but that an a was deleted at that position (+Da, &amp;quot;delete a&amp;quot; here). Next to deletions, insertions (+I) and replacements (+R, with a deletion and an insertion argument) can also occur. Together  for the four tasks.</Paragraph>
    <Paragraph position="6"> these two classification labels code that the first morpheme is the adjective abnormaal. The second morpheme, the suffix iteit, has class A -N. This complex tag, which is in fact a rewrite rule, indicates that when iteit attaches right to an adjective (encoded by A ), the new combination becomes a noun (-N).</Paragraph>
    <Paragraph position="7"> Rewrite rule class labels occur exclusively with suffixes, that do not have a part-of-speech tag of their own, but rather seek an attachment to form a complex morpheme with the part-of-speech tag. Finally, the third morpheme is en, which is a plural inflection that by definition attaches to a noun.</Paragraph>
    <Paragraph position="8"> Logically, the number of trigram classes for each task is larger than the number of atomic classes; the actual numbers for the four tasks investigated here are displayed in Table 3. The English morphological analysis task has the lowest number of tri-gram classes, 80, due to the fact that there are only five atomic classes in the original task, but for the other tasks the number of trigram classes is quite high; above 10,000. With these numbers of classes, several machine learning algorithms are practically ruled out, given their high sensitivity to numbers of classes (e.g., support vector machines or rule learners). Memory-based learning algorithms, however, are among a small set of machine learning algorithms that are insensitive to the number of classes both in learning and in classification.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML