XML Viewer - p98-1047

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1047_metho.xml
Size: 15,817 bytes
Last Modified: 2025-10-06 14:14:55
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1047">
  <Title>Learning a syntagmatic and paradigmatic structure from language data with a bi-multigram model</Title>
  <Section position="4" start_page="0" end_page="301" type="metho">
    <SectionTitle>
2 Theoretical formulation of the
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="300" type="sub_section">
      <SectionTitle>
multigrams
2.1 Variable-length phrase distribution
</SectionTitle>
      <Paragraph position="0"> In the multigram framework, the assumption is made that sentences result from the concatenation of variable-length phrases, called multigrams. The likelihood of a sentence is computed by summing the likelihood values of all possible segmentations of the sentence into phrases. The likelihood computa- null tion for any particular segmentation into phrases depends on the model assumed to describe the dependencies between the phrases. We call bi-multigram model the model where bigram dependencies are assumed between the phrases. For instance, by limiting to 3 words the maximal length of a phrase, the bi-multigram likelihood of the string &amp;quot;a b c d&amp;quot; is:</Paragraph>
      <Paragraph position="2"> To )resent the general formalism of the model in this section, we assume ~-gram correlations between the phrases, and we note n the maximal length of a phrase (in the above example, ~=2 and n=3). Let W denote a string of words, and {S} the set of possible segmentations on W. The likelihood of W is:</Paragraph>
      <Paragraph position="4"> with s(~) denoting the phrase of rank (r) in the segmentation S. The model is thus fully defined by the set of ~-gram probabilities on the set {8i} i of all the phrases which can be formed by combining 1, 2, ...up to n words of the vocabulary. Maximum likelihood (ML) estimates of these probabilities can be obtained by formulating the estimation problem as a ML estimation from incomplete data (Dempster et al., 1977), where the unknown data is the underlying segmentation S. Let Q(k, k+ 1) be the following auxiliary function computed with the likelihoods of iterations k and k + 1 :</Paragraph>
      <Paragraph position="6"> It has been shown in (Dempster et al., 1977) that if Q(k,k + 1) &gt; Q(k,k), then PS(k+l)(W) &gt; PS(k)(W). Therefore the reestimation equation of p(sir I si, ...sir_,), at iteration (k + 1), can be derived by maximizing Q(k, k + 1) over the set of parameters of iteration (k + 1), under the set of constraints ~&amp;quot;~'.a&amp;quot; p(sir \[si,...sir_,) = 1, hence:</Paragraph>
      <Paragraph position="8"> ~sels} c(si, sir_,, S) x PS(k)(S I W) where c(si, ... si-~, S) is the number ofoccurences of the combination of phrases sl, ... siw in the segmentation S. Reestimation equation (4) can be implemented by means of a forward-backward algorithm, such as the one described for bi-multigrams (~ = 2) in the appendix of this paper. In a decision-oriented scheme, the reestimation equation reduces to: c(si, ... si.~_, s~, S &amp;quot;(k)) p(k+l)(si-~ I si, ...sir_,) = c(si, ...sir_,, S &amp;quot;(k)) (5) where S *(k), the segmentation maximizing PS:(k)(S \] W), is retrieved with a Viterbi algorithm. null Since each iteration improves the model in the sense of increasing the likelihood /:(k)(W), it eventually converges to a critical point (possibly a local maximum).</Paragraph>
    </Section>
    <Section position="2" start_page="300" end_page="301" type="sub_section">
      <SectionTitle>
2.2 Variable-length phrase clustering
</SectionTitle>
      <Paragraph position="0"> Recently, class-phrase based models have gained some attention (Ries et al., 1996), but usually it assumes a previous clustering of the words.</Paragraph>
      <Paragraph position="1"> Typically, each word is first assigned a word-class label &amp;quot;&lt; Ck &gt;&amp;quot;, then variable-length phrases \[Ck,Ck2...Ck,\] of word-class labels are retrieved, each of which leads to define a phrase-class label which can be denoted as &amp;quot;&lt; \[Ck,Ck2...Ch\] &gt;&amp;quot;. But in this approach only phrases of the same length can be assigned the same phrase-class label. For instance, the phrases &amp;quot;thank you for&amp;quot; and &amp;quot;thank you very much for&amp;quot; cannot be assigned the same class label. We propose to address this limitation by directly clustering phrases instead of words.</Paragraph>
      <Paragraph position="2"> For this purpose, we assume bigram correlations between the phrases (~ = 2), and we modify the learning procedure of section 2.1, so that each iteration consists of 2 steps:  Step 1 takes a phrase distribution as an input, assigns each phrase sj to a class Cq(,.), and outputs the corresponding class dmtnbutmn. In our experiments, the class assignment is performed by maximizing the mutual information between adjacent phrases, following the line described in (Brown  et al., 1992), with only the modification that candidates to clustering are phrases instead of words. The clustering process is initialized by assigning each phrase to its own class. The loss in average mutual information when merging 2 classes is computed for every pair of classes, and the 2 classes for which the loss is minimal are merged. After each merge, the loss values are updated and the process is repeated till the required number of classes is obtained.</Paragraph>
      <Paragraph position="3"> Step _2 consists in reestimating a phrase distribution using the bi-multigram reestimation equation (4) or (5), with the only difference that the likelihood of a parse, instead of being computed as in Eq. (2), is now computed with the class estimates, i.e. as:</Paragraph>
      <Paragraph position="5"> This is equivalent to reestimating p(k+l)(sj \[ Si) from p(k)(Cq(, D \[ Cq(,,)) x p(k)(sj \[ Cq(,D), instead ofp(k)(sj \[ si) as was the case in section 2.1.</Paragraph>
      <Paragraph position="6"> Overall, step 1 ensures that the class assignment based on the mutual information criterion is optimal with respect to the current estimates of the phrase distribution and step _2 ensures that the phrase distribution optimizes the likelihood computed according to (6) with the current estimates of the ciass distribution. The training data are thus iteratively structured in a fully integrated way, at both a paradigmatic level (step 1) and a syntagmatic level (step 2_).</Paragraph>
    </Section>
    <Section position="3" start_page="301" end_page="301" type="sub_section">
      <SectionTitle>
2.3 Interpolation of stochastic class-phrase
</SectionTitle>
      <Paragraph position="0"> and phrase models With a class model, the probabilities of 2 phrases belonging to the same class are distinguished only according to their unigram probability. As it is unlikely that this loss of precision be compensated by the improved robustness of the estimates of the class distribution, class based models can be expected to deteriorate the likelihood of not only train but also test data, with respect to non-class based models.</Paragraph>
      <Paragraph position="1"> However, the performance of non-class models can be enhanced by interpolating their estimates with the class estimates. We first recall the way linear interpolation is performed with conventional word ngram models, and then we extend it to the case of our stochastic phrase-based approach. Usually, linear interpolation weights are computed so as to maximize the likelihood of cross evaluation data (Jelinek and Mercer, 1980). Denoting by A and (1 - A) the interpolation weights, and by p+ the interpolated estimate, it comes for a word bigram model:  I i) = a p(w i I w,) + (l-a) p(Cq(wj) I cq(w,)) I with A having been iteratively estimated on a cross evaluation corpus l,VC/~o,, as: 1 A (k) p(wj \[ wi) A(k+l) - - TC/',.o,, Z c(wiwj) p(~)(wj I wi) (8) ij  where Tcro,, is the number of words in Weros,, and c(wiwj) the number of co-occurences of the words wi and wj in Wero,~.</Paragraph>
      <Paragraph position="2"> In the case of a stochastic phrase based model where the segmentation into phrases is not known a priori - the above computation of the interpolation weights still applies, however, it has to be embedded in dynamic programming to solve the ambiguity on the segmentation:</Paragraph>
      <Paragraph position="4"> where S &amp;quot;(k) the most likely segmentation of Wero,s given the current estimates p(~)(sj I si) can be retrieved with a Viterbi algorithm, and where c(S*(k)) is the number of sequences in the segmentation S &amp;quot;(k). A more accurate, but computationally more involved solution would be to compute A (~+1) as the</Paragraph>
      <Paragraph position="6"> over the set of segmentations {S} on Wcross, using for this purpose a forward-backward algorithm.</Paragraph>
      <Paragraph position="7"> However in the experiments reported in section 4, we use Eq (9) only.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="301" end_page="303" type="metho">
    <SectionTitle>
3 Experiments with phrase based
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="301" end_page="302" type="sub_section">
      <SectionTitle>
models
3.1 Protocol and database
</SectionTitle>
      <Paragraph position="0"> Evaluation protocol A motivation to learn bi-gram dependencies between variable length phrases is to improve the predictive capability of conventional word bigram models, while keeping the number of parameters in the model lower than in the word trigram case. The predictive capability is usually evaluated with the perplexity measure:</Paragraph>
      <Paragraph position="2"> where T is the number of words in W. The lower PP is, the more accurate the prediction of the model is. In the case of a stochastic model, there are actually 2 perplexity values PP and PP* computed respectively from ~&amp;quot;\]~s PS(W,S) and PS(W,S*). The difference PP* - PP is always positive or zero, and measures the average degree of ambiguity on a parse S of W, or equivalently the loss in terms of prediction accuracy, when the sentence likelihood is approximated with the likelihood of the best parse, as is done in a speech recognizer.</Paragraph>
      <Paragraph position="3">  In section 3.2, we first evaluate the loss (PP&amp;quot; - PP) using the forward-backward estimation procedure, and then we study the influence of the estimation procedure itself, i.e. Eq. (4) or (5), in terms of perplexity and model size (number of distinct 2-uplets of phrases in the model). Finally, we compare these results with the ones obtained with conventional n-gram models (the model size is thus the number of distinct n-uplets of words observed), using for this purpose the CMU-Cambridge toolkit (Clarkson and Rosenfeld, 1997).</Paragraph>
      <Paragraph position="4"> Training protocol Experiments are reported for phrases having at most n = 1, 2, 3 or 4 words (for n =1, bi-multigrams correspond to conventional bigrams). The bi-multigram probabilities are initialized using the relative frequencies of all the 2-uplets of phrases observed in the training corpus, and they are reestimated with 6 iterations. The dictionaries of phrases are pruned by discarding all phrases occuring less than 20 times at initialization, and less than 10 times after each iteration s, except for the 1-word phrases which are kept with a number of occurrences set to 1. Besides, bi-multigram and n-gram probabilities are smoothed with the backoff smoothing technique (Katz, 1987) using Witten-Bell discounting (Witten and Bell, 1991) 3.</Paragraph>
      <Paragraph position="5"> Database Experiments are run on ATR travel arrangement data (see Tab. 1). This database consists of semi-spontaneous dialogues between a hotel clerk and a customer asking for travel/accomodation informations. All hesitation words and false starts were mapped to a single marker &amp;quot;*uh*&amp;quot;.</Paragraph>
    </Section>
    <Section position="2" start_page="302" end_page="303" type="sub_section">
      <SectionTitle>
3.2 Results
</SectionTitle>
      <Paragraph position="0"> Ambiguity on a parse (Table 2) The difference (PP&amp;quot; - PP) usually remains within about 1 point of perplexity, meaning that the average ambiguity on a parse is low, so that relying on the single best parse should not decrease the accuracy of the prediction very much.</Paragraph>
      <Paragraph position="1"> Influence of the estimation procedure (Table 3) As far as perplexity values are concerned,  the estimation scheme seems to have very little influence, with only a slight advantage in using the forward-backward training. On the other hand, the size of the model at the end of the training is about 30% less with the forward-backward training: approximately 40 000 versus 60 000, for a same test perplexity value. The bi-multigram results tend to indicate that the pruning heuristic used to discard phrases does not allow us to fully avoid overtraining, since perplexities with n =3, 4 (i.e. dependencies possibly spanning over 6 or 8 words) are higher than with n =2 (dependencies limited to 4 words).</Paragraph>
      <Paragraph position="2">  Comparison with n-grams (Table 4) The lowest bi-multigram perplexity (43.9) is still higher than the trigram score, but it is much closer to the tri-gram value (40.4) than to the bigram one (56.0) 4 The number of entries in the bi-multigram model is much less than in the trigram model (45000 versus 75000), which illustrates the ability of the model to  values and model size.</Paragraph>
      <Paragraph position="3"> 4Besides, the trig-ram score depends on the discounted scheme: with a linear discounting, the trlg'ram perplexity on our test data was 48.1.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="303" end_page="303" type="metho">
    <SectionTitle>
4 Experiments with class-phrase
</SectionTitle>
    <Paragraph position="0"> based models</Paragraph>
    <Section position="1" start_page="303" end_page="303" type="sub_section">
      <SectionTitle>
4.1 Protocol and database
</SectionTitle>
      <Paragraph position="0"> Evaluation protocol In section 4.2, we compare class versions and interpolated versions of the bigram, trigram and bi-multigram models, in terms of perplexity values and of model size. For bigrams (resp. trigrams) of classes, the size of the model is the number of distinct 2-uplets (resp. 3-uplets) of word-classes observed, plus the size of the vocabulary. For the class version of the bi-multigrams, the size of the model is the number of distinct 2-uplets of phrase-classes, plus the number of distinct phrases maintained. In section 4.3, we show samples from classes of up to 5-word phrases, to illustrate the potential benefit of clustering relatively long and variable-length phrases for issues related to language understanding.</Paragraph>
      <Paragraph position="1"> Training protocol All non-class models are the same as in section 3. The class-phrase models are trained with 5 iterations of the algorithm described in section 2.2: each iteration consists in clustering the phrases into 300 phrase-classes (step 1), and in reestimating the phrase distribution (step 2) with Eq. (4). The bigrams and trigrams of classes are estimated based on 300 word-classes derived with the same clustering algorithm as the one used to cluster the phrases. The estimates of all the class ditributions are smoothed with the backoff technique like in section 3. Linear interpolation weights between the class and non-class models are estimated based on Eq. (8) in the case of the bigram or trigram models, and on Eq.(9) in the case of the bi-multigram model.</Paragraph>
      <Paragraph position="2"> Database The training and test data used to train and evaluate the models are the same as the ones described in Table 1. We use an additional set of 7350 sentences and 55000 word tokens to estimate the interpolation weights of the interpolated models.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML