File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2054_metho.xml

Size: 21,876 bytes

Last Modified: 2025-10-06 14:10:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2054">
  <Title>Exploiting Non-local Features for Spoken Language Understanding</Title>
  <Section position="4" start_page="412" end_page="413" type="metho">
    <SectionTitle>
2 Spoken Language Understanding as a
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="412" end_page="412" type="sub_section">
      <SectionTitle>
Sequential Labeling Problem
2.1 Spoken Language Understanding
</SectionTitle>
      <Paragraph position="0"> The goal of SLU is to extract semantic meanings from recognized utterances and to fill the correct values into a semantic frame structure.</Paragraph>
      <Paragraph position="1"> A semantic frame (or template) is a well-formed and machine readable structure of extracted information consisting of slot/value pairs. An example of such a reference frame is as follows.</Paragraph>
      <Paragraph position="2"> &lt;s&gt; i wanna go from denver to new york on  november eighteenth &lt;/s&gt; FROMLOC.CITY NAME = denver TOLOC.CITY NAME = new york MONTH NAME = november DAY NUMBER = eighteenth  This example from air travel data (CU-Communicator corpus) was automatically generated by a Phoenix parser and manually corrected (Pellom et al., 2000; He and Young, 2005). In this example, the slot labels are two-level hierarchical; such as FROMLOC.CITY NAME. This hierarchy differentiates the semantic frame extraction problem from the named entity recognition (NER) problem.</Paragraph>
      <Paragraph position="3"> Regardless of the fact that there are some differences between SLU and NER, we can still apply well-known techniques used in NER to an SLU problem. Following (Ramshaw and Marcus, 1995), the slot labels are drawn from a set of classes constructed by extending each label by three additional symbols, Beginning/Inside/Outside (B/I/O). A two-level hierarchical slot can be considered as an integrated flattened slot. For example, FROMLOC.CITY NAME and TOLOC.CITY NAMEare different on this slot definition scheme.</Paragraph>
      <Paragraph position="4"> Now, we can formalize the SLU problem as a sequential labeling problem, y[?] = argmaxy P(y|x). In this case, input word sequences x are not only lexical strings, but also multiple linguistic features. To extract semantic frames from utterance inputs, we use a linear-chain CRF model; a model that assigns a joint probability distribution over labels which is conditional on the input sequences, where the distribution respects the independent relations encoded in a graph (Lafferty et al., 2001).</Paragraph>
      <Paragraph position="5"> A linear-chain CRF is defined as follows. Let G be an undirected model over sets of random variables x and y. The graph G with parameters L = {l,...} defines a conditional probability for a state (or label) sequence y = y1,...,yT , given an input x = x1,...,xT , to be</Paragraph>
      <Paragraph position="7"> parenrightBigg where Zx is the normalization factor that makes the probability of all state sequences sum to one. fk(yt[?]1,yt,x,t) is an arbitrary linguistic feature function which is often binary-valued in NLP tasks. lk is a trained parameter associated with feature fk. The feature functions can encode any aspect of a state transition, yt[?]1 - yt, and the observation (a set of observable features), x, centered at the current time, t. Large positive values for lk indicate a preference for such an event, while large negative values make the event unlikely. null Parameter estimation of a linear-chain CRF is typically performed by conditional maximum loglikelihood. To avoid overfitting, the 2-norm regularization is applied to penalize on weight vector whose norm is too large. We used a limited memory version of the quasi-Newton method (L-BFGS) to optimize this objective function. The L-BFGS method converges super-linearly to the solution, so it can be an efficient optimization technique on large-scale NLP problems (Sha and Pereira, 2003).</Paragraph>
      <Paragraph position="8"> A linear-chain CRF has been previously applied to obtain promising results in various natural language tasks, but the linear-chain structure is deficient in modeling long-distance dependencies because of its limited structure (n-th order Markov chains).</Paragraph>
    </Section>
    <Section position="2" start_page="412" end_page="413" type="sub_section">
      <SectionTitle>
2.2 Long-distance Dependency in Spoken
Language Understanding
</SectionTitle>
      <Paragraph position="0"> In most sequential supervised learning problems including SLU, the feature function fk(yt[?]1,yt,xt,t) indicates only local information  for practical reasons. With sufficient local context (e.g. a sliding window of width 5), inference and learning are both efficient.</Paragraph>
      <Paragraph position="1"> However, if we only use local features, then we cannot model long-distance dependencies.</Paragraph>
      <Paragraph position="2"> Thus, we should incorporate non-local information into the model. For example, figure 1 shows the long-distance dependency problem in an SLU task. The same two word tokens &amp;quot;dec.&amp;quot; should be classified differently, DEPART.MONTH and RETURN.MONTH. The dotted line boxes represent local information at the current decision point (&amp;quot;dec.&amp;quot;), but they are exactly the same in two distinct examples. Moreover, the two states share the same previous sequence (O, O, FROMLOC.CITY NAME-B, O, TOLOC.CITY NAME-B, O). If we cannot obtain higher-order dependencies such as &amp;quot;fly&amp;quot; and &amp;quot;return,&amp;quot; then the linear-chain CRF cannot classify the correct labels between the two same tokens. To solve this problem, we propose an approach to exploit non-local information in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="413" end_page="415" type="metho">
    <SectionTitle>
3 Incorporating Non-local Information
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="413" end_page="413" type="sub_section">
      <SectionTitle>
3.1 Using Trigger Features
</SectionTitle>
      <Paragraph position="0"> To exploit non-local information to sequential labeling for a statistical SLU, we can use two approaches; a syntactic parser-based and a data-driven approach. Traditionally, information extraction and language understanding fields have usually used a syntactic parser to encode global information (e.g. parse tree path, governing category, or head word) over a local model. In a semantic role labeling task, the syntax and semantics are correlated with each other (Gildea and Jurafsky, 2002), that is, the global structure of the sentence is useful for identifying ambiguous semantic roles. However the problem is the poor accuracy of the syntactic parser with this type of feature. In addition, recognized utterances are erroneous and the spoken language has no capital letters, no additional symbols, and sometimes no grammar, so it is difficult to use a parser in an SLU problem.</Paragraph>
      <Paragraph position="1"> Another solution is a data-driven method, which uses statistics to find features that are approximately modeling long-distance dependencies. The simplest way is to use identical words in history or lexical co-occurrence, but we wish to use a more general tool; triggering. The trigger word pairs are introduced by (Rosenfeld, 1994). A trigger pair is the basic element for extracting information from the long-distance document history. In language modeling, n-gram based on the Markovian assumption cannot represent higher-order dependencies, but it can automatically extract trigger word pairs from data. The pair (A - B) means that word A and B are significantly correlated, that is, when A occurs in the document, it triggers B, causing its probability estimate to change.</Paragraph>
      <Paragraph position="2"> To select reasonable pairs from arbitrary word pairs, (Rosenfeld, 1994) used averaged mutual information (MI). In this scheme, the MI score of one pair is MI(A;B) =</Paragraph>
      <Paragraph position="4"> Using the MI criterion, we can select correlated word pairs. For example, the trigger pair (dec.-return) was extracted with score 0.001179 in the training data1. This trigger word pair can represent long-distance dependency and provide a cue to identify ambiguous classes. The MI approach, however, considers only lexical collocation without reference labels y, and MI based selection tends to excessively select the irrelevant triggers. Recall that our goal is to find the significantly correlated trigger pairs which improve the model. Therefore, we use a more appropriate selection method for sequential supervised learning.</Paragraph>
    </Section>
    <Section position="2" start_page="413" end_page="415" type="sub_section">
      <SectionTitle>
3.2 Selecting Trigger Feature
</SectionTitle>
      <Paragraph position="0"> We present another approach to extract relevant triggers and exploit them in a linear-chain CRF.</Paragraph>
      <Paragraph position="1"> Our approach is based on an automatic feature induction algorithm, which is a novel method to select a feature in an exponential model (Pietra et al., 1997; McCallum, 2003). We follow McCallum's work which is an efficient method to induce features in a linear-chain CRF model. Following the framework of feature inducing, we start the algorithm with an empty set, and iteratively increase the bundle of features including local features and trigger features. Our basic assumption, however, is that the local information should be included because the local features are the basis of the decision to identify the classes, and they reduce the 1In our experiment, the pair (dec.-fly) cannot be selected because this MI score is too low. However, the trigger pair is a binary type feature, so the pair (dec.-return) is enough to classify the two cases in the previous example.</Paragraph>
      <Paragraph position="2">  this case, a word token &amp;quot;dec.&amp;quot; with local feature set (dotted line box) is ambiguous for determining the correct label (DEPART.MONTH or RETURN.MONTH).</Paragraph>
      <Paragraph position="3"> mismatch between training and testing tasks. Furthermore, this assumption leads us to faster training in the inducing procedure because we can only consider additional trigger features.</Paragraph>
      <Paragraph position="4"> Now, we start the inducing process with local features rather than an empty set. After training the base model L(0), we should calculate the gains, which measure the effect of adding a trigger feature, based on the local model parameter L(0). The gain of the trigger feature is defined as the improvement in log-likelihood of the current model L(i) at the i-th iteration according to the following formula:</Paragraph>
      <Paragraph position="6"> where u is a parameter of a trigger feature to be found and g is a corresponding trigger feature function. The optimal value of u can be calculated by Newton's method.</Paragraph>
      <Paragraph position="7"> By adding a new candidate trigger, the equation of the linear-chain CRF model is changed to an additional feature model as PL(i)+g,u(y|x) =</Paragraph>
      <Paragraph position="9"> Note that Zx(L(i),g,u) is the marginal sum over all states of yprime. Following (Pietra et al., 1997; Mc-Callum, 2003), the mean field approximation and agglomerated features allows us to treat the above calculation as the independent inference problem rather than sequential inference. We can evaluate the probability of state y with an adding trigger pair given observation x separately as follows.</Paragraph>
      <Paragraph position="11"> Here, we introduce a second approximation. We use the individual inference problem over the unstructured maximum entropy (ME) model whose state variable is independent from other states in history. The background of our approximation is that the state independent problem of CRF can be relaxed to ME inference problem without the state-structured model. In the result, we calculate the gain of candidate triggers, and select trigger features over a light ME model instead of a huge computational CRF model2.</Paragraph>
      <Paragraph position="12"> We can efficiently assess many candidate trigger features in parallel by assuming that the old features remain fixed while estimating the gain.</Paragraph>
      <Paragraph position="13"> The gain of trigger features can be calculated on the old model that is trained with the local and added trigger pairs in previous iterations. Rather than summing over all training instances, we only need to use the mislabeled N tokens by the current parameter L(i) (McCallum, 2003). From mis-classified instances, we generate the candidates of trigger pairs, that is, all pairs of current words and others within the sentence. With the candidate feature set, the gain is</Paragraph>
      <Paragraph position="15"> Using the estimated gains, we can select a small portion of all candidates, and retrain the model with selected features. We iteratively perform the selection algorithm with some stop conditions (excess of maximum iteration or no added feature up to the gain threshold). The outline of the induction 2The ME model cannot represent the sequential structure and the resulting model is different from CRF. Nevertheless, we empirically prove that the effect of additional trigger features on both ME and approximated CRF (without regarding edge-state) are similar (see the experiment section).</Paragraph>
      <Paragraph position="17"> algorithms is described in figure 2. In the next section, we empirically prove the effectiveness of our algorithm.</Paragraph>
      <Paragraph position="18"> The trigger pairs introduced by (Rosenfeld, 1994) are just word pairs. Here, we can generalize the trigger pairs to any arbitrary pairs of features. For example, the feature pair (of-B-PP) is useful in deciding the correct answer PERIOD OF DAY-Iin &amp;quot;in the middle of the day.&amp;quot; Without constraints on generating the pairs (e.g.</Paragraph>
      <Paragraph position="19"> at most 3 distant tokens), the candidates can be arbitrary conjunctions of features3. Therefore we can explore any features including local conjunction or non-local singleton features in a uniform framework.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="415" end_page="417" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="415" end_page="415" type="sub_section">
      <SectionTitle>
4.1 Experimental Setup
</SectionTitle>
      <Paragraph position="0"> We evaluate our method on the CU-Communicator corpus. It consists of 13,983 utterances. The semantic categories correspond to city names, time-related information, airlines and other miscellaneous entities. The semantic labels are automatically generated by a Phoenix parser and manually corrected. In the data set, the semantic category has a two-level hierarchy: 31 first level classes  tions because we wish to capture the effect of long-distance entities.</Paragraph>
      <Paragraph position="1"> city names, a tenth are state and country names, and a fifth are airline and airport names. For the second level hierarchy, approximately three quarters of the entities are &amp;quot;NONE&amp;quot;, a tenth are &amp;quot;TOLOC&amp;quot;, a tenth are &amp;quot;FROMLOC&amp;quot;, and the remaining are &amp;quot;RETURN&amp;quot;, &amp;quot;DEPERT&amp;quot;, &amp;quot;ARRIVE&amp;quot;, and &amp;quot;STOPLOC.&amp;quot; For spoken inputs, we used the open source speech recognizer Sphinx2. We trained the recognizer with only the domain-specific speech corpus. The reported accuracy for Sphinx2 speech recognition is about 85%, but the accuracy of our speech recognizer is 76.27%; we used only a subset of the data without tuning and the sentences of this sub-set are longer and more complex than those of the removed ones, most of which are single-word responses. null All of our results have averaged over 5-fold cross validation with an 80/20 split of the data.</Paragraph>
      <Paragraph position="2"> As it is standard, we compute precision and recall, which are evaluated on a per-entity basis and combined into a micro-averaged F1 score (F1 = 2PR/(P+R)).</Paragraph>
      <Paragraph position="3"> A final model (a first-order linear chain CRF) is trained for 100 iterations with a Gaussian prior variance of 20, and 200 or fewer trigger features (down to a gain threshold of 1.0) for each round of inducing iteration (100 iterations of L-BFGS for the ME inducer and 10[?]20 iterations of L-BFGS for the CRF inducer). All experiments are implemented in C++ and executed on Linux with XEON 2.8 GHz dual processors and 2.0 Gbyte of main memory.</Paragraph>
    </Section>
    <Section position="2" start_page="415" end_page="417" type="sub_section">
      <SectionTitle>
4.2 Empirical Results
</SectionTitle>
      <Paragraph position="0"> We list the feature templates used by our experiment in figure 3. For local features, we use the indicators for specific words at location i, or locations within five words of i ([?]2,[?]1,0,+1,+2 words on current position i). We also use the part-of-speech (POS) tags and phrase labels with partial parsing. Like words, the two basic linguistic features are located within five tokens. For comparison, we exploit the two groups of non-local syntax parser-based features; we use Collins parser and extract this type of features from the parse trees. The first consists of the head word and POS-tag of the head word. The second group includes governing category and parse tree paths introduced by semantic role labeling (Gildea and Jurafsky, 2002). Following the previous studies  of semantic role labeling, the parse tree path improves the classification performance of semantic role labeling. Finally, we use the trigger pairs that are automatically extracted from the training data.</Paragraph>
      <Paragraph position="1"> Avoiding the overlap of local features, we add the constraint |i[?]j |&gt; 2 for the target word wj. Note that null pairs are equivalent to long-distance singleton word features wj.</Paragraph>
      <Paragraph position="2"> To compute feature performance, we begin with word features and iteratively add them one-by-one so that we achieve the best performance. Table 1 shows the empirical results of local features, syntactic parser-based features, and trigger features respectively. The two F1 scores for text transcripts (Text) and outputs recognized by an automatic speech recognizer (ASR) are listed. We achieved F1 scores of 94.79 and 71.79 for Text and ASR inputs using only word features. The performance is decreased by adding the additional local features (POS-tags and chunk labels) because the pre-processor brings more errors to the system for spoken dialog.</Paragraph>
      <Paragraph position="3"> The parser-based and trigger features are added to two baselines: word only and all local features.</Paragraph>
      <Paragraph position="4"> The result shows that the trigger feature is more robust to an SLU task than the features generated from the syntactic parser. The parse tree path and governing category show a small improvement of performance over local features, but it is rather insignificant (word vs. word+path, McNemar's test (Gillick and Cox, 1989); p = 0.022). In contrast, the trigger features significantly improve the performance of the system for both Text and ASR inputs. The differences between the trigger and the others are statistically significant (McNemar's test; p &lt; 0.001 for both Text and ASR).</Paragraph>
      <Paragraph position="5">  Next, we compared the two trigger selection methods; mutual information (MI) and feature induction (FI). Table 2 shows the experimental results of the comparison between MI and FI approaches (with the local feature set; w+p+c). For the MI-based approach, we should calculate an averaged MI for each word pair appearing in a sentence and cut the unreliable pairs (down to threshold of 0.0001) before training the model. In contrast, the FI-based approach selects reliable triggers which should improve the model in training time. Our method based on the feature induction algorithm outperforms simple MI-based methods. Fewer features are selected by FI, that is, our method prunes the event pairs which are highly correlated, but not relevant to models. The extended feature trigger (fi - fj) and null triggers (e - wj) improve the performance over word trigger pairs (wi - wj), but they are not statistically significant (vs. (fi - fj); p = 0.749, vs. ({e,wi} - wj); p = 0.294). Nevertheless, the null pairs are effective in reducing the size of trigger features.</Paragraph>
      <Paragraph position="6"> Figure 4 shows a sample of triggers selected by MI and FI approaches. For example, the trigger &amp;quot;morning - return&amp;quot; is ranked in first of FI but 66th of MI. Moreover, the top 5 pairs of MI are not meaningful, that is, MI selects many functional word pairs. The MI approach considers only lexical collocation without reference labels, so the FI method is more appropriate to sequential supervised learning.</Paragraph>
      <Paragraph position="7"> Finally, we wish to justify that our modified  version of an inducing algorithm is efficient and maintains performance without any drawbacks.</Paragraph>
      <Paragraph position="8"> We proposed two approximations: starting with local features (Approx. 1) and using an unstructured model on the selection stage (Approx. 2), Table 3 shows the results of variant versions of the algorithm. Surprisingly, the selection criterion based on ME (the unstructured model) is better than CRF (the structured model) not only for time cost but also for the performance on our experiment4. This result shows that local information provides the fundamental decision clues. Our modification of the algorithm to induce features for CRF is sufficiently fast for practical usage.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML