File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1624_metho.xml

Size: 20,332 bytes

Last Modified: 2025-10-06 14:10:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1624">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Weakly Supervised Learning Approach for Spoken Language Understanding</Title>
  <Section position="4" start_page="199" end_page="203" type="metho">
    <SectionTitle>
2 The System Architecture
</SectionTitle>
    <Paragraph position="0"> The semantic representation of an application domain is usually defined in terms of the semantic frame, which contains a frame type representing the topic of the input sentence, and some slots representing the constraints the query goal has to satisfy. Then, the goal of the SLU system is to translate an input utterance into a semantic frame. Besides the two key components, i.e., topic classifier and semantic classifier, our system also contains a preprocessor and a slot-value merger. Figure 1 illustrates the overall system architecture. It also describes the whole SLU procedure using an example sentence.</Paragraph>
    <Paragraph position="1"> Preprocessor Please tell me how can I go from the people's square to the bund by bus</Paragraph>
    <Paragraph position="3"/>
    <Section position="1" start_page="199" end_page="199" type="sub_section">
      <SectionTitle>
2.1 The Preprocessor
</SectionTitle>
      <Paragraph position="0"> Usually, the preprocessor is to look for the sub-strings in a sentence that correspond to a semantic class or matching a regular expression and to replace them with the class label, e.g., &amp;quot;Huashan Road&amp;quot; and &amp;quot;1954&amp;quot; are replaced with two class labels [road_name] and [number] respectively. In our system, the preprocessor can recognize more complex word sequences, e.g., &amp;quot;1954 Huashan Road&amp;quot; can be recognized as [address] through matching a rule like &amp;quot;[address] null [number] [road_name]&amp;quot;. The preprocessor is implemented with a local chart parser, which is a variation of the robust parser introduced in (Wang, 1999).</Paragraph>
      <Paragraph position="1"> The robust local parser can skip noise words in the sentence, which ensures that the system has the low level robustness. For example, &amp;quot;1954 of the Huashan Road)&amp;quot; can also be recognized as  Because the length is limited, in this paper we only illustrate all the example sentences in English, which are Chinese sentences, in fact.</Paragraph>
      <Paragraph position="2"> [address] by skipping the words &amp;quot;of the&amp;quot;. However, the robust local parser possibly skips the words in the sentence by mistake and produces an incorrect class label. To avoid this side-effect, this local parser exploits an embedded decision tree for pruning, of which the details can be seen in (Wu et al., 2005). According to our experience, it is fairly easy for a general developer with good understanding of the application to author the small grammar used by the local chart parser and annotate the training cases for the embedded decision tree. The work can be finished in several hours.</Paragraph>
    </Section>
    <Section position="2" start_page="199" end_page="200" type="sub_section">
      <SectionTitle>
2.2 Topic Classification
</SectionTitle>
      <Paragraph position="0"> Given the representation of semantic frame, topic classification can be regarded as identifying the frame type. It is suited to be dealt using pattern recognition techniques. The application of statistical pattern techniques to topic classification can improve the robustness of the whole understanding system. Also, in our system, topic classification can greatly reduce the search space and hence improve the performance of subsequent semantic classification. For example, the total number of slots into which the concept [location] can be filled in all topics is 33 and the corresponding maximum number of slots in a single topic is decreased to 10.</Paragraph>
      <Paragraph position="1"> Many statistical pattern recognition techniques have been applied to similar tasks, such as Naive Bayes, N-Gram and Support Vector Machines (SVMs) (Wang et al., 2002). According to the literature (Wang et al., 2002) and our experiments, the SVMs showed the best performance among many other statistical classifiers. Also, it has been showed that active learning can be effectively applied to the SVMs (Schohn and Cohn, 2000; Tong and Koller, 2000). Therefore, we choose the SVMs as the topic classifier. We resorted to the LIBSVM toolkit (Chang and Lin, 2001) to construct the SVMs for our experiments.</Paragraph>
      <Paragraph position="2"> Following the practice in (Wang et al., 2002), the SVMs use a binary valued features vector. If the simplest feature (Chinese character) is used, each query is converted into a feature vector  is the total number of Chinese characters occur in the corpus) with binary valued elements: 1 if a given Chinese character is in this input sentence or 0 otherwise. Due to the existence of the preprocessor, we can also include semantic class labels (e.g., [location]) as features for topic classification. Intuitively, the class label features are more informative than the  Chinese character features. At the same time, including class labels as features can also relieve the data sparseness problem.</Paragraph>
    </Section>
    <Section position="3" start_page="200" end_page="200" type="sub_section">
      <SectionTitle>
2.3 Topic-dependent Semantic Classifica-
</SectionTitle>
      <Paragraph position="0"> tion The job of semantic classification is to assign the concepts with the most likely slots. It can also be modeled as a classification problem since the number of possible slot names for each concept is limited. Let's consider the example sentence in Figure 1. After the preprocessing and topic classification, we get the preprocessed result &amp;quot;Please tell me how can I go from [location]</Paragraph>
      <Paragraph position="2"> by [bus]?&amp;quot; and the topic ShowRoute. We have to work out which slots are to be filled with the values such as [location]  . The first clue is the surrounding literal context. Intuitively, we can infer that it is a [destination] since a [destination] indicator &amp;quot;to&amp;quot; is before it. If [location]  has already been recognized as a [origin], it is another clue to imply that [location]  is a [destina tion]. Since initially the slot context is not available, the slot context is only employed for the semantic re-classification, which will be described in latter section.</Paragraph>
      <Paragraph position="3"> To learn the topic-dependent semantic classifiers, the training sentences need to be annotated against the semantic frame. Our annotating scenario is relatively simple and can be performed by general developers. For example, for the sentence &amp;quot;Please tell me how can I go from the people's square to the bund by bus?&amp;quot;, the annotated results are like the following: The corresponding slot names can be automatically extracted from the domain model. A domain model is usually a hierarchical structure of the relevant concepts in the application domain. For every occurrence of a concept in the domain model graph, we list all the concept names along the path from the root to its occurrence position and regard their concatenation as a slot name. Thus, the slot name is not flat since it inherits the hierarchy from the domain model.</Paragraph>
      <Paragraph position="4"> With provision of the annotated data, we can collect all the literal and slot context features related to each concept. The examples of features for the concept [location] are illustrated as follows: null  (1) to within the -3 windows (2) from _ to (3) ShowRoute.[route].[origin] within the 2+windows null  The former two are literal context features. Feature (1) is a context-word that tends to indicate ShowRoute.[route].[destination]. Feature (2) is a collocation that checks for the pattern &amp;quot;from&amp;quot; and &amp;quot;to&amp;quot; immediately before and after the concept [location] respectively, and tends to indicate ShowRoute.[route].[origin]. The third one is a slot context feature, which tends to imply the target concept [location] is of type Show-Route.[route].[destination]. In nature, these features are equivalent to the rules in the semantic grammar used by the robust rule-based parser. For example, the feature (2) has the same function as the semantic rule &amp;quot;[origin] null from [location] to&amp;quot;. The advantage of our approach is that we can automatically learn the semantic &amp;quot;rules&amp;quot; from the training data rather than manually authoring them. Also, the learned &amp;quot;rules&amp;quot; are intrinsically robust since they may involves gaps, for example, feature (1) allows skipping some noise words between &amp;quot;to&amp;quot; and [location].</Paragraph>
      <Paragraph position="5"> The next problem is how to apply these features when predicting a new case since the active features for a new case may make opposite predictions. One simple and effective strategy is employed by the decision list (Rivest, 1987), i.e., always applying the strongest features. In a decision list, all the features are sorted in order of descending confidence. When a new target concept is classified, the classifier runs down the list and compares the features against the contexts of the target concept. The first matched feature is applied to make a predication. Obviously, how to measure the confidence of features is a very important issue for the decision list. We use the metric described in (Yarowsky, 1994; Golding,</Paragraph>
    </Section>
    <Section position="4" start_page="200" end_page="201" type="sub_section">
      <SectionTitle>
2.4 Slot-value merging and semantic re-
</SectionTitle>
      <Paragraph position="0"> classification The slot-value merger is to combine the slots assigned to the concepts in an input sentence. Another simultaneous task of the slot-value merger is to check the consistency among the identified slot-values. Since the topic-dependent classifiers corresponding to different concepts FRAME: ShowRoute Slots: [route].[origin].[location].( the people's square)</Paragraph>
      <Paragraph position="2"> are training and running independently, it possibly results in inconsistent predictions. Considering the preprocessed word sequence &amp;quot;Please tell me how can I go from [location]  are both classified as ShowRoute.[route].[origin]. To relieve this problem, we can use the semantic classifier based on the slot context feature. We apply the context features like, for example, &amp;quot;Show-Route.[route].[origin] within the k+- windows&amp;quot;, which tends to imply Show-Route.[route].[destination]. The literal contexts reflect the local lexical semantic dependency.</Paragraph>
      <Paragraph position="3"> The slot contexts, however, are good at capturing the long distance dependency. Therefore, when the slot-value merger finds that two or more slot-value pairs clash, it first anchors the one with the highest confidence. Then, it extracts the slot contexts for the other concepts and passes them to the semantic classification module for reclassification. If the re- classification results still clash, the dialog system can involve the user in an interactive dialog for clarity.</Paragraph>
      <Paragraph position="4"> The idea of semantic classification and re-classification can be understood as follows: it first finds the concept or slot islands (like partial parsing) and then links them together. This mechanism is well-suited for SLU since the spoken utterance usually consists of several phrases and noises (restart, repeats and filled pauses, etc) are most often between them (Ward and Issar, 1994). Especially, this phenomena and the out-of-order structures are very frequent in the spo- null As stated before, to train the classifiers for topic identification and slot-filling, we need to label each sentence in the training set against the semantic frame. Although this annotating scenario is relatively minimal, the labeling process is still time-consuming and costly. Meanwhile unlabeled sentences are relatively easy to collect. Therefore, to reduce the cost of labeling training utterances, we employ weakly supervised techniques for training the topic and semantic classifiers. null The weakly supervised training of the two classifiers is successive. Assume that a small amount of seed sentences are manually labeled against the semantic frame. We first exploit the labeled frame types (e.g. ShowRoute) of the seed sentences to train a topic classifier through the combination of active learning and selftraining. The resulting topic classifier is used to label the remaining training sentences with the corresponding topic, which are not selected by active learning. Then, we use all the sentences annotated against the semantic frame (including the seed sentences and sentences labeled by active learning) and the remaining training sentences labeled the topic to train the semantic classifiers using a practical bootstrapping technique. null</Paragraph>
    </Section>
    <Section position="5" start_page="201" end_page="202" type="sub_section">
      <SectionTitle>
3.1 Combining Active Learning and Self-
</SectionTitle>
      <Paragraph position="0"> training for Topic Classification We employ the strategy of combining active learning and self-training for training the topic classifier, which was firstly proposed in (Tur et al., 2005) and applied to a similar task.</Paragraph>
      <Paragraph position="1"> One way to reduce the number of labeling examples is active learning, which have been applied in many domains (McCallum and Nigam, 1998; Tang et al., 2002; Tur et al., 2005). Usually, the classifier is trained by randomly sampling the training examples. However, in active learning, the classifier is trained by selectively sampling the training examples (Cohn et al., 1994). The basic idea is that the most informative ones are selected from the unlabeled examples for a human to label. That is to say, this strategy tries to always select the examples, which will have the largest improvement on performance, and hence minimizes the human labeling effort whilst keeping performance (Tur et al., 2005). According to the strategy of determining the informative level of an example, the active learning approaches can be divided into two categories: uncertainty-based and committeebased. Here, we employ the uncertainty-based strategy for selective sampling. It is assumed that a small amount of labeled examples is initially available, which is used to train a basic classifier. Then the classifier is applied to the unannotated examples. Typically the most unconfident examples are selected for a human to label and then added to the training set. The classifier is re-trained and the procedure is repeated until the system performance converges.</Paragraph>
      <Paragraph position="2"> Another alternative for reducing human labeling effort is self-training. In self-training, an initial classifier is built using a small amount of annotated examples. The classifier is then used to label the unannotated training examples. The examples with classification confidence scores  over a certain threshold, together with their predicted labels, are added to the training set to re-train the classifier. This procedure repeated until the system performance converges.</Paragraph>
      <Paragraph position="3"> These two strategies are complementary and hence can be combined. The combination strategy is quite straightforward for pool-based training. At each iteration, the current classifier is applied to the examples in the current pool. The most unconfident examples in the pool are selected by active learning and labeled by a human.</Paragraph>
      <Paragraph position="4"> The remaining examples in the pool are automatically labeled by the current classifier. Then, these two parts of labeled examples are both added into the training set and used for retraining the classifier. Since the LIBSVM toolkit provides the class probability, we directly use the class probability as the confidence score. Our dynamic pool-based (the pool size is n ) algorithm of combining active learning and self-training for training the topic classifier is as follows: null  (b) Apply the current classifier to n unlabeled sentences (c) Select m examples which are most informative to the current classifier and manually label the selected m examples null (d) Add the m human-labeled examples and the remaining nm[?] machine-labeled examples to the training set t S (e) Train a new classifier on all labeled ex-</Paragraph>
    </Section>
    <Section position="6" start_page="202" end_page="203" type="sub_section">
      <SectionTitle>
amples
3.2 Bootstrapping the Topic-dependent
Semantic Classifiers
</SectionTitle>
      <Paragraph position="0"> Bootstrapping refers to a problem of inducing a classifier given a small set of labeled data and a large set of unlabeled data (Abney, 2002). It has been applied to problems such as word-sense disambiguation (Yarowsky, 1995), web-page classification (Blum and Mitchell, 1998), named-entity recognition (Collins and Singer, 1999) and automatic construction of semantic lexicon (Thelen and Riloff, 2003). The key to the bootstrapping methods is to exploit the redundancy in the unlabeled data (Collins and Singer, 1999).</Paragraph>
      <Paragraph position="1"> Thus, many language processing problems can be dealt using the bootstrapping methods since language is highly redundant (Yarowsky, 1995).</Paragraph>
      <Paragraph position="2"> The semantic classification problem here also exhibits the redundancy. In the example &amp;quot;Please tell me how can I go from [location]  of type ShowRoute.[route].[origin], such as: (1) from within the -1 windows; (2) from _ to ; (3) to within the +1 windows.</Paragraph>
      <Paragraph position="3">  If the [location]  has already be recognized as ShowRoute.[route].[destination], thus the slot context feature &amp;quot;ShowRoute.[route].[origin] within the 2+- windows&amp;quot; is also a strong evidence that [location]  is of type Show-Route.[route].[origin]. That is to say, the literal context and slot context features above effectively overdetermine the slot of a concept in the input sentence. Especially, the literal and slot context features can be seen as two natural &amp;quot;views&amp;quot; of an example from the respective of &amp;quot;Co-Training&amp;quot; (Blum and Mitchell, 1998). Our bootstrapping algorithm exploits the property of redundancy to incrementally identify the features for assigning slots of a concept, given a few annotated seed sentences.</Paragraph>
      <Paragraph position="4"> The bootstrapping algorithm is performed on</Paragraph>
      <Paragraph position="6"> the number of concepts appears in the sentences of topic</Paragraph>
      <Paragraph position="8"> (1.1) Build the two initial classifiers based on literal and slot context features respectively using a small amount of labeled seed sentences.</Paragraph>
      <Paragraph position="9"> (1.2) Apply the current classifier based on the  literal context feature to the remaining unlabeled concepts in the training sentences belong to topic</Paragraph>
      <Paragraph position="11"> classified slots with confidence score above a certain threshold (In this paper, the threshold is fixed on 0.5).</Paragraph>
      <Paragraph position="12">  2. Check the consistency of the classified slots in each sentence. If some slots in a sentence clashed, take the one with the highest confidence score among them and leave the others unlabeled.</Paragraph>
      <Paragraph position="13"> 3. For each concept</Paragraph>
      <Paragraph position="15"> classifier based on the slot context to the residual unlabeled concepts. Keep those classi- null fied slots with confidence score above a certain threshold. Repeat Step 3.</Paragraph>
      <Paragraph position="16"> 4. Augment the new classified cases into the training set and retrain the two classifiers based on literal and slot context features respectively. null 5. If new slots are classified from the training data, return to step 2. Otherwise, repeat 2-5 to label training data and keep all new classified slots regardless of the confidence score. Train the two final semantic classifiers based on the literal and context features respectively using the new labeled training data.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML