XML Viewer - w04-3003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3003_metho.xml
Size: 23,803 bytes
Last Modified: 2025-10-06 14:09:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3003">
  <Title>Interactive Machine Learning Techniques for Improving SLU Models</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 System Overview
</SectionTitle>
    <Paragraph position="0"> In this section, we will give a system overview and show how automation has sped up and improved the existing process. The UE expert no longer needs to keep track of utterances or call type labels in spreadsheets. Our system allows the UE expert to more easily and efficiently label collected utterances in order to automatically build a SLU model and an electronic annotation guide (see System Diagram in Figure 1). The box in Figure 1 contains the new components used in the improved and more automated process of creating SLU models and annotation guides.</Paragraph>
    <Paragraph position="1"> After data collection, the Preprocessing steps (the data reduction and clustering steps are described in more detail below) reduce the data that the UE expert needs to work with thus saving time and money. The Processed Data, which initially only contains the transcribed utterances but later will also contain call types, is stored in an XML database which is used by the Web Interface. At this point, various components of the Web Interface are applied to create call types from utterances and the Processed Data (utterances and call types) continue to get updated as these changes are applied. These include the Clustering Tool to fine-tune the optimal clustering performance by adjusting the clustering threshold. Using this tool, the UE expert can easily browse the utterances within each cluster and compare the members of one cluster with those of its neighboring clusters. The Relevance Feedback component is implemented by the Call Type Editor Tool. This tool provides an efficient way to move utterances between two call types and to search relevant utterances for a specific call type. The Search Engine is used to search text in the utterances in order to facilitate the use of relevance feedback. It is also used to get a handle on utterance and</Paragraph>
    <Paragraph position="3"> call type proximity using utterance and cluster distances. null After a reasonable percentage of the utterances are populated or labeled into call types, an initial SLU model can be built and tested using the SLU Toolset. Although a reduced dataset is used for labeling (see discussion on clone families and reduced dataset below), all the utterances are used when building the SLU model in order to take advantage of more data and variations in the utterances. The UE expert can iteratively refine the SLU model. If certain test utterances are being incorrectly classified or are not providing sufficient differentiability among certain call types (the SLU metric described below is used to improve call type differentiation), then the UE expert can go back and modify the problem call types (by adding utterances from other call types or by removing utterances using the Web Interface). The updated Processed Data can then be used to rebuild the SLU model and it can be retested to ensure the desired result. This initial SLU model can also be used as a guide in determining the call flow for the application.</Paragraph>
    <Paragraph position="4"> The Reporting component of the Web Interface can automatically create the annotation guide from the  the Annotation Guide Generation Tool. If changes are made to utterances or call types, then the annotation guide can be regenerated almost instantly. Thus, this Web Interface allows the UE expert to easily and more efficiently create the annotation guide in an automated fashion unlike the manual process that was used before.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Components
</SectionTitle>
    <Paragraph position="0"> Many SLU systems require data collection and some form of utterance preprocessing and possibly utterance clustering. Our system uses relevance feedback and SLU tools to improve the SLU process.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Data Collection
</SectionTitle>
      <Paragraph position="0"> Natural language data exists in a variety of forms such as documents, e-mails, and text chat logs. We will focus here on transcriptions of telephone conversations, and in particular, on data collected in response to the first prompt from an open dialogue system. The utterances collected are typically short phrases or single sentences, although in some cases, the caller may make several statements. It is assumed that there may be multiple intents for each utterance. We have also found that the methods presented here work well when used with the one-best transcription from a large vocabulary automatic speech recognition system instead of manual transcription. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Preprocessing
</SectionTitle>
      <Paragraph position="0"> Our tools add structure to the raw collected data through a series of preprocessing steps. Utterance redundancy (and even repetition) is inherent in the collection process and this is tedious for UE experts to deal with as they examine and work with the dataset. This section describes taking the original utterance set and reducing the redundancy (using text normalization, named entity extraction, and feature extraction) and thereby the volume of data to be examined. The end product of this processing is a subset of the original utterances that represents the diversity of the input data in a concise way. Sets of identical or similar utterances are formed and one utterance is selected at random to represent each set (alternative selection methods are also possible, see the Future Work section). UE experts may choose to expand these clone families to view individual members, but the bulk of the interaction needs to only involve a single representative utterance from each set.</Paragraph>
      <Paragraph position="1"> Text Normalization There is a near continuous degree of similarity between utterances. At one extreme are exact text duplicates (data samples in which two different callers say the exact same thing). At the next level, utterances may differ only by transcription variants like &amp;quot;100&amp;quot; vs. &amp;quot;one hundred&amp;quot; or &amp;quot;$50&amp;quot; vs. &amp;quot;fifty dollars.&amp;quot; Text normalization is used to remove this variation. Moving further, utterances may differ only by the inclusion of verbal pauses or of transcription markup such as: &amp;quot;uh, eh, background noise.&amp;quot; Beyond this, for many applications it is insignificant if the utterances differ only by contraction: &amp;quot;I'd vs. I would&amp;quot; or &amp;quot;I wanna&amp;quot; vs. &amp;quot;I want to.&amp;quot; Acronym expansions can be included here: &amp;quot;I forgot my personal identification number&amp;quot; vs. &amp;quot;I forgot my P I N.&amp;quot; Up to this point it is clear that these variations are not relevant for the purposes of intent determination (but of course they are useful for training a SLU classifier). We could go further and include synonyms or synonymous phrases: &amp;quot;I want&amp;quot; vs. &amp;quot;I need.&amp;quot; Synonyms however, quickly become too powerful at data reduction, collapsing semantically distinct utterances or producing other undesirable effects (&amp;quot;I am in want of a doctor.&amp;quot;) Also, synonyms may be application specific.</Paragraph>
      <Paragraph position="2"> Text normalization is handled by string replacement mappings using regular expressions. Note that these may be represented as context free grammars and composed with named entity extraction (see below) to perform both operations in a single step. In addition to one-to-one replacements, the normalization includes many-to-one mappings (you to-null mappings (to remove noise words).</Paragraph>
      <Paragraph position="3">  Utterances that differ only by an entity value should also be collapsed. For example &amp;quot;give me extension 12345&amp;quot; and &amp;quot;give me extension 54321&amp;quot; should be represented by &amp;quot;give me extension extension_value.&amp;quot; Named entity extraction is implemented through rules encoded using context free grammars in Backus-Naur form. A library of generic grammars is available for such things as phone numbers and the library may be augmented with application-specific grammars to deal with account number formats, for example. The grammars are viewable and editable, through an interactive web interface. Note that any grammars developed or selected at this point may also be used later in the deployed application but that the named entity extraction process may also be data driven in addition to or instead of being rule based.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Feature Extraction
</SectionTitle>
      <Paragraph position="0"> To perform processing such as clustering, relevance feedback, or building prototype classifiers, the utterances are represented by feature vectors. At the simplest level, individual words can be used as features (i.e., a unigram language model). In this case, a lexis or vocabulary for the corpus of utterances is formed and each word is assigned an integer index. Each utterance is then converted to a vector of indices and the subsequent processing operates on these feature vectors. Other methods for deriving features include using bi-grams or tri-grams as features, weighting features based upon the number of times a word appears in an utterance or how unusual the word is in the corpus (TF, TF-IDF), and performing word stemming (Porter, 1980). When the dataset available for training is very small (as is the case for relevance feedback) it is best to use less restrictive features to effectively amplify the training data. In this case, we have chosen to use features that are invariant to word position, word count and word morphology and we ignore noise words. With this, the following two utterances have identical feature vector representations: * I need to check medical claim status * I need check status of a medical claim Note that while these features are very useful for the process of initially analyzing the data and defining call types, it is appropriate to use a different set of features when training classifiers with large amounts of data when building the SLU model to be fielded. In that case, tri-grams may be used, and stemming is not necessary since the training data will contain all of the relevant morphological variations.</Paragraph>
      <Paragraph position="1"> Clustering After the data reductions steps above, we use clustering as a good starting point to partition the dataset into clusters that roughly map to call types.</Paragraph>
      <Paragraph position="2"> Clustering is grouping data based on their intrinsic similarities. After the data reduction steps described above, clustering is used as a bootstrapping process to create a reasonable set of call types.</Paragraph>
      <Paragraph position="3"> In any clustering algorithm, we need to define the similarity (or dissimilarity, which is also called distance) between two samples, and the similarity between two clusters of samples. Specifically, the data samples in our task are call utterances. Each utterance is converted into a feature vector, which is an array of terms and their weights. The distance of two utterances is defined as the cosine distance between corresponding feature vectors.</Paragraph>
      <Paragraph position="4"> Assume x and y are two feature vectors, the distance d(x,y) between them is given by</Paragraph>
      <Paragraph position="6"> As indicated in the previous section, there are different ways to extract a feature vector from an utterance.</Paragraph>
      <Paragraph position="7"> The options include named entity extraction, stop word removal, word stemming, N-gram on terms, and binary or TF-IDF (Term frequency - inverse document frequency) based weights. Depending on the characteristics of the applications in hand, certain combinations of these options are appropriate. For all the results presented in this paper, we applied named entity extraction, stop word removal, word stemming, and 1-gram term with binary weights to extract the feature vectors.</Paragraph>
      <Paragraph position="8"> The cluster distance is defined as the maximum distance between any pairs of two utterances, one from each cluster. Figure 2 illustrates the definition of the  The range of utterance distance is from 0 to 1, and the range of the cluster distance is the same. When the cluster distance is 1, it means that there exists at least one pair of utterances, one from each cluster, that are totally different (sharing no common term).</Paragraph>
      <Paragraph position="9"> The clustering algorithm we adopted is the Hierarchical Agglomerative Clustering (HAC) method. The details of agglomerative hierarchical clustering algorithm can be found in (Jan and Dubes, 1988). The following is a brief description of the HAC procedure.</Paragraph>
      <Paragraph position="10"> Initially, each utterance is a cluster on its own. Then, for each iteration, two clusters with a minimum distance value are merged. This procedure continues until the minimum cluster distance exceeds a preset threshold.</Paragraph>
      <Paragraph position="11"> The principle of HAC is straightforward, yet the computational complexity and memory requirements may be high for large size datasets. We developed an efficient implementation of HAC by on-the-fly cluster/utterance distance computation and by keeping track of the cluster distances from neighboring clusters, such that the memory usage is effectively reduced and the speed is significantly increased.</Paragraph>
      <Paragraph position="12"> Our goal is to partition the dataset into call types recognized by the SLU model and the clustering results provide a good starting point. It is easier to transform a set of clusters into call types than to create call types directly from a large set of flat data. Depending on the distance threshold chosen in the clustering algorithm, the clustering results may either be conservative (with small threshold) or aggressive (with large threshold). If the clustering is conservative, the utterances of one call type may be scattered into several clusters, and the UE expert has to merge these clusters to create the call type. On the other hand, if the cluster is aggressive, there may be multiple call types in one cluster, and the UE expert needs to manually split the mixture cluster into different call types. In real applications, we tend to set a relatively low threshold since it is easier to merge small homogeneous clusters than to split one big heterogeneous cluster.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Relevance Feedback
</SectionTitle>
      <Paragraph position="0"> Although clustering provides a good starting point, finding all representative utterances belonging to one call type is not a trivial task. Effective data mining tools are desirable to help the UE expert speed up this manual procedure. Our solution is to provide a relevance feed-back mechanism based on support vector machine (SVM) techniques for the UE expert to perform this tedious task.</Paragraph>
      <Paragraph position="1"> Relevance feedback is a form of query-free retrieval where documents are retrieved according to a measure of relevance to given documents. In essence, a UE expert indicates to the retrieval system that it should retrieve &amp;quot;more documents like the ones desired, not the ones ignored.&amp;quot; Selecting relevant documents based on UE expert's inputs is basically a classification (relevant/irrelevant) problem. We adopted support vector machine as the classifier for to two reasons: First, SVM efficiently handles high dimensional data, especially a text document with a large vocabulary. Second, SVM provides reliable performance with small amount of training data. Both advantages perfectly match the task at hand. For more details about SVM, please refer to (Vapnik, 1998; Drucker et al, 2002).</Paragraph>
      <Paragraph position="2"> Relevance feedback is an iterative procedure. The UE expert starts with a cluster or a query result by certain keywords, and marks each utterance as either a positive or negative utterance for the working call type.</Paragraph>
      <Paragraph position="3"> The UE expert's inputs are collected by the relevance feedback engine, and they are used to build a SVM classifier that attempts to capture the essence of the call type. The SVM classifier is then applied to the rest of the utterances in the dataset, and it assigns a relevance score for each utterance. A new set of the most relevant utterances are generated and presented to the UE expert, and the second loop of relevance feedback begins. During each loop, the UE expert does not need to mark all the given utterances since the SVM is capable of building a reasonable classifier based on very few, e.g., 10, training samples. The superiority of relevance feedback is that instead of going through all the utterances one by one to create a specific call type, the UE expert only needs to check a small percentage of utterances to create a satisfactory call type.</Paragraph>
      <Paragraph position="4">  The relevance feedback engine is implemented by the Call Type Editor Tool. This tool provides an integrated environment for the UE expert to create a variety of call types and assign relevant utterances to them.</Paragraph>
      <Paragraph position="5"> The tool provides an efficient way to move utterances between two call types and to search relevant utterances for a specific call type. The basic search function is to search a keyword or a set of keywords within the dataset and retrieve all utterances containing these search terms. The UE expert can then assign these utterances into the appropriate call types. Relevance feedback serves as an advanced searching option. Relevance feedback can be applied to the positive and negative utterances of a clus- null ter or call type or can be applied to utterances, from a search query, which are marked as positive or negative.</Paragraph>
      <Paragraph position="6"> The interface for the relevance feedback is shown in Figure 3. In the interface, the UE expert can mark the utterances as positive or negative samples. The UE expert can also control the threshold of the relevance value such that the relevance feedback engine only returns utterances with high enough relevance values. In the tool, we are using an internally developed package for learning large margin classifiers to implement the SVM classifier (Haffner et al, 2003).</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 SLU Toolset
</SectionTitle>
      <Paragraph position="0"> The SLU toolset is based on an internally developed NLU Toolset. The underlying boosting algorithm for text classification used, BoosTexter, is described elsewhere (Freund and Schapire, 1999; Schapire and Singer, 2000; Rochery et al, 2002). We added interactive input and display capabilities via a Web interface allowing the UE expert to easily build and test SLU models.</Paragraph>
      <Paragraph position="1"> Named entity grammars are constructed as described above. About 20% of the labeled data is set aside for testing. The remaining data is used to build the initial SLU model which is used to test the utterances set aside for testing. The UE expert can interactively test utterances typed into a Web page or can evaluate the test results of the test data. For each of the tested utterances in the test data, test logs show the classification confidence scores for each call type. The confidence scores are replaced by probability thresholds that have been computed using a logistic function. These scores are then used to calculate a simple metric which is a measure of call type differentiability. If the test utterance labeled by the UE expert is correctly classified, then the call type is the truth call type. The SLU metric is calculated as follows and it is averaged over the utterances: * if the call type is the truth, the score is the difference (positive) between the truth probability and the next highest probability * if the call type is not the truth, the score is the difference (negative) between the truth probability and the highest probability This metric allows the UE expert to easily spot problem call types or those that might give potential problems in the field. It is critical that call types are easily differentiable in order to properly route the call. The UE expert can iteratively build and test the initial SLU models until the UE expert has a set of self-consistent call types before creating the final annotation guide.</Paragraph>
      <Paragraph position="2"> The final annotation guide would then be used by the labelers to label all the utterance data needed to build the final SLU model. Thus, the SLU Toolset is critical for creating the call types defined in the annotation guide which in turn is needed to label the data for creating the final SLU.</Paragraph>
      <Paragraph position="3"> Alternatively, the labeled utterances can easily be exported in a format compatible with the internally developed NLU Toolset if further SLU model tuning is to be performed by the NLU expert using just the command line interface.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3.5 Reporting
</SectionTitle>
    <Paragraph position="0"> One of the reporting components is the Annotation Guide Generation Tool. The UE expert can use this at any time to automatically generate the annotation guide from the Processed Data. Other reporting components include summary statistics and spreadsheets containing utterance and call type information.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> The performance of the preprocessing techniques has been evaluated on several datasets from various industry sectors. Approximately 10,000 utterances were collected for each application and the results of the data reduction at each processing stage are shown in Table 1.</Paragraph>
    <Paragraph position="1"> The Redundancy R is given by</Paragraph>
    <Paragraph position="3"> where U is the number of unique utterances after feature extraction and N is the number of original utterances.</Paragraph>
    <Paragraph position="4">  Initial UE experts of the tools have been successful in producing annotation guides more quickly and with very good initial F-measures.</Paragraph>
    <Paragraph position="6"> They have also reported that the task is much less tedious and that they have done a better job of covering all of the significant utterance clusters. Further studies are required to generate quantitative measures of the performance of the toolset.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Future Work
</SectionTitle>
    <Paragraph position="0"> In the future, the system could be improved using other representative utterance selection algorithms (e.g., selecting the utterance with the minimum string edit distance to all others).</Paragraph>
    <Paragraph position="1"> The grammars for entity extraction were not tuned for these applications and it is expected that further data reduction will be obtained with improved grammars.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML