File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/j99-3003_metho.xml
Size: 60,995 bytes
Last Modified: 2025-10-06 14:15:18
<?xml version="1.0" standalone="yes"?> <Paper uid="J99-3003"> <Title>Lucent Technologies Bell Laboratories</Title> <Section position="4" start_page="364" end_page="365" type="metho"> <SectionTitle> 5 Although the call center had nearly 100 departments, in our corpus of 4,500 calls, only 23 departments </SectionTitle> <Paragraph position="0"> received more than 10 calls. We chose to base our experiments on these 23 destinations. 6 In most calls, we analyzed the utterances given in the operator's second turn in the dialogue. However, in situations where the operator generates an acknowledgment, such as uh-huh, midway through the caller's request, we analyzed utterances in the next operator turn.</Paragraph> </Section> <Section position="5" start_page="365" end_page="385" type="metho"> <SectionTitle> 4. Vector-based Call Routing </SectionTitle> <Paragraph position="0"> In addition to notifying the caller of a selected destination or querying the caller for further information, an automatic call router should be able to identify when it is unable to handle a call and route the call to a human operator for further processing.</Paragraph> <Paragraph position="1"> The process of determining whether to route a call, generate a disambiguation query, or redirect the call to an operator is carried out by two modules in our system, the routing module and the disambiguation module, as shown in Figure 3. Given a caller request, the routing module selects a set of candidate destinations to which it believes the call can reasonably be routed. If there is exactly one such destination, the call is routed to that destination and the caller notified; if there is no appropriate destination, the call is sent to an operator; and if there are multiple candidate destinations, the disambiguation module is invoked. In the last case, the disambiguation module attempts to formulate a query that it believes will solicit relevant information from the caller to allow the revised request to be routed to a unique destination. If such a query is successfully formulated, it is posed to the caller, and the system makes another attempt at routing the revised request, which includes the original request and the caller's response to the follow-up question; otherwise, the call is sent to a human operator.</Paragraph> <Paragraph position="2"> 7 Note that the corpus analysis described in this section was conducted with the purpose of determining guidelines for system design in order to achieve reasonable coverage of phenomena in actual human-human dialogues. The call classification schemes presented in this section do not come into play in the actual training or testing of our system, nor do we discard any part of our training corpus as a result of this analysis. Two-dimensional vector representation for the routing module.</Paragraph> <Paragraph position="3"> Our approach to call routing is novel in its application of vector-based information retrieval techniques to the routing process, and in its extension of the vector-based representation for dynamically generating disambiguation queries (Chu-Carroll and Carpenter 1998). The routing and disambiguation mechanisms are detailed in the following sections.</Paragraph> <Section position="1" start_page="366" end_page="375" type="sub_section"> <SectionTitle> 4.1 The Routing Module </SectionTitle> <Paragraph position="0"> retrieval, the database contains a large collection of documents, each of which is represented as a vector in n-dimensional space. Given a query, a query vector is computed and compared to the existing document vectors, and those documents whose vectors are similar to the query vector are returned. We apply this technique to call routing by treating each destination as a document, and representing the destination as a vector in n-dimensional space. Given a caller request, an n-dimensional request vector is computed. The similarity between the request vector and each destination vector is then computed and those destinations that are close to the request vector are then selected as the candidate destinations. This vector representation for destinations and query is illustrated in a simplified two-dimensional space in Figure 4.</Paragraph> <Paragraph position="1"> In order to carry out call routing with the aforementioned vector representation, three issues must be addressed. First, we must determine the vector representation for each destination within the call center. Once computed, these destination vectors should remain constant as long as the organization of the call center remains unchanged. 8 Second, we must determine how a caller request will be mapped to the same vector space for comparison with the destination vectors. Finally, we must decide how the similarity between the request vector and each destination vector will be measured in order to select candidate destinations.</Paragraph> <Paragraph position="2"> termine the values of the destination vectors (and term vectors) that will subsequently be used in the routing process. Our training process, depicted in Figure 5, requires a corpus of transcribed calls, each of which is routed to the appropriate destination. 9 These routed calls are processed by five domain-independent procedures to obtain the desired document (destination) and term vectors.</Paragraph> <Paragraph position="3"> Document Construction. Since our goal is to represent each destination as an n-dimensional vector, we must create one (virtual) document per destination. The document for a destination contains the raw text of the callers' contributions in all calls routed to that destination, since these are the utterances that provided vital information for routing purposes. For instance, the document for deposit services may contain utterances such as I want to check the balance in my checking account and I would like to stop payment on a check. In our experiments, the corpus contains 3,753 calls routed to 23 destinations. 1deg 8 One may consider allowing the call router to constantly update the destination vectors as new data are being collected while the system is deployed. We leave adding learning capabilities to the call router for future work. 9 The transcription process can be carried out by humans or by an automatic speech recognizer. In the experiments reported in this paper, we used human transcriptions. 10 These calls are a subset of the 4,500 calls used in our corpus analysis. We included calls of all semantic types, but excluded calls to destinations that were not represented by more than 10 calls, as well as ambiguous calls that were not resolved by the operator.</Paragraph> <Paragraph position="4"> Chu-Carroll and Carpenter Vector-based Natural Language Call Routing Morphological Filtering and Stop Word Filtering. For routing purposes, we are concerned with the semantics of the words present in a document, but not with the morphological forms of the words themselves. Thus we filter each (virtual) document, produced by the document construction process, through the morphological processor of the Bell Labs Text-to-Speech synthesizer (Sproat 1998) to extract the root form of each word in the corpus. This process will reduce singulars, plurals, and gerunds to their root forms, such as reducing service, services, and servicing to the root service. Also, the various verb forms are also reduced to their root forms, such as reducing going, went, and gone to go. 11 Next, the root forms of caller utterances are filtered through two lists, the ignore list and the stop list, in order to build more accurate n-gram term models for subsequent processing. The ignore list consists of noise words, which are common in spontaneous speech and can be removed without altering the meaning of an utterance, such as um and uh. These words sometimes get in the way of proper n-gram extraction, as in I'd like to speak to someone about a car uh loan. When the noise word uh is filtered out of the utterance, we can then properly extract the bigram car+loan.</Paragraph> <Paragraph position="5"> The stop list enumerates words that are ubiquitous and therefore do not contribute to discriminating between destinations, such as the, be, for, and morning. We modified the standard stop list distributed with the SMART information retrieval system (Salton 1971) to include domain-specific terms and proper names that occurred in our training corpus. 12 Note that when a word on the ignore list is removed from an utterance, it allows words preceding and succeeding the removed word to form n-grams, such as car+loan in the example above. On the other hand, when a stop word is removed from an utterance, a placeholder is inserted into the utterance to prevent the words preceding and following the removed stop word from forming n-grams. For instance, after stop word filtering, the caller utterance I want to check on an account becomes (sw) (sw) (sw) check (sw) (sw) account, resulting in the two unigrams check and account. Without the placeholders, we would extract the bigram check+account, just as if the caller had used the term checking account in the utterance.</Paragraph> <Paragraph position="6"> In our experiments, the ignore list contains 25 words, which are variations of common transcriptions of speech disfluencies, such as ah, aah, and ahh. The stop list contains over 1,200 words, including function words, proper names, greetings, etc.</Paragraph> <Paragraph position="7"> Term Extraction. The output of the filtering processes is a set of documents, one for each destination, containing the root forms of the content words extracted from the raw texts originally in each document. In order to capture word co-occurrence, n-gram terms are extracted from the filtered texts. First, a list of n-gram terms and their counts are generated from all filtered texts. Thresholds are then applied to the n-gram counts to select as salient terms those n-gram terms that occurred sufficiently frequently. Next, these salient terms are used to reduce the filtered text for each document to a bag of salient terms, i.e., a collection of n-gram terms along with their respective counts.</Paragraph> <Paragraph position="8"> Note that when an n-gram term is extracted, all of the lower order k-grams, where 1<k<n, are also extracted. For instance, the word sequence checking account balance will result in the trigram check+account+balance, as well as the bigrams check+account and account+balance and the unigrams check, account, and balance.</Paragraph> <Paragraph position="9"> 11 Not surprisingly, confusion among morphological variants was a source of substantial error from the recognizer. Details can be found in Reichl et al. (1998). 12 The idea of a standard stop list in the information retrieval literature is to eliminate terms that do not contribute to discriminating among documents. We extend this notion to our application to include additional proper names such as Alaska and Houston, as well as domain- or application-specific terms such as bye and cadet. The modification to the standard stop list is performed by manually examining the unigram terms extracted from the training corpus.</Paragraph> <Paragraph position="10"> Computational Linguistics Volume 25, Number 3 In our experiments, we selected as salient terms unigrams that occurred at least twice and bigrams and trigrams that occurred at least three times. This resulted in 62 trigrams, 275 bigrams, and 420 unigrams. In our training corpus, no four-gram occurred three times. Manual examination of these n-gram terms indicates that almost all of the selected salient terms are relevant for routing purposes. 13 Term-Document Matrix Construction. Once the bag of salient terms for each destination is constructed, it is very straightforward to construct an m x n term-document frequency matrix A, where m is the number of salient terms, n is the number of destinations, and an element at,d represents the number of times the term t occurred in calls to destination d. This number indicates the degree of association between term t and destination d, and our underlying assumption is that if a term occurred frequently in calls to a destination in our training corpus, then occurrence of that term in a caller's request indicates that the call should be routed to that destination.</Paragraph> <Paragraph position="11"> In the term-document frequency matrix A, a row At is an n-dimensional vector representing the term t, while a column Ad is an m-dimensional vector representing the destination d. However, by using the raw frequency counts as the elements of the matrix, more weight is given to terms that occurred more often in the training corpus than to those that occurred less frequently. For instance, a unigram term such as account, which occurs frequently in calls to multiple destinations will have greater frequency counts than say, the trigram term social+security+number. As a result, when the two vectors representing account and social+security+number are combined, as will be done in the routing process, the term vector for account contributes more to the combined vector than that for social+security+number. In order to balance the contribution of each term, the term-document frequency matrix is normalized so that each term vector is of unit length (later weightings do not preserve this normalization, though). Let B be the result of normalizing the term-document frequency matrix, whose elements are given as follows:</Paragraph> <Paragraph position="13"> Our second weighting is based on the notion that a term that only occurs in a few documents is more important in routing than a term that occurs in many documents.</Paragraph> <Paragraph position="14"> For instance, the term stop+payment, which occurred only in calls to deposit services, should be more important in discriminating among destinations than check, which occurred in many destinations. Thus, we adopted the Inverse-document frequency (IDF) weighting scheme (Sparck Jones 1972) whereby a term is weighted inversely to the number of documents in which it occurs. This score is given by:</Paragraph> <Paragraph position="16"> where t is a term, n is the number of documents in the corpus, and d(t) is the number of documents containing the term t. If t only occurred in one document, IDF(t) = log 2 n; if t occurred in every document, IDF(t) = log 2 1 -- 0. Thus, using this weighting scheme, terms that occur in every document will be eliminated. 14 We weight the matrix B by 13 It would have been possible to hand-edit the set of n-gram terms at this point to remove unwanted terms. The results we report in this paper use the automatically selected terms without any hand-editing.</Paragraph> <Paragraph position="17"> 14 To preserve all terms, we could have used a common variant of the IDF weighting where IDF(t) = nq-e log2 ~(t) for some nonnegative C/.</Paragraph> <Paragraph position="18"> Chu-Carroll and Carpenter Vector-based Natural Language Call Routing multiplying each row t by IDF(t) to arrive at the matrix C: Ct, d = IDF(t) * Bt,d Singular Value Decomposition and Vector Representation. In the weighted term-document frequency matrix C, terms are represented as n-dimensional vectors (in our system, n = 23), and destinations are represented as m-dimensional vectors (in our system, m = 757). In order to provide a uniform representation of term and document vectors and to reduce the dimensionality of the document vectors, we applied the singular value decomposition to the m x n matrix C (Deerwester et al. 1990) to obtain: 15 where r is the rank of C, and they are arranged in descending order S1,1 ~ S2,2 ~ ''' ~ Sr, r ~ O.</Paragraph> <Paragraph position="19"> Figure 6 illustrates the results of singular value decomposition according to the above equation. The shaded portions of the matrices are what we use as the basis for our term and document vector representations, as follows:</Paragraph> <Paragraph position="21"> Vr is an n x r matrix, in which each row forms the basis of our document vector representation; and Sr is an r x r positive diagonal matrix whose values are used for appropriate scaling in the term and document vector representations. The actual representations of the term and document vectors are Ur and VF scaled (or not) by elements in St, depending on whether the representation is intended for comparisons between terms, between documents, or between a term and a document. For instance, since the similarity between two documents can be measured by the dot product between vectors representing the two documents (Salton 1971), and C T * C contains the dot products of all pairwise column vectors in the weighted term-document frequency matrix C, the similarity between the ith and jth documents can 15 Our original intent was to apply singular value decomposition and to reduce the dimensionality of the resulting vectors below the rank of the original matrix in order to carry out latent semantic indexing (Deerwester et al. 1990). Benefits cited for latent semantic indexing include the clustering of &quot;synonyms&quot; leading to improved recall. In the end, dimensionality reduction degraded performance for our data. However, our method is not equivalent to the standard approach in vector-based information retrieval, which simply uses the rows or columns of the term-document matrix (see Salton \[1971\] for definitions of the standard case). The difference arises through the cosine measure of similarity, which is operating over different spaces in the standard vector case and with the result of SVD in our case. Although we did not run the experiments, we believe similar results would be obtained by using cosine to compare rows or columns of the term-document matrix directly.</Paragraph> <Paragraph position="22"> Because only the first r diagonal elements of S are nonzero, we have: (W. S). (V. S) T = (W r. Sr) . (V r * Sr) T The above equations suggest that scaling the vectors Vr with elements in Sr, i.e., representing documents as row vectors in Vr'Sr, facilitates comparisons between documents. The same reasoning holds for representing terms as row vectors in Ur&quot; Sr for comparisons between terms, although in this particular application, we are not interested in term-term comparisons.</Paragraph> <Paragraph position="23"> To measure the degree of association between a term and a document, we look up an element in the weighted term-document frequency matrix. Because S is a diagonal matrix with only the first r elements nonzero, we have:</Paragraph> <Paragraph position="25"> Therefore, representing terms simply by row vectors in U r and documents by row vectors in Vr'Sr allows us to make comparisons between documents, as well as between terms and documents.</Paragraph> <Paragraph position="26"> 4.1.3 Call Routing. As discussed earlier, two subprocesses need to be carried out during the call routing process. First, a pseudodocument vector must be constructed to represent the caller's request in order to facilitate the comparisons between the request and each document vector. Second, a method for comparison must be established to measure the similarity between the pseudodocument vector and the document vectors in Vr * Sr, and a threshold must be determined to allow for the selection of candidate destinations.</Paragraph> <Paragraph position="27"> Chu-Carroll and Carpenter Vector-based Natural Language Call Routing Pseudodocument Generation. Given a caller utterance (either in text form from a keyboard interface or as the output from an automatic speech recognizer), we first perform the morphological and stop word filtering and the term extraction procedures as in the training process to extract the relevant n-gram terms from the utterance. Since higher-level n-gram terms are, in general, better indicators of potential destinations, we further allow trigrams to contribute more to constructing the pseudodocument than bigrams, which in turn contribute more than unigrams. Thus we assign a weight w3 to trigrams, w2 to bigrams, and wl to unigrams, 16 and each extracted n-gram term is then weighted appropriately to create a bag of terms in which each extracted n-gram term occurs Wn times. As a result, when we construct a pseudodocument from the bag of terms, we get the effect of weighting each n-gram term by Wn.</Paragraph> <Paragraph position="28"> Given the extracted n-gram terms, we can present the request as an m x 1 vector Q where each element Qi in the vector represents the number of times the ith term occurred in the bag of terms. The vector Q is then added as an additional column vector in our original weighted term-document frequency vector C, as shown in Figure 7, and we want to find the new corresponding column vector in V, Vq, that represents the pseudodocument in the reduced r-dimensional space. Since U is orthonormal and S is a diagonal matrix, we can solve for Vq by setting Q-_ U. S. Vq T Because we want a representation of the query in the document space, we transpose Q, to yield:</Paragraph> <Paragraph position="30"> Finally, multiplying both sides on the right by U, we have:</Paragraph> <Paragraph position="32"> Finally, note that for our query representation in the document space, we have QT. U ----QT . Ur and Vq . S = Vq * Sr. Vq * Sr is a pseudodocument representation for the caller utterance in r-dimensional space, and is scaled appropriately for comparison between documents. This vector representation is simply obtained by multiplying QT and Ur, or equivalently, summing the vector representing each term in the bag of n-gram terms.</Paragraph> <Paragraph position="33"> Candidate Destination Selection. Once the pseudodocument vector representing the caller request is computed, we measure the similarity between each document vector in VF' Sr and the pseudodocument vector. There are a number of ways one may measure the similarity between two vectors, such as using the cosine score between the vectors, the Euclidean distance between the vectors, the Manhattan distance between the vectors, etc. We follow the standard technique adopted in the information retrieval community and select the cosine score as the basis for our similarity measure. The cosine score between two n-dimensional vectors x and y is given as follows:</Paragraph> <Paragraph position="35"/> <Paragraph position="37"> Using cosine reduces the contribution of each vector to its angle by normalizing for length. Thus the key in maximizing cosine between two vectors is to have them point in the same direction. However, although the raw vector cosine scores give some indication of the closeness of a request to a destination, we noted that the absolute value of such closeness does not translate directly into the likelihood for correct routing.</Paragraph> <Paragraph position="38"> Instead, some destinations may require a higher cosine value, i.e., a closer degree of similarity, than others in order for a request to be correctly associated with those destinations. We applied the technique of logistic regression (see Lewis and Gale \[1994\]) in order to transform the cosine score for each destination using a sigmoid function specifically fitted for that destination. This allows us to obtain a score that represents the router's confidence that the call should be routed to that destination.</Paragraph> <Paragraph position="39"> From each call in the training data, we generate, for each destination, a cosine value/routing value pair, where the cosine value is that between the destination vector and the request vector, and the routing value is 1 if the call was routed to that destination in the training data and 0 otherwise. Thus, for each destination, we have a set of cosine value/routing value pairs equal to the number of calls in the training data.</Paragraph> <Paragraph position="40"> The subset of these value pairs whose routing value is I will be equal to the number of calls routed to that destination in the training set. Then, we used least squared error to fit a sigmoid function, 1/(1 + e-(ax+b)), to the set of cosine value/routing value pairs. 17 Assuming da and db are the coefficients of the fitted sigmoid function for destination d, we have the following confidence function for a destination d and cosine value x: Conf(da, db, x) = 1/(1 + e -(dax+db)) Thus the score given a request and a destination, where d is the vector corresponding to destination d, and r is the vector corresponding to the request is Conf(da, db, cos(r, d)). To obtain a preliminary evaluation of the effectiveness of cosine vs. confidence scores, we tested routing performance on transcriptions of 307 unseen unambiguous requests. In each case, we selected the destination with the highest cosine/confidence score to be the target destination. Using raw cosine scores, 92.2% of the calls are routed to the correct destination. On the other hand, using sigmoid confidence fitting, 93.5% of the calls are correctly routed. This yields an error reduction rate of 16.7%, illustrating the advantage of transforming the raw cosine scores to more uniform confidence scores that allow for more accurate comparisons between destinations.</Paragraph> <Paragraph position="41"> 17 Maximum likelihood fitting is often used rather than least squared error in logistic regression. Router performance vs. confidence threshold.</Paragraph> <Paragraph position="42"> We can compute the kappa statistic for the pure routing component of our system using the accuracies given above. 18 Recall that kappa is defined by (a -e)/(1 -e) where a is the system's accuracy and e is the expected agreement by chance. Selecting destinations from the prior distribution and guessing destinations using the same distribution leads to a chance performance of e = Y~a P(d) 2 = 0.2858 where the summation is over all destinations d and P(d) is the percentage of calls routed to destination d. The resulting kappa score is (0.935 - 0.2858)/(1 - 0.2858) = 0.909.</Paragraph> <Paragraph position="43"> Once we have obtained a confidence value for each destination, the final step in the routing process is to compare the confidence values to a predetermined threshold and return those destinations whose confidence values are greater than the threshold as candidate destinations. To determine the optimal value for this threshold, we ran a series of experiments to compute the upper bound and lower bound of the router's performance by varying the threshold from 0 to 0.9 at 0.1 intervals. The lower bound represents the percentage of calls that are routed correctly, while the upper bound indicates that percentage of calls that have the potential to be routed correctly after disambiguation (see Section 5 for details on upper bound and lower bound measures).</Paragraph> <Paragraph position="44"> Figure 8 illustrates the results of this set of experiments and shows that a threshold of 0.2 yields optimal performance. Thus we adopt 0.2 as our confidence threshold for selecting candidate destinations in the rest of our discussion.</Paragraph> <Paragraph position="45"> suppose the caller responds to the operator's prompt with I am calling to apply for a new car loan. First the caller's utterance is passed through morphological filtering to obtain the root forms of the words in the utterance, resulting in I am call to apply for a new car loan. Next, words on the stop list are removed and replaced with a placeholder, resulting in (sw I (sw I call (sw I apply (sw I (sw I new car loan. From the filtered &quot;utterance, the router extracts the salient n-gram terms to form a bag of terms as follows: new+car+loan, new+car, car+loan, call, apply, new, car, and loan. A request vector is then computed by taking the weighted sum of the term vectors representing the salient n-gram terms, and the cosine value between this request vector and each destination vector is computed. The cosine value for each destination is subsequently transformed 18 See Siegel and Castellan (1988), and Carletta (1996) for a definition or discussion of the kappa statistic, and Walker et al. (1998) for an application of the kappa statistic to dialogue evaluation. Two-dimensional vector representation for the disambiguation module.</Paragraph> <Paragraph position="46"> using the destination-specific sigmoid function to obtain a confidence score for each destination. Figures 9(a) and 9(b) show the cosine scores and the confidence scores for the top five destinations, respectively. Given a confidence threshold of 0.2, the only candidate destination selected is Consumer Lending. Thus, the caller's request is routed unambiguously to that destination.</Paragraph> </Section> <Section position="2" start_page="375" end_page="379" type="sub_section"> <SectionTitle> 4.2 The Disambiguation Module </SectionTitle> <Paragraph position="0"> ule returns more than one candidate destination, the disambiguation module is invoked. The disambiguation module attempts to formulate an appropriate query to solicit further information from the caller to determine a unique destination to which the call should be routed. As discussed earlier, this occurs when two or more destination vectors are close to the request vector, as illustrated in reduced two-dimensional space in Figure 10. In the example, the caller's request car loans please is ambiguous since the caller does not specify whether he is interested in existing or new car loans.</Paragraph> <Paragraph position="1"> Therefore, the vector representation for the request falls between the vectors representing the two candidate destinations, Consumer Lending and Loan Services, and is close to both of them. The goal of the disambiguation process is to solicit an n-gram term from the caller so that when the vector representing this new n-gram term is added to the original request vector, the refined request vector will be unambiguously routed to one of the two candidate destinations. In terms of our vector representation, this means that our goal is to find term vectors that are close to the differences between the candidate destination vectors and the request vector, i.e., the highlighted vectors in Figure 10. These difference vectors, which are simply the pairwise differences of elements in each vector and are dynamically generated from the destination and request vectors, form the basis from which the disambiguation queries will be generated.</Paragraph> <Paragraph position="2"> n-gram terms from which the query will be generated. The subset of n-gram terms are those related to the original query that can likely be used to disambiguate among the candidate destinations. They are chosen by filtering all n-gram terms based on the following three criteria, as shown in Figure 11: Closeness Since the goal of the disambiguation process is to solicit terms whose corresponding vectors are close to the difference vectors, the first step in the term selection process is to compare each n-gram term vector with the difference vectors and select those n-gram term vectors that are close to the difference vectors by the cosine measure. Since both the destination vectors and the request vector are scaled for document-document comparison in Vr * Sr space, the difference vectors are also represented in Vr * SF space. As discussed in Section 4.1.2, documents represented in Vr * Sr space are suitable for comparison with terms represented in Ur space. In our system, for each difference vector, we compute the cosine score between the difference vector and each term vector, and select the 30 terms with the highest cosine scores as the set of close terms. The reasons for selecting a threshold on the number of terms instead of on the cosine score are twofold. First, in situations where many term vectors are close to the difference vector, we avoid generating an overly large set of close terms but instead focus on a smaller set of most promising terms. Second, in situations where few term vectors are close to the difference vector, we still select a set of close terms in the hope that they may contribute to formulating a reasonable query, instead of giving up on the disambiguation process outright.</Paragraph> <Paragraph position="3"> Relevance From the set of close terms, we select a set of relevant terms, which are terms that further specify a term in the original request. If a term in the set of close terms can be combined with a term in the original request to form a valid n-gram term, then the resulting n-gram term is considered a relevant term. For instance, if car+loan is a term in the original request, then both new and new+car would produce the Computational Linguistics Volume 25, Number 3 relevant term new+car+loan. This mechanism for selecting relevant terms allows us to focus on selecting n-gram terms for noun phrase disambiguation by eliminating close terms that are semantically related to underspecified n-gram noun phrases in the original request but do not contribute to further disambiguating the noun phrases.</Paragraph> <Paragraph position="4"> Disambiguating power The final criterion that we use for term selection is to restrict attention to relevant terms that can be added to the original request to result in an unambiguous routing decision using the routing mechanism described in Section 4.1.3. In other words, we augment the bag of n-gram terms extracted from the original request with each relevant term, and the routing module is invoked to determine if this augmented set of n-gram terms can be unambiguously routed to a unique destination. The set of relevant terms with disambiguating power then forms the set of selected terms from which the system's query will be formulated. If none of the relevant terms satisfy this criterion, then we include all relevant terms. Thus, instead of giving up the disambiguation process when no one term is predicted to resolve the ambiguity, the system poses a question to solicit information from the caller to move the original request one step toward being an unambiguous request.</Paragraph> <Paragraph position="5"> After the first disambiguation query is answered, the system subsequently selects a new set of terms from the refined, though still ambiguous, request and formulates a follow-up disambiguation query. 19 The result of this selection process is a finite set of terms that are relevant to the original ambiguous request and, when added to it, are likely to resolve the ambiguity. The actual query is formulated based on the number of terms in this set as well as features of the selected terms. As shown in Figure 11, if the three selection criteria ruled out all n-gram terms, then the call is sent to a human operator for further processing. If there is only one selected term, then a yes-no question is formulated based on this term. If there is more than one selected term in the set, and a significant number of these terms share a common headword, 2deg X, the system generalizes the query to ask the wh-question For what type of X? Otherwise, a yes-no question is formed based on the term in the selected set that occurred most frequently in the training data, based on the heuristic that a more common term is more likely to be relevant than an obscure term. 21 A third alternative would be to ask a disjunctive question, but we have not yet explored this possibility. Figure 3 shows that after the system poses its query, it attempts to route the refined request, which is the caller's original request plus the response to the disambiguation query posed by the system. In the case of whquestions, n-gram terms are extracted from the response. For instance, if the system asks For what type of loan? and the user responds It's a car loan, then the bigram car+loan 19 Although it captures a similar property of each term, this criterion is computationally much more expensive than the closeness criterion. Thus, we adopt the closeness criterion to select a fixed number of candidate terms and then apply the more expensive, but more accurate, criterion to the much smaller set of candidate terms. 20 In our implemented system, this path is selected if 1) there are five or less selected terms and they all share a common headword, or 2) there are more than five terms and at least five of them share a common headword. 21 The generation of natural language queries is based on templates for wh-questions and yes-no questions. The generation process consults a manually constructed mapping between n-gram morphologically reduced noun phrases and their expanded forms. For instance, it maps the n-gram term exist + car + loan to an existing car loan. This mapping is the only manual effort needed to port the disambiguation module to a new domain.</Paragraph> <Paragraph position="6"> Chu-Carroll and Carpenter Vector-based Natural Language Call Routing is extracted from the response. In the case of yes-no questions, the system determines whether a yes or no answer is given. 22 In the case of a yes response, the term selected to formulate the disambiguation query is considered the caller's response, while in the case of a no response, the response is treated as in responses to wh-questions. For instance, if the user says yes in response to the system's query Is this an existing car loan?, then the trigram term exist+car+loan in the system's query is considered the user's response.</Paragraph> <Paragraph position="7"> Note that our disambiguation mechanism, like our training process for basic routing, is domain-independent (except for the manual construction of a mapping between n-gram noun phrases and their expanded forms). It utilizes the set of n-gram terms, as well as term and document vectors that were obtained by the training of the call router. Thus, the call router can be ported to a new task with only very minor domain-specific work on the disambiguation module.</Paragraph> <Paragraph position="8"> router, consider the request Loans please. This request is ambiguous because the call center we studied handles mortgage loans separately from all other types of loans, and for all other loans, existing loans and new loans are also handled by different departments.</Paragraph> <Paragraph position="9"> Given this request, the call router first performs morphological, ignore word, and stop word filterings on the input, resulting in the filtered utterance of loan Iswl. N-gram terms are then extracted from the filtered utterance, resulting in the unigram term loan. Next, the router computes a pseudodocument vector that represents the caller's request, which is compared in turn with the destination vectors. The cosine values between the request vector and each destination vector are then mapped into confidence values. Using a confidence threshold of 0.2, we have two candidate destinations, Loan Services and Consumer Lending; thus the disambiguation module is invoked.</Paragraph> <Paragraph position="10"> Our disambiguation module first selects from all n-gram terms those whose term vectors are close to the difference vectors, i.e., the differences between each candidate destination vector and the request vector. This results in a list of 60 close terms, the vast majority of which are semantically close to loan, such as auto+loan, payoff, and owe. Next, the relevant terms are constructed from the set of close terms by selecting those close terms that form a valid n-gram term with loan. This results in a list of 27 relevant terms, including auto+loan and loan+payoff, but excluding owe, since neither loan+owe nor owe+loan constitutes a valid bigram. The third step is to select those relevant terms with disambiguating power, resulting in 18 disambiguating terms. Since 11 of these terms share a head noun loan, a wh-question is generated based on this headword, resulting in the query For what type of loan? Suppose in response to the system's query, the user answers Car loan. The router then adds the new bigram car+loan and the two unigrams car and loan to the original request and attempts to route the refined request. This refined request is again ambiguous between Loan Services and Consumer Lending because the caller did not specify whether it was an existing or new car loan. Again, the disambiguation module selects the close, relevant, and disambiguating terms, resulting in a unique trigram exist+car+loan. Thus, the system generates the yes-no question Is this about an existing 22 In our current system, a response is considered a yes response only if it explicitly contains the word yes. However, as discussed in Green and Carberry, (1994) and Hockey et al. (1997), responses to yes-no questions may not explicitly contain a yes or no term. We leave incorporating a more sophisticated response understanding model, such as Green and Carberry (1994), into our system for future work. Computational Linguistics Volume 25, Number 3 car loan? 23 If the user responds yes, then the trigram exist+car+loan is added to the refined request and the call unambiguously routed to Loan Services; if the user says No, it's a new car loan, then the trigram new+car+loan is extracted from the response and the call routed to Consumer Lending. 24 5. Evaluation of the Call Router</Paragraph> </Section> <Section position="3" start_page="379" end_page="382" type="sub_section"> <SectionTitle> 5.1 Routing Module Performance </SectionTitle> <Paragraph position="0"> We performed an evaluation of the routing module of our call router on a set of 389 calls disjoint from the training corpus. Of the 389 requests, 307 were unambiguous and routed to their correct destinations, and 82 were ambiguous and annotated with a list of potential destinations. Unfortunately, in this test set, only the caller's utterance in response to the system's initial prompt of How may I direct your call? was recorded and transcribed; thus we have no information about where the ambiguous calls should be routed after disambiguation. We evaluated the routing module performance on both transcriptions of caller utterances as well as output of the Bell Labs Automatic Speech Recognizer (Reichl et al. 1998) based on speech input of caller utterances (Carpenter and Chu-Carroll 1998).</Paragraph> <Paragraph position="1"> is computed based on the term vectors representing the n-gram terms extracted from the requests, the performance of our call router is directly tied to the the accuracy of terms extracted from each caller utterance. Given the set of n-gram terms obtained from the training process, the accuracy of extraction of such terms based on transcriptions of caller utterances is 100%. However, when using the output of an automatic speech recognizer as input to our call router, deletions of terms present in the caller's request as well as insertions of terms that did not occur in the request affect the term extraction accuracy and thus the routing performance.</Paragraph> <Paragraph position="2"> We evaluated the output of the automatic speech recognizer based on both word accuracy and term accuracy, as shown in Table 3. 25 Word accuracy is measured by taking into account all words in the transcript and in the recognized string. Two sets of results are given for word accuracy, one based on raw forms of words and the other based on comparisons of the root forms of words, i.e., after both the transcript and the recognized string are sent through the morphological filter. Term accuracy is measured by taking into account only the set of actual/recognized words that contribute to routing performance, i.e., after both the transcript and the recognized string are sent through the term extraction process.</Paragraph> <Paragraph position="3"> For each evaluation dimension, we measured the recognizer performance by calculating the precision and recall. Precision is the percentage of words/terms in the recognizer output that are actually in the transcription, i.e., percentage of found words/terms 23 Recall that our current system uses simple template filling for response generation by utilizing manually constructed mappings from n-gram terms to their inflected forms, such as from exist+car+loan to an existing car loan. 24 The current implementation of the system requires that the user specify the correct answer when providing a no answer to a yes-no question, in order for the call to be properly disambiguated. However, it is possible that a system may attempt to disambiguate given a simple no answer by considering the n-gram term being queried (exist+car+loan in the above example) as a negative feature, subtracting its vector representation from the query, and attempting to route the resulting vector representation. 25 In computing the precision and recall figures, we did not take into account multiple occurrences of the same word. In other words, we consider a word in the recognized string correct if the word occurs in the transcribed text. For comparison purposes, the standard speech recognition accuracy on raw ASR output is 69.94%.</Paragraph> <Paragraph position="4"> that are correct, while recall is the percentage of words/terms in the transcription that are correctly returned by the recognizer, i.e., percentage of actual words/terms that are found. Table 3 shows that using the root forms of words results in a 1% absolute improvement (approximately 5% error reduction) in both precision and recall over using the raw forms of words.</Paragraph> <Paragraph position="5"> A comparison of the rooted word accuracy and the unigram accuracy shows that the recognizer performs much better on content words than on all words combined.</Paragraph> <Paragraph position="6"> Furthermore, comparisons among term accuracies for various n-gram terms show that as n increases, precision increases while recall decreases. This is because finding a correct trigram requires that all three unigrams that make up the trigram be correctly recognized in order, hence the low recall. On the other hand, this same feature makes it less likely for the recognizer to postulate a trigram by chance, hence the high precision. An overall observation in the results presented in Table 3 is that the speech recognizer misses between 12-17% of the n-gram terms used by the call router, and introduces an extra 1-6% of n-gram terms that should not have existed. In the next section, we show how these deletions and insertions of n-gram terms affect the call router's performance.</Paragraph> <Paragraph position="7"> module, we compare the list of candidate destinations with the manually annotated correct destination(s) for each call. The routing decision for each call is classified into one of eight classes, as shown in Figure 12. For instance, class 2a contains those calls that 1) are actually unambiguous, 2) are considered ambiguous by the router, and 3) have the potential to be routed to the correct destination, i.e., the correct destination is one of the candidate destinations. On the other hand, class 3b contains those calls that 1) are actually ambiguous, 2) are considered unambiguous by the router, and 3) are routed to a destination that is not one of the potential destinations.</Paragraph> <Paragraph position="8"> We evaluated the router's performance on three subsets of our test data: unambiguous requests alone, ambiguous requests alone, and all requests combined. For each set of data, we calculated a lower bound performance, which measures the percentage of calls that are correctly routed, and an upper bound performance, which measures the percentage of calls that are either correctly routed or have the potential to be correctly routed. Table 4 shows how the upper bounds and lower bounds are computed based on the classification in Figure 12 for each of the three data sets. For instance, for unambiguous requests (classes 1 and 2), the lower bound is the number of calls actually routed to the correct destination (class la) divided by the number of total unambiguous requests, while the upper bound is the number of calls actually routed to the correct destination (class la) plus the number of calls that the router finds to be ambiguous between the correct destination and some other destination(s) (class 2a), divided by the number of unambiguous requests. The calls in 2a are con- null sidered potentially correct because it is likely that the call will be routed to the correct destination after disambiguation.</Paragraph> <Paragraph position="9"> Tables 5(a) and 5(b) show the number of calls in our testing corpus that fell into the classes illustrated in Figure 12 based on transcriptions of caller requests and the output of an automatic speech recognizer, respectively. Tables 6(a) and 6(b) show the upper bound and lower bound performance for the three test sets based on the results in Tables 5(a) and (b), as well as the evaluation mechanism in Table 4. These results show that the system's overall performance in the case of perfect recognition falls somewhere between 75.6% and 97.2%, while the performance using our current automatic speech recognizer (ASR) output falls between 72.2% and 92.5%. The actual performance of the system is determined by two factors: 1) the performance of the disambiguation module, which determines the correct routing rate of the unambiguous calls that are considered ambiguous by the router (class 2a, 16.6% of all unambiguous calls with transcription and 15.9% with ASR output), and 2) the percentage of calls that were routed correctly out of the ambiguous calls that were considered unambiguous by the router (class 3a, 40.4% of all ambiguous calls with transcription and 36.6% with ASR output). Note that the performance figures given in Tables 6(a) and 6(b) are (b) Performance on ASR output based on 100% automatic routing. In the next section, we discuss the performance of the disambiguation module, which determines the overall system performance, and show how allowing calls to be redirected to human operators affects the system's performance.</Paragraph> </Section> <Section position="4" start_page="382" end_page="383" type="sub_section"> <SectionTitle> 5.2 Disambiguation Module Performance </SectionTitle> <Paragraph position="0"> To evaluate our disambiguation module, we needed dialogues that satisfy two criteria.</Paragraph> <Paragraph position="1"> First, the caller's first utterance must be ambiguous. Second, the operator must have asked a follow-up question to disambiguate the request and have subsequently routed the call to the appropriate destination. We used 157 calls that met these two criteria as our test set for the disambiguation module. Note that this test set is disjoint from the test set used in the evaluation of the call router, since none of the calls in that set satisfied the second criterion (those calls were not recorded or transcribed beyond the caller's response to the operator's prompt). Furthermore, for this test set, we only had access to the transcriptions of the calls but not the original speech files.</Paragraph> <Paragraph position="2"> For each ambiguous call, the first caller utterance was given to the router as input.</Paragraph> <Paragraph position="3"> The outcome of the router was classified as follows: Unambiguous if the call was routed to the selected destination. This routing was considered correct if the selected destination was the same as the actual destination and incorrect otherwise.</Paragraph> <Paragraph position="4"> Ambiguous if the router attempted to initiate disambiguation. The outcome of the routing of these calls was determined as follows: Correct if a disambiguation query was generated which, when answered, led to the correct destination. 26 Incorrect if a disambiguation query was generated which, when answered, could not lead to a correct destination.</Paragraph> <Paragraph position="5"> 26 Since our corpus consists of human-human dialogues, we do not have human responses to the exact disambiguation questions that our system generates. We consider a disambiguation query correct if it attempts to solicit the same type of information as the human operator, regardless of syntactic phrasing, and if answered based on the user's response to the human operator's question, led to the correct destination.</Paragraph> <Paragraph position="6"> Reject if the router could not form a sensible query or was unable to gather sufficient information from the user after its queries and routed the call to a human operator.</Paragraph> <Paragraph position="7"> Table 7 shows the number of calls that fall into each of the five categories. Out of the 157 calls, the router automatically routed 115 either with or without disambiguation (73.2%). Furthermore, 87.0% of these automatically routed calls were sent to the correct destination. Notice that out of the 52 ambiguous calls that the router considered unambiguous, 40 were routed correctly (76.9%). This is because our statistically trained call router is able to distinguish between cases where a semantically ambiguous request is equally likely to be routed to two or more destinations, and situations where the likelihood of one potential destination overwhelms that of the other(s). In the latter case, the router routes the call to the most likely destination instead of initiating disambiguation, which has been shown to be an effective strategy; not surprisingly, human operators are also prone to guess the destination based on likelihood and route calls without disambiguation.</Paragraph> </Section> <Section position="5" start_page="383" end_page="384" type="sub_section"> <SectionTitle> 5.3 Overall Performance </SectionTitle> <Paragraph position="0"> Our final evaluation of the overall performance of the call router is calculated by applying the results for evaluating the disambiguation module in Section 5.2 to the results for the routing module in Section 5.1. Tables 8(a) and 8(b) show the percentage of calls that will be correctly routed, incorrectly routed, and rejected, if we apply the performance of the disambiguation module (Table 7) to the calls that fall into each class in the evaluation of the routing module (Table 5). 27 For instance, the performance of transcribed class 2 calls (unambiguous calls that the router considered ambiguous) is computed as follows:</Paragraph> <Paragraph position="2"> 27 Note that the results in Table 8(b) are only a rough upper bound for the system's overall performance on recognizer output, since the performance of the disambiguation module presented in Table 7 is evaluated on transcribed texts (because we were not able to obtain any speech data that were recorded and transcribed beyond the caller's initial response to the system's prompt). In reality, the insertions and deletions of n-gram terms in the recognizer output may lead to some inappropriate disambiguation queries or more rejections to human operators. In addition, users may provide useful information not solicited by the system's query.</Paragraph> <Paragraph position="3"> The results in Table 8(a) show that, with perfect recognition, our call router sends 84.2% of all calls in our test set to the correct destination either with or without disambiguation, sends 5.6% of all calls to the incorrect destination, and redirects 10.2% of the calls to a human operator. In other words, our system attempts to automatically handle 89.8% of the calls, of which 93.8% are routed to their correct destinations. When speech recognition errors are introduced to the routing module, the percentage of calls correctly routed decreases while that of calls incorrectly routed increases. However, it is interesting to note that the rejection rate decreases, indicating that the system attempted to handle a larger portion of calls automatically.</Paragraph> </Section> <Section position="6" start_page="384" end_page="385" type="sub_section"> <SectionTitle> 5.4 Performance Comparison with Existing Systems </SectionTitle> <Paragraph position="0"> As discussed in Section 2, Gorin and his colleagues have experimented with various methodologies for relating caller utterances with call types (destinations). Their system performance is evaluated by comparing the most likely destination returned by their call type classifier given the first caller utterance with a manually annotated list of destinations labeled based again on the first caller utterance. A call is considered correctly classified if the destination returned by their classifier is present in the list of possible destinations. In other words, their evaluation scheme is similar to our method for computing the upper bound performance of our router discussed in Section 5.1.2.</Paragraph> <Paragraph position="1"> We evaluated our router using their evaluation scheme with a rejection threshold of 0.2 on both transcriptions and recognition output on our original set of 389 calls used in evaluating the routing module. Table 9 shows a comparison of our system's performance and the best-performing version of their system as reported in Wright, Gorin, and Riccardi (1997); henceforth WGR97Y Without other measures of task complexity, it is impossible to directly compare our results with those of WGR97. In several respects, their task is substantially different than ours. Their task is simpler in that there are fewer possible activities that a caller might request and fewer overall destinations; but it is more complex in that vocabulary 28 Wright, Gorin, and Riccardi (1997) presents system performance in the form of a rejection rate vs. correct classification rate graph, with rejection rate ranging between 10-55% and correct classification rate ranging between 63-94%. We report on two sets of results from their graph in Table 9, one with the lowest rejection rate and one that they chose to emphasize in their paper.</Paragraph> <Paragraph position="2"> items like cities are far more open-ended. Furthermore, it appears that they have many more instances of callers requesting services from more than one destination.</Paragraph> <Paragraph position="3"> Comparison with human operators was not possible for our task as their routing accuracy has not been evaluated. Our transcriptions clearly indicate that they all make a substantial number of routing errors (5-10% or more), with a large degree of variation among operators.</Paragraph> </Section> </Section> <Section position="6" start_page="385" end_page="386" type="metho"> <SectionTitle> 6. Future Work </SectionTitle> <Paragraph position="0"> In our current system, we perform morphological reduction context-independently without regard to word class. Ideally, we would have distinguished the uses of the word check as a verb from its uses as a noun, requiring both training and run-time category disambiguation.</Paragraph> <Paragraph position="1"> We are also interested in further clustering words that are similar in meaning, such as car, auto, and automobile, even though they are not related by regular morphological processes. For our application, digits or sequences of digits might be conflated into a single term, as might states, car makes and models, and so on. This kind of application-specific lexical clustering, whether done by hand or with the help of resources such as thesauri or semantic networks, should improve performance by overcoming inherent data sparseness problems. Classes might also prove helpful in dealing with changing items such as movie titles. In our earlier experiments, we used latent semantic analysis (Deerwester et al. 1990) for dimensionality reduction in an attempt to automatically cluster words that are semantically similar. This involved selecting dimensionality k, which is less than the rank r of the original term-document matrix. We found performance degrades for any k < r.</Paragraph> <Paragraph position="2"> In the current version of our system, the interface between the automatic speech recognizer and the call router is the top hypothesis of the speech recognizer for the speech input. As reported in Table 3, this top hypothesis has an approximately 10% error rate on salient unigrams. One way to improve this error rate is to allow the speech recognizer to produce an n-best list of the top n recognition hypotheses or even a probabilistic word graph rather than a single best hypothesis. The n-gram terms can then be extracted from the graph in a straightforward manner and weighted according to their scores from the recognizer. Our prediction is that this will lead to increased recall, with perhaps a slight degradation in precision. However, since increased recall will, at the very least, increase the chance that the disambiguation module can formulate reasonable queries, we expect the system's overall performance to improve as a result.</Paragraph> <Paragraph position="3"> Chu-Carroll and Carpenter Vector-based Natural Language Call Routing</Paragraph> </Section> class="xml-element"></Paper>