File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0706_metho.xml
Size: 25,890 bytes
Last Modified: 2025-10-06 14:10:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0706"> <Title>Automating Help-desk Responses: A Comparative Study of Information-gathering Approaches</Title> <Section position="4" start_page="41" end_page="43" type="metho"> <SectionTitle> 2 Information-gathering Methods </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="41" end_page="41" type="sub_section"> <SectionTitle> 2.1 Retrieve a Complete Answer </SectionTitle> <Paragraph position="0"> This method retrieves a complete document (answer) on the basis of request lemmas. We use cosine similarity to determine a retrieval score, and use a minimal retrieval threshold that must be surpassed for a response to be accepted.</Paragraph> <Paragraph position="1"> We have considered three approaches to indexing the answers in our corpus: according to the content lemmas in (1) requests, (2) answers, or (3) requests&answers. The results in Section 3 are for the third approach, which proved best. To illustrate the difference between these approaches, consider request-answer pair RA2. If we received a new request similar to that in RA2, the answer in RA2 would be retrieved if we had indexed according to requests or requests&answers. However, if we had indexed only on answers, then the response would not be retrieved.</Paragraph> </Section> <Section position="2" start_page="41" end_page="41" type="sub_section"> <SectionTitle> 2.2 Predict a Complete Answer </SectionTitle> <Paragraph position="0"> This prediction method rst groups similar answers in the corpus into answer clusters. For each request, we then predict an answer cluster on the basis of the request features, and select the answer that is most representative of the cluster (closest to the centroid). This method would predict a group of answers similar to the answer in RA2 from the input lemmas compaq and cp-2w .</Paragraph> <Paragraph position="1"> The clustering is performed in advance of the prediction process by the intrinsic classi cation program Snob (Wallace and Boulton, 1968), using the content lemmas (unigrams) in the answers as features. The predictive model is a Decision Graph (Oliver, 1993) trained on (1) input features: unigram and bigram lemmas in the request,3 and (2) target feature the identi er of the answer cluster that contains the actual answer for the request.4 The model provides a prediction of which response cluster is most suitable for a given request, as well as a level of con dence in this prediction. We do not attempt to produce an answer if the con dence is not suf ciently high.</Paragraph> <Paragraph position="2"> In principle, rather than clustering the answers, the predictive model could have been trained on individual answers. However, on one hand, the tion features because these parts of the system were developed at different times. In the near future, we will align these features.</Paragraph> <Paragraph position="3"> dimensionality of this task is very high, and on the other hand, answers that share signi cant features would be predicted together, effectively acting as a cluster. By clustering answers in advance, we reduce the dimensionality of the problem, at the expense of some loss of information (since somewhat dissimilar answers may be grouped together). null</Paragraph> </Section> <Section position="3" start_page="41" end_page="42" type="sub_section"> <SectionTitle> 2.3 Predict Sentences </SectionTitle> <Paragraph position="0"> This method looks at each answer sentence as though it were a separate document, and groups similar sentences into clusters in order to obtain meaningful sentence abstractions and avoid redundancy.5 For instance, the last sentence in A3 and the rst sentence in A4 are assigned to the same sentence cluster. As for Answer Prediction (Section 2.2), this clustering process also reduces the dimensionality of the problem.</Paragraph> <Paragraph position="1"> Each request is used to predict promising clusters of answer sentences, and an answer is composed by extracting a sentence from such clusters. Because the sentences in each cluster originate in different response documents, the process of selecting them for a new response corresponds to multi-document summarization. In fact, our selection mechanism, described in more detail in (Marom and Zukerman, 2005), is based on a multi-document summarization formulation proposed by Filatova and Hatzivassiloglou (2004).</Paragraph> <Paragraph position="2"> In order to be able to generate appropriate answers in this manner, the sentence clusters should be cohesive, and they should be predicted with high con dence. A cluster is cohesive if the sentences in it are similar to each other. This means that it is possible to obtain a sentence that represents the cluster adequately (which is not the case for an uncohesive cluster). A high-con dence prediction indicates that the sentence is relevant to many requests that share certain regularities. Owing to these requirements, the Sentence Prediction method will often produce partial answers (i.e., it will have a high precision, but often a low recall).</Paragraph> <Paragraph position="3"> The clustering is performed by applying Snob using the following sentence-based and word-based features, all of which proved signi cant for 5We did not cluster request sentences, as requests are often ungrammatical, which makes it hard to segment them into sentences, and the language used in requests is more diverse than the corporate language used in responses.</Paragraph> <Paragraph position="4"> at least some datasets. The sentence-based features are: Number of syntactic phrases in the sentence (e.g., prepositional, subordinate) gives an idea of sentence complexity.</Paragraph> <Paragraph position="5"> Grammatical mood of the main clause (5 states: imperative, imperative-step, declarative, declarative-step, unknown) indicates the function of the sentence in the answer, e.g., an isolated instruction, part of a sequence of steps, part of a list of options.</Paragraph> <Paragraph position="6"> Grammatical person in the subject of the main clause (4 states: rst, second, third, unknown) indicates the agent (e.g., organization or client) or patient (e.g., product).</Paragraph> <Paragraph position="7"> The word-based features are binary: Signi cant lemma bigrams in the subject of the main clause and in the augmented object in the main clause. This is the syntactic object if it exists or the subject of a prepositional phrase in an imperative sentence with no object, e.g., click on the following link.</Paragraph> <Paragraph position="8"> The verbs in the sentence and their polarity (asserted or negated).</Paragraph> <Paragraph position="9"> All unigrams in the sentence, excluding verbs.</Paragraph> <Paragraph position="10"> To measure the textual cohesion of a cluster, we inspect the centroid values corresponding to the word features. Due to their binary representation, the centroid values correspond to probabilities of the words appearing in the cluster. Our measure is similar to entropy, in the sense that it yields non-zero values for extreme probabilities (Marom and Zukerman, 2005). It implements that idea that a cohesive group of sentences should agree strongly on both the words that appear in these sentences and the words that are omitted. Hence, it is possible to obtain a sentence that adequately represents a cohesive sentence cluster, while this is not the case for a loose sentence cluster. For example, the italicized sentences in A3 and A4 belong to a highly cohesive sentence cluster (0.93), while the opening answer sentence in RA1 belongs to a less cohesive cluster (0.7) that contains diverse sentences about the Rompaq power management.</Paragraph> <Paragraph position="11"> Unlike Answer Prediction, we use a Support Vector Machine (SVM) for predicting sentence clusters. A separate SVM is trained for each sentence cluster, with unigram and bigram lemmas in a request as input features, and a binary target feature specifying whether the cluster contains a sentence from the response to this request.</Paragraph> <Paragraph position="12"> During the prediction stage, the SVMs predict zero or more clusters for each request. One representative sentence (closest to the centroid) is then extracted from each highly cohesive cluster predicted with high con dence. These sentences will appear in the answer (at present, these sentences are treated as a set, and are not organized into a coherent reply).</Paragraph> </Section> <Section position="4" start_page="42" end_page="43" type="sub_section"> <SectionTitle> 2.4 Retrieve Sentences </SectionTitle> <Paragraph position="0"> As for Sentence Prediction (Section 2.3), this method looks at each answer sentence as though it were a separate document. For each request sentence, we retrieve candidate answer sentences on the basis of the match between the content lemmas in the request sentence and the answer sentence. For example, while the rst answer sentence in RA1 might match the rst request sentence in RA1, an answer sentence from a different response (about re-installing Win98) might match the second request sentence. The selection of individual text units from documents implements ideas from question-answering approaches.</Paragraph> <Paragraph position="1"> We are mainly interested in answer sentences that cover request sentences, i.e., the terms in the request should appear in the answer. Hence, we use recall as the measure for the goodness of a match, where recall is de ned as follows.</Paragraph> <Paragraph position="2"> recall = TF.IDF of lemmas in request sent & answer sent TF.IDF of lemmas in request sentence We initially retain the answer sentences whose recall exceeds a threshold.6 Once we have the set of candidate answer sentences, we attempt to remove redundant sentences. This requires the identi cation of sentences that are similar to each other a task for which we use the sentence clusters described in Section 2.3.</Paragraph> <Paragraph position="3"> Again, this redundancy-removal step essentially casts the task as multi-document summarization.</Paragraph> <Paragraph position="4"> Given a group of answer sentences that belong to 6To assess the goodness of a sentence, we experimented with f-scores that had different weights for recall and precision. Our results were insensitive to these variations. the same cohesive cluster, we retain the sentence with the highest recall (in our current trials, a cluster is suf ciently cohesive for this purpose if its cohesion 0.7). In addition, we retain all the answer sentences that do not belong to a cohesive cluster. All the retained sentences will appear in the answer.</Paragraph> </Section> <Section position="5" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 2.5 Hybrid Predict-Retrieve Sentences </SectionTitle> <Paragraph position="0"> It is possible that the Sentence Prediction method predicts a sentence cluster that is not suf ciently cohesive for a con dent selection of a representative sentence, but instead the ambiguity can be resolved through cues in the request. For example, selecting between a group of sentences concerning the installation of different drivers might be possible if the request mentions a speci c driver. Thus the Sentence Prediction method is complemented with the Sentence Retrieval method to form a hybrid, as follows.</Paragraph> <Paragraph position="1"> For highly cohesive clusters predicted with high con dence, we select a representative sentence as before.</Paragraph> <Paragraph position="2"> For clusters with medium cohesion predicted with high con dence, we attempt to match the sentences with the request sentences, using the Sentence Retrieval method but with a lower recall threshold. This reduction takes place because the high prediction con dence provides a guarantee that the sentences in the cluster are suitable for the request, so there is no need for a convervative recall threshold. The role of retrieval is now to select the sentence whose content lemmas best match the request.</Paragraph> <Paragraph position="3"> For uncohesive clusters or clusters predicted with low con dence, we have to resort to word matches, which means reverting to the higher, more convervative recall threshold, because we no longer have the prediction con dence.</Paragraph> </Section> </Section> <Section position="5" start_page="43" end_page="45" type="metho"> <SectionTitle> 3 Evaluation </SectionTitle> <Paragraph position="0"> As mentioned in Section 1, our corpus was divided into topic-based datasets. We have observed that the different datasets lend themselves differently to the various information-gathering methods described in the previous section. In this section, we examine the overall performance of the ve methods across the corpus, as well as their performance for different datasets.</Paragraph> <Section position="1" start_page="43" end_page="44" type="sub_section"> <SectionTitle> 3.1 Measures </SectionTitle> <Paragraph position="0"> We are interested in two performance indicators: coverage and quality.</Paragraph> <Paragraph position="1"> Coverage is the proportion of requests for which a response can be generated. The various information gathering methods presented in the previous section have acceptance criteria that indicate that there is some level of con dence in generating a response. A request for which a planned response fails to meet these criteria is not covered, or addressed, by the system. We are interested in seeing if the different methods are applicable in different situations, that is, how exclusively they address different requests. Note that the sentence-based methods generate partial responses, which are considered acceptable so long as they contain at least one sentence generated with high con dence. In many cases these methods produce obvious and non-informative sentences such as Thank you for contacting HP , which would be deemed an acceptable response. We have manually excluded such sentences from the calculation of coverage, in order to have a more informative comparison between the different methods.</Paragraph> <Paragraph position="2"> Ideally, the quality of the generated responses should be measured through a user study, where people judge the correctness and appropriateness of answers generated by the different methods.</Paragraph> <Paragraph position="3"> However, we intend to re ne our methods further before we conduct such a study. Hence, at present we rely on a text-based quantitative measure. Our experimental setup involves a standard 10-fold validation procedure, where we repeatedly train on 90% of a dataset and test on the remaining 10%. We then evaluate the quality of the answers generated for the requests in each test split, by comparing them with the actual responses given by the help-desk operator for these requests.</Paragraph> <Paragraph position="4"> We are interested in two quality measures: (1) the precision of a generated response, and (2) its overall similarity to the actual response. The reason for this distinction is that the former does not penalize for a low recall it simply measures how correct the generated text is. As stated in Section 1, a partial but correct response may be better than a complete response that contains incorrect units of information. On the other hand, more complete responses are favoured over partial ones, and so we use the second measure to get an overall indication of how correct and complete a response is. We use the traditional Information Retrieval precision and f-score measures (Salton and McGill, 1983), employed on a word-by-word basis, to evaluate the quality of the generated responses.7 null</Paragraph> </Section> <Section position="2" start_page="44" end_page="45" type="sub_section"> <SectionTitle> 3.2 Results </SectionTitle> <Paragraph position="0"> Table 1 shows the overall results obtained using the different methods. We see that combined the different methods can address 72% of the requests.</Paragraph> <Paragraph position="1"> That is, at least one of these methods can produce some non-empty response to 72% of the requests.</Paragraph> <Paragraph position="2"> Looking at the individual coverages of the different methods we see that they must be applicable in different situations, because the highest individual coverage is 43%.</Paragraph> <Paragraph position="3"> The Answer Retrieval method addresses 43% of the requests, and in fact, about half of these (22%) are uniquely addressed by this method. However, in terms of the quality of the generated response, we see that the performance is very poor (both precision and f-score have very low averages). Nevertheless, there are some cases where this method uniquely addresses requests quite well. In three of the datasets, Answer Retrieval is the only method that produces good answers, successfully addressing 15-20 requests (about 5% of the requests in these datasets). These requests include several cases similar to RA2, where the request was sent to the wrong place. We would expect Answer Prediction to be able to handle such cases as well.</Paragraph> <Paragraph position="4"> However, when there are not enough similar cases in the dataset (as is the case with the three datasets referred to above), Answer Prediction is not able to generalize from them, and therefore we can only rely on a new request closely matching an old request or an old answer.</Paragraph> <Paragraph position="5"> The Answer Prediction method can address 29% of the requests. Only about a tenth of these 7We have also employed sequence-based measures using the ROUGE tool set (Lin and Hovy, 2003), with similar results to those obtained with the word-by-word measure. are uniquely addressed by this method, but the generated responses are of a fairly high quality, with an average precision and f-score of 0.82.</Paragraph> <Paragraph position="6"> Notice the large standard deviation of these averages, suggesting a somewhat inconsistent behaviour. This is due to the fact that this method gives good results only when complete template responses are found. In this case, any re-used response will have a high similarity to the actual response. However, when this is not the case, the performance degrades substantially, resulting in inconsistent behaviour. This behaviour is particularly prevalent for the product replacement dataset, which comprises 18% of the requests.</Paragraph> <Paragraph position="7"> The vast majority of the requests in this dataset ask for a return shipping label to be mailed to the customer, so that he or she can return a faulty product. Although these requests often contain detailed product descriptions, the responses rarely refer to the actual products, and often contain the following generic answer.</Paragraph> <Paragraph position="8"> A5: Your request for a return airbill has been received and has been sent for processing. Your replacement airbill will be sent to you via email within 24 hours.</Paragraph> <Paragraph position="9"> Answer Retrieval fails in such cases, because each request has precise information about the actual product, so a new request can neither match an old request (about a different product) nor can it match the generic response. In contrast, Answer Prediction can ignore the precise information in the request, and infer from the mention of a shipping label that the generic response is appropriate. When we exclude this dataset from the calculations, both the average precision and f-score for the Answer Prediction method fall below those of the Sentence Prediction and Hybrid methods. This means that Answer Prediction is suitable when requests that share some regularity receive a complete template answer.</Paragraph> <Paragraph position="10"> The Sentence Prediction method can nd reg- null ularities at the sub-document level, and therefore deal with cases when partial responses can be generated. It produces such responses for 34% of the requests, and does so with a consistently high precision (average 0.94, standard deviation 0.13).</Paragraph> <Paragraph position="11"> Only an overall 1% of the requests are uniquely addressed by this method, however, for the cases that are shared between this method and other ones, it is useful to compare the actual quality of the generated response. In 5% of the cases, the Sentence Prediction method either uniquely addresses requests, or jointly addresses requests together with other methods but has a higher f-score. This means that in some cases a partial response has a higher quality than a complete one.</Paragraph> <Paragraph position="12"> Like the document-level Answer Retrieval method, the Sentence Retrieval method performs poorly. It is dif cult to nd an answer sentence that closely matches a request sentence, and even when this is possible, the selected sentences tend to be different to the ones used by the help-desk operators, hence the low precision and f-score.</Paragraph> <Paragraph position="13"> This is discussed further below in the context of the Sentence Hybrid method.</Paragraph> <Paragraph position="14"> The Sentence Hybrid method extends the Sentence Prediction method by employing sentence retrieval as well, and thus has a higher coverage (45%). In fact, the retrieval component serves to disambiguate between groups of candidate sentences, thus enabling more sentences to be included in the generated response. This, however, is at the expense of precision, as we also saw for the pure Sentence Retrieval method. Although retrieval selects sentences that match closely a given request, this selection can differ from the selections made by the operator in the actual response. Precision (and hence f-score) penalizes such sentences, even when they are more appropriate than those in the model response. For example, consider request-answer pair RA6. The answer is quite generic, and is used almost identically for several other requests. The Hybrid method almost reproduces this answer, replacing the rst sentence with A7. This sentence, which matches more request words than the rst sentence in the model answer, was selected from a sentence cluster that is not highly cohesive, and contains sentences that describe different reasons for setting up a repair (the matching word in A7 is screen ).</Paragraph> <Paragraph position="15"> The Hybrid method outperforms the other methods in about 10% of the cases, where it either RA6: My screen is coming up reversed (mirrored). There must be something loose electronically because if I put the stylus in it's hole and move it back and forth, I can get the screen to display properly momentarily. Please advise where to send for repairs.</Paragraph> <Paragraph position="16"> To get the iPAQ serviced, you can call 1-800-phone-number, options 3, 1 (enter a 10 digit phone number), 2. Enter your phone number twice and then wait for the routing center to put you through to a technician with Technical Support. They can get the unit picked up and brought to our service center.</Paragraph> <Paragraph position="17"> A7: To get the iPAQ repaired (battery, stylus lock and screen), please call 1-800-phone-number, options 3, 1 (enter a 10 digit phone number), 2.</Paragraph> <Paragraph position="18"> uniquely addresses requests, or addresses them jointly with other methods but produces responses with a higher f-score.</Paragraph> </Section> <Section position="3" start_page="45" end_page="45" type="sub_section"> <SectionTitle> 3.3 Summary </SectionTitle> <Paragraph position="0"> In summary, our results show that each of the different methods is applicable in different situations, all occurring signi cantly in the corpus, with the exception of the Sentence Retrieval method. The Answer Retrieval method uniquely addresses a large portion of the requests, but many of its attempts are spurious, thus lowering the combined overall quality shown at the bottom of Table 1 (average f-score 0.50), calculated by using the best performing method for each request. The Answer Prediction method is good at addressing situations that warrant complete template responses. However, its con dence criteria might need re ning to lower the variability in quality. The combined contribution of the sentence-based methods is substantial (about 15%), suggesting that partial responses of high precision may be better than complete responses with a lower precision.</Paragraph> </Section> </Section> <Section position="6" start_page="45" end_page="46" type="metho"> <SectionTitle> 4 Related Research </SectionTitle> <Paragraph position="0"> There are very few reported attempts at corpus-based automation of help-desk responses. The retrieval system eResponder (Carmel et al., 2000) is similar to our Answer Retrieval method, where the system retrieves a list of request-response pairs and presents a ranked list of responses to the user. Our results show that due to the repetitions in the responses, multi-document summarization can be used to produce a single (possibly partial) representative response. This is recognized by Berger and Mittal (2000), who employ query-relevant summarization to generate responses. However, their corpus consists of FAQ request-response pairs a signi cantly different corpus to ours in that it lacks repetition and redundancy, and where the responses are not personalized. Lapalme and Kosseim (2003) propose a retrieval approach similar to our Answer Retrieval method, and a question-answering approach, but applied to a corpus of technical documents rather than request-response pairs. The methods presented in this paper combine different aspects of document retrieval, question-answering and multi-document summarization, applied to a corpus of repetitive request-response pairs.</Paragraph> </Section> class="xml-element"></Paper>