File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1128_metho.xml

Size: 19,958 bytes

Last Modified: 2025-10-06 14:08:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1128">
  <Title>Detection of Question-Answer Pairs in Email Conversations</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Automatic Question Detection
</SectionTitle>
    <Paragraph position="0"> While the detection of questions in email messages is not as dif cult a problem as in speech conversations where features such as the question mark character are absent, relying on the use of question mark character for identifying questions in email messages is not adequate. The need for special attention in detecting questions in email messages arises due to three reasons. First, the use of informal language means users might use the question mark character in cases other than questions (for example, to denote uncertainty) and may overlook using a question mark after a question. Second, a question may be stated in a declarative form, as in, I was wondering if you are free at 5pm today. Third, not every question, whether in an interrogative form or in a declarative form, is meant to be answered. For example, rhetorical questions are used for purposes other than to obtain the information the question asked, and are not required to be associated with answer segments.</Paragraph>
    <Paragraph position="1"> We used supervised rule induction for the detection of interrogative questions. Training examples were extracted from the transcribed SWITCHBOARD corpus annotated with DAMSL tags.1 This particular corpus was chosen not only because an adequate number of training examples could be extracted from the manual annotations, but also because of the use of informal language in speech that is also characteristic of email conversation. Utterances with DAMSL tags sv (speech act statement-opinion ) and sd (speech act statement-non-opinion ) were used to extract 5,000 negative examples. Similarly, utterances with tags qy ( yes-no-question ), qw ( Wh-question ), and qh ( rhetorical-question ) were used to extract 5,000 positive examples. Each utterance was then represented by a feature vector which included the following features: POS tags for the rst ve terms (shorter utterances were padded with dummies) POS tags for the last ve terms (shorter utterances were padded with dummies) length of the utterance  interrogative form POS-bigrams from a list of 100 most discriminating POS-bigrams list.</Paragraph>
    <Paragraph position="2"> The list of most discriminating POS-bigrams was obtained from the training data by following the same procedure that (Zechner and Lavie, 2001) used.</Paragraph>
    <Paragraph position="3"> We then used Ripper (Cohen, 1996) to learn rules for question detection. Like many learning programs, Ripper takes as input the classes to be learned, a set of feature names and possible values, and training data specifying the class and feature values for each training example. In our case, the training examples are the speech acts extracted from the SWITCHBOARD corpus as described above.</Paragraph>
    <Paragraph position="4"> Ripper outputs a classi cation model for predicting the class (i.e., whether a speech act is a question or not) of future examples; the model is expressed as an ordered set of if-then rules. For testing, we manually extracted 300 questions in interrogative form and 300 statements in declarative form from the ACM corpus.2 We show our test results with recall, precision and F1-score3 in Table 1 on the rst column.</Paragraph>
    <Paragraph position="5"> While the test results show that the precision was very good, the recall score could be further improved. Upon further investigation on why the recall was so low, we found that unlike the positive examples we used in our training data, most of the questions in the test data that were missed by the rules learned by Ripper started with a declarative phrase. For example, both I know its on 108th, but after that not sure, how exactly do we get there? , and By the way, are we shutting down clic? begin with declarative phrases and were missed by the Ripper learned rules. Following this observation, 2More information on the ACM corpus will be provided in Section 4.1. At the time of the development of the question detection module the annotations were not available to us, so we had to manually extract the required test speech acts.</Paragraph>
    <Paragraph position="7"> we manually updated our question detection module to break a speech act that was not initially predicted as question into phrases separated by comma characters. Then we applied the rules on the rst phrase of the speech act and if that failed on the last phrase. For example, the rules would fail on the phrase I know its on 108th , but would be able to classify the phrase how exactly do we get there as a question. In doing this we were able to increase the recall score to 0.72, leading to a F1-score of 0.82 as shown in Table 1 in the second column.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="6" type="metho">
    <SectionTitle>
4 Automatic Answer Detection
</SectionTitle>
    <Paragraph position="0"> While the automatic detection of questions in email messages is relatively easier than the detection of the same in speech conversations, the asynchronous nature of email conversations makes detection and pairing of question and answer pairs a more dif cult task. Whereas in speech a set of heuristics can be used to identify answers as shown by (Zechner and Lavie, 2001), such heuristics cannot be readily applied to the task of question and answer pairing in email conversations. First, more than one topic can be discussed in parallel in an email thread, which implies that questions relating to more than a single topic can be pursued in parallel. Second, even when an email thread starts with the discussion of a single topic, the thread may eventually be used to initiate a different topic just because the previous topic's list of recipients closely matched those required for the newly initiated topic. Third, because of the use of Reply and ReplyAll functions in email clients, a user may be responding to an issue posed earlier in the thread while using one of the email messages subsequent to the message posing the issue to reply back to that issue. So, while it may seem from the structure of the email thread that a person is replying back to a certain email, the person may actually be replying back to an email earlier in the thread.</Paragraph>
    <Paragraph position="1"> This implies that when several persons answer a question, there may be answers which appear several emails after the email posing the question. Finally, the fact that the question and its corresponding answers may have few words in common further complicates answer detection. This is possible when a person uses the context of the email conversation to ask questions and make answers, and the semantics of such questions and answers have to be interpreted with respect to the context they appear in. Such context is readily available for a reader through the use of quoted material from past email messages. All of these make the task of detecting and linking question and answer pairs in email conversations a complicated task. However, this task is not as complicated a task as automatic question answering where the search space for candidate answers is much wider and more sophisticated measures than those based on lexical similarity have to be employed.</Paragraph>
    <Paragraph position="2"> Our approach to automatic answer detection in email conversations is based on the observation that while a number of issues may be pursued in parallel, users tend to use separate paragraphs to address separate issues in the same email message. While a more complicated approach to segmentation of email messages could be possible, we have used this basic observation to delineate discourse segments in email messages. Further, because discourse segments contain more lexical context than their individual sentences, our approach detects associations between pairs of discourse segments rather than pairs of sentences.</Paragraph>
    <Paragraph position="3"> We now present our machine learning approach to automatic detection of question and answer pairs.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Corpus
</SectionTitle>
      <Paragraph position="0"> Our corpus consists of approximately 300 threads of about 1000 individual email messages sent during one academic year among the members of the board of the student organization of the ACM at Columbia University. The emails dealt mainly with planning events of various types, though other issues were also addressed. On average, each thread contained 3.25 email messages, with all threads containing at least two messages, and the longest thread containing 18 messages. Threads were constructed from the individual email messages using the In-Reply-To header information to link parent and child email messages.</Paragraph>
      <Paragraph position="1"> Two annotators (DB and GR) each were asked to highlight and link question and answer pairs in the corpus. Our work presented here is based on the work these annotators had completed at the time of this writing. GR has completed work on 200 threads of which there are 81 QA threads (threads with question and answer pairs), 98 question segments, and 142 question and answer pairs. DB has completed work on 138 threads of which there are 62 QA threads, 72 question segments, and 92 question and answer pairs. We consider a segment to be a question segment if a sentence in that segment has been highlighted as a question. Similarly, we consider a segment to be an answer segment if a sentence in that segment has been paired with a question to form a question and answer pair. The kappa statistic (Carletta, 1996) for identifying question segments is 0.68, and for linking question and answer segments given a question segment is 0.81.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="6" type="sub_section">
      <SectionTitle>
4.2 Features
</SectionTitle>
      <Paragraph position="0"> For each question segment in an email message, we make a list of candidate answer segments.</Paragraph>
      <Paragraph position="1"> This is basically just a list of original (content that is not quoted from past emails)4 segments in all the messages in the thread subsequent to the message of the question segment. Let the thread in consideration be called t, the container message of the question segment be called mq, the container message of the candidate answer segment be called ma, the question segment be called q, and the candidate answer segment be called a. For each question and candidate answer pair, we compute the following sets of features:  (a) number of non stop words in segment q and segment a; (b) cosine similarity5 and euclidean distance6 between segment q and a; 4.2.2 Features derived from the structure of the thread t (c) the number of intermediate messages between mq and ma in t; (d) the ratio of the number of messages in t sent ear-</Paragraph>
      <Paragraph position="3"> where cxi is the count of word i in segment x, and cyi is the count of word i in segment y.</Paragraph>
      <Paragraph position="5"> where cxi is the count of word i in segment x, and cyi is the count of word i in segment y.</Paragraph>
      <Paragraph position="6"> lier than mq and all the messages in t, and similarly for ma; (e) whether a is the rst segment in the list of candidate answer segments of q (this is true if a segment is the rst segment in the rst message sent in reply to mq); 4.2.3 Features based on the other candidate answer segments of q (f) number of candidate answer segments of q and the number of candidate answer segments of q after a (a segment x is considered to be after another segment y if x is from a message sent later than that of y, or if x appears after y in the same message); (g) the ratio of the number of candidate answer segments before a and the number of all candidate answer segments (a segment x is considered to be before another segment y if x is from a message sent earlier than that of y, or if x appears before y in the same message); and (h) whether q is the most similar segment of a among all segments from ancestor messages of ma based on cosine similarity (the list of ancestor messages of a message is computed by recursively following the In-Reply-To header information that points to the parent message of a message).</Paragraph>
      <Paragraph position="7"> While the contribution of a single feature to the classi cation task may not be intuitively apparent, we hope that a combination of a subset of these features, in some way, would help us in detecting question-answer pairs. For example, when the number of candidate answer segments for a question segment is less than or equal to two, feature (e) may be the best contributor to the classi cation task. But, when the number of candidate answer segments is high, a collection of some features may be the best contributor.</Paragraph>
      <Paragraph position="8"> We categorized each feature vector for the pair q and a as a positive example if a has been marked as an answer and linked with q.</Paragraph>
    </Section>
    <Section position="3" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
4.3 Training Data
</SectionTitle>
      <Paragraph position="0"> We computed four sets of training data. Two for each of the annotators separately, which we call DB and GR according to the label of their respective annotator. One taking the union of the annotators, which we call Union, and another taking the intersection, which we call Inter. For the rst two sets, we collected the threads that had at least one question and answer pair marked by the respective annotator. For each question that has an answer marked  (some of the highlighted questions do not have corresponding answers in the thread), we computed a list of feature vectors as described above with all of its candidate answer segments. For the union set, we collected all the threads that had question and answer pairs marked by either annotator, and computed the feature vectors for each such question segment. A feature vector was categorized positive if any of the two annotators have marked the respective candidate answer segment as an answer. For the intersection set, we collected all the threads that had question and answer pairs marked by both annotators. Here, a feature vector was labelled positive only if both the annotators marked the respective candidate answer segment as an answer. Table 2 summarizes the information on the four sets of training data.</Paragraph>
    </Section>
    <Section position="4" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
4.4 Experiments and Results
</SectionTitle>
      <Paragraph position="0"> This section describes experiments using Ripper to automatically induce question and candidate answer pair classi ers, using the features described in Secion 4.2. We obtained the results presented here using ve-fold cross-validation.</Paragraph>
      <Paragraph position="1"> Table 3 shows the precision, recall and F1-score for the four datasets using the cosine similarity feature only. We use these results as the baseline against which we compare the results for the full set of features shown in Table 4. While precision using the full feature set is comparable to that of the baseline measure, we get a signi cant improvement on recall with the full feature set. The base- null line measure predicts that the candidate answer segment whose similarity with the question segment is above a certain threshold will be an actual answer segment. Our results suggest that lexical similarity cannot alone capture the rules associated with question and answer pairing, and that the use of various features based on the structure of the thread of email conversations can be used to improve upon lexical similarity of discourse segments. Further, while the results do not seem to suggest a clear preference for the data set DB over the data set GR (this could be explained by their high kappa score of 0.81), taking the union of the two datasets does seem to be better than taking the intersection of the two datasets. This could be because the intersection greatly reduces the number of positive data points from what is available in the union, and hence makes the learning of rules more dif cult with Inter.</Paragraph>
      <Paragraph position="2"> Finally, on observing that some questions had at most 2 candidate answers, and others had quite a few, we investigated what happens when we divide the data set Union into two data sets, one for question segments with 2 or less candidate answer segments which we call the data set Union a, and the other with the rest of the data set which we call the data set Union b. Union a has, on average, 1.5 candidate answer segments, while Union b has 5.7. We show the results of this experiment with the full feature set in Table 5. Our results show that it is much easier to learn rules for questions with the data set Union a, which we show in the rst row, than otherwise. We compare our results for the baseline measure of predicting the majority class, which we show in the second row, to demonstrate that the results obtained with the dataset Union a were not due to majority class prediction. While the results for the other subset, Union b, which we show in the third row, compare well with the results for Union, when the results for the data sets Union a and Union b are combined, which we show in the fourth row, we achieve better results than without the splitting,  shown in the last row.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="6" end_page="6" type="metho">
    <SectionTitle>
5 Conclusion and Future Work
</SectionTitle>
    <Paragraph position="0"> We have presented an approach to detect question-answer pairs with good results. Our approach is the rst step towards a system which can highlight question-answer pairs in a generated summary. Our approach works well with interrogative questions, but we have not addressed the automatic detection of questions in the declarative form and rhetorical questions. People often pose their requests in a declarative form in order to be polite among other reasons. Such requests could be detected with their use of certain key phrases some of which include Please let me know... , I was wondering if... , and If you could....that would be great. . And, because rhetorical questions are used for purposes other than to obtain the information the question asked, such questions do not require to be paired with answers.</Paragraph>
    <Paragraph position="1"> The automatic detection of these question types are still under investigation.</Paragraph>
    <Paragraph position="2"> Further, while the approach to the detection of question-answer pairs in threads of email conversation we have presented here is quite effective, as shown by our results, the use of such pairs of discourse segments for use in summarization of email conversations is an area of open research to us.</Paragraph>
    <Paragraph position="3"> As we discussed in Section 1, generation of summaries for email conversations that are devoted to question-answer exchanges and that integrate identi ed question-answer pairs as part of a full summary is also needed.</Paragraph>
  </Section>
  <Section position="7" start_page="6" end_page="6" type="metho">
    <SectionTitle>
6 Acknowledgements
</SectionTitle>
    <Paragraph position="0"> We are grateful to Owen Rambow for his helpful advice. We also thank Andrew Rosenberg for his discussion on kappa statistic as it relates to the ACM corpus. This work was supported by the National Science Foundation under the KDD program. Any opinions, ndings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily re ect the views of the</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML