File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1605_evalu.xml
Size: 9,101 bytes
Last Modified: 2025-10-06 13:59:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1605"> <Title>Interrogative Reformulation Patterns and Acquisition of Question Paraphrases</Title> <Section position="5" start_page="3" end_page="6" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> Using the paraphrase data described in the previous section, we evaluated our question reformulation patterns on coverage and in the paraphrase recognition task. From the data, we selected all paraphrases derived from the original questions of type PRC, RSN and ATR. There were 306 such examples, and they constituted the testset for the evaluation.</Paragraph> <Section position="1" start_page="3" end_page="5" type="sub_section"> <SectionTitle> 4.1 Coverage </SectionTitle> <Paragraph position="0"> We first applied the transformation patterns to all examples in the testset and generated their case frame representations. In the 306 examples, 289 of them found at least one pattern. If an example matched with two or more patterns, the one with the highest priority was selected. Thus the coverage was 94%.</Paragraph> <Paragraph position="1"> However after inspecting the results, we observed that in some successful matches, the syntactic structure of the question did not exactly correspond to 1. Where can I get British tea in the United States? [ATR] a. How can I locate some British tea in the United States? b. Who sells English tea in the U.S.? c. What stores carry British tea in the United States? d. Where is the best place to find English tea in the U.S.? e. Where exactly should I go to buy British tea in the U.S.? f. How can an American find British tea? 2. Why does the Moon always show the same face to the Earth? [RSN] a. What is the reason why the Moon show only one side to the Earth? b. Why is the same side of the Moon facing the Earth all the time? c. How come we do not see the other side of the Moon from Earth? d. Why do we always see the same side of the Moon? e. Why do the Moon always look the same from here? f. Why is there the dark side of Moon? the pattern as intended. For example, &quot;How can I learn to drink less tea and coffee?&quot; matched the pattern (1) shown in Figure 2 and produced a frame where &quot;I&quot; was the actor, &quot;learn&quot; was the verb and the theme was null (because the shallow parser analyzed &quot;to drink less tea and coffee&quot; to be a verb modifier). Although the difficulty with this example was incurred by inadequate pre-processing or inherent difficulty in shallow parsing, the end result was a spurious match nonetheless. In the 289 matches, 15 of them were such false matches.</Paragraph> <Paragraph position="2"> As for the 17 examples which failed to match with any patterns, one example is &quot;What internet resources exist regarding copyright?&quot; - there were patterns that matched the interrogative part (&quot;What internet resources&quot;), but all of them had constrained variables for the verb which did not match &quot;exist&quot;. Other failed matches were because of elusive paraphrasing. For example, for an original question &quot;Why is evaporative emissions a problem?&quot;, web users entered &quot;What's up with evaporative emissions?&quot; and &quot;What is wrong with evaporative emissions?&quot;. Those paraphrases seem to be keyed off from &quot;problem&quot; rather than &quot;why&quot;. The original question for this paraphrase was &quot;How can I get rid of a caffeine habit?&quot;.</Paragraph> <Paragraph position="3"> This question can be paraphrased as &quot;Where can I find information about copyright on the internet?&quot;</Paragraph> </Section> <Section position="2" start_page="5" end_page="6" type="sub_section"> <SectionTitle> 4.2 Paraphrase Recognition </SectionTitle> <Paragraph position="0"> Using the case frame representations derived from the first experiment, we applied a frame similarity measure for all pairs of frames. This measure is rather rudimentary, and we are planning to fine-tune it in the future work. This measure focuses on the effect of paraphrase patterns - how much the canonical representations, after the variations of interrogatives are factored out, can bring closer the (true) paraphrases (i.e., questions generated from the same original question), thereby possibly improving the recognition of paraphrases.</Paragraph> <Paragraph position="1"> The frame similarity between a pair of frames is defined as a weighted sum of two similarity scores: one for the interrogative part (which we call interrogative similarity) and another for the sentence part (which we call case role similarity). The interrogative similarity is obtained by computing the average slot-wise correspondence of the empty categories (slots whose value is '?'), where the correspondence value of a slot is 1 if both frames have '?' for the slot or 0 otherwise. The case role similarity, on the other hand, is obtained by computing the distance between two term vectors, where terms are the union of words that appeared in the remaining slots (i.e., non-empty category slots) of the two frames. Those terms/words are considered as a bag of words (as in Information Retrieval), irrespective of the order or the slots in which they appeared. We chose this scheme for the non-empty category slots because our current work does not address the issue of paraphrases in the sentence part of the questions (as we mentioned earlier). Value of each term in a frame is either 1 if the word is present in the frame or 0 otherwise, and the cosine of the two vectors is returned as the distance. The final frame similarity value, after applying weights which sum to 1, would be between 0 and 1, where 1 indicates the strongest similarity.</Paragraph> <Paragraph position="2"> Using the frame similarity measure, we computed two versions - one with 0.5 for the weight of the interrogative similarity and another with 0.0. In addition, we also computed a baseline metric, sentence similarity. It was computed as the term vector similarity where terms in the vectors were taken from the phrase representation of the questions (i.e., syntactic phrases generated by the shallow parser). Thus the terms here included various wh-interrogative words as well as words that were dropped or changed in the paraphrase patterns (e.g. words instantiated with <methodN> in pattern (3) in Figure 2). This metric produces a value between 0 and 1, thus it is comparable to the frame similarity.</Paragraph> <Paragraph position="3"> The determination of whether or not two frames (or questions) are paraphrase of each other depends on the threshold value - if the similarity value is above a certain threshold, the two frames/questions are determined to be paraphrases. With the 306 case frames in the testset, there were a total of 46665 (= ) distinct combinations of frames, and 3811 If either one of the frames is null (for which the pattern-matching failed), the frame similarity is 0.</Paragraph> <Paragraph position="4"> of them were (true) paraphrases. After computing the three metrics (two versions of frame similarity, plus sentence similarity) for all pairs, we evaluated their performance by examining the trade-off between recall and rejection for varying threshold values. Recall is defined in the usual way, as the ratio of true positives (= # classified as paraphrase # true paraphrases ), and rejection is defined as the ratio of true negatives (= # classified as non-paraphrase # true non-paraphrases ). We chose to use rejection instead of precision or accuracy because those measures are not normalized for the number of instances in the classification category (# true paraphrases vs. # true non-paraphrases); since our test-set had a skewed distribution (8% paraphrases, 92% non-paraphrases), those measures would have only given scores in which the results for paraphrases was overshadowed by those for non-paraphrases.</Paragraph> <Paragraph position="5"> Figure 5 shows the recall vs. rejection curves for the three metrics. As you see, both versions of the frame similarity (FrSim 0.5 and FrSim 0.0 in the figure) outperformed the sentence similarity (Sent), suggesting that the use of semantic representation was very effective in recognizing paraphrases compared to syntactic representation. For example, FrSim 0.5 correctly recognized 90% of the true paraphrases while making only a 10% error in recognizing false positives, whereas Sent made a slightly over 20% error in achieving the same 90% recall level. This is a quite encouraging result.</Paragraph> <Paragraph position="6"> The figure also shows that FrSim 0.5 performed much better than FrSim 0.0. This means that explicit representation of empty categories (or question types) contributed significantly to the paraphrase recognition. This also underscores the importance of considering the formulations of interrogatives in analyzing question sentences.</Paragraph> </Section> </Section> class="xml-element"></Paper>