File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1605_metho.xml
Size: 12,216 bytes
Last Modified: 2025-10-06 14:08:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1605"> <Title>Interrogative Reformulation Patterns and Acquisition of Question Paraphrases</Title> <Section position="3" start_page="0" end_page="1" type="metho"> <SectionTitle> 2 Paraphrasing Patterns for Questions </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 2.1 Training Data </SectionTitle> <Paragraph position="0"> Paraphrasing patterns were extracted from a large corpus of question sentences which we had used in our previous work (Tomuro and Lytinen, 2001; Lytinen and Tomuro, 2002). It consisted of 12938 example questions taken from 485 Usenet FAQ files. In the current work, we used a subset of that corpus consisting of examples whose question types were PRC (procedure), RSN (reason) or ATR (atrans).</Paragraph> <Paragraph position="1"> Those question types are members of the 12 question types we had defined in our previous work (Tomuro and Lytinen, 2001). As described in that paper, PRC questions are typical 'how-to' questions and RSN questions are 'why' questions. The type ATR (for ATRANS in Conceptual Dependency (Schank, 1973)) is essentially a special case of PRC, where the (desire for the) transfer of possession is strongly implied. An example question of this type would be &quot;How can I get tickets for the Indy 500?&quot;. Not only do ATR questions undergo the paraphrasing patterns of PRC questions, they also allow reformulations which ask for the (source or destination) location or entity of the thing(s) being sought, for instance, &quot;Where can I get tickets for the Indy 500?&quot; and &quot;Who sells tickets for the Indy 500?&quot;. We had observed that such ATR questions were in fact asked quite frequently in question-answering systems. null Also those question types seem to have a richer set of paraphrasing patterns than other types (such as definition or simple reference questions given in TREC competitions (Voorhees, 2000)) with regard to the interrogative reformulation. In the corpus, there were 2417, 1022 and 968 questions of type PRC, RSN, ATR respectively, and they constituted the training data in the current work.</Paragraph> <Paragraph position="2"> Although we did not use it in the current work, we also had access to the user log of AskJeeves system (http://www.askjeeves.com). We observed that a large portion of the user questions were ATR questions.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 Paraphrase Patterns </SectionTitle> <Paragraph position="0"> The aim of our paraphrasing patterns is to account for different syntactic variations of interrogative words. As we showed examples in section 1, the interrogative part of a question adds a syntactic superstructure to the sentence part, thereby making it difficult for an automatic system to get to the core of the question. By removing this syntactic overhead, we can derive the canonical representations of questions, and by using them we can perform a many-to-one matching instead of many-to-many when we compare questions for similarity.</Paragraph> <Paragraph position="1"> In the pre-processing stage, we first applied a shallow parser to each question in the training data and extracted its phrase structure. The parser we used is customized for interrogative sentences, and its complexity is equivalent to a finite-state machine.</Paragraph> <Paragraph position="2"> The output of the parser is a list of phrases in which each phrase is labeled with its syntactic function in the question (subject, verb, object etc.). Passive questions are converted to active voice in the last step of the parser by inverting the subject and object noun phrases. Then using the pre-processed data, we manually inspected all questions and defined patterns which seemed to apply to more than two instances. By this enumeration process, we derived a total of 127 patterns, consisting of 18, 23 and 86 patterns for PRC, RSN and ATR respectively.</Paragraph> <Paragraph position="3"> Each pattern is expressed in the form of a rule, where the left-hand side (LHS) expresses the phrase structure of a question, and the right-hand side (RHS) expresses the semantic case frame representation of the question. When a rule is matched against a question, the LHS of the rule is compared with the question first, and if they match, the RHS is generated using the variable binding obtained from the LHS. Figure 2 shows some example patterns.</Paragraph> <Paragraph position="4"> In a pattern, both LHS and RHS are a set of slot-value tuples. In each tuple, the first element, which is always prefixed with :, is the slot name and the remaining elements are the values. Slots names which appear on the LHS (:S, :V, :O, etc.) relate to syntactic phrases, while those on the RHS (:actor, :theme, :source etc.) indicate semantic cases. A slot value could be either a variable, indicated by a symbol enclosed in <..> (e.g. <NPS>), or a constant (e.g. how). A variable could be either constrained (e.g. <obtainV>) or unconstrained (e.g.</Paragraph> <Paragraph position="5"> <NPS>, <NPO>). Constrained variables are defined separately, and they specify that a phrase to be matched must satisfy certain conditions. Most of the conditions are lexical constraints - a phrase must contain a word of a certain class. For instance, <obtainV> denotes a word class 'obtainV' and it includes words such as &quot;obtain&quot;, &quot;get&quot;, &quot;buy&quot; and &quot;purchase&quot;. Word classes are groupings of words appeared in the training data which have similar meanings (i.e., synonyms), and they were developed in tandem with the paraphrase patterns. Whether constrained or unconstrained, a variable gets bound with one or more words in the matched question (if possible for constrained variables). A constant indicates a word and requires the word to exist in the tuple. 'NIL' and '?' are special constants where 'NIL' requires the tuple (phrase in the matched question) to be empty, and '?' indicates that the slot is an empty category. Each rule is also given a priority level (e.g. 3 in pattern (2)), with a large number indicating a high priority.</Paragraph> <Paragraph position="6"> In the example patterns shown in Figure 2, pattern (1) matches a typical 'how-to' question such as &quot;How do I make beer?&quot;. Its meaning, according to the case frame generated by the RHS, would be &quot;I&quot; for the actor, &quot;make&quot; for the verb, &quot;beer&quot; for the theme, and the empty category is :proc (for pro- null cedure). Patterns (2) through (4) are rules for ATR questions. Notice they all have two empty categories - :proc and :source - as consistent with our definition of type ATR. Also notice the semantic case roles are taken from various syntactic phrases: pattern (2) takes the actor and theme from syntactic subject and object straight-forwardly, while pattern (3), which matches a question such as &quot;What is a good way to buy tickets for the Indy 500&quot;, takes the theme from the object in the infinitival phrase (:NP) and fills the actor with &quot;I&quot; which is implicit in the question. Pattern (4), which matches a question such as &quot;Who sells tickets for the Indy 500&quot;, changes the verb to &quot;obtain&quot; as well as filling the implicit actor with &quot;I&quot;. This way, ATR paraphrases are mapped to identical case frames (modulo variable binding).</Paragraph> </Section> </Section> <Section position="4" start_page="1" end_page="3" type="metho"> <SectionTitle> 3 Acquisition of Question Paraphrases </SectionTitle> <Paragraph position="0"> To evaluate the question paraphrase patterns, we used the set of question paraphrases which we had acquired in our previous work (Tomuro and Lytinen, 2001) for the test data. In that work, we obtained question paraphrases in the following way. First we selected a total of 35 questions from 5 FAQ categories: astronomy, copyright, gasoline, mutualfund and tea. Then we created a web site where users could enter paraphrases for any of the 35 questions. Figure 3 shows a snapshot of the site when the astronomyFAQ is displayed.</Paragraph> <Paragraph position="1"> After keeping the site public for two weeks, a total of 1000 paraphrases were entered. Then we inspected each entry and discarded ill-formed ones (such as keywords or boolean queries) and incorrect paraphrases. This process left us with 714 correct paraphrases (including the original 35 questions).</Paragraph> <Paragraph position="2"> Figure 4 shows two sets of example paraphrases entered by the site visitors. In each set, the first sentence in bold-face is the original question (and its question type). In the paraphrases of the first question, we see more variations of the interrogative part of ATR questions. For instance, 1c explicitly refers to the source location/entity as &quot;store&quot; and 1d uses &quot;place&quot;. Those words are essentially hyponyms/specializations of the concept 'location'.</Paragraph> <Paragraph position="3"> Paraphrases of the second question, on the other hand, show variations in the sentence part of the questions. The expression &quot;same face&quot; in the original question is rephrased as &quot;one side&quot; (2a), &quot;same side&quot; (2b), &quot;not .. other side&quot; (2c) and &quot;dark side&quot; (2f). The verb is changed from &quot;show&quot; to &quot;face&quot; (2b), &quot;see&quot; (2c, 2d) and &quot;look&quot; (2e). Those rephrasings are rather subtle, requiring deep semantic knowledge and inference beyond lexical semantics, that is, the common-sense knowledge.</Paragraph> <Paragraph position="4"> To see the kinds of rephrasing the web users entered, we categorized the 679 (= 714 - 35) paraphrased questions roughly into the following 6 categories. null (1) Lexical substitution - synonyms; involves no or minimal sentence transformation (2) Passivization (3) Verb denominalization - e.g. &quot;destroy&quot; vs. &quot;destruction&quot; (4) Lexical semantics & inference - e.g. &quot;show&quot; vs. &quot;see&quot; (5) Interrogative reformation - variations in the interrogative part (6) Common-sense - e.g. &quot;dark side of the Moon&quot; Table 1 shows the breakdown by those categories.</Paragraph> <Paragraph position="5"> As you see, interrogative transformation had the In order to give a context to a question, we put a link (&quot;wanna know the answer?&quot;) to the actual Q&A pair in the FAQ file for each sample question.</Paragraph> <Paragraph position="6"> If a paraphrase fell under two or more categories, the one with the highest number was chosen.</Paragraph> <Paragraph position="7"> (1) Lexical substitution 168 (25 %) (2) Passivization 37 (5 %) (3) Verb denominalization 18 (3 %) (4) Lexical semantics & inference 107 (16 %) (5) Interrogative reformation 339 (50 %) (6) Common-sense 10 (1 %) Total 679 (100 %) largest proportion. This was partly because all transformations to questions that start with &quot;What&quot; were classified as this category. But the data indeed contained many instances of transformation between different interrogatives (why $ how $ where $ who etc.). From the statistics above, we can thus see the importance of understanding the reformulations of the interrogatives. As for other categories, lexical substitution had the next largest proportion. This means a fair number of users entered relatively simple transformations. On this, (Lin and Pantel, 2001) makes a comment on manually generated paraphrases (as versus automatically extracted paraphrases): &quot;It is difficult for humans to generate a diverse list of paraphrases, given a starting formulation and no context&quot;. Our data is in agreement with their observations indeed.</Paragraph> </Section> class="xml-element"></Paper>