File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1070_metho.xml
Size: 9,193 bytes
Last Modified: 2025-10-06 14:07:41
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1070"> <Title>Using Machine Learning Techniques to Interpret WH-questions</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Data Collection </SectionTitle> <Paragraph position="0"> Our models were built from questions identified in a log of Web queries submitted to the Encarta encyclopedia service. These questions include traditional WH-questions, which begin with &quot;what&quot;, &quot;when&quot;, &quot;where&quot;, &quot;which&quot;, &quot;who&quot;, &quot;why&quot; and &quot;how&quot;, as well as imperative statements starting with &quot;name&quot;, &quot;tell&quot;, &quot;find&quot;, &quot;define&quot; and &quot;describe&quot;. We extracted 97,640 questions (removing consecutive duplicates), which constitute about 6% of the 1,649,404 queries in the log files collected during a period of three weeks in the year 2000. A total of 6,436 questions were tagged by hand. Two types of tags were collected for each question: (1) tags describing linguistic features, and (2) tags describing high-level informational goals of users. The former were obtained automatically, while the latter were tagged manually.</Paragraph> <Paragraph position="1"> We considered three classes of linguistic features: word-based, structural and hybrid.</Paragraph> <Paragraph position="2"> Word-based features indicate the presence of specific words or phrases in a user's question, which we believed showed promise for predicting components of his/her informational goals. These are words like &quot;make&quot;, &quot;map&quot; and &quot;picture&quot;. Structural features include information obtained from an XML-encoded parse tree generated for each question by NLPWin (Heidorn, 1999) - a natural language parser developed by the Natural Language Processing Group at Microsoft Research. We extracted a total of 21 structural features, including the number of distinct parts-of-speech (PoS) - NOUNs, VERBs, NPs, etc - in a question, whether the main noun is plural or singular, which noun (if any) is a proper noun, and the PoS of the head verb post-modifier.</Paragraph> <Paragraph position="3"> Hybrid features are constructed from structural and word-based information. Two hybrid features were extracted: (1) the type of head verb in a question, e.g., &quot;know&quot;, &quot;be&quot; or action verb; and (2) the initial component of a question, which usually encompasses the first word or two of the question, e.g., &quot;what&quot;, &quot;when&quot; or &quot;how many&quot;, but for &quot;how&quot; may be followed by a PoS, e.g., &quot;how ADVERB&quot; or &quot;how ADJECTIVE.&quot; We considered the following variables representing high-level informational goals: Information Need, Coverage Asked, Coverage Would Give, Topic, Focus, Restriction and LIST. Information about the state of these variables was provided manually by three people, with the majority of the tagging being performed under contract by a professional outside the research team.</Paragraph> <Paragraph position="4"> Information Need is a variable that represents the type of information requested by a user. We provided fourteen types of information need, including Attribute, IDentification, Process, Intersection and Topic Itself (which, as shown in Section 5, are the most common information needs), plus the additional category OTHER. As examples, the question &quot;What is a hurricane?&quot; is an IDentification query; &quot;What is the color of sand in the Kalahari?&quot; is an Attribute query (the attribute is &quot;color&quot;); &quot;How does lightning form?&quot; is a Process query; &quot;What are the biggest lakes in New Hampshire?&quot; is an Intersection query (a type of IDentification, where the returned item must satisfy a particular Restriction - in this case &quot;biggest&quot;); and &quot;Where can I find a picture of a bay?&quot; is a Topic Itself query (interpreted as a request for accessing an object directly, rather than obtaining information about the object).</Paragraph> <Paragraph position="5"> Coverage Asked and Coverage Would Give are variables that represent the level of detail in answers. Coverage Asked is the level of detail of a direct answer to a user's question. Coverage Would Give is the level of detail that an information provider would include in a helpful answer. For instance, although the direct answer to the question &quot;When did Lincoln die?&quot; is a single date, a helpful information provider might add other details about Lincoln, e.g., that he was the sixteenth president of the United States, and that he was assassinated. This additional level of detail depends on the request itself and on the available information. However, here we consider the former factor, viewing it as an initial filter that will guide the content planning process of an enhanced QA system. The distinction between the requested level of detail and the provided level of detail makes it possible to model questions for which the preferred level of detail in a response differs from the detail requested by the user. We considered three levels of detail for both coverage variables: Precise, Additional and Extended, plus the additional category OTHER. Precise indicates that an exact answer has been requested, e.g., a name or date (this is the value of Coverage Asked in the above example); Additional refers to a level of detail characterized by a oneparagraph answer (this is the value of Coverage Would Give in the above example); and Extended indicates a longer, more detailed answer.</Paragraph> <Paragraph position="6"> Topic, Focus and Restriction contain a PoS in the parse tree of a user's question. These variables represent the topic of discussion, the type of the expected answer, and information that restricts the scope of the answer, respectively. These variables take 46 possible values, e.g., NOUN , VERB and NP , plus the category OTHER. For each question, the tagger selected the most specific PoS that contains the portion of the question which best matches each of these informational goals. For instance, given the question &quot;What are the main traditional foods that Brazilians eat?&quot;, the Topic is NOUN (Brazilians), the Focus is ADJ +NOUN (traditional foods) and the restriction is ADJ (main).</Paragraph> <Paragraph position="7"> As shown in this example, it was sometimes necessary to assign more than one PoS to these target variables. At present, these composite assignments are classified as the category OTHER.</Paragraph> <Paragraph position="8"> LIST is a boolean variable which indicates whether the user is looking for a single answer (False) or multiple answers (True).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Predictive Model </SectionTitle> <Paragraph position="0"> We built decision trees to infer high-level informational goals from the linguistic features of users' queries. One decision tree was constructed for each goal: Information Need, Coverage Asked, Coverage Would Give, Topic, Focus, Restriction and LIST. Our decision trees were built using dprog (Wallace and Patrick, 1993) - a procedure based on the Minimum Message Length principle (Wallace and Boulton, 1968).</Paragraph> <Paragraph position="1"> The decision trees described in this section are those that yield the best predictive performance (obtained from a training set comprised of &quot;good&quot; queries, as described Section 5). The trees themselves are too large to be included in this paper. However, we describe the main attributes identified in each decision tree. Table 2 shows, for each target variable, the size of the decision tree (in number of nodes) and its maximum depth, the attribute used for the first split, and the attributes used for the second split. Table 1 shows examples and descriptions of the attributes in Table 2.1 We note that the decision tree for Focus splits first on the initial component of a question, e.g., &quot;how ADJ&quot;, &quot;where&quot; or &quot;what&quot;, and that one of the second-split attributes is the PoS following the initial component. These attributes were also used to build the hand-crafted rules employed by the QA systems described in Section 2, which concentrate on determining the type of the expected as &quot;what&quot; and &quot;who&quot; as PRONOUNs. Also, the clue attributes, e.g., Comparison clues, represent groupings of different clues that at design time where considered helpful in identifying certain target variables.</Paragraph> <Paragraph position="2"> answer (which is similar to our Focus). However, our Focus decision tree includes additional attributes in its second split (these attributes are added by dprog because they improve predictive performance on the training data).</Paragraph> </Section> class="xml-element"></Paper>