File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1024_metho.xml

Size: 17,977 bytes

Last Modified: 2025-10-06 14:08:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1024">
  <Title>A Hybrid Approach to Natural Language Web Search</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Data Analysis
</SectionTitle>
    <Paragraph position="0"> To generate optimal keyword queries from natural language questions, we first analyzed a set of 502 questions related to the purchasing and support of ThinkPads (notebook computers) and their accessories, such as &amp;quot;How do I set up hibernation for my ThinkPad?&amp;quot; and &amp;quot;Show me all p3 laptops.&amp;quot; Our analysis focused on three tasks. First, we attempted to identify an exhaustive set of correct webpages for each question, where a correct webpage is one that contains either an answer to the question or a hyperlink to such a page. Second, we manually formulated successful keyword queries from the question, i.e., queries which retrieved at least one correct webpage. Third, we attempted to discover general patterns in how the natural language questions may be transformed into successful keyword queries.</Paragraph>
    <Paragraph position="1"> Our analysis eliminated 110 questions for which no correct webpage was found. Of the remaining 392 questions, we identified, on average, 4.37 correct webpages and 1.58 successful queries per question. We found that the characteristics of successful queries varied greatly. In the simplest case, a successful query may contain all the content bearing NPs in the question, such as thinkpad AND &amp;quot;answering machine&amp;quot; for &amp;quot;Can I use my ThinkPad as an answering machine?&amp;quot;2 In the vast majority of cases, however, more complex transformations were applied to the question to result in a successful query. For instance, a successful query for &amp;quot;How do I hook an external mouse to my laptop?&amp;quot; is (mouse OR mice) AND thinkpad AND +url:support. In this case, the head noun mouse was inflected,3 the premodifier external was dropped, hook was deleted, laptop was replaced by thinkpad, and a URL constraint was applied.</Paragraph>
    <Paragraph position="2"> We observed that in our corpus, most successful queries can be derived by applying one or more transformation rules to the NPs and verbs in the questions. Table 1 shows the manually in- null search words to avoid overgeneralization of queries.</Paragraph>
    <Paragraph position="3">  duced commonly-used transformation rules based on our corpus analysis. Though the rules were quite straightforward to identify, the order in which they should be applied to yield optimal overall performance was non-intuitive. In fact, the best order we manually derived did not yield sufficient performance improvement over our baseline (see Section 7). We further hypothesize that the optimal rule application sequence may be dependent on question characteristics. For example, DropVerb may be a higher priority rule for buy questions than for support questions, since the verbs indicative of buy questions (typically &amp;quot;buy&amp;quot; or &amp;quot;sell&amp;quot;) are often absent in the target product pages. Therefore, we investigated a machine learning approach to automatically obtain the optimal rule application sequence.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 A Reinforcement Learning Approach to
Query Formulation
</SectionTitle>
    <Paragraph position="0"> Our problem consists of obtaining an optimal strategy for choosing transformation rules to generate successful queries. A key feature of this problem is that feedback during training is often delayed, i.e., the positive effect of applying a rule may not be apparent until a successful query is constructed after the application of other rules. Thus, we adopt a reinforcement learning approach to obtain this optimal strategy.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Q Learning
</SectionTitle>
      <Paragraph position="0"> We adopted the Q learning paradigm (Watkins, 1989; Mitchell, 1997) to model our problem as a set of possible states, S, and a set of actions, A, which can be performed to alter the current state. While in state s 2 S and performing action a 2 A, the learner receives a reward r(s;a), and advances to state s0 = -(s;a).</Paragraph>
      <Paragraph position="1"> To learn an optimal control strategy that maximizes the cumulative reward over time, an evaluation function Q(s;a) is defined as follows:</Paragraph>
      <Paragraph position="3"> In other words, Q(s;a) is the immediate reward, r(s;a), plus the discounted maximum future reward starting from the new state -(s;a).</Paragraph>
      <Paragraph position="4"> The Q learning algorithm iteratively selects an action and updates ^Q, an estimate of Q, as follows:</Paragraph>
      <Paragraph position="6"> where s0 = -(s;a) and fin is inversely proportional to the number of times a state/action pair &lt;s,a&gt; has been visited up to the nth iteration of the algorithm.4 Once the system learns ^Q, it can select from the possible actions in state s based on ^Q(s;ai).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Query Formulation Using Q Learning
</SectionTitle>
      <Paragraph position="0"> To formulate our problem in the Q learning paradigm, we represent a state as a 6-tuple, &lt;qtype, url constraint, np phrase, num nps, num modifiers, num verbs&gt;, where: + qtype is buy or support depending on question classification.</Paragraph>
      <Paragraph position="1"> + url constraint is true or false, and determines if manually predefined URL restrictions will be applied in the query.</Paragraph>
      <Paragraph position="2"> + np phrase is true or false, and determines whether each NP will be searched for as a phrase or a conjunction of words.</Paragraph>
      <Paragraph position="3"> + num nps is an integer between 1 and 3, and determines how many NPs will be included in the query.</Paragraph>
      <Paragraph position="4"> + num modifiers is an integer between 0 and 2, and indicates the maximum number of premodifiers in each NP.</Paragraph>
      <Paragraph position="5"> + num verbs is 0 or 1, and determines if the verb will be included in the query.</Paragraph>
      <Paragraph position="6"> 4Equation (2) modifies (1) by taking a decaying weighted average of the current ^Q value and the new value to guarantee convergence of ^Q in non-deterministic environments. We explain in the next section why our query formulation problem in the Q learning framework is non-deterministic. This representation is chosen based on the rules identified in Section 3. The actions, A, include the first 5 actions in Table 1, and the &amp;quot;undo&amp;quot; counterpart for each action.5 Except for qtype, which remains static for a question, each remaining element in the tuple can be altered by one of the 5 pairs of actions in a straightforward manner. The state, s, and the question, q, generate a unique keyword query which results in a hit list, h(s;q). The hit list is evaluated for correctness, whose result is used to define the reward function as follows:  where s0 = -(s;a). Note that our system operates in a non-deterministic environment because the reward is dependent not only on s and a, but also on q.6 Having defined S, A, -, and r, ^Q is determined by applying the Q learning algorithm, using the update function in (2), to our corpus of 392 questions. For each question, an initial state is randomly selected within the bounds of the question. The system then iteratively selects and applies actions, and updates ^Q until a successful query is generated or the maximum number of iterations is reached (in our imple- null mentation, 15). The training algorithm iterates over all questions in the training set and terminates when ^Q converges.</Paragraph>
      <Paragraph position="7"> 5 RISQUE: A Hybrid System for Natural</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Language Search
5.1 System Overview
</SectionTitle>
      <Paragraph position="0"> In addition to motivating machine learning based query transformation as our central approach to natural language search, our analysis revealed the need for several other key system components. As shown in Figure 1, RISQUE adopts a hybrid architecture 5Morphological and synonym expansions are applied at the outset, which was shown to result in better performance than optional application of those rules.</Paragraph>
      <Paragraph position="1">  that combines the utility of traditional knowledge-based methods and statistical approaches. Given a question, RISQUE first performs question analysis by extracting pertinent information to be used in query formulation, such as the NPs, VPs, and question type, and then orders the NPs in terms of their relative salience. This information is then used for hit list construction by two modules. The first component is the hub-page identifier, which retrieves, if possible, a hub page for the most salient NP in the question. The second component is the Q learning based query formulation and retrieval module that iteratively generates queries via transformation rule application and issues them to the search engine.</Paragraph>
      <Paragraph position="2"> The results from both processes are combined and accumulated until n distinct hits are retrieved.</Paragraph>
      <Paragraph position="3"> In addition to the above components, RISQUE employs an ontology for the ThinkPad domain, which consists of 1) a hierarchy of about 500 domain objects, 2) nearly 400 instances of relationships, such as isa and accessory-of, between objects, and 3) a synonym dictionary containing about 1000 synsets. The ontology was manually constructed and took approximately 2 person-months for coverage in the ThinkPad domain. It provides pertinent information to the question pre-processing and query formulation modules, which we will describe in the next sections.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Question Pre-Processing
5.2.1 Question Understanding
RISQUE's question understanding component is
</SectionTitle>
      <Paragraph position="0"> based primarily on a rule-driven parser in the slot grammar framework (McCord, 1989). The resulting parse tree is first analyzed for NP/VP extraction. Each NP includes the head noun and up to two premodifiers, which covers most NPs in our domain. The NPs are further processed by a named-entity recognizer (Prager et al., 2000; Wacholder et al., 1997), with reference to domain-specific proper names in our ontology. Recognized compound terms, such as &amp;quot;hard drive&amp;quot;, are treated as single entities, rather than as head nouns (&amp;quot;drive&amp;quot;) with pre-modifiers (&amp;quot;hard&amp;quot;). This prevents part of the compound term from being dropped when the DropModifier transformation rule is applied.</Paragraph>
      <Paragraph position="1"> The parse tree is also used to classify the question as buy or support. The classifier utilizes a set of rules based on lexical and part-of-speech information. For example, &amp;quot;how&amp;quot; tagged as a adverb (as in &amp;quot;How do I ...&amp;quot;) suggests a support question, while &amp;quot;buy/sell&amp;quot; used as a verb indicates a buy question. These rules were manually derived based on our training data.</Paragraph>
      <Paragraph position="2">  Our analysis showed that when a successful query is to contain fewer NPs than in the question, it is not straightforward to determine which NPs to eliminate, as it requires both domain and content knowledge. However, we observed that less salient NPs are often removed first, where salience indicates the importance of the term in the search process. The relative salience of NPs in this context can, for the most part, be determined based on the ontological relationship between the NPs and knowledge about the website organization. For instance, if A is an accessory-of B, then A is more salient than B since, on our website, accessories typically have their own webpages with significantly more information than pages about, for instance, the ThinkPads with which they are compatible.</Paragraph>
      <Paragraph position="3"> Our NP sequencer utilizes a rule-based reasoning mechanism to rank a set of NPs based on their relative salience, as determined by their relationship in the ontology.7 Objects not present in the ontol7We are aware that factors involving deeper question underogy are considered less important than those present. This process produces a list of NPs ranked in decreasing order of salience.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Hub-Page Identifier
</SectionTitle>
      <Paragraph position="0"> As with most websites, the ThinkPad pages on www.ibm.com are organized hierarchically, with a dozen or so hub-pages that serve as good starting points for specific sub-topics, such as mobile accessories and personal printers. However, since these hub-pages are typically not content-rich, they often do not receive high scores from the search engine (over which we have no control). Thus, we developed a mechanism to explicitly retrieve these hub-pages when appropriate, and to combine its results with the outcome of the actual search process.</Paragraph>
      <Paragraph position="1"> The hub-page identifier consists of a mapping from a subset of the named entities in the ontology to their corresponding hub-pages.8 For each question, the hub-page identifier retrieves the hub-page for the most salient NP, if possible, which is presented as the first entry in the hit list.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Reinforcement Learning Based Query
Formulation
</SectionTitle>
      <Paragraph position="0"> This main component of RISQUE iteratively formulates queries, issues them to the search engine, and accumulates the results to construct the hit list. The query formulation process starts with the most constrained query, and each new query is a relaxation of a previously issued query, obtained by applying one or more transformation rules to the current query.</Paragraph>
      <Paragraph position="1"> The transformation rules are applied in the order obtained by the Q learning algorithm as described in Section 4.2.</Paragraph>
      <Paragraph position="2"> The initial state of the query formulation process is as follows: url constraint and np phrase are set to true, while the other attributes are set to their respective maximum values based on the outcome of the question understanding process. This initial state represents the most constrained query possible for the given question, and allows for subsequent relaxation via the application of transformation rules.</Paragraph>
      <Paragraph position="3"> standing come into play in determining relative salience. We leave investigation of such features as future work.</Paragraph>
      <Paragraph position="4"> 8For reasons of robustness, we actually map a named entity to manually selected keywords which, when issued to the search engine, retrieves the desired hub-page as the first hit.</Paragraph>
      <Paragraph position="5"> When a state s, is visited, a query is generated based on s and the question. The query terms are instantiated based on the values of np phrase, num nps, num modifiers, and num verbs in s and the question itself, while URL constraints may be applied based on url constraint and qtype. Finally, synonyms expansion is applied based on the synonym dictionary in the ontology, while morphological expansion is performed on all NPs using a rule-based inflection procedure.</Paragraph>
      <Paragraph position="6"> After a query is issued, the search results are incorporated into the hit list, and duplicate hits are removed. A transformation rule amax = argmaxa ^Q(s;a) is applied to yield the new state.</Paragraph>
      <Paragraph position="7"> ^Q(s;amax) is then decreased to remove it from further consideration. This iterative process continues until the hit list contains 10 or more elements.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Example
</SectionTitle>
    <Paragraph position="0"> To illustrate RISQUE's end-to-end operation, consider the question &amp;quot;Do you sell a USB hub for a ThinkPad?&amp;quot; The question is classified as a buy question, given presence of the verb sell. In addition, two NPs are identified: NP1: head = USB hub NP2: head = ThinkPad Note that &amp;quot;USB hub&amp;quot; is identified as a compound noun by our named-entity recognizer. The NP sequencer determines that USB hub is more salient than ThinkPad since the former is an accessory of the latter.</Paragraph>
    <Paragraph position="1"> The hub-page identifier finds the networking devices hub-page for USB hub, presented as the first entry in the hit list in Figure 2, where correct web-pages are boldfaced.</Paragraph>
    <Paragraph position="2"> Next, RISQUE invokes its iterative query formulation process to populate the remaining hit list entries. The initial state is &lt;qtype = buy, url constraint = true, np phrase = true, num nps = 2, num modifiers = 0, num verbs = 0&gt;. This state generates the query shown as &amp;quot;Query 2&amp;quot; in Figure 2, which results in 6 hits, of which 4 are correct. RISQUE selects the optimal transformation rule for the current state, which is ReinstateModifier. Since neither NP has any modifier, a second rule, RelaxNP is attempted, which resulted in a new query that did not retrieve any previously unseen hits. Next, RISQUE selects ConstrainNP and RelaxURL, resulting in the query shown as &amp;quot;Query 3&amp;quot; in Figure 2.9 Note that relaxing the URL constraint results in retrieval of USB hub support pages.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML