File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-1024_evalu.xml

Size: 4,650 bytes

Last Modified: 2025-10-06 13:58:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1024">
  <Title>A Hybrid Approach to Natural Language Web Search</Title>
  <Section position="7" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
7 Performance Evaluation and Analysis
</SectionTitle>
    <Paragraph position="0"> We evaluated RISQUE's performance on 102 questions in the ThinkPad domain previously unseen to both RISQUE's knowledge-based and statistical components. The top 10 hits returned by RISQUE for each question were manually evaluated for correctness as in Section 3. A 2NP baseline was obtained by extracting up to two most salient NPs in each question, searching for the conjunction of all words in the NPs, and manually evaluating the 10 top hits returned.</Paragraph>
    <Paragraph position="1"> We selected the 2NP baseline based on statistics of keyword query logs on our website, which show that 98.2% of all queries contain 4 keywords or less. Furthermore, most three and four-word queries contain two distinct noun phrases, such as &amp;quot;visualage for java&amp;quot; and &amp;quot;application framework for e-business&amp;quot;. Thus, we use the 2NP baseline as an approximation of user keyword search performance for our natural language questions.10 We compared RISQUE's performance to the baseline using three metrics:11  1. Total correct: number of questions for which at least one correct webpage is retrieved.</Paragraph>
    <Paragraph position="2"> 2. Average correct: average number of correct webpages retrieved per question.</Paragraph>
    <Paragraph position="3"> 3. Average rank: average rank of the first correct  webpage in the hit list.</Paragraph>
    <Paragraph position="4"> The evaluation results are summarized in Table 2, where the first and last rows show the 2NP base-line and RISQUE's performance, respectively. The 9A set of negative URL constraints is applied at all times to best exclude parts of the website unrelated to ThinkPads. 10This is likely too high an estimate for current keyword search performance, since the majority of user queries employ only one noun phrase.</Paragraph>
    <Paragraph position="5"> 11We chose not to evaluate our results using the traditional IR recall measure because for our task, it is often sufficient to return one page that answers the question instead of attempting to retrieve all relevant pages.</Paragraph>
    <Paragraph position="6"> Question: Do you have a USB hub for a ThinkPad?  results show that RISQUE correctly answered 71 questions, a 137% relative improvement over the baseline. Furthermore, the average number of correct answers found nearly tripled, while, on average, the rank of the first correct answer improved from 4.0 to 2.11.</Paragraph>
    <Paragraph position="7"> Table 2 further shows performance figures that evaluate the individual contribution of RISQUE's two main components, the hub-page identifier and the iterative query formulation module. Comparison between the last two rows in Table 2 shows the effectiveness of the hub-page identifier, which substantially increased the number of questions correctly answered, but resulted in only minor gain using the other two performance metrics. To assess the effectiveness of the query formulation module, we used the best manually-derived rule application sequence obtained in Section 3. We compared these fixed order performance figures to those for RISQUE w/o hub identifier which shows that applying Q learning to derive an optimal state-dependent rule application order resulted in fairly substantial improvement us- null One of RISQUE's parameters, maxq, specifies the maximum number of distinct queries it can issue to the search engine for each question. Table 3 shows the average number of queries actually issued for select values of maxq.12 Figure 3 shows how performance degrades when fewer queries are issued as a result of lowering maxq for both RISQUE and RISQUE without the hub-page identifier. It shows that, with the exception of RISQUE's performance 12Maxq is 10 for the results in Table 2.</Paragraph>
    <Paragraph position="8"> when only one query is issued,13 the number of questions answered have a near-linear relationship with the number of queries issued for both systems. Notice that without the hub-page identifier, RISQUE's performance when issuing an average of 1.93 queries per question is nearly the same as that of the 2NP baseline, while it performs worse than the baseline when issuing only one query per question. This is because our iterative query formulation process intentionally begins with the most constrained query, resulting in an empty hit list in many cases.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML