File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1414_evalu.xml

Size: 4,787 bytes

Last Modified: 2025-10-06 13:59:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1414">
  <Title>Generic Querying of Relational Databases using Natural Language Generation Techniques</Title>
  <Section position="6" start_page="99" end_page="100" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="99" end_page="100" type="sub_section">
      <SectionTitle>
4.1 Usability
</SectionTitle>
      <Paragraph position="0"> A recent study of the usability of a WYSIWYM type of interface for querying databases (Hallett et al., 2006) has shown that users can learn how to use the interface after a very brief training and succeed in composing queries of quite a high level of complexity. They achieve near-perfect query construction after the first query they compose. The study also showed that the queries as they appear in the WYSIWYM feedback text are unambiguous -- not only to the back-end system -- but also to the user, i.e., users are not misled into constructing queries that may have a different meaning than the one intended. Additionally, it appears that expert users of SQL , with expert knowledge of the underlying database, find the query interface easier to use than querying the database directly in SQL . We consider that most of the conclusions drawn in (Hallett et al., 2006) apply to the current system. The only difference may appear in assessing the ambiguity of the feedback text. Since the query construction rules used for our system are generated automatically, it is likely that the feed-back text may be less fluent and, potentially, more ambiguous than a feedback text generated using manually constructed rules, as in (Hallett et al., 2006). We have not yet addressed this issue in a formal evaluation of the current system.</Paragraph>
    </Section>
    <Section position="2" start_page="100" end_page="100" type="sub_section">
      <SectionTitle>
4.2 Coverage
</SectionTitle>
      <Paragraph position="0"> We have assessed the coverage of the system using as our test set a set of English questions posed over a database of geographical information GEOBASE, as in (Tang and Mooney, 2001) and (Popescu et al., 2003). Our first step was to convert the original Prolog database (containing about 800 facts) into a relational database. Then we tested how many of the 250 human produced questions in the test set can be constructed using our system.</Paragraph>
      <Paragraph position="1"> There are several issues in using this particular dataset for testing. Since we do not provide a pure natural language interface, the queries our system can construct are not necessarily expressed in the same way or using the same words as the questions in the test set. For example, the question &amp;quot;How high is Mount McKinley?&amp;quot; in the test set is equivalent to &amp;quot;What is the height of Mount McKinley?&amp;quot; produced by our system. Similarly, &amp;quot;Name all the rivers in Colorado.&amp;quot; is equivalent to &amp;quot;Which rivers flow through Colorado?&amp;quot;. Also, since the above test set was designed for testing and evaluating natural language interfaces, many of the questions have equivalent semantic content.</Paragraph>
      <Paragraph position="2"> For example, &amp;quot;How many people live in California?&amp;quot; is semantically equivalent to &amp;quot;What is the population of California?&amp;quot;. Similarly, there is no difference in composing and analysing &amp;quot;What is the population of Utah?&amp;quot; and &amp;quot;What is the population of New York City?&amp;quot;.</Paragraph>
      <Paragraph position="3"> Out of 250 test questions, 100 had duplicate semantic content and the remaining 150 had original content. On the whole test set of 250 questions, our system was able to generate query frames that allow the construction of 145 questions, therefore 58%. The remaining 42% of questions belong to a single type of questions that our current implementation cannot handle, which is questions that require inferences over numerical types, such as Which is the highest point in Alaska? or What is the combined area of all 50 states?.</Paragraph>
      <Paragraph position="4"> Similar results are achieved when testing the system on the 150 relevant questions only: 60% of the questions can be formulated, while the remaining 40% cannot.</Paragraph>
    </Section>
    <Section position="3" start_page="100" end_page="100" type="sub_section">
      <SectionTitle>
4.3 Correctness
</SectionTitle>
      <Paragraph position="0"> The correctness of the SQL generated queries was assessed on the subset of queries that our system can formulate out of the total number of queries in the test set. We found that the correct SQL was produced for all the generated WYSIWYM queries produced.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML