File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-1317_evalu.xml
Size: 6,458 bytes
Last Modified: 2025-10-06 13:58:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1317"> <Title>Automated Construction of Database Interfaces: Integrating Statistical and Relational Learning for Semantic Parsing</Title> <Section position="6" start_page="138" end_page="139" type="evalu"> <SectionTitle> 5 Experimental Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="138" end_page="138" type="sub_section"> <SectionTitle> 5.1 The Domains </SectionTitle> <Paragraph position="0"> Three different domains are used to demonstrate the performance of the new approach.</Paragraph> <Paragraph position="1"> The first is the U.S. Geography domain.</Paragraph> <Paragraph position="2"> The database contains about 800 facts about U.S. states like population, area, capital city, neighboring states, major rivers, major cities, and so on. A hand-crafted parser, GEOBASE was previously constructed for this domain as a demo product for Turbo Prolog. The second application is the restaurant query system illustrated in Figure 1. The database contains information about thousands of restaurants in Northern California, including the name of the restaurant, its location, its specialty, and a guide-book rating. The third domain consists of a set of 300 computer-related jobs automatically extracted from postings to the USENET newsgroup austin.jobs. The database contalus the following information: the job title, the company, the recruiter, the location, the salary, the languages and platforms used, and required or desired years of experience and degrees. null</Paragraph> </Section> <Section position="2" start_page="138" end_page="138" type="sub_section"> <SectionTitle> 5.2 Experimental Design </SectionTitle> <Paragraph position="0"> The geography corpus contains 560 questions.</Paragraph> <Paragraph position="1"> Approximately 100 of these were collected from a log of questions submitted to the web site and the rest were collected in studies involving students in undergraduate classes at our university. We also included results for the subset of 250 sentences originally used in the experiments reported in (Zelle and Mooney, 1996). The remaining questions were specificaUy collected to be more complex than the original 250, and generally require one or more meta-predicates. The restaurant corpus contaln~ 250 questions automatically generated from a hand-built grammar Constructed to reflect typical queries in this domain. The job corpus contains 400 questions automatically generated in a similar fashion. The beam width for TABULATE was set~ to five for all the domains. The deterministic parser used only the best hypothesis found. The experiments were conducted using 10-fold cross validation.</Paragraph> <Paragraph position="2"> For each domain, the average recall (a.k.a.</Paragraph> <Paragraph position="3"> accuracy) and precision of the parser on disjoint test data are reported where: of correct queries produced Recall = of test sentences Precision = # of correct queries produced # of complete parses produced&quot; A complete parse is one which contains an executable query (which could be incorrect). A query is considered correct if it produces the same answer set as the gold-standard query supplied with the corpus.</Paragraph> </Section> <Section position="3" start_page="138" end_page="139" type="sub_section"> <SectionTitle> 5.3 Results </SectionTitle> <Paragraph position="0"> The results are presented in Table 1 and Figure 3. By switching from deterministic to probabilistic parsing, the system increased the number of correct queries it produced. Recall increases almost monotonically with parsing beam width in most of the domains. I_mprovement is most apparent in the Jobs domaln where probabilistic parsing signiBcantly outperformed the deterministic system (80% vs 68%). However, using a beam width of one (and thus the probabilistic parser picks only the best action) results in worse performance than using the original purely logic-based determlni~tic parser. This suggests that the probability esitmates could be improved since overall they are not indicating the single best action as well as a non-probabilistic approach. Precision of the system decreased with beam width, but not signi~cantly except for the larger Geography corpus. Since the system conducts a more extensive search for a complete parse, it risks increasing the number of incorrect as well as correct parses. The importance of recall vs. precision depends on the relative cost of providing an incorrect answer versus no answer at all. Individual applications may require emphasizing one or the other.</Paragraph> <Paragraph position="1"> All of the experiments were run on a 167MHz UltraSparc work station under Sicstus Prolog. Although results on the parsing time of the different systems are not formally reported here, it was noted that the difference between using a beam width of three and the original system was less than two seconds in all domains but increased to a~r0und twenty seconds when using a beam width of twelve.</Paragraph> <Paragraph position="2"> However, the current Prolog implementation is not highly optimized.</Paragraph> <Paragraph position="4"> While there was an overall improvement in recall using the new approach, its performance varied signiGcantly from dom~;~ to domain.</Paragraph> <Paragraph position="5"> As a result, the recall did not always improve dramatically by using a larger beam width.</Paragraph> <Paragraph position="6"> Domain factors possibly affecting the performance are the quality of the lexicon, the relative amount of data available for calculating probability estimates, and the problem of '~parser incompleteness&quot; with respect to the test data (i.e. there is not a path from a sentence to a correct query which happens when '7 = 0). The performance of all systems were basically equivalent in the restaurant domain, where they were near perfect in both recall and precision. This is because this corpus is relatively easier given the restricted range of possible questions due to the limited information available about each restaurant. The systems achieved > 90% in recall and precision given only roughly 30% of the training data in this domain. Finally, GEOBASE perfomed the worst on the original geography queries, since it is difficult to hand-crat~ a parser that handles a sn~cient variety of questions.</Paragraph> </Section> </Section> class="xml-element"></Paper>