XML Viewer - w06-1414

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1414_metho.xml
Size: 17,030 bytes
Last Modified: 2025-10-06 14:10:41
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1414">
  <Title>Generic Querying of Relational Databases using Natural Language Generation Techniques</Title>
  <Section position="4" start_page="95" end_page="95" type="metho">
    <SectionTitle>
2 WYSIWYM interfaces for database
</SectionTitle>
    <Paragraph position="0"> querying Conceptual authoring through WYSIWYM editing alleviates the need for expensive syntactic and semantic processing of the queries by providing the users with an interface for editing the conceptual meaning of a query instead of the surface text (Power and Scott, 1998).</Paragraph>
    <Paragraph position="1"> The WYSIWYM interface presents the contents of a knowledge base to the user in the form of a natural language feedback text. In the case of query editing, the content of the knowledge base is a yet to be completed formal representation of the users query. The interface presents the user with a natural language text that corresponds to the incomplete query and guides them towards editing a semantically consistent and complete query. In this way, the users are able to control the interpretation that the system gives to their queries. The user starts by editing a basic query frame, where concepts to be instantiated (anchors) are clickable spans of text with associated pop-up menus containing options for expanding the query.</Paragraph>
    <Paragraph position="2"> Previously, WYSIWYM interfaces have proved valid solutions to querying databases of legal documents and medical records (Piwek et al., 2000), (Piwek, 2002), (Hallett et al., 2005).</Paragraph>
    <Paragraph position="3"> As a query-formulation method, WYSIWYM provides most of the advantages of NLIDBs,but overcomes the problems associated with natural language interpretation and of users attempting to pose questions that are beyond the capability of the system or, conversely, refraining from asking useful questions that are in fact within the system's capability. However, one of the disadvantages of the WYSIWYM method is the fact that domain knowledge has to be manually encoded. In order to construct a querying interface for a new database, one has to analyse the database and manually model the queries that can be posed, then implement grammar rules for the construction of these queries. Also, the process of transforming WYSIWYM queries into SQL or another database querying language has previously been databasespecific. These issues have made it expensive to port the interface to new databases and new domains. null The research reported here addresses both these shortcomings by providing a way of automatically inferring the type of possible queries from a graph representation of the database model and by developing a generic way of translating internal representations of WYSIWYM constructed queries into</Paragraph>
  </Section>
  <Section position="5" start_page="95" end_page="99" type="metho">
    <SectionTitle>
SQL .
3 Current approach
</SectionTitle>
    <Paragraph position="0"> In the rest of the paper, we will use the following terms: a query frame refers to a system-generated query that has not been yet edited by the user, therefore containing only unfilled WYSIWYM anchors. An anchor is part of the WYSIWYM terminology and means a span of text in a partially formulated query, that can be edited by the user to expand a concept. Anchors are displayed in square brackets (see examples in section 3.3).</Paragraph>
    <Paragraph position="1"> To exemplify the system behaviour, we will use as a case study the MEDIGAP database, which is a freely downloadable repository of information concerning medical insurance companies in the United States. We have chosen this particular database because it contains a relatively wide range of entity and relation types and can yield a large number of types of queries. In practice we have often noticed that large databases tend to be far less complex.</Paragraph>
    <Section position="1" start_page="95" end_page="96" type="sub_section">
      <SectionTitle>
3.1 System architecture
</SectionTitle>
      <Paragraph position="0"> Figure 1 shows the architecture of the querying system. It receives as input a model of the database semantics (the semantic graph) and it automatically generates some of the compo- null nents and resources (highlighted in grey) that in previous WYSIWYM querying systems were constructed manually. Finally, it implements a module that translates the user-composed query into SQL .</Paragraph>
      <Paragraph position="1"> The components highlighted in grey are those that are constructed by the current system.</Paragraph>
      <Paragraph position="2"> The T-box describes the high-level components of the queries. It is represented in Profit notation (Erbach, 1995) and describes the composition of the query frames (the elements that contribute to a query and their type) . A fragment of the semantic graph displayed in 2 will generate the following  fragment of t-box: query &gt; [about_company, about_state, about_phone, about_ext].</Paragraph>
      <Paragraph position="3"> about_company &gt; [company_state, company_phone, company_ext].</Paragraph>
      <Paragraph position="4"> company_state intro [company:company_desc]. company_desc intro [comp:comp_desc, phone:phone_desc, ext:ext_desc].</Paragraph>
      <Paragraph position="5"> state_desc &gt; external('dbo_vwOrgsByState_StateName'). comp_desc &gt; external('dbo_vwOrgsByState_org_name'). phone_desc &gt; external('dbo_vwOrgsByState_org_phone'). ext_desc &gt; external('dbo_vwOrgsByState_org_ext').  The grammar rules are also expressed in Profit, and they describe the query formulation procedure. For example, the following rule will be generated automatically to represent the construction procedure for the query in Example (1.1): rule(english, company_state,</Paragraph>
      <Paragraph position="7"> layout!level!word]).</Paragraph>
      <Paragraph position="8"> In addition to the grammar rules automatically generated by the system, the WYSIWYM package also contains a set of English grammar rules (for example, rules for the construction of definite noun phrases or attachment of prepositional phrases). These rules are domain independent, and therefore a constant resource for the system.</Paragraph>
      <Paragraph position="9"> The lexicon consists of a list of concepts together with their lexical form and syntactic category. For example, the lexicon entry for insurance  company will look like: word(english, meaning!company &amp; syntax!(category!noun &amp; form!name) &amp; cset!'insurance company')).</Paragraph>
    </Section>
    <Section position="2" start_page="96" end_page="97" type="sub_section">
      <SectionTitle>
3.2 Semantic graph
</SectionTitle>
      <Paragraph position="0"> The semantics of a relational database is specified as a directed graph where the nodes represent elements and the edges represent relations between elements. Each table in the database can be seen as a subgraph, with edges between subgraphs representing a special type of join relation.</Paragraph>
      <Paragraph position="1"> Each node has to be described in terms of its semantics and, at least for the present, in terms of its linguistic realisation. The semantic type of a node is restricted by the data type of the corresponding entity in the database. A database entity of type String can belong to one of the following semantic categories: person, organization, location (town, country), address (street or complete address), telephone number, other name, other object. Similarly, numerical entities can have the semantic type: age, time (year, month, hour), length,  one possible semantic type, which is date. These semantic types have proved sufficient in our experiments, however this list can be expanded if necessary. null Apart from the semantic type, each node must specify the linguistic form used to express that node in a query. For example, in our case study, the field StateName will be realised as state, with the semantic category location. Additionally, each node will contain the name of the table it belongs to and the name of the column it describes.</Paragraph>
      <Paragraph position="2"> Relations in the semantic graph are also described in terms of their semantic type. Since relations are always realised as verbs, their semantic type also defines the subcategorisation frame associated with that verb. For the moment, subcategorisation frames are imported from a locally compiled dictionary of 50 frequent verbs. The user only needs to specify the semantic type of the verb and, optionally, the verb to use. The system automatically retrieves from the dictionary the appropriate subcategorisation frame. The dictionary has the disadvantage of being rather restricted in coverage, however it alleviates the need for the user to enter subcategorisation frames manually, a task which may prove tedious for a user without the necessary linguistic knowledge. However, we allow users to enter new frames in the dictionary, should their verb or category of choice not be present. A relation must also specify its arity.</Paragraph>
      <Paragraph position="3"> This model of the database semantics is partially constructed automatically by extracting database metadata information such as data types and value ranges and foreign keys. The manual effort involved in creating the semantic graph is reduced to the input of semantic and linguistic information. null</Paragraph>
    </Section>
    <Section position="3" start_page="97" end_page="99" type="sub_section">
      <SectionTitle>
3.3 Constructing queries
</SectionTitle>
      <Paragraph position="0"> We focus our attention in this paper to the construction of the most difficult type of queries: complex wh-queries over multiple database tables and containing logical operators. The only restriction on the range of wh-queries we currently construct is that we omit queries that require inferences over numerical and date types.</Paragraph>
      <Paragraph position="1"> Each node in the semantic graph can be used to generate a number of query frames equal to the number of nodes it is connected to in the graph.</Paragraph>
      <Paragraph position="2"> Each query frame is constructed by pairing the current node with each other of the nodes it is linked to. By generation of query frames we designate the process of automatically generating Profit code for the grammar rule or set of rules used by WYSIWYM , together with the T-box entries required by that particular rule.</Paragraph>
      <Paragraph position="3"> If we consider the graph presented in Fig.2, and focus on the node orgName, the system will construct the query frames:  Example (1): 1. In which state is [some insurance company] located? 2. What phone number does [some insurance company] have? 3. What extension does [some insurance  company] have? If we consider the first query in the example above, the user can further specify details about  the company by selecting the [some insurance company] anchor and choosing one of the options available (which themselves are automatically generated from the database in question). This information may come from one or more tables. For example, one table in our database contains information about the insurance companies contact details, whilst another describes the services provided by the insurance companies. Therefore, the user can choose between three options: contact details, services and all. Each selection triggers a text regeneration process, where the feedback text is transformed to reflect the user selection, as in the example below:  Example (2): 1. In which state is [some insurance company] that has [some phone number] and [some extension] located? 2. In which state is [some insurance company] that offers [some medical insurance plan] and [is available] to people over 65 located? 3. In which state is the insurance company with the following features located: * It has [some phone number] and [some extension] and * It offers [some medical insurance plan] and [is available] to people over  Figure 3 shows a snapshot of the query editing interface where query (2.1) is being composed. Each query frame is syntactically realised by using specially designed grammar rules. The generation of high level queries such as those in Example (1.1) relies on basic query syntax rules. The semantic type of each linked element determines the type of wh-question that will be constructed. For example, if the element has the semantic type location, we will construct where questions, whilst a node with the semantic type PERSON will give rise to a who-question. In order to avoid ambiguities, we impose further restrictions on the realisation of the query frames. If there is more than one location-type element linked to a node, the system will not generate two where query frames, which would be ambiguous, but more specific which queries. For example, our database contains two nodes of semantic type location linked to the node OrgName. The first describes the state where an insurance company is  located, the second its address. The query frames generated will be: Example (3): 1. In which states is some insurance company located? 2. At what addresses is some insurance company located?  The basic grammar rule pattern for queries based on one table only states that elements linked to a particular node will be realised in relative clauses modifying that node. For example, in Example (2.1), the nodes phones and ext are accessible from the node orgName, therefore will be realised in a relative clause that modifies insurance company.</Paragraph>
      <Paragraph position="4"> In the case where the information comes from more than one table, it is necessary to introduce more complex layout features in order to make the query readable. For each table that provides information about the focused element we generate bulleted lines as in Example (2.3).</Paragraph>
      <Paragraph position="5"> Each question frame consists of a bound element null  , i.e., the user cannot edit any values for that particular element. This corresponds to the information that represents the answer to the questions. In example (2), the bound element is state. All other nodes will be realised in the feedback text as anchors, that are editable by users. One exception is represented by nodes that correspond to database elements of boolean type. In this case, the anchor will not be associated to a node, but to a relation, as in Example (2.3) (the availability of an insurance plan is a boolean value). This is to allow the verb to take negative form - in our example, one can have is available to people over 65 or is not available to people over 65.</Paragraph>
      <Paragraph position="6"> Since not all anchors have to be filled in, one query frame can in fact represent more than one real-life question. In example (4), one can edit the query to compose any of the following corresponding natural language questions:  Example (4): 1. In which state is the insurance company with the phone number 8008474836 located? 2. In which state is the insurance  In fact, a single element can be replaced of any number of elements of the same type linked by conjunctions or disjunctions. However, we will refer to a single element by way of simplification. The process of inferring queries remains esentially the same.</Paragraph>
      <Paragraph position="7">  company Thrivent Financial for Lutherans with the phone number 8008474836 and extension 8469 located? The actual values of anchors are extracted from the database and transformed into correct lexicon entries on a per-need basis. The strings associated with a value (e.g. Thrivent Financial for Lutherans) are retrieved from the database table and column indicated in the description of the node that was used for generating the anchor (e.g. orgName) and the syntactic role (e.g. proper noun)isgiven by the syntactic information associated with the node.</Paragraph>
    </Section>
    <Section position="4" start_page="99" end_page="99" type="sub_section">
      <SectionTitle>
3.4 Query translation module
</SectionTitle>
      <Paragraph position="0"> Once a query has been constructed, it is represented internally as a directed acyclic graph.</Paragraph>
      <Paragraph position="1"> Moreover, each node in the graph can be mapped into a node in the semantic graph of the database.</Paragraph>
      <Paragraph position="2"> The translation module transforms a contructed query to an SQL statement by parsing the query graph and combining it with the corresponding elements in the semantic graph.</Paragraph>
      <Paragraph position="3"> The SELECT portion of the statement contains the focused element. The WHERE portion contains those nodes in the question graph that correspond to edited anchors. For constructing the FROM portion of the statement, we extract, from the semantic graph, for each SELECTED element information about their corresponding database table. null For example, if we assume that in Example (2.1) the user has specified the name of the company and its phone number, the SQL statement generated will be:</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML