XML Viewer - c04-1170

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1170_metho.xml
Size: 26,228 bytes
Last Modified: 2025-10-06 14:08:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1170">
  <Title>Kaplan, J., Cooperative Responses from a Portable</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The WEBCOOP project and its
environment
</SectionTitle>
    <Paragraph position="0"> This research is part of the WEBCOOP system (Benamara et al. 2004), that makes a heavy use of a rich domain ontology, which is a concept ontology structured by means of several semantic relations (Cruse 1986). To better understand the perspective in which this work has been carried out, let us say a few words about the WEBCOOP project, a system designed to provide cooperative answers to queries on the Web.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Typology of the domain ontology
</SectionTitle>
      <Paragraph position="0"> The ontology on which our project is based is basically a conceptual ontology where nodes are associated with concept lexicalizations and essential properties, as it is now often the case in restricted domain ontologies (http://www.daml.org/ontologies/). In our implementation, each node is represented by the predicate : onto-node(concept, lex, prop) where concept has properties prop, represented as attribute-value pairs, and lexicalisations lex.</Paragraph>
      <Paragraph position="1"> Most lexicalisations are entries in the lexicon, where morphological and grammatical aspects are described. For example, for hotel, we have (coded in Prolog): onto-node(hotel, [[hotel], [residence, hoteliere]],</Paragraph>
      <Paragraph position="3"> There are several well-designed public domain ontologies on the net. Our ontology is a synthesis of two existing French ontologies, that we customized: TourinFrance (www.tourinfrance.net) and the bilingual (French and English) thesaurus of tourism and leisure activities (www.iztzg.hr/indokibiblioteka/THESAUR.PDF) which includes 2800 French terms. We manually integrated these ontologies in WEBCOOP (Benamara et al. 2004a) and added ourselves most properties by hand. We also revised slightly the ontology by removing concepts which are either too specific (i.e. too low level), such as some basic aspects of ecology or rarely considered, as e.g.</Paragraph>
      <Paragraph position="4"> the economy of tourism. We also removed quite surprising classifications such as sanatorium under tourist accommodation. We finally reorganized some concept hierarchies, so that they 'look' more intuitive for a large public. Finally, we found that some hierarchies are a little bit odd, for example, we found at the same level accommodation capacity and tourist accommodation whereas, in our case, we consider that capacity is a property of the concept tourist accommodation.</Paragraph>
      <Paragraph position="5"> We have, at the moment, 1000 concepts in our tourism ontology which describe the different aspects of accommodation and transportation and a few other satellite elements (geography, health, immigration). Besides the traditional 'isa' relation, we also coded the 'part-of' relation and the opposition. We also have a few proportional series in order to encode graduality. In the tourism domain there is a reasonable number of part-of pairs, and a small number of opposites and proportional series. Synonymy is encoded via the list of lexicalizations. Encoding these relations is straightforwardly realized by introducing facts, which operate at the concept level, not at the lexicalization one: part-of(journey, trip).</Paragraph>
      <Paragraph position="6"> As shall be seen below, this ontology is futher annotated for lexicalization tasks.</Paragraph>
      <Paragraph position="7"> Finally, in (Benamara et al. 2004), we define a conceptual metrics that determines the distance between two concepts, evaluated in terms of distance in the hierarchy between the two concepts and prop-erty differences (nature of properties and values associated). null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Outline of WEBCOOP
</SectionTitle>
      <Paragraph position="0"> Following Grice's maxims (Grice, 1975) and related works, e.g. (Searle, 1975), in the early 1990s, a number of forms of cooperative responses were identified. Most of the efforts in these studies and systems focussed on the foundations and on the implementation of reasoning procedures (Gal, 1988), (Minock et ali., 1996), while little attention was paid to question analysis and NL response generation. An overview of these systems can be found in (Gasterland et al., 1994) and in (Webber et ali., 2002), based on works by (Hendrix et ali., 1978), (Kaplan, 1982), (Mays et ali., 1982), among others. These systems include e.g. the identification of false presuppositions and various types of misunderstandings found in questions. They also include reasoning schemas based e.g. on constant relaxation to provide approximate or alternative, but relevant, answers when the direct question has no response.</Paragraph>
      <Paragraph position="1"> Intensional reasoning schemas can also be used to generalize over lists of basic responses or to construct summaries.</Paragraph>
      <Paragraph position="2"> The framework of Advanced Reasoning for Question Answering (QA) systems, as described in a recent road map, raises new challenges since answers can no longer be only directly extracted from texts (as in TREC) or databases, but require the use of a domain knowledge base, including a conceptual ontology, and dedicated inference mechanisms. Such a perspective, obviously, reinforces and gives a whole new insight to cooperative answering.</Paragraph>
      <Paragraph position="3"> In WEBCOOP, user questions may range from keywords to comprehensive natural language expressions. Their parse produces a semantic logical representation that includes the question conceptual category, the question focus and the question contents. Responses are structured in two parts. The first part contains explanation elements in natural language. It is a first level of cooperativity that reports user misconceptions, if any, in relation with the domain knowledge. The second part is the most important and the most original. It reflects the know-how of the cooperative system, going beyond the cooperative statements given in the first part. It is based on intensional description techniques and on intelligent relaxation procedures going beyond classical generalization methods used in AI. This component also includes additional dedicated cooperative rules that make a thorough use of the domain ontology and of general knowledge. Responses provided to users are built in web style, by integrating natural language generation (NLG) techniques with hypertext links to produce &amp;quot;dynamic&amp;quot; responses. Hyperlinks are dynamically created at generation time. This leaves up to the user the high-level planning tasks inherent to NLG and improves readability and information access. The standard generation difficulties (lexicalisation, aggregation, argumentation) remain crucial to generate cooperative responses, but their web-style greatly reduces the overall complexity.</Paragraph>
      <Paragraph position="4"> The goal of the present contribution is to clarify and model the lexicalization aspects, which are, in fact, in close dependence with the aggregation functions. null</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Investigating lexicalisations from
</SectionTitle>
    <Paragraph position="0"> cooperative QA corpora To carry out our study, we considered three typical sources of cooperative discourses: Frequently Asked Questions (FAQ), Forums, and email question-answer pairs (EQAP), these latter obtained by sending ourselves emails to relevant services (e.g. for tourism: tourist offices, airlines, hotels). Our study was carried out on 670 cooperative question-answer pairs. We have about 50% pairs coming from FAQ, 25% from Forums and 25% from EQAP. The domains considered are basically largepublic applications: tourism (45%), health (22%), sport, shopping and education (19%). In all these corpora, no user model is assumed, and there is no dialogue: QA pairs are isolated, with no context, in a way similar to web search engine querying.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Lexicalisation strategies deployed by
</SectionTitle>
    <Paragraph position="0"> humans From the analysis of these QA pairs, we have identified the main lexicalisation strategies presented below. Between parentheses are introduced XMLtype labels, used to annotate our corpora.</Paragraph>
    <Paragraph position="1"> * Identity (ID): the term in the question is kept unchanged in the response, * Quasi-synonyms (SYN): under this term, we observe a large spectrum of variations: SYN(a) terms which are almost equivalent (have/possess), SYN(b) usages with a specific orientation or connotation, such as the use of commercially oriented terms to attract customers or of more technical terms, possibly with slight nuances, on the part of the responder (refund / financial compensation, breathing stops / respiration pauses), quite typical of financial or health corpora where the responder is often a professional, null SYN(c) transcategorial variations or short paraphrases (international - from the entire world, from abroad, from all the countries in the world). This latter subclass is much more difficult to characterize, it often indicates a form of insistence or stress, or the desire to make information more explicit (e.g. international is really from all the countries in the world), * Opposites (OPP): used on a limited scale to account e.g. for the differences in point of view between the questioner and the responder (e.g. lend / borrow, pay/charge, stay/leave) (Cruse 1986).</Paragraph>
    <Paragraph position="2"> * Subtypes (SUB): often subtypes in the ontology (accommodation - hotel, motel, bungalow, country cottage, etc.) or enumerations of basic terms (e.g. different types of credit cards). We have the following main subtypes: SUB(a) lower-level elements in an ontology (possibly terminal elements), SUB(b) focuses on sub-events or sub-activities of the larger event lexicalised in the question, or (simple) procedure description instead of the procedure name, and SUB(c) subtyping via modification (official guide, mountain guide).</Paragraph>
    <Paragraph position="3"> * Sisters (SIS) are used when the response introduces minor superficial corrections to the terms used in the question (not as important as false presuppositions), often related to 'lexical approximations' (e.g. bungalow used instead of hut, cabin in a camping), these should not be confused with SYN(b). SIS is also useful in negative responses, where the correct response is a sister of the concept used in the query (is my metro ticket valid on the bus ? no, you need to buy a bus ticket).</Paragraph>
    <Paragraph position="4"> * Generalizations (GEN): their main use is in intensional answers, where generalizations are realized to make the answer more compact.</Paragraph>
    <Paragraph position="5"> Generalizations can also be used e.g. to insist on a certain facet of the term (castle / building), or to provide a certain form of help (where is the Dominican Republic ? - in the Antilles).</Paragraph>
    <Paragraph position="6"> It is possible, in addition, to indicate the generalization level w.r.t. the ontology.</Paragraph>
    <Paragraph position="7"> * Metonymies (MET): are relatively frequent, and focus on a particular 'facet' of the term used in the question to better emphasize a particular aspect, such as a means or an instrument of an action, or use of make for object, container for containee.</Paragraph>
    <Paragraph position="8"> * Fuzzy term interpretation (FTI): close interpreted as 50 meters.</Paragraph>
    <Paragraph position="9"> These lexical variations are relatively well distributed over the above different categories, roughly as follows (all situations included) in % :</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ID SYN OPP SUB SIS GEN MET FTI
</SectionTitle>
    <Paragraph position="0"> 32% 21% 2% 21% 4% 5% 14% 2% As the reader may note it, almost none of these lexical variations are neutral. They convey a specific connotation, point of view, focus or they may activate a concept property or facet usually in the background.</Paragraph>
    <Paragraph position="1"> To have a more systematic analysis of lexicalisation, we manually annotated lexicalisation forms in QA pairs. The examples in Fig. 1, on the next page, illustrate several forms frequently encountered. Terms considered are labelled as T1, T2, etc. (including terms within the focus), a special annotation is reserved for the whole structure identified as the focus (FOC) and op= specifies the variation operation at stake (e.g. SYN(a), SUB(b), etc.). Q1-R1 have several layers of lexicalisations: guided tour is a subtype of visit, guided tours at fixed hours 11AM and 3 PM is a subtype of visit hours, while building is a generalization of castle, necessary to allow for the reference to town hall, since building is the least upper bound in the ontology for castle and town hall. Q2-R2 stresses on and makes more explicit drinkable in water from the tap, while Q3-R3 requires the GEN strategy, which is the simplest way to locate a country or any object, via its relations with others.</Paragraph>
    <Paragraph position="2"> 5 Enriching the ontology for lexicalisations tasks in NLG Let us consider again the ontology presented in section 2.1. It needs some refinements so that the lexicalization situations identified can be automated. Let us consider the main cases.</Paragraph>
    <Paragraph position="3"> SYN(a) is based on the set of lexicalizations associated with the concept considered. SYN(b) requires lexicalisations to be marked according to their language level, in our case: slang, popular term, standard term (the one chosen by default) and technical term. SYN(c) requires another type of notation: the category of the lexicalisation: word versus expression + category. For example, possess is marked technical (contrasted with have), and from the entire world is identified as an expression of type PP. In terms of coding, the last example is coded as: lex([from,the,entire,world], standard, expr:pp).</Paragraph>
    <Paragraph position="4"> Opposites are encoded via a specific set of facts in the ontology at the concept level, same for SUB(b) where subactivites of an activity are considered as parts. Treating SIS is more delicate since all the sisters of a node may not be appropriate. At this level, we use the conceptual metrics advocated in section 2.1, with a quite high threshold, that we parameterize to evaluate its best level. The metrics is highly dependent on the number of properties encoded, but it seems hard to have another strategy, since sampling methods would require too much corpus work. As an example, hut and cabin are considered as close sisters of bungalow,butchalet and country cottage are more remote. This aspect is crucial when relaxing concepts to find alternative solutions. null Finally, metonymies will not be treated at the moment and FTI are dealt with in a more or less ad'hoc way, using simple and naive geometrical considerations. null 6 A lexicalisation model for QA As shown in Fig. 2, lexicalisation strategies in QA pairs occur at five levels: L1: between the query and the direct response, L2: between the question and the cooperative know-how part, L3: within the direct response (insistence, paraphrases, exclusions, etc.), L4: between the direct response and the know-how part, and, finally, L5: within the know-how part, which may be composed of several components, as shown in (Benamara et ali. 2004b).</Paragraph>
    <Paragraph position="5"> It seems that L1, L2, L3, L4 and L5 are organized hierarchically in the lexicalisation process.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 The relation L1
</SectionTitle>
      <Paragraph position="0"> Let us concentrate on the question focus, which is the term the most sensitive to lexical variation (variations on other terms are essentially synonyms). Let us first consider L1, the lexicalisation strategies be- null tween the question and the direct response. The Q1: What are the &lt;FOC&gt;&lt;T2 &gt;&lt; T3 &gt; visit &lt;/T3 &gt; hours &lt;/T2 &gt; of the &lt;T1 &gt; castle ? &lt;/T1 &gt;&lt; /FOC &gt; R1: This &lt;T1 op = GEN &gt;building &lt;/T1 &gt; being the town hall, there are only &lt;T2op = SUB &gt;&lt;T3op = SUB &gt; guided tours &lt;/T3 &gt; at fixed hours: 11 AM and 3 PM &lt;/T2 &gt;. Q2: Is water &lt;FOC&gt;drinkable &lt;/FOC&gt;in the Dominican Republic ? R2: No, you must not &lt;FOCop= SYN(c) &gt; drink water from the tap &lt;/FOC&gt;, always water from bottles. Q3: &lt;FOC&gt;Where &lt;/FOC&gt;is the Dominican Republic ? R3: It is in the very heart of &lt;FOCop= GEN &gt; the Caribbean &lt;/FOC&gt;, it is the second largest &lt;  lap, observed in the domain ontology, between the response elements and the question focus. This is determined by comparing the subtree ST1 in the domain ontology whose root is the concept associated with the question focus, with the subtree ST2 associated with the response, or, in case of an enumeration, the smallest subtree that includes all the solutions. We basically have four situations: * Focus adequate w.r.t. the response: both sub-trees coincide, or ST2 is a proper subtree of ST1, (Q: what kind of tourist accommodation in Chamonix? R: we have all kinds of hotels, bungalows, chalets and campings), * Focus too large: ST2 is only a portion of a proper subtree of ST1 (Q: what kind of tourist accomodation in La Trairie? we have only cabins and bungalows), * Focus too narrow: ST2 is larger than ST1, i.e.</Paragraph>
      <Paragraph position="1"> ST2 dominates ST1 or has elements not included into ST1, for example: (Q: An express train to Marseille at 11.PM ? R: We have an express train at 11.15, but also a TGV, which is much faster, at 11.25., where there is a generalization over the types of trains, * Shifts: ST2 is not included into ST1 (but, in some cases, a few solutions can be common) (what hotel categories ? we only have campsites and bungalows).</Paragraph>
      <Paragraph position="2"> The above situations are categorized in terms of the lexicalisation functions below, percentage per focus situation is given for L1 only, more pragmatic aspects are briefly commented:  response via enumeration (2) Too large SUB(c), MET, (21%) SUB(a,b) for enumerations (3) Too GEN restricted (6%) reference to generic concepts (4) Fuzzy or FTI, needs an SYN not interpreted interpretation (2%) (5) Adequate SUB(a)(b) for enumerations with conditions (18%) (6) Minor SIS for adjustments  shifts (13%) In an NLG system, more schematic than real-life corpora, the above strategies allow for the production of adequate lexicalisations for the following situations, which seem to cover all the cases we found for yes/no and entity (what is, who) questions: illustration: Q: what kinds of hotels in Meg`eve? * direct response, simple or complex, (ID, SYN), we have all kinds of hotels in Meg`eve.</Paragraph>
      <Paragraph position="3"> * response with restrictions (SYN, ID + SUB, GEN + SUB), We have all kinds of tourist accomodation except 4* hotels and huts,or simple adaptations (SIS), * response scoping on precise subtopics or subtypes (SUB), response can be negative on the query focus and then positive on a subtype: We do not have hotels, but comfortable, high standard, chalets, * response with some form of generalization (as in example Q1), We have all types of tourist accomodation, from the cheapest to the fanciest (GEN, least upper bound), * response with interpretation of fuzzy term (FTI), * response with conditions, or case structures (SUB), we have 2* hotels at 30-60 Euros, 3* hotels at 50-80 Euros, etc.</Paragraph>
      <Paragraph position="4"> * response with forms of stress, insistence, explicitation via enumerations or paraphrases (SYN, SUB, OPP, GEN), we have all types of hotels: 2*, 3*, 4* and even 5* hotels.</Paragraph>
      <Paragraph position="5"> The implementation of L1 is functionally quite straightforward. The reasoning component produces a logical formula, from which the NLG functions (lexicalisation, but also aggregation and microplanning) operate. As indicated in the above array, the generator evaluates the relative overlapp between the question focus and the response. It can then activate, for the terms to generate, the relevant lexicalisation function(s). The GEN function is quite specific since it has also some stylistic motivations, it is described in depth in (Benamara 2004). There are, at this level, several choices, that we manage via preferences given a priori since we do not have any user model. Stress and various additional enumerations have essentially a pragmatic motivation, as indicated above. At the implementation level, these can just be parameterized.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 The relations L2, L3 and L4
</SectionTitle>
      <Paragraph position="0"> Within the direct response, the most direct lexicalisation forms (L3, Fig. 2), can be paired with categories like OPP, SUB, FTI or MET. OPP is used within the response to reinforce some terms by saying what they are not. MET is in general a form of stress or focus on a particular point. MET is produced via the concept properties of the ontology.</Paragraph>
      <Paragraph position="1"> SUB(a)(b) are also used as forms of elaborations, to make more explicit a term (e.g. we accept all forms of payments: cheques, credit cards, cash, money orders, etc.), or to give a precision, possibly using FTI. Such lexicalisations have a strong pragmatic basis.</Paragraph>
      <Paragraph position="2"> However, in NLG, the use of OPP or of SYN(a)(b) can be motivated either by the user model, or, if it does not exist, by the need of some elaboration if some concepts in the response are complex or quite remote from the original question focus. The implementation relies on the conceptual metrics, also used for concept relaxation (Benamara et al 2004a) and also on manually marked really complex concepts in the ontology. The lexicalization programme is simply able to propose additional lexicalizations, structured by the aggregation procedure, e.g. as a list of SUB or OPP.</Paragraph>
      <Paragraph position="3"> In the know-how part of the cooperative response, lexicalisation strategies (L2) are more rigid, since they are guided by the reasoning procedures. We observe a large number of relaxations, involving lexicalisations based on SUB, SIS or GEN, depending on how the ontology is traversed by relaxation. Several intensional operations are also developed to make the response more compact, involving lexicalisations based on GEN (Benamara 2004). Lexicalization choices totally depend on the reasoning procedures and are just direct language realizations of the concepts introduced by relaxation or generalization inference schemas.</Paragraph>
      <Paragraph position="4"> There are lexicalisation relations between the direct response and the know-how part (L4, which has some relations with L2) in particular when the direct response is negative. The know how part involves realisations which are in: - GEN, SUB, SIS or FTI relation with the term used in the direct response when relaxation or intentional calculus is used in the know how part, - SYN or SUB relation when the know-how part is a warning or a restriction. Pronominal or other forms of references are possible at this level (yes, we do have cabins, but they can only be rent on a weekly basis).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.3 The relation L5
</SectionTitle>
      <Paragraph position="0"> The relation L5 has a specific status, it covers a large number of situations, we therefore simply evoke prototypical situations here. Its motivation comes from the fact that the know how part of a response is composed, in general, of several segments, called meaningful units. As explained in detail in (Benamara et ali. 2004b), the know-how part is in general composed of a response elaboration (related to constraint relaxation, or any other type of inference) followed by a number of additional information under the form of justifications, warnings, restrictions, counter-proposals, etc.</Paragraph>
      <Paragraph position="1"> For example, in the following example, we have several segments: Q: How can I get to the Borme castle ? Direct response + elaboration: You must take the GR90 from the old castle: additional information - precision: walking distance: 30 minutes.</Paragraph>
      <Paragraph position="2"> warning: There is no possibility to get there by car. In this example, the warning segment uses a kind of opposite of walking, namely driving, to warn the user e.g. about the lack of alternative.</Paragraph>
      <Paragraph position="3"> The following example has a know-how part based on relaxation, and then proposes an alternative to cinemas by means of a sister concept: Q: Is there a cinema in Borme ? Direct response: No, Know-how (relaxation on the localization): the closest cinema is at Londes (8 kms) or at Hyeres, Cinema Olbia (at 20 kms).</Paragraph>
      <Paragraph position="4"> Additional information (counter-proposal on the activity) : however, there is a niceopen air theater in Borme.</Paragraph>
      <Paragraph position="5"> In this latter example, we see that, again, the additional information segment introduces a sister concept of cinema.</Paragraph>
      <Paragraph position="6"> Implementing L5 is largely dependent on the reasoning procedures, on the knowledge found in the database, and on the way the general form of the response is planned. The role of lexicalisation here is essentially choosing the appropriate terms, with the appropriate level in the ontology.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML