File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/83/a83-1007_metho.xml

Size: 24,777 bytes

Last Modified: 2025-10-06 14:11:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="A83-1007">
  <Title>L., and Friedman, C. Natural Language Interfaces Using Limited Semantic</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ISOLATING DOMAIN DEPENDENCIES IN NATURAL LANGUAGE INTERFACES
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="47" type="metho">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Isolating the domain-dependent information within a large natural language system offers the general advantages of modular design and greatly enhances the portability of the system to new domains. We have explored the problem of isolating the domain dependencies within two large natural language systems, one for generating a tabular data base from text (&amp;quot;information formatting&amp;quot;), the other for retrieving information from a data base. We describe the domain information schema which is used to capture the --d~n-specific information, and indicate how this information is used throughout the two systems.</Paragraph>
    <Paragraph position="1"> Prologue Computational linguistics is an interesting blend of science and engineering. It is science insofar as we are trying to understand a natural process -- verbal communication. It is engineering insofar as we are trying to manage complexity -- the complexity which, from our present viewpoint, seems inherent in natural language systems.</Paragraph>
    <Paragraph position="2"> One tool we have for managing complexity is modular design -- dividing a large system into components of manageable size. This may mean, in particular, separating procedures from knowledge sources and then separating different sources of knowledge. If we &amp;quot;factor m our system in an appropriate way, we may be able to reduce the size not just of individual components but of the system as a whole. Because of our involvement with large natural language systems, we have long been concerned with these issues of modularity \[Grishman 1980\]. Attacking substantial natural language applications will require that we scale up our already large systems, and we believe that this will be feasible only with systems which have been carefully divided into modules.</Paragraph>
    <Paragraph position="3">  One such division is the separation of domain-specific knowledge (sometimes called &amp;quot;world knowledge&amp;quot;) from knowledge of the language in general. Such a division not only confers the usual advantages of modularity in facilitating development of a system for a single application, but also greatly enhances the portability of a system to new domains.</Paragraph>
    <Paragraph position="4"> Portability is especially enhanced if the domain-specific information can be empirically verified and its discovery at least partially automated. What we require is a compact representation of the domain-specific information needed by the components of a natural language system, in a form which can be efficiently utilized by these components. Before we review our own efforts in this direction, a few historical comments are in orde~ on isolating domain-dependent information.</Paragraph>
    <Paragraph position="5"> In the early 1970's, the prime concern for most designers was getting these domain-specific (&amp;quot;semantic&amp;quot;) constraints into their system; little emphasis was placed on isolating this information from the rest of the system.</Paragraph>
    <Paragraph position="6"> For example, in the LUNAR system the constraints of operating in a moon rocks world were interwoven with the semantic interpretation procedures \[Woods 1972\].</Paragraph>
    <Paragraph position="7"> Interestingly, one trend of the mid-70's was a tighter integration of domain-specific constraints with general grammatical constraints, in the form of semantic grammars \[Burton 1976\]. By merging grammatical and 'semantic' constraints, semantic grammars facilitated the construction of small natural language systems. On the other hand, they impeded the capture of grammatical regularities; adding a new syntactic pattern (e.g., reduced relatives) might require adding dozens of productions (one for each allowable combination of semantic classes). They also made it difficult to transport a system to a new domain. As a result, the most recent trend has been towards a careful isolation of domain-specific knowledge (e.g., the RUS System \[Bobrow 1980\]). In particular, some groups which developed substantial semantic-grammar-based systems, such as LADDER at SRI \[Hendrix 1978\] and PLANES at the Univ. of Illinois \[Waltz 1978\], have now developed syntactic grammars with separate domain information components.</Paragraph>
    <Paragraph position="8"> Our systems For all of the reasons mentioned above -- reduced size and complexity, better capture of grammatical regularities, greater portability, empirical verifiability -- we have been working fOE the past few years to factor out the domain dependencies from two large natural language systems. One of these is a system for information formatting -- the mapping of natural language text into a tabular data base, for subsequent use in information retrieval and statistical analysis; this system has been used to process radiology reports and hospital discharge summaries \[Sager 1978, Hlrschman 1982b\]. The other is a question-answering system for data retrieval from relational data bases, including in particular those generated by information formatting \[Grishman 1978\].</Paragraph>
    <Paragraph position="9"> In both systems the initial processing -- parsing and transformational decomposition -- is performed by the Linguistic String Parser \[Sager 1981\]. In formatting, the transformationally regularized parse tree is mapped into an information format; the format is then &amp;quot;normalized&amp;quot; to recover zeroed information and analyze the time structure of the narrative \[Hirschman 1981\]. For question-answering the operator-operand tree (produced by transformational decomposition) is first translated into a logical form based on first-order predicate calculus; anaphoric phrases are resolved; the logical form is translated into a data base retrieval request; the data is retrieved; and, if necessary, a full-sentence answer is generated incorporating the retrieved data.</Paragraph>
    <Paragraph position="10"> The domain information schema (DIS) The domain-dependent information used by our systems has two basic aspects.</Paragraph>
    <Paragraph position="11"> First, it characterizes the structure of the information in the domain. Second, it specifies the correspondence between the information structures as they appear in the text and the various internal representations of information in the system.</Paragraph>
    <Paragraph position="12"> We call the characterization of the structure of information in the domain a domain information schema or DIS \[Grishman 1982\]. This characterization consists primarily of a set of semantic classes, the words and phrases which are members of these classes, and the allowable predicate-argument relationships among  these classes in this domain. For example, a schema for a medical domain would contain classes such as PT (patient), VPT (patient-verb), INDIC (indicator of sign or symptom), and</Paragraph>
  </Section>
  <Section position="3" start_page="47" end_page="49" type="metho">
    <SectionTitle>
BODY-PART INDIC
</SectionTitle>
    <Paragraph position="0"> neck stiff (here class names are given in upper case and members of the classes in lower case). Certain other properties of these predicates, such as functional dependencies among arguments, are also included in the DIS. For example, in the medical domain there is a functional relationship from tests to patients because each test is of one and only one patient. The DIS is thus similar to data base schemata and to the frame-slot structures of frame-based systems.</Paragraph>
    <Paragraph position="1"> Usin@ the DIS The domain information schema is used most extensively in the parsing stage of the two systems. The predicate-argument constraints of the DIS yield a sublanguage in the sense of Harris \[Harris 1968\]. These constraints are enforced by a set of selectional restrictions added to the Linguistic String Project English grammar. The task of enforcing these constraints is complicated by the wide variety of surface structures in which a subject-verb-object pattern may appear: declarative, interrogative, and imperative sentences; active and passive voice; in main clauses, relatives, and reduced relatives; etc. The complexity of the restrictions is reduced by the power of the Restriction Language to operate in terms of the string relations (e.g., subject-verb-object or host-modifier relations) \[Sager 1975\], but it is still substantial. The virtue of this approach, however, is that these restrictions are essentially constant across sublanguages, while the DIS, which will change and grow for new applications, is kept to a minimum.</Paragraph>
    <Paragraph position="2"> The following sentence fragment illustrates the use of selectional restrictions to obtain the correct parse: Blood cultures obtained in the visit to the emergency room prior to admission.</Paragraph>
    <Paragraph position="3"> Here the problem is the placement of the prepositional phrase prior to admission, which could modify the d~r'-~ct--~y adjacent noun phrase (the emergency room), but in fact modifies the preceding noun phrase the visit. The selection for prepositlon&amp;quot;-o~-\[ phrases on their hosts is given by the P-NSTGO-HOST list (part of the DIS). The portion of this list relevant fo~ the preposition prior to is:  This list contains the information that a time preposition (PREPTIME, e.g., prior t_~o) can appear with a VMD (medical action word) as its prepositional object (e.g., prior to admission), with the prepositional phrase modifying another VMD word (e.g., visit); this corresponds to the P-NSTGO-HOST pattern PREPTIME: (VMD: (VMD)). There is no pattern PREPTIME: (VMD: (INST)) which would allow prior to admission to modify the INST word (inst--\[tution word) emergency room. The application of the selectional constraints ensures that the incorrect parse will be eliminated and the correct one produced.</Paragraph>
    <Paragraph position="4"> In order to verify these constraints, the restrictions must determine the semantic class membership of the noun phrases in the sentence. Usually the class of a noun phrase is that of the &amp;quot;core&amp;quot; noun of the phrase. In certain cases, however, the class of the noun phrase as a whole is a function of the classes of both core and adjunct. For instance, a BODY-PART word modified by an INDIC word becomes an INDIC phrase, as in stiff neck. In some cases, the core is &amp;quot;transparent&amp;quot; and the class of the noun phrase is determined by the class of its right or left adjunct, so that, for example, onset of swelling would be in the same class as swelling. To accommodate these situations, the DIS contains rules for such phrasal or &amp;quot;computed&amp;quot; attributes (see the N-LN-COMP-ATT and N-RN-COMP-ATT lists in the appendix). Each time a noun is encountered which can participate in a computed attribute construction, its adjuncts are checked and, if ~pprcp La e, a computed attribute is assigned to be used in further selectional restrictions.</Paragraph>
    <Paragraph position="5"> The selectional restrictions serve to exclude many incorrect but syntactically well-formed parses. Constraints on coordinate conjunction (requiring the conjoining of phrases from identical or similar semantic classes), acting together with the computed attribute mechanism, serve to reduce the structural ambiguity of conjoined constructs, always a serious problem \[Hirschman 1982a\]. The noun phrase anorexia and onset of a stiff neck illustrates this process. There are two possible parses for this phase, namely (anorexia and onset) of a stiff neck, which is incorrect; and the correct analysis (anorexia) and (onset of a stiff neck). The conjunction mechanism--r~quires t a~only &amp;quot;similar = elements be conjoined; this rules out the conjunction of anorexia (an INDIC word) and onset (a BEGIN word).</Paragraph>
    <Paragraph position="6"> However, the phrase s~-~neck receives a computed attribute INDIC; and onset is &amp;quot;transparent&amp;quot;, so it receives a computed attribute INDIC from its right adjunct stiff neck. Therefore the entire noun phrase onset of a stiff neck has a computed at-'~-6&amp;quot;{~but&amp;quot;e INDIC, and can conjoin (as a phrase) to anorexia, giving the correct parse.</Paragraph>
    <Paragraph position="7"> In addition, these selectional patterns can be used to resolve most homographs, that is, to determine the class assignment of words which are members of several classes. For example, the word in~ection is ambiguous: it can mean inflammation (an INDIC word), as in throat in~ection, or it can mean shot (a VTR word), as in penicillin injection.</Paragraph>
    <Paragraph position="8"> The DIS enables a homograph to be disambiguated, provided that sufficient context is present. For example, in the phrase throat injection, the combination INDIC: (BODY-PART) is allowed in the compound noun (N-NPOS) relation, whereas the combination VTR: (BODY-PART) is not allowed (see the appendix for the N-NPOS list). This disambiguation is important because the subsequent mapping into an internal representation (information format or predicate calculus expression) is dependent on the correct identification of the semantic class of each information-carrying word.</Paragraph>
    <Paragraph position="9"> The anaphora resolution procedure in the question-answering system relies crucially on the DIS. The same mechanism which uses context to resolve homographs also serves to determine the possible class assignment(s) for an anaphoric phrase. For example, if given the question Did it show swelling?  the procedure would consult the DIS to determine what classes of subjects can occur with a VSHOW verb (show) and an INDIC object (swelling). The DIS (Appendix, section 2) indicates that the subject in this context can be a BODY-PART or a TEST. The anaphora resolution procedure then searches the current and prior sentences for an antecedent belonging to one of those classes.</Paragraph>
    <Paragraph position="10"> The DIS also plays a role in the translation of queries into predicate calculus. Specifically, the information on functional dependencies between arguments of predicates is used in determining the scope of quantlflers and conjunctions. Consider the following two sentences, which have similar syntactic structures:  (1) How many patients have had an X-ray and a biopsy? (2) How many biopsies did patient X and patient Y have?  The first question asks for a single number; in other words, the scope of how man~ is wider than the scope of and. T-~ second question, however, asks for two numbers: the number of biopsies X had and the number Y had; in this case, the scope of and is wider than that of how many. We know that this is the on-~ possible interpretation of the second question because there are no &amp;quot;group biopsies&amp;quot; -each biopsy is of one and only one patient. This fact is encoded in the DIS as a functional dependency from TEST to PT (patient) in the triple VHAVE-PT-TEST. By using this functional dependency information, the system is able to assign the correct interpretation to the two questions.</Paragraph>
    <Paragraph position="11"> A further application of the DIS (not yet implemented) is the retrieval of &amp;quot;implicit&amp;quot; or omitted information. For example, certain compound noun constructions can be considered to result from the omission of the connector between the two nouns. In these cases, it may be possible to use the verb-subject-object list of the DIS to identify the omitted verb. This can be done by assuming that the head noun of the compound noun phrase will be the subject of the verb, and the modifying noun the object. Thus, given the phrase infectious disease consultant, we have a compound noun whose head is the the DOCTOR class, and the modifying noun in the INDIC class. If we search the V-S-O list of the DIS (see appendix) for candidate verbs, we find that a verb of class VTR (treatment) can take a DOCTOR subject and an INDIC object. If, in addition, each class has a distinguished &amp;quot;default&amp;quot; member (e.g., treat for the VTR  class), it may be possible to regularize the compound noun by restoring the omitted information (infectious disease consultant &lt;= consultant who treats infectious disease).</Paragraph>
    <Section position="1" start_page="49" end_page="49" type="sub_section">
      <SectionTitle>
Generating Internal Representations
</SectionTitle>
      <Paragraph position="0"> The semantic classes, and the subject-verb-object and host-adjunct patterns are also used to specify the correspondence between the textual and internal representations.</Paragraph>
      <Paragraph position="1"> In the current information format for hospital records, most classes map into a unique format column. The formatting procedure records this correspondence as a llst of semantic class - format column pairs. For some modifiers, however, such as time modifiers, aspectuals, and negation, the placement of the modifier in the format depends on the class (and hence the placement) of the host; special transformstlons are provided for the formatting of these modifiers.</Paragraph>
      <Paragraph position="2"> The questlon-answering system has provided slightly greater generality within a two-stage mapping. Syntactically analyzed queries are first mapped into an extended predicate calculus. For each subject-verb-object and host-adjunct pattecn in the DIS, we specify a predicate (or set of predicates) and a correspondence between syntactic roles (subject, object, sentence adjunct) and argument positions. Later (after anaphora resolution) the predicate calculus expression is mapped into a retrieval request on the information format; each predicate is defined as a projection of the information format.</Paragraph>
      <Paragraph position="3"> Automated verification and discover Z procedures One of the attractive features of the DIS is that it is empirically verifiable; some of our current research also addresses the possibility of (at least partial) automation of a discovery procedure for portions of the DIS.</Paragraph>
      <Paragraph position="4"> Semantic classes can be identified within a sublanguage, using techniques of distribution analysis \[Hirschman 1975\] : for each pair of words in a parsed, regularized sample corpus of a sublanguage, a similarity coefficient is computed based on how many common syntactic environments the two words occurred in (e.g., as the subject of the same verb). &amp;quot;Clusters&amp;quot; of similar words are then formed by grouping together words whose similarity coefficients exceed a certain threshold value. This technique has been used to identify the frequently occurring members of the major semantic classes of a radiology report domain.</Paragraph>
      <Paragraph position="5"> Given the semantic classes, it is then possible to identify the selectional patterns, simply by recording those patterns that occur in (good) parses.</Paragraph>
      <Paragraph position="6"> This provides verification of the DIS selectional patterns. It also allows collection of data on the relative frequency of occurrence of the various patterns. The frequency data would permit use of a weighting algorithm, in order to &amp;quot;prefer&amp;quot; a parse with more frequently occurring patterns to an alternate parse with less frequently occurring patterns. The &amp;quot;preferential&amp;quot; approach may allow significant enhancement Of parsing robustness compared to the &amp;quot;accept/reject&amp;quot; approach currently used. (In the &amp;quot;accept/reject&amp;quot; approach, a parse is either acceptable, or if it violates any constraints, it is rejected; there is no notion of &amp;quot;relative goodness&amp;quot; of several parses). The preferential approach would be particularly useful for incremental development of a DIS in a new sublanguage, where only partial data on selectional patterns is available, and also in highly non-deterministic parsing, such as speech understanding.</Paragraph>
      <Paragraph position="7"> One of the issues in automating the discovery procedure for the selectional patterns of the DIS is how to prevent patterns from bad parses from being included in =he DIS (and thus allowing even more bad parses). The use of weighted patterns may provide a means for automating the discovery of the DIS, since &amp;quot;correct&amp;quot; patterns are more likely to outnumber random &amp;quot;incorrect&amp;quot; patterns from bad parses. These issues are the subject of an ongoing research project.</Paragraph>
      <Paragraph position="8">  of our systems. Our experience with the rather different student transcript data base has indicated that not all domain dependencies have yet been isolated, particularly in specifying the mapping from textual to internal representation. Problems arose with the characterization of sentence adjuncts, units of time (semesters instead of days and months), and nouns or noun phrases implying computations (grade point average, enrollment), which we intend to rectify shortly by enriching the DIS.</Paragraph>
      <Paragraph position="9"> Our experiments also indicated that relatively limited domain-specific information (primarily a characterization of the structure of information in a domain, rather than specific facts about the domain) can be adequate for certain natural language applications, such as those described. Problems arose more often because the selectional constraints were too &amp;quot;tight&amp;quot; than because constraints deducible from specific facts of the domain were not available. As a result, we are now beginning to experiment with the automatic selective relaxation of these restrictions in order to improve parsing performance.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="49" end_page="49" type="metho">
    <SectionTitle>
Acknowledgements
</SectionTitle>
    <Paragraph position="0"> This research was supported in part by National Science Foundation grants MCS 80-02453 from the Division of Mathematical and Computer Sciences and IST 81-15669 from the Division of Information Science and Technology; in part by National Library of Medicine grant I-R01-LM03933, awarded by the National Institutes of Health, Department of Health and Human Services; and in part by Office of Naval Research contract N00014-75-C-0571, NR 049-347.</Paragraph>
    <Paragraph position="1"> Both systems described above have been extensively tested. The formatting procedure has been applied to a set of 14 hospital discharge summaries containing over 700 sentences; it is currently being used to process other types of hospital records. The question-answering system has been used on a data base of simplified formatted radiology records. In addition, to test its portability to quite different domains, we have applied the system to a simple data base of student transcripts.* In the course of this work, we have developed a simple, compact representation  of domain-specific knowledge and have thereby substantially reduced the complexity and increased the portability * The student data base was developed by V. K. Lamson as her master's thesis \[tamson 82\].</Paragraph>
    <Paragraph position="2"> APPENDIX AN EDITED DOMAIN INFORMATION SCHEMA FOR A MEDICAL SUBLANGUAGE i. SUBLANGUAGE SEMANTIC CLASSES * Below are some of the sublanguage classes used in the medical * domain information schema; note that classes may contain * words from different syntactic classes. The 15 classes * shown below were selected for illustrative purposes from * the over 50 classes in the full DIS.</Paragraph>
    <Paragraph position="3"> * Classes are given in the format: (abbreviated) CLASS NAME, * \[explanation of name\], followed by a few class members.</Paragraph>
  </Section>
  <Section position="5" start_page="49" end_page="49" type="metho">
    <SectionTitle>
2. ALLOWABLE PREDICATE-ARGUMENT RELATIONSHIPS
</SectionTitle>
    <Paragraph position="0"> describes which classes of head noun can be modified by which classes of compound noun (NPOS) modifier in the form: HEAD-NOUN1: (MODIFIER-NOUNII,...,MODIFIER-NOUNIn) , HEAD-NOUN2: (MODIFIER-NOUN21,...,MODIFIER-NOUN2m), Thus the compound noun INDIC :(BODY-PART), as in throat in~ection, is allowable, but the compound noun BODY-PART :(INDIC), as in in~ection throat, is not.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML