File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/82/c82-1014_metho.xml
Size: 14,637 bytes
Last Modified: 2025-10-06 14:11:24
<?xml version="1.0" standalone="yes"?> <Paper uid="C82-1014"> <Title>NATURAL LANGUAGE INTERFACES USING LIMITED SEMANTIC INFORMATION</Title> <Section position="1" start_page="0" end_page="89" type="metho"> <SectionTitle> NATURAL LANGUAGE INTERFACES USING LIMITED SEMANTIC INFORMATION </SectionTitle> <Paragraph position="0"> In order to analyze their input properly, natural language interfaces require access to domain-speciflc semantic information. However, design considerations for practical systems -- in particular, the desire to construct interfaces which are readily portable to new domains -require us to limit and segregate this domain-specific information. We consider here the possibility of limiting ourselves to a characterization of the structure of information in a domain. This structure is captured in a domain information schema, which specifies the semantic classes of the domain, the words and phrases which belong to these classes, and the predicate-argument relationships among members of these classes which are meaningful in the domain. We describe how this schema is used by the various stages of two large natural language processingsystems.</Paragraph> <Paragraph position="1"> The necessity of incorporating domain-specific semantic information into natural language processing systems is now generally recognized. The task we face as computational linguists lies in selecting this information, organizing it, and integrating it into a natural language processing system.</Paragraph> <Paragraph position="2"> In principle, no limit can be placed on the semantic knowledge needed for natural language analysis -- given essentially any fact, one can devise a natural language input which requires knowledge of that fact for its correct interpretation. For the construction of operational systems, however, there are practical limitations on our ability to collect and organize the domain-specific knowledge for any substantial domain. Rather than ignore such limitations, we should use them as a motivation for identifying manageable components of this domaln-specific knowledge. Such considerations are especially important if we are aiming to construct _portable systems -- systems which can be readily moved from one domain to another.</Paragraph> <Paragraph position="3"> What properties should such a component have? It should * be effective in providing the information needed to guide the analysis of the input text; * have a ~ structure, to facilitate both the collection of the information and its use in the language analysis procedures; * have a discoverv procedure -- a systematic way of collecting * Present affiliation: Research and Development Activity, Federal and Special Systems Group, Burroughs Corp., Paoli, PA.</Paragraph> <Paragraph position="4"> R.G~SHMAN, L. HIRSCHMANandC. FRIEDMAN this information for a new domain.</Paragraph> <Paragraph position="5"> We suggest that a characterization of the structure of information in a domain is such a semantic component. We call this component a domain information schema (DIS). A DIS specifies a set of semantic classes, the words and phrases which belong to these classes, and the predicate-argument relationships among members of these classes which are meaningful in this domain. Some features of these relationships, such as functional dependencies between semantic classes, are also noted.</Paragraph> <Paragraph position="6"> This is not a novel assemblage of information. The DIS is perhaps most similar to data base schemata which also seek to separate a description of the structure of information in a domain from the specific facts about a domain. In frame-based systems, this information is essentially captured by the top-level frames, although the delineation here between structural description and specific facts is not as precise. Semantic grammars embed much of the information of the DIS, although there it is mixed with general linguistic knowledge. Certain parsers (e.g., the RUS parser \[1\]) also make use of aspects of information stored in a separate semantic component. Thus information similar to a DIS has been used, at least implicitly, by other natural language systems; however, little research has been explicitly concerned with the task of choosing a subset of the domain-specific information and evaluating it using criteria such as those mentioned above. We therefore decided to address this question with respect to the DIS in our recent research.</Paragraph> <Paragraph position="7"> To this end, we have recently modified portions of two large natural language systems so that all domain-specific knowledge is isolated in a DIS. One of these is a system for the information formatting of natural language medical reports; the other, a &quot;question-answering&quot; system for data base retrieval using natural language. We shall report here on how information from the DIS is used in the various stages of analysis.*</Paragraph> </Section> <Section position="2" start_page="89" end_page="89" type="metho"> <SectionTitle> THE SYSTEMS </SectionTitle> <Paragraph position="0"> The information formatting system \[2\] is designed to accept natural language text in some scientific or technical domain and map the text into a domain-specific structure (an information format) which is suitable for subsequent retrieval operations. In essence, the format is a set of tables in which each category of domain information (for example, for hospital reports: laboratory tests, laboratory findings, diagnoses, treatments, etc.) is assigned a separate column. This formatting procedure has been successfully applied to radiology reports and to hospital discharge summaries. The question-answering system \[3\] accepts natural language queries regarding the data in the text and retrieves the requested information from the formatted data base. Q Both systems use the Linguistic String Pars~ and grammar \[4\] to obtain a parse and transformational decomposition of the input sentence. The grammar is an augmented context-free grammar written in Restriction Language \[5\]. In the formatting procedure, the * we have concurrently been investigating discovery procedures for DIS's; some of our early work in this area was r~ported in \[6\].</Paragraph> </Section> <Section position="3" start_page="89" end_page="89" type="metho"> <SectionTitle> NL INTERFACES USING LIMITED SEMANTIC INFORMATION 91 </SectionTitle> <Paragraph position="0"> decomposition tree is mapped into the information format; the format then goes through a normalization component which fills in implicit information and a component to analyze the time structure of the narrative. For question answering, the decomposition tree is mapped into an extended predicate calculus formula; this is followed by anaphora resolution and translation of the formula into a data base retrieval request.</Paragraph> </Section> <Section position="4" start_page="89" end_page="89" type="metho"> <SectionTitle> SELECTION </SectionTitle> <Paragraph position="0"> The domain information schema is most directly reflected in the syntax of the language, forming a sublanguage as described by Harris \[7\]. The semantic classes and relationships, as defined by the DIS, are used to formulate sublanguage selectional constraints.</Paragraph> <Paragraph position="1"> These ~onstraints rule out incorrect syntactic analyses, many of which are caused by structural ambiguity due to adjunct placement and conjunction, and by lexical ambiguity due to homographs.</Paragraph> <Paragraph position="2"> The selection mechanism is list driven to provide for portability from one sublanguage to another. These lists specify for each basic linguistic relation, such as SUBJECT-VERB-OBJECT or HOST-ADJECTIVE, the patterns of word classeswhich are permissible inthesublanguage. Each basic lingustic relation has many surface realizations for which selection must be checked. The SUBJECT-VERB-OBJECT relation, for instance, may appear in declaratives and questions, in main and relative clauses, in active and passive voice, in perfect and progressive forms, etc. This task is greatly simplified, however, by the linguistic routines of the Restriction Language \[4,5\], which locate the elements of the parse tree bearing the underlying SUBJECT-VERB, VERB-OBJECT, and HOST-ADJUNCT relations.</Paragraph> <Paragraph position="3"> An example of how the DIS eliminates incorrect parses in the medical sublanguage can be seen in the sentence from a medical text Brother 18 also has heart disease, on cardiac meds.</Paragraph> <Paragraph position="4"> which has two analyses: one where &quot;on cardiac meds&quot; is an adjunct of &quot;heart disease&quot; and the other where it is an adjunct of &quot;brother&quot;. There is a HOST-ADJUNCT pattern for the classes</Paragraph> </Section> <Section position="5" start_page="89" end_page="89" type="metho"> <SectionTitle> FAMILY-MEMBER ON MEDICATION but not for DIAGNOSIS ON MEDICATION; </SectionTitle> <Paragraph position="0"> thus only the second analysis has a pattern matching one in the DIS.</Paragraph> <Paragraph position="1"> Matching the patterns is only one function of the selection procedure. When a match is successful, those classes which match the pattern are recorded as &quot;selected attributes&quot; so that they may be referenced at a further point in processing. Once a pattern is established, the &quot;selected attribute&quot; classes are preferred to the original ones. Additional selectional constraints will refer to the &quot;selected attributes&quot; of a word if it exists. How this procedure aides in the disambiguation of homographs can be shown using the homograph &quot;discharge&quot;. &quot;Discharge&quot; can be a medical administrative action (MED-VERB) as in &quot;discharge from hospital&quot; or a SIGN-SYMPTOM word as in &quot;discharge from wound&quot;. The phrase &quot;discharge from hospital&quot; will be successfully matched by the pattern MED-VERB FROM INSTITUTION; there is, in contrast, no pattern SIGN-SYMPTOM FROM INSTITUTION. Thus in this phrase &quot;discharge&quot; is assigned a &quot;selected attribute&quot; MED-VERB and the 92 R. GRISHMAN, L HIRSCHMAN and C. FRIEDMAN SIGN-SYMPTOM class of &quot;discharge&quot; will be ignored. This will he particularly helpful in the information formatting stage, since the mapping into the format is based primarily on a word's selected sublanguage class.</Paragraph> <Paragraph position="2"> The selectional constraints are complicated by the fact that the class of a noun phrase is sometimes determined by the entire phrase and not by the head noun alone. In some cases the class of the phrase is the class of one of its constituents. For example, &quot;stiff neck&quot; has the same class as &quot;stiff&quot;, which is a SIGN-SYMPTOM class. In other cases words from two classes combine to form a phrase with a different class. In the medical domain, &quot;temperature of 103&quot; is of the FINDING class because &quot;temperature&quot; is in the BODY-FUNCTION class and &quot;103&quot; is a quantifier. This computation of a phrasal attribute is called the &quot;computed attribute&quot; construction. This attribute plays an important role in eliminating incorrect parses which arise with coordinate conjunction. Noun phrase conjunction is restricted to phrases which are of the same or closely related classes. In &quot;Patient had stiff neck and fever&quot; there are two readings. The reading in which &quot;stiff&quot; is the left adjunct of both &quot;neck&quot; and &quot;fever&quot; is eliminated because &quot;neck&quot; and &quot;fever&quot; have different subclasses: &quot;fever&quot; is a SIGN-SYMPTOM word whereas &quot;neck&quot; is a BODY-PART word. However the phrase &quot;stiff neck&quot; has a SIGN-SYMPTOM &quot;computed attribute&quot; and is in the same class as &quot;fever&quot;~ therefore we do get the analysis where &quot;fever&quot; is conjoined to &quot;stiff neck&quot;. A more detailed description of constraints on noun phrase conjunction is described by Hirschman \[8\].</Paragraph> </Section> <Section position="6" start_page="89" end_page="89" type="metho"> <SectionTitle> FORMATTING </SectionTitle> <Paragraph position="0"> The format itself can be viewed as a derivative of the DIS, obtained by merging several predicate-argument relations into a single larger relation. Because the formats, like the predicate-argument relations, are based on the semantic classes of the DIS, the mapping from decomposition trees into formats can be driven by a table of the correspondences between semantic classes and format columns.</Paragraph> <Paragraph position="1"> QUESTION-ANSWERING The predicate names used in the predicate calculus representation within the question-answering system correspond to the predicate-argument patterns of semantic classes in the DIS, so the ~apping from decomposition trees to predicate calculus expressions is also DIS-driven. In addition, this mapping uses the information on functional dependencies recorded in the DIS: quantifier scoping is determined primarily by surface word order and syntactic structure, but functional dependencies may take precedence. For example, in the medical domain, because there is a functional relation from &quot;X-rays&quot; to &quot;patients&quot; (each X-ray is of one and only one patient), the phrase &quot;the X-rays of the patients&quot; is correctly analyzed with the quantifier over &quot;patients&quot; having wider scope than the quantifier over &quot;X-rays&quot;.</Paragraph> <Paragraph position="2"> The anaphora resolution component relies on the selection mechanism described earlier (and hence on the DIS) to determine from context the possible semantic classes for the referent of an</Paragraph> </Section> <Section position="7" start_page="89" end_page="89" type="metho"> <SectionTitle> NL INTERFACES uSING LIMITED SEMANTIC INFORMATION 93 </SectionTitle> <Paragraph position="0"> anaphoric phrase; the antecedent search is then restricted to members of these classes. In addition, the word classes are used in distinguishing between definite and &quot;one&quot; anaphora (as defined by Webber \[9\]), and resolving &quot;one&quot; anaphora correctly \[10\].</Paragraph> </Section> class="xml-element"></Paper>