File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-4204_metho.xml

Size: 21,451 bytes

Last Modified: 2025-10-06 14:13:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-4204">
  <Title>APPLYING AND IMPROVING THE RESTRICTION GRAMMAR APPROACH FOR DUTCH PATIENT DISCHARGE SUMMARIES</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
APPLYING AND IMPROVING THE RESTRICTION GRAMMAR APPROACH
FOR DUTCH PATIENT DISCHARGE SUMMARIES
PETER SPYNS \[1,2\] GEERT ADRIAENS \[1,3\]
</SectionTitle>
    <Paragraph position="0"> This paper starts by giving a short overview of one existing NLP project for the medical sublanguage (1).</Paragraph>
    <Paragraph position="1"> After having presented our objectives (2), we will describe the Restriction Grammar formalism (3), the datastructure we use for parsing (4) and our parser (5) enhanced with a special control structure (6). An attempt to build a bootstrap dictionary of medical terminology in a semi-automatic way will also be discussed (7). A brief evaluation (8) and a short outline of our future research (9) will conclude this article.</Paragraph>
    <Paragraph position="2"> 1. context and state of the art In medecine, the use of natural language for compiling reports and patient documents is a widespread habit.</Paragraph>
    <Paragraph position="3"> The importance of the information embedded in a patient discharge summary (PDS) has led to the creation of various storage and retrieval programs (e.g. Dorda 1990). Basically, these programs use pattern-matching (enhanced' with boolean selection possibilities) to retrieve the required information. A more scientifically based method to extract the information from a patient discharge summary is a linguistically based analysis, which captures all the subtilities of the free text. This &amp;quot;intelligent&amp;quot; approach permits questions that imply deductive reasoning and is therefore preferred to simple pattern-matching based techniques (Zweigenbaum et al 1990).</Paragraph>
    <Paragraph position="4"> A team from the New York University, directed by Sager, has developed an NLP system that analyzes PDSs, structures their information and stores the whole in a relational database. The data can thus be accessed by other programs in an organized way. This Medical Language Processor (MLP) is an extension of the Linguistic String Project (LSP) (Sager 1981), which aimed at analyzing technical and specialized language by means of String Grammar (Sager et al 1987). More recently, the LSP-MLP was successfully imported in Europe and it is currently functioning in the H6pital Cantonal de Gen~ve, where it handles French patient documents (Sager et al 1989).</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. objectives
</SectionTitle>
    <Paragraph position="0"> Starting from an existing grammar for English, an attempt was made to develop a grammar for the Dutch language by making use of the experience gained during the LSP. This implies the creation of a set of grammar rules, the implementation of a interpreter/translator for these rules and the generation of a limited dictionary.</Paragraph>
    <Paragraph position="1"> On the basis of a set of 6 PDSs, a limited grammar has been built. Every word constitutes a separate entry in the dictionary; no morphologic analysis takes places.</Paragraph>
    <Paragraph position="2"> Conjunctions are not yet handled, but the possibility exists (Hirschman 1986). Relative clauses also fall outside the scope of the current grammar. It is our intention to use the grammar to analyze free input (as it occurs in PDSs) with an eye to extracting the relevant medical information. From there on, the system will be used to help medical secretaries in the classification of the PDSs with respect to medical database systems.</Paragraph>
    <Paragraph position="3"> 3. restriction grammar 3.1. historical background Restriction Grammar (RG) is the Prolog-version of String Grammar, which emanated during the 60s as the grammar formalism of the distributionalism school promoted by Harris (1962). One might wonder why this theory is being reused. An up-to-date theoretical formalism suffers to a large extent from the limited and experimental character of its applications. Tile LSP-MLP has proved to lead to large scale useful NLP systems. That is why we adopted the same theoretical background (distributioualism) to develop an aualogous grammar for the Dutch language.</Paragraph>
    <Paragraph position="4"> 3.2. relation with other logic formalisms The RG-formalism is related to Definite Clause Grammar (DCG). A grammar consists in both formalisms of a set of context-free production rules interspersed with Prolog predicates that function as restrictions. Advantages of RG are the absence of parameters in the rules and the separate treatment of context-ACTES DE COLING-92, NANTES, 23-28 Adler 1992 1 2 6 4 PROC. OF COLlNG-92, NANTES, AU6. 23-28, 1992 sensitive information, which is stored in a tree-structure that is gradually built up during the sentence-analysis.</Paragraph>
    <Paragraph position="5"> Flexibility in creating, adapting and checking the grammar is thus guaranteed (Hirschman 1986, Hirschman &amp; Puder 1986).</Paragraph>
    <Paragraph position="6"> 3.3. some type definitions The RG-grammar contains different types of structures.</Paragraph>
    <Paragraph position="7"> Alinc, uistic strine is a single option definition which consists of a sequence of &amp;quot;required&amp;quot; elements in direct correspondence with an elementary word in the sentence, interspersed with elements of the adjunct set type. For instance, the definition for an affirmation is the following: affn-mation ::= np, sa_rec opt, vp, saree_opt.</Paragraph>
    <Paragraph position="8"> This rule states that the affirmation consists of an np and vp-string (both required) as well as optional sentence adjunct strings.</Paragraph>
    <Paragraph position="9"> An ailjll.qPS,l...~l definition has several options, which are the names of string definitions or sets of string definitions. The strings of the adjunct set are optional additions to the sentence. In opposition to Sager and Hirschman, the optionality and recursivity of some strings (sa, rn, rv) is embedded in our grammar and no special parser mechanisms are needed (see below).</Paragraph>
    <Paragraph position="10"> sa ::= prate; ptime; pn; nstgt;csstg; dstg.</Paragraph>
    <Paragraph position="11"> The rule states that a sentence adjunct can consist of prepositional strings (indicating a date or moment in time), an expression of time (nstgt), a subordinate sentence-structure (csstg), or an adverbial suing (dstg). Structures of the ~ consist of a word class &amp;quot;X&amp;quot;, which is the core or head of the structure, occurring with optional left and right adjuncts (Sager 1981).</Paragraph>
    <Paragraph position="12"> lnr ::= In, nvar, rn_rec opt.</Paragraph>
    <Paragraph position="13"> Here, a nominal constituent is surrounded by its left and right adjunct strings.</Paragraph>
    <Paragraph position="14"> Positional variant,5 are auxiliary definitions which group together linguistically related classes of strings. nvar ::= lnamer; {d_nvar}, *n; {d_inf},lo_vinf_ro; {dnl}, *nulln; *pro, {w pro2}.</Paragraph>
    <Paragraph position="15"> The core of a nominal constituent can be formed by a proper name with its adjuncts, a noun, a nominalized verb string, a null noun and a pronoun. Note the existence of the zero noun (*nulln), which is only justified for reasons of the economy principle applied to the grammar rather than on purely linguistic grounds.</Paragraph>
    <Paragraph position="16"> New omional and recursive structures are created and included in our grammar. This was useful to skip the extra machinery needed by Sager and Hirschman to cope with the &amp;quot;empty&amp;quot; and repeated adjunct strings (Sager 1972). These new strings allow the parser to adopt a uniform strategy for the complete grammar. All the strings of the adjunct set have a corresponding optional and recursive structure.</Paragraph>
    <Paragraph position="17"> sa rec_opt ::= sa_rec; \[l. (optional definition) sa_rec ::= sa, sa_rec_opt. (recarsive def'mition) These constructs allow a transparent treatment of an optional recursive sentence-adjunct.</Paragraph>
    <Paragraph position="18"> Sometimes, the theoretical frontiers between the different structures are slightly blurred in grammar rules, but this will be cleaned up during the further development of the grammar.</Paragraph>
    <Paragraph position="19"> As opposed to non-terminal categories (the structures already mentioned), a ternlinal cater,try constitutes a leaf of the parse tree. The variable Word (see below) of every leaf is instantiated. In the grammar, a terminal category is marked by an asterisk.</Paragraph>
    <Paragraph position="20"> nvar ::= lnamer; \[d nvar}, *n; {d inf} ,lo vinf ro; {dnl \], *nulln; *pro, {w_pro2}.</Paragraph>
    <Paragraph position="21"> are words which are directly integrated in the grammar and function more or less as fixed expressions. The example shows the definition of a leftadjunct to a proper noun. The literals appear between square brackets.</Paragraph>
    <Paragraph position="22"> lname ::= \[Prof, '.', dr, '.'l; \[dr,'.'\] .</Paragraph>
    <Paragraph position="23"> 3.4. restrictions The application of the grammar rules to a sentence can result in various parse trees. These are not necessarily correct from a pragmatic-linguistic point of view.</Paragraph>
    <Paragraph position="24"> Thus, the need emerges to distinguish the acceptable analyses from the bad ones. The restrictions prune the combinatorial explosion of parses permitted by the context-free grammar rules. When a restriction fails, the next positional variant of the category which functionts as the head of a grammar rule is considered.</Paragraph>
    <Paragraph position="25"> Restrictions appear between curly brackets.</Paragraph>
    <Paragraph position="26"> nvar ::= lnamer; {d_nvar}, *n; {d inf},lo_vinf ro; {dial}, *nulln; *pro, {w pro2}.</Paragraph>
    <Paragraph position="27"> There exist two kinds of restrictions. The well-formedness restrictions check if the parse tree meets syntactic and semantic constraints imposed on the leaves or on the syntactic relations between various nodes of the parse tree (e.g. w_pro2). Disqualify restrictions consider the input stream (e.g. d inf fails if no infinitive is present in the remaining input stream) ACRES DE COLING-92, NANTEs, 23-28 Ao~r 1992 l 2 6 5 PROC. OF COLING-92, NANTEs. AUG. 23-28, 1992 and enhance efficient parsing by blocking directions which lead to failure. The restrictions are implemented by means of special functions (locating routines) that allow navigating from one node in the tree to another.</Paragraph>
    <Paragraph position="28"> Currently, the still limited grammar for the Dutch language consists of some 157 RG-clanses completed with 41 restrictions. As already mentioned, conjunctions are not yet covered; neither are relative clauses or interrogatives sentences. This grammar was the result of confronting a larger theoretically built grammar with the linguistic reality of 6 patient discharge summaries.</Paragraph>
    <Paragraph position="29"> 4. the data structure of the parse Iree The data structure used for the parse tree is the one proposed by Hirschman and Puder (1986). The parse tree together with its operators are defined as an abstract data type. The abstraction function can be defined as follows:  relation links a node to the path in the parse tree that must be traversed to reach the root starting from that node. Movements in the parse tree are executed by means of the up/2, down/2, right/2 and left/2 operators. As subtrees are successfully generated, the path gradually shrinks during the movement upwards in the tree.</Paragraph>
    <Paragraph position="30"> 5. the implementation of the parser Basically, an RG translator-interpreter does not differ substantially from a DCG translator-interpreter. The main difference resides in the handling of the parameters needed for the construction of the parse tree (absent in RG-rales). The translator-interpreter takes care of the parameter bindings instead of the grammar rules as is the case in DCG. An example of the output after parsing a sentence can be found below.</Paragraph>
    <Paragraph position="32"> 6. the control structure 6.1. general outline  The interpreter/translator mechanism as it is described in Hirschman 1986 or Dowding &amp; Hirschman 1988 works in a depth-first fashion that backtracks when necessary. We added a control structure that records the options that are unsuccessfully tried as well as the options that have led to a null-node.</Paragraph>
    <Paragraph position="33"> The sentence to be analyzed is logically considered as a graph where the words of the sentence function as arcs between the vertices. The arcs will be labeled with grammatical categories. An arc can span various vertices. The &amp;quot;sentence-arc&amp;quot; e.g. spans all the vertices. An arc is uniquely identified by its label and coordinates in the tree (number of the starting vertex and treedepth). The fourth characteristic of a arc, its mode, serves to distinguish a void from an empty arc. A void arc means that a grammatical category under a certaiu ACTES DE COLING-92, NANTES, 23-28 Aotrr 1992 1 2 6 6 PROC. OF COLING-92, NANTEs, AUG. 23-28, 1992 vertex leads to no success, whereas an empty at~c only leads to success if it is realized as an empty string. In opposition to a chart parser that remembers which grammatical structure spans which vertices, the only function of the control structure is to record which grammatical category leads to failure or realization as an empty string under a certain vertex (= a word).</Paragraph>
    <Paragraph position="34"> 6.2. the data structure of the control mechanism The backbone consists of a kind of sparse matrix, of which each element is a stack of grammatical labels.</Paragraph>
    <Paragraph position="35"> The stack contains all the grammatical labels of the arcs that were (or still are) considered on the current depth in the parse tree under the current vertex. The Last In First Out ordering principle provides the advantage that arcs recently added below a vertex under construction can be modified or discarded without the need for time-cousuming search-routines.</Paragraph>
    <Paragraph position="36"> Here again tile control structure is conceived as an abstract data type. There exist only five public predicates: cheek option/3, treat most regentare/4, empty_arc/3, initialize c-ontroi-/i anti remove control/0.</Paragraph>
    <Paragraph position="37"> 6.3. the parsing algorithm When the parser tests if an arc can span two vertices (cheek-option/3), the control structure is checked first to see if the sanle grammatical category is not already present under the starting vertex. If this is the case, a failure blocks this option aud the parser will consider the next category in the grammar rule as the new candidate arc. In the other case, the arc is included as a void arc in the control structure. This prevents the parser from ending up in an inf'mite loop, by trying to satisfy a grammatical category already functioning as a candidate are but on a higher level of the parse tree.</Paragraph>
    <Paragraph position="38"> If a category is present as an empty arc (empty_arc/3) in the control structure, this means that all the other underlying grammatical categories have already been tried, but only the empty string can satisfy this category. When the parser backtracks and retries this category under the same vertex, useless search paths can be pruned as the previous parsing attempt only allowed the empty string.</Paragraph>
    <Paragraph position="39"> On successful completion of a parsing attempt by a terminal element or a literal, the grammatical category is either removed from the control structure to allow the same category to be retried on backtracking, or marked as an empty arc (treat most recent are/4).</Paragraph>
    <Paragraph position="40"> Backtracking does not affect the control structure, as it is implemented by menus of global Prolog &amp;quot;record keys&amp;quot;.</Paragraph>
    <Paragraph position="41"> 7. semi-automatic dictionary buildup As was noted by Wolff, file major lexical category of a word is deterolined by its final coostituent part (Wolff 1984). In order to semi-antomatically build up our dictionary (currently only coutainiug word class inl0rmadon), we ordered a set of technical medical terms alphabetically starting from their ends. This allowed us to distinguish relevant suffixes and determine the associated grammatical category (e.g. file suffix -tara is an adjectival suffix. The data about rile link between suffix and word class can be entered into the system by means of a short interactive program.</Paragraph>
    <Paragraph position="42"> Some words belonging to a closed grammatical category (e.g. articles) are al.~ integrated in the suffix file. A distinction is made between the real suffixes and tile closed t~Ltegory words. &amp;quot;the latter are stored uudeL the form of a closed Prolog-list while tile ti~rmer arc entered as open eaded Prolog-lists.</Paragraph>
    <Paragraph position="43"> The suffix entrie,q are alphabetically classified, whereby the ending character functions as the sort key. The &amp;quot;ending groups&amp;quot; created by this process contain two subgroups: the real morphemes versus the real suffixes.</Paragraph>
    <Paragraph position="44"> The fonner have a higher ordering value than the hitter. The latter are in addition ordered according their decrca~sing length. Concerting the suffixes, this means that the principle of the longest match is applied. &amp;quot;lqhe general strategy can be stumnariz~ed it,; follows: (...) the table is Iool,:ed at slatting with tile most specific entry, ,and ending with tile least specific one (Adrhtens &amp; Lemmens 1990).</Paragraph>
    <Paragraph position="45"> The output of the classifying program is a (primitive) dictionary file that must be completed, e.g. by adding semantic le, atures. Some words of the dictionary file can be marked as &amp;quot;category unknown&amp;quot; or carl be attributed the wsong category. To cope with this problem, all interactive program was added to allow corrections to be carried out. Despite the fact that tile semi-automatic classification of words proved to be highly eflectivc and time-saving (l~jrtly due to file high degree of regularity in medical terminology), it still filnctions on a ton limited basis to be fully integrated in an NLP system.</Paragraph>
    <Paragraph position="46"> 8. evaluation and results The most serious problem for our parsing approach appears to be structnral ambiguity. A branch is attached to the first IX~ssible node in the parse tree instead of a subsequent node on the same or higher level. This is due to the depth-first mechanism of tile parser. The generation of more than one parse tree does not provide a solution, because then the question on which basis a final parse tree will be selected remains unanswered. The LSP-MI,P dkl also encounter this problem, but allows tile &amp;quot;adjuncts to be integrated freely (Sager 1981). Subsequently, the parse tree. is passed to a &amp;quot;sublangnage selection module&amp;quot; aud re- arranged on the basis of lists of allowed syntagmatic combinations ACRES DE COLING-92, NANI~:S, 23-28 AOI3T 1992 1 2 6 7 PROC. OF COLING-92, NANIES, AUG. 23-28, 1992 of semantic classes of the medical language as they appear in the distribution pattern of a word.</Paragraph>
    <Paragraph position="47"> 9. further research Farther research will be oriented towards the integration of conjunction handling, which implies the use of a meta-grammar (Hirschman 1986). The existing grammar will be continously extended in two phases.</Paragraph>
    <Paragraph position="48"> After a survey of various grammar books, theoretically sound grammar rules will be developed. These rules will subsequently be checked against a corpus of PDSs, to see how well the paper grammars fare when confronted with random inpuL A more elaborated grammar requires more and refined restrictions, which need to eliminate more accurately dead ends during the parsing process. A sublanguage module must be added to account for the syntactic specificities of the medical language as well as to rearrange some branches of the parse tree. To transcend the prototype status, a large scale dictionary should be developed, including a morphological analyzer.</Paragraph>
    <Paragraph position="49"> Furthermore, as the ultimate goal is to represent the meaning of the sentence, the syntactic analysis needs to be completed by a semantic counterpart. A distinction should be made between general purpose and medical sublanguage concepts. The already mentioned rearrangement of the parse tree by the sublanguage component uses semantic classes, but these are created on the basis of a syntactic analysis of the distribution of the various texical terms. This does not include any modeling of the medical domain nor any deductive reasoning as is pointed out by Zweigenbaum (1990). In the long run, these two aspects need to be included in an intelligent information retrieval system.</Paragraph>
    <Paragraph position="50"> note.' For reasons af space limitations, only a restricted set of grammar rules was shown. Tile complete grammar as well as the full Prolog code can be found in (Spyns 1991).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML