File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1415_metho.xml
Size: 26,018 bytes
Last Modified: 2025-10-06 14:15:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1415"> <Title>Clause Aggregation Using Linguistic Knowledge</Title> <Section position="3" start_page="138" end_page="139" type="metho"> <SectionTitle> 2 Corpus Analysis </SectionTitle> <Paragraph position="0"> We conducted a corpus analysis to study various styles and types of aggregation. The corpus consists of the first few sentences in the discharge summaries for 54 patients in the medical domain.</Paragraph> <Paragraph position="1"> These sentences describe patients' demographics and medical conditions pertinent to patient care in the Intensive Care Unit. In our study, the first step was to find out how many propositions were combined in each sentence. A proposition is defined as a piece of information that the physician (the speaker) might choose to convey in a stand-alone sentence tothe nurses in the Intensive Care Unit (the hearer). For example; a sentence &quot;The patient is a 40 year old female admitted for heart surgery:' contains three propositions: &quot;The patient is a female.&quot;, &quot;The patient is 40 years old.&quot;, and &quot;The patient was admitted for heart surgery.&quot; The small corpus contained 121 sentences with 2262 words. From the 121 sentences, we obtained 418 propositions after manual decomposition, with a maximum 12 propositions in a single sentence as shown in Figure 1. On average, there are 3.5 propositions per sentence. Out of 54 summary sentences (the first sentence in each discharge summary) for each patient, doctors prefer to use prepositional phrases (PPs) ('%vith aortic stenosis&quot;) rather than relative clauses (&quot;who likely has endocarditis...';) to insert medical conditions into a sentence (35 occurrences vs. 4). In only two cases, both PPs and relative clauses were used; all others have neither. Our studies revealed the following: The analysis also indicates that people prefer using linguistic devices that are simpler (e.g., words over phrases over clauses) \[Scott and de Souza, 1990, Hovy, 1993\].</Paragraph> <Paragraph position="2"> We encountered sentences from the corpus which could be formulated more concisely. The doctors did very little editing to the discharge summaries. In this respect, the summaries are somewhat similar to speech. As a result, doctors prefer to use more flexible linguistic constructions, such as PPs, instead of producing the most concise sentences. Concepts such as &quot;hypertension&quot; and &quot;diabetes&quot; have both noun and adjective forms. Even though the noun form is longer (it is always used together with other words as in &quot;patient with hypertension&quot;, or &quot;patient who has hypertension&quot;), the shorter adjective form (&quot;hypertensive patient&quot;) did not appear in the corpus. In only one case, an adjective &quot;obese&quot; is used instead of the PP &quot;with obesity&quot; to indicate medical conditions. Since many medical conditions have no adjective forms, such as &quot;peptic ulcers&quot;, the speaker is more likely to use noun forms to group together all medical conditions. In addition, more information can be attached to nouns but not adjectives. In the noun form, the medical condition &quot;diabetes&quot; might be modified in the corpus, as in &quot;type 1 diabetes with extensive end organ damage&quot; and &quot;borderline diabetes&quot;: Such flexibility with nouns explains the popularity of its usage over adjectives.</Paragraph> <Paragraph position="3"> In summary, our analysis shows that a high level of aggregation is typical in the domain. Judging from the number of the PPs in comparison to relative clauses used, clause aggregation using simpler syntactic constituents is preferred. DoCtors generate summaries in real-time without examining all the information right in front of them. As a result, they might not generate the most concise sentences. MAGIC, on the other hand, generates text off-line, with all the conveying information available. This would allow MAGIC to generate more concise text by taking advantage of linguistic opportunities.</Paragraph> </Section> <Section position="4" start_page="139" end_page="139" type="metho"> <SectionTitle> 3 Semantic Representation </SectionTitle> <Paragraph position="0"> CASPER uses a representation influenced by Lexical-Functional Grammar (LFG) \[Kaplan and Bresnan, 1982\] and Semantic Structures \[Jackendoff, 1990\]. An example of the semantic representation is provided inFigure 2. In our representation, the roles for each event or state are PRED, ARG1, ARG2, ARG3, and MOD. The slot PRED stores the verb concept. Depending on the concept in PRED, ARG1, ARG2, and ARG3 can take on different thematic roles, such as Actor, Goal, and Beneficiary, respectively, as in &quot;John gave a red book to Mary yesterday.&quot; The optional slot MOD stores modifiers of the PRED. It can have one or multiple circumstantial elements, including MANNER, PLACE, or TIME. Inside each argument slot, it too has a MOD slot to store information such as adjectives or PPs.</Paragraph> </Section> <Section position="5" start_page="139" end_page="141" type="metho"> <SectionTitle> 4 Hypotactic Operators </SectionTitle> <Paragraph position="0"> We will use an example from MAGIC to demonstrate how hypotactic operators work. The surface forms of the propositions from the content planner are shown in Figure 3. In addition to the propositions, the content planner also indicates that the focus of the discourse is &quot;the patient&quot;, with an entity-id, ID1. CASPER picks the first proposition, la, as the dominant proposition because it contains the focus entity, and it has C-NAME entity. Since, the entity in focus should appear as early as possible to provide a context, the proposition la is transformed from &quot;The patient has name - Jones&quot; into the semantic representation for &quot;Jones is a patient&quot;. The PRED of the proposition is changed from C-IttS-ATTRIBtrrE to C-IS-INSTANCE, in addition to swapping of ARG1 and ARG2. Each proposition is represented similarly to the one shown in Figure 2. We use the concept C-HAS-ATTRIBUTE to denote that the entity in ARG1 has the attribute stored in ARG2.</Paragraph> <Paragraph position="1"> Depending on the lexical properties of the attribute in ARG2, the proposition le in Figure 3, can be realized as &quot;the patient has diabetesnou~&quot; or &quot;the patient is diabeticaaj&quot;.</Paragraph> <Paragraph position="3"> patient has name - Jones.</Paragraph> <Paragraph position="4"> patient has gender - female.</Paragraph> <Paragraph position="5"> patient has age - 80 year.</Paragraph> <Paragraph position="6"> patient has hypertension.</Paragraph> <Paragraph position="7"> patient has diabetes.</Paragraph> <Paragraph position="9"> To aggregate two propositions using hypotactic operators, the propositionsmust share some entities in common. When they do, hypotactic operators try to transform one of the clauses into a modifier. Since the goal is to generate concise text, CASPER prefers transforming a proposition into an adjective if possible, then a PP, a participle clause, and if il else fails, a relative clause.</Paragraph> <Paragraph position="10"> This preference of syntactically simple expressions over more complex ones was also proposed in \[Scott and de Souza, 1990\]. In the future, we plan to incorporate constraints from the corpus to determine which aggregation operators to apply and in what order.</Paragraph> <Paragraph position="11"> To transform a proposition into an adjective, a propositions must satisfy the following two preconditions. First, the slot PLIED of the proposition being transformed must be C-HAS-ATTRIBUTE (the patient has age - 80 years). The other requirement is that the ARG2 of the proposition (age 80 years) can be mapped to an adjective, as permitted in the lexicon. Using the algorithm, propositions lb, lc, ld, le can all be transformed into adjectives and attached to proposition la resulting in &quot;Jones is an 80 year old hypertensive diabetic female patient.&quot; There are two interesting things to note here. First, because of the PRED of the dominant proposition is C-IS-INSTANCE, the transformed modifiers (age, gender, etc) are attached to the ARG2 slot of the dominant proposition ('% patient&quot;) instead of ARG1 (&quot;Jones&quot;). Second, the sequential order of the modifiers is not determined yet at this stage. The goal of CASPER is to produce a concise semantic representation for a set of propositions and to guarantee that there is at least one way to express the result in the later generation modules. To guarantee expressibility \[Meteer, 1991\], CASPER looks ahead into the lexicon, but it does not make detailed lexical decisions for efficiency reasons. The exact lexical and syntactic decisions, including the ordering between modifiers, are made later in the lexical chooser.</Paragraph> <Paragraph position="12"> Consider another proposition: &quot;the patient has peptic ulcers&quot;. This proposition cannot be transformed into an adjective because there is no adjective form for C-PEPTIC-ULCER in the lexicon.</Paragraph> <Paragraph position="13"> A proposition can be transformed into a PP with a general preposition '%vith&quot; if the PRED of the proposition is C-HAS-ATTRIBUTE and the concept in its ARG2 can be mapped into a noun phrase. If we apply the PP operator to the proposition, we would have &quot;Jones is an 80 year old hypertensive diabetic female patient with peptic ulcers.&quot; CASPER currently uses an ontology which can identify that C-PEPTIC-ULCER, C-HYPERTENSION, and C=DIABETES are all medical disorders and group them together for cohesion. Since all these medical conditions can be mapped to nouns but not to adjectives, they will all be realized as PPs: &quot;Jones is an 80 year old female patient with hypertension, diabetes and peptic ulcers\]'</Paragraph> <Paragraph position="15"> Smith&quot;). The POSSESSOR modifier in ARC1, as shown in Figure 2, can be transformed into a PP using of-genitive\[Quirk et al., 1985\]. This phenomenon holds for relationships similar to patient/doctor, such as advisor/advisee, and boss/employee.</Paragraph> <Paragraph position="16"> All propositions can be transformed into a relative clause of another as long as they: share a common entity. In the example, proposition lg does not satisfy the precondition s of the previous hypotactic operators. In this case, it is combined as a present participle clause because present participle clause is simpler and shorter. The result of the hypotactic operators is a semantic representation for &quot;Jones is an 80 year old hypertensive diabetic female patient of Smith undergoing CABG.&quot; Similar to parsing long sentences, efficiency is an important issue in generating long and complex sentences. Search space grows exponentially in respect to the length in both cases. CASPER is able to generate complex sentences efficiently because it delays the difficult detailed lexical decisions until absolutely needed. At the sentence planning level, CASPER looks ahead into the lexicon and merges those propositions that satisfy the required lexical constraints. This prevents the lexical chooser from *trying to combine incompatible clauses later. By determining sentence boundaries before carrying out detailed lexical decisions, CASPER cuts down the search space of the lexical chooser drastically. In STREAK \[Robin, 1995\], a generation system which also implements hypotactic aggregation, detailed lexical decisions are made whenever a proposition is aggregated. This is costly because the best lexical decisions* for n propositions might not be useful or correct for n + ! propositions. The strategy generates impressive complex sentences, but for some complex sentences, STREAK took more than half an hour. Since *CASPER does not use detailed lexical information when it makes sentence boundary determination, it traded some optimal aggregation for efficiency. Even though the lexicon is accessed twice in our system, CASPER prunes the search space drastically by delaying expensive detailed lexical decisions after it knows* about how many concepts are involved in the desired sentence. Efficiency issues in generation were also addressed in \[McDonald et al., 1987, Elhadad et al., *1997\].</Paragraph> </Section> <Section position="6" start_page="141" end_page="144" type="metho"> <SectionTitle> 5 Paratactic Operators </SectionTitle> <Paragraph position="0"> We will use an imaginary human resource report system for a technical support team as an example to illustrate our paratactic algorithm. The example shown in Figure 4 has the following slots: PRED, ARC1, ARC2, MOD-BENEFICIARY,. MOD-TIME. We Currently have two approaches to combine propositions using coordinate constructions. In the first approach, adjacent propositions that have only 1 slot containing distinct elements are collapsed into one proposition with one conjoined slot containing the distinct elements. For example, the following sentence is the result of collapsing * two propositions with distinct elements in their MOD-BENEFICIARY slot: &quot;Alice installed Quicken for Mary and Peter on Tuesday.&quot; \[McCawley, 198!\] described this *phenomenon as *Conjunction</Paragraph> <Paragraph position="2"> one slot. To combine them, each conjoined proposition is generated, but deletion rules (described later in Section 5.4) are used to ensure the resulting sentence has the correct ellipsis. In the following sentence, the two propositions are distinct at both PRED and ARG2: &quot;John finished his work and \[John\] went home. ''1 The ARG1 in second proposition &quot;John&quot; is deleted.</Paragraph> <Paragraph position="3"> Due to limited space, we only describe the algorithm used in CASPER to produce sentences with coordinations. For a more detailed discussion with relevant linguistic motivations, please see \[Shaw, 1998\]. We have divided the algorithm into four stages~ where the first three stages take place in the sentence planner and the last stage takes place in the lexical chooser: Stage 1: group propositions and order them according to their similarities while satisfying pragmatic and contextual constraints.</Paragraph> <Paragraph position="4"> Stage 2: determine recurring elements in the ordered propositions that will be combined.</Paragraph> <Paragraph position="5"> Stage 3: create a sentence boundary when the combined clause reaches pre-determined thresholds.</Paragraph> <Paragraph position="6"> Stage 4: determine which recurring elements are redundant and should be deleted.</Paragraph> <Paragraph position="7"> We will go into detail of each Stage in the following 4 sections.</Paragraph> <Section position="1" start_page="142" end_page="143" type="sub_section"> <SectionTitle> 5.1 Group and Order Propositions </SectionTitle> <Paragraph position="0"> Coordination allows the deletion of recurring entities at the surface level, but only if they are adjacent; that is, the propositions containing the entities are sequentially next to each other. As a result, the sequential order of the propositions being coordinated affects the length of the output text. In Step 1, CASPER sequentializes the propositions to allow the maximum number of adjacent recurring entities to produce concise text.</Paragraph> <Paragraph position="1"> For the proposition in Figure 5, the semantic representations have the following slots: PRED, ARG1, ARG2, MOD-BENEFICIARY, and MOD-TIME. To identify which slot has the most similarity among its elements, we calculate the number of distinct elements (NDE) in each slot across the propositions. For the purpose of generating concise text, CASPER prefers to group propositions which result in as many slots with NDE = 1 as possible. For the propositions in Figure 5, the NDE of MOD-BENEFICIARY is 1 because all the beneficiaries are &quot;John&quot;; the NDEs for both PRED and MOD-TIME are2 because there are two actions, &quot;install&quot; and &quot;remove&quot;, which occurred on either &quot;Monday&quot; or &quot;Tuesday&quot;; the NDE for ARG2 is 4 because it contains &quot;Excel&quot;, &quot;WordPerfect&quot;, &quot;Powerpoint&quot;, and &quot;Access&quot;; similarly, the NDE of ARG1, the agent, is 3.</Paragraph> <Paragraph position="3"> The algorithm re-orders the propositions by sorting the elements in each slots using comparison operators which can determine that Monday is Smaller than Tuesday, or &quot;Alice&quot; issmaller than &quot;Bob&quot; alphabetically. Starting from the slots with highest NDE to the lowest, the algorithm re-orders the propositions based on the elements of each particular slot. In this case, propositions will ordered according to their ARG2 first, followed by ARG1, MOD-TIME, PRED, and MOD-BENEFICIARY. The sorting process will put similar propositions adjacent to each other as Shown in Figur e 6.</Paragraph> </Section> <Section position="2" start_page="143" end_page="144" type="sub_section"> <SectionTitle> 5.2 Identify Recurring Elements </SectionTitle> <Paragraph position="0"> The current algorithm tries to combine only two propositions at once. In Stage 2, CASPER is concerned with how many slots have distinct values and * which slots they are. When multiple * adjacent propositions have only one slot with distinct elements, these propositions are 1-distinct.</Paragraph> <Paragraph position="1"> Propositions that are 1-distinct can be replaced with one proposition with one slot conjoining the distinct elements of that slot. In our example, the first and second propositions are 1-distinct at ARG2, and they are combined into a semantic structure representing &quot;Alice installed Excel and Powerpoint for John on Monday.&quot; When propositions have more than one distinct slot or their 1-distinct slot is different from the previous 1-distinct slot, the two propositions are said to be multiple-distinct. Our approach in combining multiple-distinct propositions is different from previous linguistic analysis. Instead of removing recurring entities immediately based on transformation or substitution, the current system generates every conjoined multiple-distinct proposition. During the lexicalization of the conjoined sentence, the lexical chooser prevents the realization component from generating any string for the redundant elements. Our multiple-distinct coordination produces what linguists describe as *ellipsis and gapping. Figure 7 shows the result combining two propositions that will result in !'Alice installed Excel on Monday and Outlook on Friday.&quot; Some readers might notice that PRED and ARG1 in both propositions are marked as RECURRING. The process to delete only subsequent recurring elements at surface level will be explained in Section 5.4.</Paragraph> <Paragraph position="2"> * 5.3 Determine Sentence Boundary Unless combining the next proposition into the result proposition will exceed the pre-determined parameters for the complexity of a sentence, the algorithm will keep on combining more propositions into the result proposition using 1-distinct or multiple-distinct coordination. Based on looking at PLANDoc output, we limit the number of propositions conjoined by multiple-distinct coordination to two in normal cases. Higher threshold renders some of the sentences difficult to comprehend.</Paragraph> <Paragraph position="3"> In special cases where the same slots across nmltiple propositions are multiple-distinct, the pre- null determined limit is ignored. By taking advantage of parallel structures, these propositions can be combined using multiple-distinct procedures without making the coordinate structure more difficult to understand. For example, the sentence &quot;John took aspirin on Monday, penicillin on Tuesday, and Tylenol on Wednesday.&quot; is long but quite understandable. Similarly, conjoining a long list of 3-distinct propositions produces understandable sentences too: &quot;John played tennis on Monday, drove to school on Tuesday, and won the lottery on Wednesday.&quot; These constraints allow CASP~,R to produce easily understandable complex sentences containing a lot of information.</Paragraph> </Section> <Section position="3" start_page="144" end_page="144" type="sub_section"> <SectionTitle> 5.4 Delete Redundant Elements </SectionTitle> <Paragraph position="0"> Stage 4 handles ellipsis. In the previous stages, adjacent elements that occur more than once among the propositions are marked as RECURRING, but the actual deletion decisions have not been made because CASPER lacks the necessary information. T15e importance of the surface sequential order can be demonstrated by the following example. In the sentence &quot;On Monday, Alice installed Excel and \[on Monday,\] \]Alice\] removed Lotus 123.&quot;, the elements in MOD-TIME delete forward (i.e.</Paragraph> <Paragraph position="1"> the subsequent occurrence of the identical constituent disappears). When MOD-TIME elements are realized at the end of the clause, the same elements in MOD-TIME delete backward (i.e. the antecedent occurrence of the identical constituent disappears): &quot;Alice installed Excel \[on Monday,\] and \[Alice\] removed Lotus 123 on Monday.&quot; In general, if a slot is realized at the front or medial of a clause, the recurring elements in that slot delete forward. In the first example, MOD-TIME is realized as the front adverbial while ARG1, &quot;Alice&quot;, appears in the middle of the clause, so elements in both slots delete forward. On the other hand, if a slot is realized at the end position of a clause: the recurring elements in such slot delete backward, as the MOD-TIME in second exanlple.</Paragraph> <Paragraph position="2"> Our extended directionality constraint, an extension of \[Tai, 1969\]'s Directionality Constraint, also applies to conjoined premodifiers and postmodifiers as well, as demonstrated by ':in Aisle 3 and \[in Aisle\] 4;', and &quot;at 3 \[PM\] and \[at\] 9 PM&quot;.</Paragraph> <Paragraph position="3"> Using the algorithm just described, the result is concise and easily understandable: &quot;On Monday: Alice installed Excel and Powerpoint and Cindy removed Word for John. Bob removed WordPerfect for John on Tuesday.&quot; Further discourse processing can replace the beneficiary&quot;John&quot; in the second sentence with a pronoun &quot;him&quot;.</Paragraph> </Section> </Section> <Section position="7" start_page="144" end_page="145" type="metho"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> Both hypotactic and paratactic constructions described in this paper have received a lot of attention * in linguistics \[Quirk et al., 1985, Halliday, 19941 Carpenter, 1998\]. Much generation literature on aggregation was disguised under the topic &quot;revision&quot; \[Meteer, 1991, Robin, 1995\] \[Callaway and Lester, 1997\]. We consider clause aggregation as an integral part of a text generation system, not as a revision. The term &quot;revision&quot; implies that something has been generated and then improved upon, which is not the case in these systems. We prefer the term optimization used by \[Dale, 1992\], which describes the phenomenon of aggregation more appropriately - it use fewer words to convey the same amount of information.</Paragraph> <Paragraph position="1"> In earlier systems, clause aggregations are implemented in strategic component \[Mann and Moore, 1980, Dale, 1992, Horacek, 1992\]. Logical derivations were used to combine clauses and remove easily inferable clauses in \[Mann and Moore, 1980\]. In such systems, aggregation decisions are made without lexical information. Newer systems, such as \[Shaw, 1995, Wanner and Hovy, 1996, Huang and Fiedler, 1997\]: use a sentence planner to make decisions at clause level between the strategic and tactical component.</Paragraph> <Paragraph position="2"> With the exception of \[Scott and de Souza, 1990\] and \[Robin, 1995\], most research in aggregation did not transform clauses into modifiers, such as adjectives, PP, or relative clauses, in a systematic manner. \[Scott. and de Souza, 1990\] proposed heuristics for carrying out clause combining based on RST and specifically identified which rhetorical relations are appropriate for &quot;embedding&quot; : which corresponds to our hypotactic operators. We will incorporate rhetorical aggregation in the future. Robin's work on revision operators *is similar to ours. We have describe his work earlier in Section 4.</Paragraph> <Paragraph position="3"> Because sentences with coordination constructions can express a lot of information with few words, many text generation systems have implemented the generation of coordination expressions with various *complexities \[Dale, 1992, Dalianis and Hovy, 1993, Huang and Fiedler, 1997, Shaw, 1995, Callaway and Lester, 1997\]. Most systems handles simple coordination which contains only one conjoined syntactic constituents, such as subject, verb, or object. None of them handles ellipsis as CASPER does. CASPER tries to systematically find the most concise way to express the propositions by looking through all the * propositions. In contrast, aggregation operators proposed in other work are local and does handle complex cases. In addition, the possibility *of too much information in a sentence has not received much attention. Most research simply ignores this possibility because the input to their sentence planners never exceeds a few clauses.</Paragraph> </Section> class="xml-element"></Paper>