File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1071_metho.xml

Size: 18,279 bytes

Last Modified: 2025-10-06 14:15:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1071">
  <Title>Information Fusion in the Context of Multi-Document Summarization</Title>
  <Section position="4" start_page="551" end_page="553" type="metho">
    <SectionTitle>
3 Content Selection: Theme
</SectionTitle>
    <Paragraph position="0"> Intersection To avoid redundant statements in a summary, we could select one sentence from the set of similar sentences that meets some criteria (e.g., a threshold number of common content words).</Paragraph>
    <Paragraph position="1"> Unfortunately, any representative sentence usually includes embedded phrases containing information that is not common to other similar sentences. Therefore, we need to intersect the theme sentences to identify the common phrases and then generate a new sentence. Phrases produced by theme intersection will form the content of the generated summary.</Paragraph>
    <Paragraph position="2"> Given the theme shown in Figure 2, how can we determine which phrases should be selected to form the summary content? For our example theme, the problem is to determine that only the phrase &amp;quot;On Friday, U.S. F-16 fighter jet was shot down by a Bosnian Serb missile&amp;quot; is common across all sentences.</Paragraph>
    <Paragraph position="3"> The first sentence includes the clause; however, in other sentences, it appears in different paraphrased forms, such as &amp;quot;A Bosnian Serb missile shot down a U.S. F-16 on Friday.&amp;quot;. Hence, we need to identify similarities between phrases that are not identical in wording, but do report the same fact. If paraphrasing rules are known, we can compare the predicate-argument structure of the sentences and find common parts. Finally, having selected the common parts, we must decide how to combine phrases, whether additional information is needed for clarification, and how to order the resulting sentences to form the summary.</Paragraph>
    <Paragraph position="4"> shoot class: verb voice :passive tense: past polarity: + fighter missile class: noun class: noun definite: yes U.S. class: noun  was shot by missile.&amp;quot;</Paragraph>
    <Section position="1" start_page="551" end_page="552" type="sub_section">
      <SectionTitle>
3.1 An Algorithm for Theme
Intersection
</SectionTitle>
      <Paragraph position="0"> In order to identify theme intersections, sentences must be compared. To do this, we need a sentence representation that emphasizes sentence features that are relevant for comparison such as dependencies between sentence constituents, while ignoring irrelevant features such as constituent ordering. Since predicate-argument structure is a natural way to represent constituent dependencies, we chose a dependency based representation called DSYNT (Kittredge and Mel'~uk, 1983). An example of a sentence and its DSYNT tree is shown in Figure 3. Each non-auxiliary word in the sentence has a node in the DSYNT tree, and this node is connected to its direct dependents. Grammatical features of each word are also kept in the node. In order to facilitate comparison, words are kept in canonical form.</Paragraph>
      <Paragraph position="1"> In order to construct a DSYNT we first run our sentences through Collin's robust, statistical parser (Collins, 1996). We developed a rule-based component that transforms the phrase-structure output of the parser to a DSYNT representation. Functional words (determiners and auxiliaries) are eliminated from the tree and the corresponding syntactic features are updated.</Paragraph>
      <Paragraph position="2"> The comparison algorithm starts with all sentence trees rooted at verbs from the input DSYNT, and traverses them recursively: if two nodes are identical, they are added to the output tree, and their children are compared. Once a full phrase (a verb with at least two constituents) has been found, it is added to the intersection. If nodes are not identical, the algorithm tries to apply an appropriate paraphrasing rule from a set of rules described in the next section. For example, if the phrases  &amp;quot;group of students&amp;quot; and &amp;quot;students&amp;quot; are compared, then the omit empty head rule is applicable, since &amp;quot;group&amp;quot; is an empty noun and can be dropped from the comparison, leaving two identical words, &amp;quot;students&amp;quot;. If there is no applicable paraphrasing rule, then the comparison is finished and the intersection result is empty.</Paragraph>
      <Paragraph position="3"> All the sentences in the theme are compared in pairs. Then, these intersections are sorted according to their frequencies and all intersections above a given threshold result in theme intersection.</Paragraph>
      <Paragraph position="4"> For the theme in Figure 2, the intersection result is &amp;quot;On Friday, a U.S. F-16 fighter jet was shot down by Bosnian Serb missile.&amp;quot; 1</Paragraph>
    </Section>
    <Section position="2" start_page="552" end_page="553" type="sub_section">
      <SectionTitle>
3.2 Paraphrasing Rules Derived from
Corpus Analysis
</SectionTitle>
      <Paragraph position="0"> Identification of theme intersection requires collecting paraphrasing patterns which occur in our corpus. Paraphrasing is defined as alternative ways a human speaker can choose to &amp;quot;say the same thing&amp;quot; by using linguistic knowledge (as opposed to world knowledge) (Iordanskaja et al., 1991). Paraphrasing has been widely investigated in the generation community (Iordanskaja et al., 1991; Robin, 1994).</Paragraph>
      <Paragraph position="1"> (Dras, 1997) considered sets of paraphrases required for text transformation in order to meet external constraints such as length or readability. (Jacquemin et al., 1997) investigated morphology-based paraphrasing in the context of a term recognition task. However, there is no general algorithm capable of identifying a sentence as a paraphrase of another.</Paragraph>
      <Paragraph position="2"> In our case, such a comparison is less difficult since theme sentences are a priori close semantically, which significantly constrains the kinds of paraphrasing we need to check. In order to verify this assumption, we analyzed paraphrasing patterns through themes of our training corpus derived from the Topic Detection and Tracking corpus (Allan et al., 1998). Overall, 200 pairs of sentences conveying the same information were analyzed. We found that 85% of the paraphrasing is achieved by syntactic and lexical transformations. Examples of paraphrasing that require world knowledge are presented below:  that linearizes as this sentence.</Paragraph>
      <Paragraph position="3"> last week at Zvornik&amp;quot; and &amp;quot;Bosnian Serb leaders freed about one-third of the U.N.</Paragraph>
      <Paragraph position="4"> personnel&amp;quot; 2. &amp;quot;Sheinbein showed no visible reaction to the ruling.&amp;quot; and &amp;quot;Samuel Sheinbein showed no reaction when Chief Justice Aharon Barak read the 3-2 decision&amp;quot; Since &amp;quot;surface&amp;quot; level paraphrasing comprises the vast majority of paraphrases in our corpus and is easier to identify than those requiring world-knowledge, we studied paraphrasing patterns in the corpus. We found the following most frequent paraphrasing categories:  1. ordering of sentence components: &amp;quot;Tuesday they met...&amp;quot; and &amp;quot;They met ... tuesday&amp;quot;; 2. main clause vs. a relative clause: &amp;quot;...a building was devastated by the bomb&amp;quot; and &amp;quot;...a building, devastated by the bomb&amp;quot;; 3. realization in different syntactic categories, e.g., classifier vs. apposition: &amp;quot;Palestinian leader Ararat&amp;quot; and &amp;quot;Ararat, palestinian leader&amp;quot;, &amp;quot;Pentagon speaker&amp;quot; and &amp;quot;speaker from the Pentagon&amp;quot;; 4. change in grammatical features: active/passive, time, number. &amp;quot;...a building was devastated by the bomb&amp;quot; and &amp;quot;...the bomb devastated a building&amp;quot;; 5. head omission: &amp;quot;group of students&amp;quot; and &amp;quot;students&amp;quot;; 6. transformation from one part of speech to another: &amp;quot;building devastation&amp;quot; and &amp;quot;... building was devastated&amp;quot;; 7. using semantically related words such  as synonyms: &amp;quot;return&amp;quot; and &amp;quot;alight&amp;quot;, &amp;quot;regime&amp;quot; and &amp;quot;government&amp;quot;.</Paragraph>
      <Paragraph position="5"> The patterns presented above cover 82% of the syntactic and lexical paraphrases (which is, in turn, 70~0 of all variants). These categories form the basis for paraphrasing rules used by our intersection algorithm.</Paragraph>
      <Paragraph position="6"> The majority of these categories can be identified in an automatic way. However, some of the rules can only be approximated to a certain degree. For example, identification of similarity based on semantic relations between words depends on the coverage of the thesaurus. We  identify word similarity using synonym relations from WordNet. Currently, paraphrasing using part of speech transformations is not supported by the system. All other paraphrase classes we identified are implemented in our algorithm for theme intersection.</Paragraph>
    </Section>
    <Section position="3" start_page="553" end_page="553" type="sub_section">
      <SectionTitle>
3.3 Temporal Ordering
</SectionTitle>
      <Paragraph position="0"> A property that is unique to multi-document summarization is the effect of time perspective (Radev and McKeown, 1998). When reading an original text, it is possible to retrieve the correct temporal sequence of events which is usually available explicitly. However, when we put pieces of text from different sources together, we must provide the correct time perspective to the reader, including the order of events, the temporal distance between events and correct temporal references.</Paragraph>
      <Paragraph position="1"> In single-document summarization, one of the possible orderings of the extracted information is provided by the input document itself. However, in the case of multiple-document summarization, some events may not be described in the same article. Furthermore, the order between phrases can change significantly from one article to another. For example, in a set of articles about the Oklahoma bombing from our training set, information about the &amp;quot;bombing&amp;quot; itself, &amp;quot;the death toll&amp;quot; and &amp;quot;the suspects&amp;quot; appear in three different orders in the articles. This phenomenon can be explained by the fact that the order of the sentences is highly influenced by the focus of the article.</Paragraph>
      <Paragraph position="2"> One possible discourse strategy for summaries is to base ordering of sentences on chronological order of events. To find the time an event occurred, we use the publication date of the phrase referring to the event. This gives us the best approximation to the order of events without carrying out a detailed interpretation of temporal references to events in the article, which are not always present. Typically, an event is first referred to on the day it occurred.</Paragraph>
      <Paragraph position="3"> Thus, for each phrase, we must find the earliest publication date in the theme, create a &amp;quot;time stamp&amp;quot;, and order phrases in the summary according to this time stamp.</Paragraph>
      <Paragraph position="4"> Temporal distance between events is an essentim part of the summary. For example, in the summary in Figure 1 about a &amp;quot;U.S. pilot doumed in Bosnia&amp;quot;, the lengthy duration between &amp;quot;the helicopter was shot down&amp;quot; and &amp;quot;the pilot was rescued&amp;quot; is the main point of the story. We want to identify significant time gaps between events, and include them in the summary. To do so, we compare the time stamps of the themes, and when the difference between two subsequent time stamps exceeds a certain threshold (currently two days), the gap is recorded. A time marker will be added to the output summary for each gap, for example &amp;quot;According to a Reuters report on the 10/21&amp;quot; Another time-related issue that we address is normalization of temporal references in the summary. If the word &amp;quot;today&amp;quot; is used twice in the summary, and each time it refers to a different date, then the resulting summary can be misleading. Time references such as &amp;quot;today&amp;quot; and &amp;quot;Monday&amp;quot; are clear in the context of a source article, but can be ambiguous when extracted from the article. This ambiguity can be corrected by substitution of this temporal reference with the full time/date reference, such as &amp;quot;10//21 '' . By corpus analysis, we collected a set of patterns for identification of ambiguous dates. However, we currently don't handle temporal references requiring inference to resolve (e.g., &amp;quot;the day before the plane crashed,&amp;quot; &amp;quot;around Christmas&amp;quot;).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="553" end_page="554" type="metho">
    <SectionTitle>
4 Sentence Generation
</SectionTitle>
    <Paragraph position="0"> The input to the sentence generator is a set of phrases that are to be combined and realized as a sentence. Input features for each phrase are determined by the information recovered by shallow analysis during content planning. Because this input structure and the requirements on the generator are quite different from typical language generators, we had to address the design of the input language specification and its interaction with existing features in a new way, instead of using the existing SURGE syntactic realization in a &amp;quot;black box&amp;quot; manner.</Paragraph>
    <Paragraph position="1"> As an example, consider the case of temporal modifiers. The DSYNT for an input phrase will simply note that it contains a prepositional phrase. FUF/SURGE, our language generator, requires that the input contain a semantic role, circumstantial which in turn contains a temporal feature.</Paragraph>
    <Paragraph position="2"> The labelling of the circumstantial as time allows SURGE to make the following decisions  given a sentence such as: &amp;quot;After they made an emergency landing, the pilots were reported missing.&amp;quot; * The selection of the position of the time circumstantial in front of the clause * The selection of the mood of the embedded clause as &amp;quot;finite&amp;quot;.</Paragraph>
    <Paragraph position="3"> The semantic input also provides a solid basis to authorize sophisticated revisions to a base input. If the sentence planner decides to adjoin a source to the clause, SURGE can decide to move the time circumstantial to the end of the clause, leading to: &amp;quot;According to Reuters on Thursday night, the pilots were reported missing after making an emergency landing.&amp;quot; Without such paraphrasing ability, which might be decided based on the semantic roles, time and sources, the system would have to generate an awkward sentence with both circumstantials appearing one after another at the front of the sentence.</Paragraph>
    <Paragraph position="4"> While in the typical generation scenario above, the generator can make choices based on semantic information, in our situation, the generator has only a low-level syntactic structure, represented as a DSYNT. It would seem at first glance that realizing such an input should be easier for the syntactic realization component. The generator in that case is left with little less to do than just linearizing the input specification. The task we had to solve, however, is more difficult for two reasons: 1. The input specification we define must allow the sentence planner to perform revisions; that is, to attach new constituents (such as source) to a base input specification without taking into account all possible syntactic interactions between the new constituent and existing ones; 2. SURGE relies on semantic information to make decisions and verify that these decisions are compatible with the rest of the sentence structure. When the semantic information is not available, it is more difficult to predict that the decisions are compatible with the input provided in syntactic form.</Paragraph>
    <Paragraph position="5"> We modified the input specification language for FUF/SURGE to account for these problems.</Paragraph>
    <Paragraph position="6"> We added features that indicate the ordering of circumstantials in the output. Ordering of circumstantials can easily be derived from their ordering in the input. Thus, we label circumstantials with the features front-i (i-th circumstantial at the front of the sentence) and end-i (i-th circumstantial at the end), where i indicates the relative ordering of the circumstantial within the clause.</Paragraph>
    <Paragraph position="7"> In addition, if possible, when mapping input phrases to a SURGE syntactic input, the sentence planner tries to determine the semantic type of circumstantial by looking up the preposition (for example: &amp;quot;after&amp;quot; indicates a &amp;quot;time&amp;quot; circumstantial). This allows FUF/SURGE to map the syntactic category of the circumstantial to the semantic and syntactic features expected by SURGE. However, in cases where the preposition is ambiguous (e.g., &amp;quot;in&amp;quot; can indicate &amp;quot;time&amp;quot; or &amp;quot;location&amp;quot;) the generator must rely solely on ordering circumstantials based on ordering found in the input.</Paragraph>
    <Paragraph position="8"> We have modified SURGE to accept this type of input: in all places SURGE checks the semantic type of the circumstantial before making choices, we verified that the absence of the corresponding input feature would not lead to an inappropriate default being selected. In summary, this new application for syntactic realization highlights the need for supporting hybrid inputs of variable abstraction levels. The implementation benefited from the bidirectional nature of FUF unification in the handling of hybrid constraints and required little change to the existing SURGE grammar. While we used circumstantials to illustrate the issues, we also handled revision for a variety of other categories in the same manner.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML