File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1057_metho.xml

Size: 20,014 bytes

Last Modified: 2025-10-06 14:08:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1057">
  <Title>A Formal Model for Information Selection in Multi-Sentence Text Extraction</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Formal Model for Information Selection
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
and Packing
</SectionTitle>
      <Paragraph position="0"> Our model for selecting and packing information across multiple text units relies on three components that are specified by each application. First, we assume that there is a finite set T of textual units t1;t2;:::;tn, a subset of which will form the answer or summary. For most approaches to summarization and question answering, which follow the extraction paradigm, the textual units ti will be obtained by segmenting the input text(s) at an application-specified granularity level, so each ti would typically be a sentence or paragraph.</Paragraph>
      <Paragraph position="1"> Second, we posit the existence of a finite set C of conceptual units c1;c2;:::;cm. The conceptual units encode the information that should be present in the output, and they can be defined in different ways according to the task at hand and the priorities of each system. Obviously, defining the appropriate conceptual units is a core problem, akin to feature selection in machine learning: There is no exact definition of what an important concept is that would apply to all tasks. Current summarization systems often represent concepts indirectly via textual features that give high scores to the textual units that contain important information and should be used in the summary and low scores to those textual units which are not likely to contain information worth to be included in the final output. Thus, many summarization approaches use as conceptual units lexical features like tf*idf weighing of words in the input text(s), words used in the titles and section headings of the source documents (Luhn, 1959; H.P.Edmundson, 1968), or certain cue phrases like significant, important and in conclusion (Kupiec et al., 1995; Teufel and Moens, 1997). Conceptual units can also be defined out of more basic conceptual units, based on the co-occurrence of important concepts (Barzilay and Elhadad, 1997) or syntactic constraints between representations of concepts (Hatzivassiloglou et al., 2001). Conceptual units do not have to be directly observable as text snippets; they can represent abstract properties that particular text units may or may not satisfy, for example, status as a first sentence in a paragraph or generally position in the source text (Lin and Hovy, 1997). Some summarization systems assume that the importance of a sentence is derivable from a rhetorical representation of the source text (Marcu, 1997), while others leverage information from multiple texts to re-score the importance of conceptual units across all the sources (Hatzivassiloglou et al., 2001).</Paragraph>
      <Paragraph position="2"> No matter how these important concepts are defined, different systems use text-observable features that either correspond to the concepts of interest (e.g., words and their frequencies) or point out those text units that potentially contain important concepts (e.g., position or discourse properties of the text unit in the source document). The former class of features can be directly converted to conceptual units in our representation, while the latter can be accounted for by postulating abstract conceptual units associated with a particular status (e.g., first sentence) for a particular textual unit. We assume that each conceptual unit has an associated importance weight wi that indicates how important unit ci is to the overall summary or answer.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 A first model: Full correspondence
</SectionTitle>
      <Paragraph position="0"> Having formally defined the sets T and C of textual and conceptual units, the part that remains in order to have the complete picture of the constraints given by the data and summarization approach is the mapping between textual units and conceptual units.</Paragraph>
      <Paragraph position="1"> This mapping, a function f : TPSC ! [0;1], tells us how well each conceptual unit is covered by a given textual unit. Presumably, different approaches will assign different coverage scores for even the same sentences and conceptual units, and the consistency and quality of these scores would be one way to determine the success of each competing approach.</Paragraph>
      <Paragraph position="2"> We first examine the case where the function f is limited to zero or one values, i.e., each textual unit either contains/matches a given conceptual feature or not. This is the case with many simple features, such as words and sentence position. Then, we define the total information covered by any given sub-</Paragraph>
      <Paragraph position="4"> In other words, the information contained in a summary is the sum of the weights of the conceptual units covered by at least one of the textual units included in the summary.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Partial correspondence between textual
</SectionTitle>
      <Paragraph position="0"> and conceptual units Depending on the nature of the conceptual units, the assumption of a 0-1 mapping between textual and conceptual units may or may not be practical or even feasible. For many relatively simple representations of concepts, this restriction poses no difficulties: the concept is uniquely identified and can be recognized as present or absent in a text passage. However, it is possible that the concepts have some structure and can be decomposed to more elementary conceptual units, or that partial matches between concepts and text are natural. For example, if the conceptual units represent named entities (a common occurrence in list-type long answers), a partial match between a name found in a text and another name is possible; handling these two names as distinct concepts would be inaccurate. Similarly, an event can be represented as a concept with components corresponding to participants, time, location, and action, with only some of these components found in a particular piece of text.</Paragraph>
      <Paragraph position="1"> Partial matches between textual and conceptual units introduce a new problem, however: if two textual units partially cover the same concept, it is not apparent to what extent the coverage overlaps.</Paragraph>
      <Paragraph position="2"> Thus, there are multiple ways to revise equation (1) in order to account for partial matches, depending on how conservative we are on the expected overlap. One such way is to assume minimum overlap (the most conservative assumption) and define the total information in the summary as</Paragraph>
      <Paragraph position="4"> An alternative is to consider that f(tj;ci) represents the extent of the [0;1] interval corresponding to concept ci that tj covers, and assume that the coverage is spread over that interval uniformly and independently across textual units. Then the combined coverage of two textual units tj and tk is f(tj;ci)+f(tk;ci)!f(tj;ci)C/f(tk;ci) This operator can be naturally extended to more than two textual units and plugged into equation (2) in the place of the max operator, resulting into an equation we will refer to as equation (3). Note that both of these equations reduce to our original formula for information content (equation (1)) if the mapping function f only produces 0 and 1 values.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Length and textual constraints
</SectionTitle>
      <Paragraph position="0"> We have provided formulae that measure the information covered by a collection of textual units under different mapping constraints. Obviously, we want to maximize this information content. However, this can only sensibly happen when additional constraints on the number or length of the selected textual units are introduced; otherwise, the full set of available textual units would be a solution that proffers a maximal value for equations (1)-(3), i.e.,</Paragraph>
      <Paragraph position="2"> ing a cost pi to each textual unit ti, i = 1;:::;n, and defining a function P over a set of textual units that provides the total penalty associated with selecting those textual units as the output. In our abstraction, replacing a textual unit with one or more textual units that provide the same content should only affect the penalty, and it makes sense to assign the same cost to a long sentence as to two sentences produced by splitting the original sentence. Also, a shorter sentence should be preferable to a longer sentence with the same information content. Hence, our operational definitions for pi and P are</Paragraph>
      <Paragraph position="4"> i.e., the total penalty is equal to the total length of the answer in some basic unit (e.g., words).</Paragraph>
      <Paragraph position="5"> Note however, than in the general case the pi's need not depend solely on the length, and the total penalty does not need to be a linear combination of them. The cost function can depend on features other then length, for example, number of pronouns--the more pronouns used in a textual unit, the higher the risk of dangling references and the higher the price should be. Finding the best cost function is an interesting research problem by itself.</Paragraph>
      <Paragraph position="6"> With the introduction of the cost function P(S) our model has two generally competing components. One approach is to set a limit on P(S) and optimize I(S) while keeping P(S) under that limit.</Paragraph>
      <Paragraph position="7"> This approach is similar to that taken in evaluations that keep the length of the output summary within certain bounds, such as the recent major summarization evaluations in the Document Understanding Conferences from 2001 to the present (Harman and Voorhees, 2001). Another approach would be to combine the two components and assign a composite score to each summary, essentially mandating a specific tradeoff between recall and precision; for example, the total score can be defined as a linear combination of I(S) and P(S), in which case the weights specify the relative importance of coverage and precision/brevity, as well as accounting for scale differences between the two metrics. This approach is similar to the calculation of recall, precision, and F-measure adopted in the recent NIST evaluation of long answers for definitional questions (Voorhees, 2003). In this paper, we will follow the first tactic of maximizing I(S) with a limit on P(S) rather than attempting to solve the thorny issues of weighing the two components appropriately.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Handling Redundancy in
</SectionTitle>
    <Paragraph position="0"> Summarization Redundancy of information has been found useful in determining what text pieces should be included during summarization, on the basis that information that is repeated is likely to be central to the topic or event being discussed. Earlier work has also recognized that, while it is a good idea to select among the passages repeating information, it is also important to avoid repetition of the same information in the final output.</Paragraph>
    <Paragraph position="1"> Two main approaches have been proposed for avoiding redundancy in the output. One approach relies on grouping together potential output text units on the basis of their similarity, and outputting only a representative from each group (Hatzivassiloglou et al., 2001). Sentences can be clustered in this manner according to word overlap, or by using additional content similarity features. This approach has been recently applied to the construction of paragraph-long answers (e.g., (Blair-Goldensohn et al., 2003; Yu and Hatzivassiloglou, 2003)).</Paragraph>
    <Paragraph position="2"> An alternative approach, proposed for the synthesis of information during query-based passage retrieval is the maximum marginal relevance (MMR) method (Goldstein et al., 2000). This approach assigns to each potential new sentence in the output a similarity score with the sentences already included in the summary. Only those sentences that contain a substantial amount of new information can get into the summary. MMR bases this similarity score on word overlap and additional information about the time when each document was released, and thus can fail to identify repeated information when paraphrasing is used to convey the same meaning.</Paragraph>
    <Paragraph position="3"> In contrast to these approaches, our model handles redundancy in the output at the same time it selects the output sentences. It is clear from equations (1)-(3) that each conceptual unit is counted only once whether it appears in one or multiple textual units. Thus, when we find the subset of textual units that maximizes overall information coverage with a constraint on the total number or length of textual units, the model will prefer the collection of textual units that have minimal overlap of covered conceptual units. Our approach offers three advantages versus both clustering and MMR: First, it integrates redundancy elimination into the selection process, requiring no additional features for defining a text-level similarity between selected textual units. Second, decisions are based on the same features that drive the summarization itself, not on additional surface properties of similarity. Finally, because all decisions are informed by the overlap of conceptual units, our approach accounts for partial overlap of information across textual units. To illustrate this last point, consider a case where three features A, B, and C should be covered in the output, and where three textual units are available, covering A and B, A and C, and B and C, respectively.</Paragraph>
    <Paragraph position="4"> Then our model will determine that selecting any two of the textual units is fully sufficient, while this may not be apparent on the basis of text similarity between the three text units; a clustering algorithm may form three singleton clusters, and MMR may determine that each textual unit is sufficiently different from each other, especially if A, B, and C are realized with nearly the same number of words.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Applying the Model
</SectionTitle>
    <Paragraph position="0"> Having presented a formal metric for the information content (and optionally the cost) of any potential summary or answer, the task that remains is to optimize this metric and select the corresponding set of textual units for the final output. As stated in Section 2.3, one possible way to do this is to focus on the information content metric and introduce an additional constraint, limiting the total cost to a constant. An alternative is to optimize directly the composite function that combines cost and information content into a single number.</Paragraph>
    <Paragraph position="1"> We examine the case of zero-one mappings between textual and conceptual units, where the total information content is specified by equation (1).</Paragraph>
    <Paragraph position="2"> The complexity of the problem depends on the cost function, and whether we optimize I(S) while keeping P(S) fixed or whether we optimize a combined function of both of those quantities. We will only consider the former case in the present paper.</Paragraph>
    <Paragraph position="3"> We start by examining an artificially simple case, where the cost assigned to each textual unit is 1, and the function P for combining costs is their sum. In this case, the total cost is equal to the number of textual units used in a summary.</Paragraph>
    <Paragraph position="4"> This problem, as we have formalized it above, is identical to the Maximum Set Coverage problem studied in theoretical computer science: given C, a finite set of weighted elements, a collection T of subsets of C, and an integer k, find those k sets that maximize the total number of elements in the union of T's members (Hochbaum, 1997). In our case, the zero-one mapping allows us to view each textual unit as a subset of the conceptual units space, containing those conceptual units covered by the textual unit, and k is the total target cost. Unfortunately, maximum set coverage is NP-hard, as it is reducible to the classic set cover problem (given a finite set and a collection of subsets of that set, find the smallest subset of that collection whose members' union is equal to the original set) (Hochbaum, 1997). It follows that more general formulations of the cost function that actually are more realistic for our problem (such as defining the total cost as the sum of the lengths of the selected textual units and allowing the textual units to have different lengths) will also result in an NP-hard problem, as we can reduce these versions to the special case of maximum set coverage.</Paragraph>
    <Paragraph position="5"> Nevertheless, the correspondence with maximum set coverage provides a silver lining. Since the problem is known to be NP-hard, properties of simple greedy algorithms have been explored, and a straightforward local maximization method has been proved to give solutions within a known bound of the optimal solution. The greedy algorithm for maximum set coverage has as follows: Start with an empty solution S, and iteratively add to the S the set Ti that maximizes I(S [Ti). It is provable that this algorithm is the best polynomial approximation algorithm for the problem (Hochbaum, 1997), and that it achieves a solution bounded as follows</Paragraph>
    <Paragraph position="7"> where I(OPT) is the information content of the optimal summary and I(GREEDY) is the information content of the summary produced by this greedy algorithm. null For the more realistic case where cost is specified as the total length of the summary, and where we try to optimize I(S) with a limit on P(S) (see Section 2.3), we propose two greedy algorithms inspired by the algorithm above. Both our algorithms operate by first calculating a ranking of the textual units in decreasing order. This ranking is for the first algorithm, which we call adaptive greedy algorithm, identical to the ranking provided by the basic greedy algorithm, i.e., each textual unit receives as score the increase in I(S) that it generates when added to the output, in the order specified by the basic greedy algorithm. Our second greedy algorithm (dubbed modified greedy algorithm below) modifies this ranking by prioritizing the conceptual units with highest individual weight wi; it ranks first the textual unit that has the highest contribution to I(S) while covering this conceptual unit with the highest individual weight, and then iteratively proceeds with the textual unit that has the highest contribution to I(S) while covering the next most important unaccounted for conceptual unit.</Paragraph>
    <Paragraph position="8"> Given the rankings of textual units, we can then produce an output of a given length by adopting appropriate stopping criteria for when to stop adding textual units (in order according to their ranking) to the output. There is no clear rule for conforming to a specific length (for example, DUC 2001 allowed submitted summaries to go over &amp;quot;a reasonable percentage&amp;quot; of the target length, while DUC 2004 cuts summaries mid-sentence at exactly the target length). As the summary length in DUC is measured in words, in our experiments we extracted the specified number of words out of the top sentences (truncating the last sentence if necessary).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML