File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0713_intro.xml
Size: 6,608 bytes
Last Modified: 2025-10-06 14:06:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0713"> <Title>From discourse structures to text summaries</Title> <Section position="4" start_page="85" end_page="86" type="intro"> <SectionTitle> 3 An RST-based summarization program </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="85" end_page="86" type="sub_section"> <SectionTitle> 3.1 Implementation </SectionTitle> <Paragraph position="0"> Our summanzauon program rehes on a rhetorical parser that braids RS-trees for unrestricted texts The mathemaUcal foundaUons of the rhetorical parsing algorithm rely on a first'order formahzaUon of vahd textl structures (Marcu, 1997h) The assumpUons of the formalazaUon are the following 1 The elementary umts of complex text structures are non-overlappmg spans of text 2 Rhetorical, coherence, and cohessve relauons hold between textual umts of various sizes 3 Rel~ons can be paruuoned into two classes paratacuc and hypotacuc Paratacuc relauons are those that hold between spans of equal ~mportanee HypotacUc relations are those that hold between a span that ts essenual for the writer's purpose, I e, a nucleus, and a span that increases the understanding of the nucleus but is not essenUal for the writer's purpose, ~ e, a satelhte 4 The abstract structure of most texts ts a binary, tree-lake structure 5 If a relaUon holds between two textual spans of the tree structure of a text, that relatton also holds between the most Important umts of the consUtuent subspans The most ~mportant umts of a textual span are determined recursavely they correspond to the most important umts of the tmmechate subspans when the relauon that holds between these subspans ts paratacUc, and to the most amportant umts of the nucleus subspan when the relauon that holds between the tmmedtate subspans as hypotaclac The rhetorical parsmg algorithm, which is outhned m figure l, is based on a comprehens|ve corpus analysisof more than 450 discourse markers and 7900 text fragments (see (Marcu, 199To) for detmls) When gwen a text, the rhetorical parser detenmnes first the &scourse markers and the elementary umts that make UP that text The parser uses then the mformatton derived from the corpus analysts m order to hypothesize rhetorical relaUons among the elementary umts In the end, the parser apphes a constrmnt-saUsfactmn procedure to deterrmne the text str~tures that are vahd If more than one val|d structure is found, the parser chooses one that ts the &quot;best&quot; accordmg to a gwen metric The detmls of the algorithms that INPUT a text T rhetoncal parser for text (I) are used by the rethoncal parser are &scussed at length m (Mareu, 1997a, Marco, I997b) When the rhetoncal parser takes text (1) as mpuL R produces the RS-tree m figure 2 The conventaon that we use IS that nuclei are surrounded by sohd boxes and satelhtes by dotted boxes, the hnks between a node and a subordinate nucleus or nuclei are represented by sohd arrows, and the hnks between a node and a subordinate satelhte by dotted hnes The nodes with only one satelhte denote occurrences of parenthetical mformaUon for example, textual tnnt 2 ss labeled as parenthetacal to the textual umt that results from juxtaposing 1 and 3 The numbers assoctated voth each leaf correspond to the nu-.</Paragraph> <Paragraph position="1"> mencal labels m text (1) The numbers assocxated voth each internal node correspond to the sahent umts of that node and are exphcatly represented m the RS-tree By respecting the RS-tree m figure 2, one can horace that the trees that are bmlt by the program do not have the same granulartty as the trees constructed by the analysts For example, the program treats umts 13,14, and 15 as one elementary umt However, as we argue m (Marcu, 1997b), the corpus analysis on winch our parser as bmlt supports the observatton that, m most cases, the global structure of the RS-tree as not affected by the mabahty of the rbetoncal parser to uncover all clauses m a text most of the clauses that are not uncovered are nuclet of JOn~ relaUons The summanzatton program takes the RS-tree produced by the rbetoncal parser and selects the textual umts that are most salient m that text If the nim of the program Is to produce just a very short summary, only the salient umts associated with the internal nodes found closer to the root are selected The longer the summary one wants to generate, the farther the selected salient umts roll be from the root In fact, one can see that the RS-trees bmlt by the rhetoncal parser reduce a pamal order on the ~mportance of the textual umts For text (1), the most important umt ~s 4 The textual umts that are sahent m the nodes found one level below represent the next level of importance (m this case, umt 12 -- umt 4 was already accounted for) The next level contains umts 5, 6,16, and 18, and so on</Paragraph> </Section> <Section position="2" start_page="86" end_page="86" type="sub_section"> <SectionTitle> 3.2 Evaluation </SectionTitle> <Paragraph position="0"> To evaluate our program, we associated with each textual umt m the RS-trees bmlt by the rhetoncal parser a score m the same way we did for the RS-trees bmlt by the analysts For example, the RS-tree m figure 2 has a depth of 6 Because umt 4 is salient for the root, ~t gets a score of 6 Units 5, 6 are salient for an internal node found two levels below the root therefore, thmr score Is 4 Umt 9 Is salient for a leaf found five levels below the root therefore, ~ts score ~s 1 Table I presents the scores associated by our summanzauon program to each umt m text (1) We used the importance scores assigned by our program to compute staUst~cs s~rmlar to those discussed m the prevmus secUon When the program selected only the textual umts w~th the highest scores, m percentages that were equal to those of the judges, the recall was 53% and the preclslon was 50% When the program selected the full sentences that were asseclated w~th the most important umts, m percentages that were equal to those of the judges, the recall was 66% and the precls~on 68% The lower recall and precision scores associated w~th clauses seem to be caused primarily by the difference m granularity w~th respect to the way the texts were broken into subumts the program does not recover all rmmmal textual umts, and as a consequence, ~ts assignment of importance scores ~s coarser When full sentences are considered, the judges and the program work at the Same level of granularity, and as a consequence, the summanzauon results tmprove s~gmficantly</Paragraph> </Section> </Section> class="xml-element"></Paper>