File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2127_metho.xml
Size: 18,681 bytes
Last Modified: 2025-10-06 14:07:14
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2127"> <Title>Toward the &quot;At-a-glance&quot; Summary: Phrase-representation Summarization Method YoshihiroUEDA, MamikoOKA, TakahiroKOYAMA and TadanobuMIYAUCHl</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 The Concept </SectionTitle> <Paragraph position="0"> Examples of an &quot;'at-a-glance'&quot; summary are the headlines of news articles. The headline provides intbrmation tbr judging whether the article is to be read or not an& in this sense, it is We use &quot;'pllrases&quot; to represent the simplicity characteristic I and set our goal to create phrase-represented summaries, which provide the reader with an outline of the document, avoiding reading stress by enumerating short phrases containing the important words and concepts composed from these words.</Paragraph> <Paragraph position="1"> The lnethod we adopted to achieve this goal is to construct such phrases from the relations between words rather than extracting important sentences fl'om the original document.</Paragraph> </Section> <Section position="3" start_page="0" end_page="880" type="metho"> <SectionTitle> 2 Summarization Method 2.1 Outline of the Algorithm </SectionTitle> <Paragraph position="0"> Here we give a short description of the ot,tline of tiffs method using the example shown in Fig. 1. 2 i The word &quot;'phrase&quot; used here is not of tile linguistic sense but an expression tbr &quot;'short&quot; and &quot;sinaple.&quot; In Japanese, there is no rigid distinction between &quot;phrase&quot; and &quot;'clause.&quot; -~ In tiffs paper, Japanese words are represented in English as much as possible. The words left in Japanese arc shown in italics, such as -~a&quot; (a particle for AGENT), &quot;jidai'&quot; (&quot;era&quot;), etc. Each relation name is constructed from a Japanese pailicle and its function (shown as a case name or an equivalent l-;nglish (a) original \[ &quot;(9:.r~gl j sl~ translation) At the Green Fair held on 24th, a venture company PICORP ap.nounced it licenses its environment protection technology to B_MICO, the U.S. top compa:!v. PICORP's CEO Ken Ono said that...</Paragraph> <Paragraph position="1"> (b) analysis graph & (1) analysis of relations ....... ~~--~ (;,',,enF~,ir } (2) selection of . /, * . ...... &quot;nioite&quot;-AT ~ core relation m -FQ \[ - &quot;1 &quot;'ha'-IHEM\[= ~'~ \] venture ~'\[ PICORP j '-- I ...... e ro m--71ent 4, I , ,&quot;no-ur ! protection I /'- . ' ... &quot; - &quot;-I, , , , ~t&quot;-~\] license \[ ,.-larmouncedl \[ t_ecnnomgy I&quot;.,o&quot;-OBJ' ~ ' ' - I , II ~ (3) Addition of relations ....... ni '-DAT (c) obtained phrase .~ (4) generation PICORP licenses environment protection technology to AMICO I Fig. 1 : Outline of phrase-representation summarization The method consists of the Efllowing tbur major steps: (1) Syntactic analysis to extract the relations between words (2) Selection of the core relation (3) Adding relations necessa D' for the unity of the phrase's meaning (4)Generating the surface phrase from the constructed graph First, the sentences in the given document are analyzed to produce directed acyclic graphs (DAGs) constructed fi'om relation units, each of which consists of two nodes (words) and an arc (relation between tim words). Each node is not only a single word but also can be a word sequence (noun group).</Paragraph> <Paragraph position="2"> Then an important relation is selected as a &quot;core&quot; relation. In F'ig. l, the arc connecting the two shaded nodes is selected as the &quot;core.&quot; The core relation alone carries insufficient information to convey the content of the original docunaent. Additional arcs (represented by preposition).</Paragraph> <Paragraph position="3"> double lines) are attached to narrow the infornm-. tion the phrase supplies.</Paragraph> <Paragraph position="4"> The tbllowing short phrase can be generated fi'om the selected nodes and arcs in the graph: PICORP licenses (its) environment protection technology to AMICO. 3 Phrase-representation summarization enutnerates such short phrases to give the readers enough infornmtion to grasp the outline of a document. This algorithm is explained in the next section.</Paragraph> <Section position="1" start_page="878" end_page="879" type="sub_section"> <SectionTitle> 2.2 Further description of each step </SectionTitle> <Paragraph position="0"> The steps shown in the previous section consists of a cycle that produces a single phrase. The cycles are repeated until the generated phrases satisfy a predefined condition (e.g. the length of the summary). The scores of the words used in the cycle are reduced by a predefined cut-down ; This short sentence can be expressed as a phrase in tt~e linguistic sense in \[.;nglish: I~IC()RI)'s licensing (its) environment protection technology to AMIC().</Paragraph> <Paragraph position="1"> ratio to avoid fi'equent use of the same words in the summaiT.</Paragraph> <Paragraph position="2"> The basic algorithm is shown in El,, &quot;~</Paragraph> </Section> <Section position="2" start_page="879" end_page="879" type="sub_section"> <SectionTitle> Relation AnM|,.~'is </SectionTitle> <Paragraph position="0"> Syntactic analysis is applied to each sentence ill the document to produce a DAG of the relations of words. We use a simple parser based on pattern matching (Miyauchi, et al. 1995), one of whose rules always judges each case dependent on its nea,'est verb. Some of the misanalysis will be hidden by &quot;ambiguity packing&quot; ill the &quot;additional relation attachment&quot; step.</Paragraph> </Section> <Section position="3" start_page="879" end_page="880" type="sub_section"> <SectionTitle> Relation Scoring </SectionTitle> <Paragraph position="0"> All importance score is provided for each relation unit (two nodes and an arc connecting them).</Paragraph> <Paragraph position="1"> First, every word is scored by its importance.</Paragraph> <Paragraph position="2"> This score is calculated based on tile tf*IDF wdue (Salton, 1989) 4.</Paragraph> <Paragraph position="3"> Then, the relation score is calculated as follows: null</Paragraph> <Paragraph position="5"> Here, SI and $2 are tile scores of the two words connected by relations. The score of a word sequence is calculated by decreasing the sum of the scores of its constituent words according to tile length of the word sequence.</Paragraph> <Paragraph position="6"> Wl and W2 are the weights given to each word.</Paragraph> <Paragraph position="7"> Currently, all words are equally treated (WI ---</Paragraph> <Paragraph position="9"> Srel is the importance factor of tile relation.</Paragraph> <Paragraph position="10"> The relations that play central roles ill the meaning, such as verb cases, are given high scores, and the surrounding relations, such as &quot;'AND&quot; relations, are scored low. Tile relation scores for modifier-modified relations such as adverbs are set to 0 to avoid selecting them as the core relations.</Paragraph> <Paragraph position="11"> Core relation selection The relation unit with tile highest score among all relations is selected as the &quot;core relation.&quot; Additional relation attachment The inlbrmation that the core relation carries is usually insufficient. Additional relations arc attached to make the information tile phrase * ~ ll)F is calculated from I million WW~,V documcnts gathered by a Web search engine.</Paragraph> <Paragraph position="12"> supplies rnore specific and to give the reader sufficient information to infer the content of the original doculnent. &quot;File following relations are a part of the relations to be attached.</Paragraph> <Paragraph position="13"> @ Mandatory cases Relations that correspond to mandatory cases are attached to verbs. Mandatory case lists are defined for verbs except for those that share tile common mandatory case list, which includes deg'ga'-AGENT, %vo&quot;-OBJ and &quot;ni&quot;-DATIVE. &quot;Ha&quot;-'ftfEME, &quot;mo'-.ALSO, and null-marker relations are also treated as mandatory, because they can appear in place of the mandatory relations.</Paragraph> <Paragraph position="14"> Ex.) AMICe &quot;ga&quot;-AGENT release</Paragraph> <Paragraph position="16"> @ Noun modified by a verb In Japanese, the &quot;verb - noun&quot; structure repre-. sents an embedded sentence, and the noun usually fills some gap in the embedded sentence, l('the verb in the core relation (noun -- verb) consists ot'sucll a verb -noun relation, the modified noun is also assumed to carry important information, even if it does not t511 the mandatory case (fllough the case is not The analysi.s trees often contain error.<; be-cause the pattern-base parser doesn't resolve ambiguities. For exarnple, the strtlCttlre V 0-.TI-IAT N1 &quot;no'-OV N2 (Ving Nl's N2) i,q ambiguous in Japanese (V can rnodil~,/ either N1 or N2 but the parser always aim-.</Paragraph> <Paragraph position="17"> lyzes N2 as modified)deg lf'the V-.NI rehltion iv; selected as the cole, the N1-N2 rehition is always attached to the core to include the pos-. sible V-N2 relation.</Paragraph> <Paragraph position="18"> il Modifiers of generic llOUllS Tile concepts brought by generic rloun,; such as <~momf&quot; (thing), +~koto&quot; (<~that&quot;' of thatclause), ~baai&quot; (case), ~Tidai&quot; (era) are not so specific that they usually acconlpany lnodifi-. ers to be infbrmative, tlere such modifiers are attached to make them intbrmatiw e.</Paragraph> <Paragraph position="20"> (emerged irl the era of confi,isiorl) 77,rmimgtian comlitio~, Judges whether tim surnnlarics created so far arc sufi-icient. Curreritly the termination coriditiori is defined by either the number of produced phrases or the total summary length.</Paragraph> <Paragraph position="21"> Re-scoring ojrelationu If the condition is not flllfilled, thes;e steps from selection of the core relation Must I.)e repeated to create another phrase, t}efi)re selecting a new core, the scores of the words used in this cycle are reduced to increase the possibility for other words to be used in the next phrase. Score reduction is achieved by multiplying tile predefined Ctll-dowll ratio R (0 < It < 1) by the scores of the words used. l,>,ehition scores are re-calculated usin.~, the nov, word scores.</Paragraph> <Paragraph position="22"> Generation o.f sur~we phrases Tiffs process produces I)AGs each of ~laich consists of one core relation and several attached iclations. In ,latmnesc, the surface phrases can be ea.,;il) obtained by connecthlg the still'ace string of the nodes in their original order. See Chapter 5 for the generatioil method for \]\[:,nglish.</Paragraph> </Section> </Section> <Section position="4" start_page="880" end_page="880" type="metho"> <SectionTitle> 3 The Prototype </SectionTitle> <Paragraph position="0"> Wc developed a prototype of the summarization system based on this algorithm. The development language is Java and the system is working on Windows 95/c)8/NT and Solaris 2.6 a.</Paragraph> <Paragraph position="1"> The time consumed by summarization process is in proportion to the text length and it takes about 700 rnsec to generate a surnmal T for an Ad sized document (2000 Japanese characters) using a PC with a Celeron processor (500 Mtlz). Over 95% of the time is consumed in the relation analysis step.</Paragraph> </Section> <Section position="5" start_page="880" end_page="881" type="metho"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> We have conducted an experiment to evahiate the system. This section is a short sumrna W of the expei+iment reported iri (()ka and Uedar, 2000).</Paragraph> <Paragraph position="1"> The aim of a phrase--represented summary is to give fast and accurate sifting of lit results. To evahiate whether the aim was achieveddeg we adopted a task-based evahlation (Jing, et al. 1998, Mani, et al. 1998). One of the problems of those experiments using human subjects as assessors is inaccuracy caused by the diversity of assessment.</Paragraph> <Paragraph position="2"> To reduce the diversity, first we assign 10 sub.iects (experiment participants) fbr each sulnnlary sample. The nunlber of subjects was just I or 2 in the previous task-based experiments.</Paragraph> <Paragraph position="3"> Second, we gave the subjects a detailed instruc-tion including the situation that led them to search the WWW.</Paragraph> <Section position="1" start_page="880" end_page="881" type="sub_section"> <SectionTitle> 4,1 Experiment Method </SectionTitle> <Paragraph position="0"> The outline of the evahiation is as follows: 5 '0'&quot; shows that there ~ll'e i1() particle~; ur any other \~,ol'ds Collnccting two ;~,old:-;. ,lapttrics;e dticSll't require anything like relative pi+onoun+< ~' .lava and Solaris are the tra(temarks of Sun Microsvstems. Windows and Ccleron tll'O the mldcmark!; of Microsoft and lntel, respedively.</Paragraph> <Paragraph position="1"> (r) Assume an inlbrmation need and make a queIw for the information need (r) Prepare simulated WWW search results with different types of summaries: (A) first 80 characters, (B) important sentence selection (Zechner, 1996), (C) phrase-represented summary, (I)) keyword enumeration. The documents in the simulated search result set are selected so that the set includes an appropriate number of relevant documents and irrelevant documents.</Paragraph> <Paragraph position="2"> * Have subjects judge from the summaries the relevance between the search results and the given int'ormation need. The judgement is expressed in t'our levels (from higher to lower: L3, L2, LI, and L0, which is judged to be irrelevant).</Paragraph> <Paragraph position="3"> * Compare the relevance with the one that we assumed.</Paragraph> <Paragraph position="4"> The documents the user judges to be relevant compose a subset of the IR results and it should be more relevant to the information need than the IR results themselves. Because we have introduced three relevance levels, we can assume three kinds of the subsets; L3 only, L3+L2, and L3+L2+LI. The subset composed only from the documents with L3 judgement should have a high precision score and the subset including L1 documents should get a high recall score.</Paragraph> </Section> <Section position="2" start_page="881" end_page="881" type="sub_section"> <SectionTitle> 4.2 Result </SectionTitle> <Paragraph position="0"> Because recall and precision are in a trade~off relation, here we show the result using f-measure, the balanced score of the two indexes.</Paragraph> <Paragraph position="2"> precision + recall The fmeasure averages of the experiment result of three different tasks are shown in Fig. 3. It shows that the phrase-represented summaries (C) are more suitable tbr sifting search results than any other summaries in all cases.</Paragraph> </Section> <Section position="3" start_page="881" end_page="881" type="sub_section"> <SectionTitle> 4.3 Discussion </SectionTitle> <Paragraph position="0"> The result can be explained using the number of summaries that contain clues to the information need. Summaries consistin,, of short units (phrases (C) and keywoMs (D)) are gathered from the wide range of the original text and accordin,.zlv have many chances to include the clues. The actual average numbers of summaries that contain the clues are 2.0, 4.3 and 4.7 for (B) sentence, (C) phrases and (D) keywords, respeco tively, in spite that (D) keywords include more clues than any other samples, they don't get a good t-score. The reason is considered to be due to the lack of information about the relations among keywords.</Paragraph> </Section> </Section> <Section position="6" start_page="881" end_page="882" type="metho"> <SectionTitle> 5 Applicability to Other Languages </SectionTitle> <Paragraph position="0"> Although this algoritlun was first developed for the Japanese language, the concept of phrase~ representation stmunarization is also applicable to other languages. Here we show the direction toward its extension to t'nglish.</Paragraph> <Paragraph position="1"> English has a clear concept of ~'phrase,&quot; and simply connected words do not produce well-formed phrases. I'his requires semantic analysis and generation from the semantic structure.</Paragraph> <Paragraph position="2"> We will consider the following example again.</Paragraph> <Paragraph position="3"> Ex.) A venture company PICORP announced to license their environment protection technology to AMICO, a U.S. top company.</Paragraph> <Paragraph position="4"> lf&quot;PICORP&quot; and &quot;license&quot; must be included in the summary and &quot;announce&quot; is not so important, &quot;PlCORP license(s)&quot; is the core of the desired phrase. Generating it requires sub.iect resolution o\[&quot; &quot;license&quot; and thus semantic level analysis is required. Moreover, predicate-argument structures arc preferable to syntactic trees because the sub.iect and the object are represented in the same level, thlification gramtnar flameworks such as I,FG (Kaplan and P, restmn. 1082) and tlPSG (Pollard and Sag, t994) fulfill these requirements.</Paragraph> <Paragraph position="5"> Fig. 4 is a part of the analysis rcsuh represented in I.FG.</Paragraph> <Paragraph position="6"> A score is calculated for each feature structure and the core feature structure will be selected by its score instead of selecting a core relation and attaching malldatory relations. In the core l~mture structure, index \[1\] is replaced by %I, JBJ of the top l\]eature structure.</Paragraph> <Paragraph position="7"> (}eneratin<,. > phrases t'rOlll the t\:ature structure requires templates ?. Several pattern,<; c, an be selected io generate phrases: V-ing (gerund) tbrm ARGI' s PRED--Ang ARG2 'co ARG3 notin |'orm ARGI's noun (PRF, D) oPS ARG2 to ARG3 to--infinitive l~}l-nl For ARGI to PRED ARG2 to ARG3 In this case, tile herin fOFlll ~&quot; lqC()RP's license c,f the protection technology to AMIC()&quot; is avoided because tile noun &quot;qicense&quot; lacks the meaning of &quot;action&quot; or &quot;'event. '&quot; ()tiler rules specific to headlines such as ~'to-infinitive represents |'uture&quot; Call alSO be hltroduced.</Paragraph> </Section> class="xml-element"></Paper>