File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0708_metho.xml

Size: 9,913 bytes

Last Modified: 2025-10-06 14:14:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0708">
  <Title>Goal-Directed Approach for Text Summarization</Title>
  <Section position="5" start_page="47" end_page="47" type="metho">
    <SectionTitle>
3 Sentence Selection Algorithm
</SectionTitle>
    <Paragraph position="0"> The sentence selectmn algonthm calculates the 'mformahveness' for each sentence m a document The measurement represents the strength of relatmn between the goals, sentences, and the richness ofmformatron m a document These var|ables are defined by the following three numerical values</Paragraph>
  </Section>
  <Section position="6" start_page="47" end_page="47" type="metho">
    <SectionTitle>
3 Total number of sentence express|ons being not re-
</SectionTitle>
    <Paragraph position="0"> lated to the goals The order of these measurements defines their precedence The first measurement is given the highest prmnty Sentences that sahsfy many of the goals are conmdered more mformahve Both the first and second values above represent the amount of tarotmatron included m a sentence The third measurement indicates the amount of mformatmn m a sentence and roughly simulates the contained amount of explanation or descnpt!on about the goal The sentence select|on algorithm (shown m Figure 2) relates the highest scored sentences by the informativeness measurement The measurements are repeatedly evaluated until all the goals are related to the sentences or all relatmns are found</Paragraph>
  </Section>
  <Section position="7" start_page="47" end_page="48" type="metho">
    <SectionTitle>
4 , Goal Detection
</SectionTitle>
    <Paragraph position="0"> Tins system is designed to be built into the text prevtew menu of a word processor or the query results hstmg of a document retneve system Thus, the contents of a document are unpredictable and the system needs to work m real time This hmltahon reqmres the system handles rather rumple mformatmn For example, the word list compiled from the :headhnes m used as the.goals when processing news  All goals are ~pven m the goal hst All sentences of the source text axe given m the sentence hst while(goal emsts m the goal hst) { .. ..</Paragraph>
    <Paragraph position="1"> The mfonnahveneas measm:ements axe apphed to each sentence m the sentence list if (the sentence(or sentences) with max.unum informativeness exists) { The sentence m and removed from the sentence hst, and added into the extract hst The goals related to the sentence axe removed from the goal hst  arhcles The htle words are used to extract a text from a report These simple word hsts may be too simple and a httle inadequate as goals Goal-dtrected summanzahon includes the processmg of the structural reformation This includes the concept level goal detechon using thesaurus, document structure, and structural mformahon m the titles (sechon, subsechon )</Paragraph>
  </Section>
  <Section position="8" start_page="48" end_page="49" type="metho">
    <SectionTitle>
5 Experiments-
</SectionTitle>
    <Paragraph position="0"> The first experiment is summary for 13,562 newspaper arhcles and 62 monthly market survey report arhcles Both texts are m Japanese The calculated extrachon rates based on the total number of</Paragraph>
    <Paragraph position="2"> characters I are hsted m Table 1 On average, the length of a summarized text by this system shows 50% of the length by the snnple t\]tle-keyword method The most frequent compression rate m the results of the rumple tltle-keyword method Is 100% (the entire text) By using the mformatwe selectmn, the rate falls between 20% to  Table 2 hats the results of the computer business survey reports In thin case, the differences between the rates are larger than the newspaper results The text of these business reports \]s longer than the newspaper articles These experiments are mostly of Japanese documents Only a few results~ for Enghsh documents are avadable Table 3 hsts the results of the extractmg summaries of Enghsh news articles In thin case, the extractmn rates are calculated based on the total number of words 2 The nature of this system makes evaluating the contents dd~cult and no clear solutmn can be obtained The evaluation methods m (Salton and Allan, 93) and (Kuplec eta l, 95) apphed to their system are using only intrinsic lnformatmn m a source text Salton measures the smnlar\]ty between a summary 1 charc~cte~s tn a ~ummar~ c~Gro, ct~r$ tn G te~rt 2 tuoFG,g t~l. G sulrnlrltQr~ words $n a t~t and an omgmal text Kuplec compares extracts with manually coded summaries If the priority of reformation of a text is equal and mformatweness can be Calculated umformly, these evaluations are statable However, a priority m affected by the context Detenmnmg the appropnatenees of the results was difficult Thus, the extracts were randomly chosen and the inappropriateness was analyzed for 87 newspaper articles 11 market report articles Obvious errors were found m 17 summaries (16 news articles, one report ) These errors were mainly caused by the fadure of synonyms .of the tltlekeywords and words m a sentence (e x, dead body, and corpse) to match The other summaries included enough reformation to extrapolate the contents of the or.lgmal texts Thus, 80% of the summattes contained enough reformation to serve as a preview In a news article, the leading paragraph should be a good summary of the article Therefore, the extracts of thin system and the lead paragraphs of news articles were compared Among all news articles, 70% of extracts from fins system included sentences from lead paragraphs and 50% of the extracts included only the lead paragraphs Thus, the system algorithm naturally selected more sentences from lead paragraphs than other parts of a news article null Next, the appropriateness and compactness of the text between the lead paragraphs and extracts of tins system were compared the news data Inappropriate results were found to be 4% higher m the extracts Double the number of extracts were more compact than the lead paragraph All of the report data of the extratlts were shorter than the leading paragraphs Thus, extracts from this system are regarded as being better than leading paragraphs In the expemnent described above on news articles, the goals were taken from the headlines and titles Also, some external source can serve as the goals of a summary If summaries are used to compare the text contents, text properties (such as tf tdf scores) can be used to create the goals of the summary null For example, the extracts wall include &amp;stmctlve reformation \]f words with high tf ldf scores are gwen The extracts wall show the common mformatxon of text \]f words with high document frequencies are given Figure 3 shows the results of fins experiment using small number of the specfllcatlons documents of hard dmk drives As shown m Figure 3(a), the high tf \]df words deterlmne the sentences describing the dmtmctlve features of the hard disk that are to be selected Figure 3(b) shows that the words with high document frequencies are used to select the common reformation about the general specfficatmns  (a) .Eztrochon by tf ,dr property Words w,th h,gh t ff sdJ scores DEs, DMs, F6632A, H, path configuratmn, MB, GB, path, RANK, F6493, F6429G Summary by the hsgh t f sdy words Flemble configuration The F1700B has a four path configuration (connechon path to a magnehc chsks) as a standard feature In ad&amp;tion, m the F1700B, the path to the channel and the paths to the magnetic dmk unit can be increased independently, so a flexible configuration can be found to smt the system environment High speed data transfer Data transfer rate between host is high speed 3 0 MB/sec or 4 5 MB/sec F1700B + F6425G/H, or F6427G/H, or F6429G/H has to be sold as a subsystem (b) Extractson by document frequency property Words tosth the hsghest document frequency .</Paragraph>
    <Paragraph position="3"> table, page, m3, contents, width, weight, temperature, power consumption, KVA, height, heat chsmpation, frequency, dunenston, depth,</Paragraph>
  </Section>
  <Section position="9" start_page="49" end_page="49" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> This experiment only demonstrates a small part of goal-directed summarization. Many subjects still need to be tested 1 Using of the thesaurus Most fmlures in processing news articles were caused by synonyms (such as 'corpse' and 'dead body', 'fishery' and 'fisherman') to be matched Most of these errors can be corrected by using the thesaurus</Paragraph>
  </Section>
  <Section position="10" start_page="49" end_page="49" type="metho">
    <SectionTitle>
2 Processing the structured goals
</SectionTitle>
    <Paragraph position="0"> To summarize structured documents (such as manuals) the hierarchical structure of the sections and subsections can be used to create goals These goals may control the inheritance of sub-goals to be satisfied m the substructure (such as, the 'preface' section )</Paragraph>
  </Section>
  <Section position="11" start_page="49" end_page="49" type="metho">
    <SectionTitle>
3 Resolving the anaphonc expression
</SectionTitle>
    <Paragraph position="0"> Fewer problems than the English sentence extrachon occurred, because Japanese text was mostly the subject of experiment and the text less contains the anaphonc expression However, person and company names In news articles are often abbreviated and shortened Resolving these, abbreviated and shortened expresslons are needed to Increase readablhty 4 Control of the summary length Because the mare purpose of this system is to offer concme information for prevlewmg document contents, the length of output cannot be directly controlled If the length needs to be varmd, some methods to extend the results may be added as post-pr0cessing The method to find sentence relations (such as leeyacal cohesion) may be suitable to find sentence chmns with related topics</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML