File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1009_metho.xml
Size: 14,586 bytes
Last Modified: 2025-10-06 14:14:51
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1009"> <Title>Trainable, Scalable Summarization Using Robust NLP and Machine Learning*</Title> <Section position="3" start_page="0" end_page="62" type="metho"> <SectionTitle> 2 Extracting Features </SectionTitle> <Paragraph position="0"> In this section, we describe how the system counts linguistically-motivated, automatically-derived words and multi-words in calculating worthiness for summarization. We show how the system uses an external corpus to incorporate domain knowledge in contrast to text-only statistics. Finally, we explain how we attempt to increase the cohesiveness of our summaries by using name aliasing, WordNet synonyms, and morphological variants.</Paragraph> <Section position="1" start_page="0" end_page="62" type="sub_section"> <SectionTitle> 2.1 Defining Single and Multi-word Terms </SectionTitle> <Paragraph position="0"> Frequency-based summarization systems typically use a single word string as the unit for counting frequency. Though robust, such a method ignores the semantic content of words and their potential membership in multi-word phrases and may introduce noise in frequency counting by treating the same strings uniformly regardless of context.</Paragraph> <Paragraph position="1"> Our approach, similar to (Tzoukerman, Klavans, and Jacquemin, 1997), is to apply NLP tools to extract multi-word phrases automatically with high accuracy and use them as the basic unit in the summarization process, including frequency calculation. Our system uses both text statistics (term frequency, or t\]) and corpus statistics (inverse docmnent frequency, or id\]) (Salton and McGill, 1983) to derive signature words as one of the summarization features. If single words were the sole basis of counting for our summarization application, noise would be introduced both in term frequency and inverse document frequency.</Paragraph> <Paragraph position="2"> First, we extracted two-word noun collocations by pre-processing about 800 MB of L.A. Times/Washington Post newspaper articles using a POS tagger and deriving two-word noun collocations using mutual information. Secondly, we employed SRA's NameTag TM system to tag the aforementioned corpus with names of people, entities, and places, and derived a baseline database for t\]*idfcalculation. Multi-word names (e.g., &quot;Bill Clinton&quot;) are treated as single tokens and disambiguated by semantic types in the database.</Paragraph> </Section> <Section position="2" start_page="62" end_page="62" type="sub_section"> <SectionTitle> 2.2 Acquiring Knowledge of the Domain </SectionTitle> <Paragraph position="0"> Knowledge-based summarization approaches often have difficulty acquiring enough domain knowledge to create conceptual representations for a text. We have automated the acquisition of some domain knowledge from a large corpus by calculating idfvalues for selecting signature words, deriving collocations statistically, and creating a word association index (Jing aim Croft, 1994).</Paragraph> </Section> <Section position="3" start_page="62" end_page="62" type="sub_section"> <SectionTitle> 2.3 Recognizing Sources of Discourse Knowledge through Lexlcal Cohesion </SectionTitle> <Paragraph position="0"> Our approach to acquiring sources of discourse knowledge is much shallower than those of discourse-based approaches. For a target text for summarization, we tried to capture lexical cohesion of signature words through name aliasing with the NameTag tool, synonyms with WordNet, and morphological variants with morphological pre-processing.</Paragraph> </Section> </Section> <Section position="4" start_page="62" end_page="64" type="metho"> <SectionTitle> 3 Combining Features </SectionTitle> <Paragraph position="0"> We experinaented with combining summarization features in two stages. In the first batch stage, we experimented to identify what features are most effective for signature words. In the second stage, we took the best combination of features determined by the first stage and used it to define &quot;high scoring signature words.&quot; Then, we trained DimSum over highscore signature word feature, along with conventional length and positional information, to determine which training features are most useful in rendering useful summaries. We also experimented with the effect of training and different corpora types.</Paragraph> <Section position="1" start_page="62" end_page="62" type="sub_section"> <SectionTitle> 3.1 Batch Feature Combiner 3.1.1 Method </SectionTitle> <Paragraph position="0"> In DirnSum, sentences are selected for a summary based upon a score calculated from the different combinations of signature word features and their expansion with the discourse features of aliases, synonyms, and morphological variants. Every token in a document is assigned a score based on its tf*idf value. The token score is used, in turn, to calculate the score of each sentence in the document. The score of a sentence is calculated as the average of the scores of the tokens contained in that sentence.</Paragraph> <Paragraph position="1"> To obtain the best combination of features for sentence extraction, we experimented extensively.</Paragraph> <Paragraph position="2"> The summarizer allows us to experiment with both how we count and what we count for both inverse document frequency and term frequency values. Because different baseline databases can affect idfvalues, we examined the effect on summarization of multiple baseline databases based upon nmltiple definitions of the signature words. Similarly, the discourse features, i.e., synonyms, morphological variants, or name aliases, for signature words, can affect tf values. Since these discourse features boost the term frequency score within a text when they are treated as variants of signature words, we also examined their impact upon summarization.</Paragraph> <Paragraph position="3"> After every sentence is assigned a score, the top n highest scoring sentences are chosen as a summary of the content of the document. Currently, the DimSum system chooses the number of sentences equal to a power k (between zero and one) of the total number of sentences. This scheme has an advantage over choosing a given percentage of document size as it yields more information for longer documents while keeping summary size manageable.</Paragraph> <Paragraph position="4"> Over 135,000 combinations of the above parameters were performed using 70 texts from L.A. Times/Washington Post. We evaluated the summary results against the human-generated extracts for these 70 texts in terms of F-Measures. As the results in Table 1 indicate, name recognition, alias recognition and WordNet (for synonyms) all make positive contributions to the system summary performance.</Paragraph> <Paragraph position="5"> The most significant result of the batch tests was the dramatic improvement in performance from withholding person names from the feature combination algorithm.The most probable reason for this is that personal names usually have high idf values, but they are generally not good indicators of topics of articles. Even when names of people are associated with certain key events, documents are not usually about these people. Not only do personal names appear to be very misleading in terms of signature word identification, they also tend to mask synonym group performance. WordNet synonyms appear to be effective only when names are suppressed.</Paragraph> </Section> <Section position="2" start_page="62" end_page="64" type="sub_section"> <SectionTitle> 3.2 'IYainable Feature Combiner 3.2.1 Method </SectionTitle> <Paragraph position="0"> With our second method, we developed a trainable feature combiner using Bayes' rule. Once we had defined the best feature combination for high scoring tf*idf signature words in a sentence in the first round, we tested the inclusion of coinlnonly acknowledged positional and length informa- null tion. From manually extracted summaries, the system automatically learns to combine the following extracted features for summarization: dial, final) Inclusion in the high scoring t\]* idf signature word set was determined by a variable system parameter (identical to that used in the pre-trainable version of the system). Unlike Kupiec et al.'s experiment, we did not use the cue word feature. Possible values of the paragraph feature are identical to how Kupiec et al. used this feature, but applied to all paragraphs because of the short length of the newspaper articles. We performed two different rounds of experiments, the first with newspaper sets and the second with a broader set from the TREC-5 collection (Harman and Voorhees, 1996). In both rounds we exper- null In the first round, we trained our system on 70 texts from the L.A. Times/Washington Post (latwpdevl) and then tested it against 50 new texts from the L.A. Times/Washington Post (latwp-testl) and 50 texts from the Philadelphia Inquirer (pi-testl). The results are shown in Table 2. In both cases, we found that the effects of training increased system scores by as much as 10% F-Measure or greater. Our results are similar to those of Mitra (Mitra, Singhal, and Buckley, 1997), but our system with the trainable combiner was able to outperform the lead sentence summaries.</Paragraph> <Paragraph position="2"> ferent training features on the 70 texts from L.A. Times/Washington Post (latwp-devl). It is evident that positional information is the most valuable. while the sentence length feature introduces the most noise. High scoring signature word sentences contribute, especially in conjunction with the positional information and the paragraph feature.</Paragraph> <Paragraph position="3"> High Score refers to using ant\]* idfmetric with Word-Net synonyms and name aliases enabled, person names suppressed, but all other name types active.</Paragraph> <Paragraph position="4"> The second round of experiments were conducted using 100 training and 100 test texts for each of six sources from the the TREC 5 corpora (i.e., Associated Press, Congressional Records, Federal Registry, Financial Times, Wall Street Journal, and Ziff).</Paragraph> <Paragraph position="5"> Each corpus was trained and tested on a large base-line database created by using multiple text sources. Results on the test sets are shown in Table 4. The discrepancy in results among data sources suggests that summarization may not be equally viable for all data types. This squares with results reported in (Nomoto and Matsumoto, 1997) where learned attributes varied in effectiveness by text type.</Paragraph> </Section> </Section> <Section position="5" start_page="64" end_page="64" type="metho"> <SectionTitle> 4 Task-based Evaluation </SectionTitle> <Paragraph position="0"> The goal of our task-based evaluation was to determine whether it was possible to retrieve automatically generated summaries with similar precision to that of retrieving the full texts. Underpinning this was the intention to examine whether a generic summary could substitute for a full-text document given that a common application for summarization is assumed to be browsing/scanning summarized versions of retrieved documents. The assumption is that summaries help to accelerate the browsing/scanning without information loss.</Paragraph> <Paragraph position="1"> Miike et al. (1994) described preliminary experiments comparing browsing of original full texts with browsing of dynamically generated abstracts and reported that abstract browsing was about 80% of the original browsing function with precision and recall about the same. There is also an assumption that summaries, as encapsulated views of texts, may actually improve retrieval effectiveness. (Brandow, Mitze, and Rau, 1995) reported that using programmatically generated sulnmaries improved precision significantly, but with a dramatic loss in recall.</Paragraph> <Paragraph position="2"> We identified 30 TREC-5 topics, classified by the easy/hard retrieval schema of (Voorhees and Harman, 1996), five as hard, five as easy, and the remaining twenty were randomly selected. In our evaluation, INQUERY (Allan et al., 1996) retrieved and ranked 50 documents for these 30 TREC-5 topics.</Paragraph> <Paragraph position="3"> Our summary system summarized these 1500 texts at 10%.reduction, 20%, 30%, and at what our system considers the BEST reduction. For each level of reduction, a new index database was built for IN-QUERY, replacing the full texts with summaries.</Paragraph> <Paragraph position="4"> The 30 queries were run against the new database, retrieving 10,000 documents per query. At this point, some of the summarized versions were dropped as these documents no longer ranked in the 10,000 per topic, as shown in Table 5. For each query, all results except for the documents summarized were thrown away. New rankings were computed with the remaining summarized documents.</Paragraph> <Paragraph position="5"> Precision for the INQUERY baseline (INQ.base) was then compared against each level of the reduction.</Paragraph> <Paragraph position="6"> Table 6 shows that at each level of reduction the overall precision dropped for the summarized versions. With more reduction, the drop was more dra- null matic. However, the BEST summary version performed better than the percentage methods.</Paragraph> <Paragraph position="7"> We examined in more detail document-level averages for five &quot;easy&quot; topics for which the INQUERY system had retrieved a high number of texts. Table 7 reveals that for topics with a high INQUERY retrieval rate the precision is comparable. We posit that when queries have a high number of relevant documents retrieved, the summary system is more likely to reduce information rather than lose information. Query topics with a high retrieval rate are likely to have documents on the subject matter and therefore the summary just reduces the information, possibly alleviating the browsing/scanning load.</Paragraph> <Paragraph position="8"> We are currently examining documents lost in the re-ranking process and are cautious in interpreting results because of the difficulty of closely correlating the term selection and ranking algorithms of automatic IR systems with human performance. Our experimental results do indicate, however, that generic summarization is more useful when there are many documents of interest to the user and the user wants to scan summaries and weed out less relevant document quickly.</Paragraph> </Section> class="xml-element"></Paper>