File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1059_metho.xml
Size: 22,646 bytes
Last Modified: 2025-10-06 14:07:38
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1059"> <Title>Producing Biographical Summaries: Combining Linguistic Knowledge with Corpus Statistics1</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Producing biographical descriptions </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Preprocessing </SectionTitle> <Paragraph position="0"> Each document in the collection to be summarized is processed by a sentence tokenizer, the Alembic part-of-speech tagger (Aberdeen et al. 1995), the Nametag named entity tagger (Krupka 1995) restricted to people names, and the CASS parser (Abney 1996). The tagged sentences are further analyzed by a cascade of finite state machines leveraging patterns with lexical and syntactic information, to identify constructions such as pre- and post-modifying appositive phrases, e.g., Presidential candidate George Bush, Bush, the presidential candidate, and relative clauses, e.g., Senator ..., who is running for re-election this Fall,.</Paragraph> <Paragraph position="1"> These appositive phrases and relative clauses capture descriptive information which can correspond variously to a person's age, occupation, or some role a person played in an incident. In addition, we also extract sentential descriptions in the form of sentences whose (deep) subjects are person names.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Cross-document coreference </SectionTitle> <Paragraph position="0"> The classes of person names identified within each document are then merged across documents in the collection using a cross-document coreference program from the Automatic Content Extraction (ACE) research program (ACE 2000), which compares names across documents based on similarity of a window of words surrounding each name, as well as specific rules having to do with different ways of abbreviating a person's name (Mani and MacMillan 1995). The end result of this process is that for each distinct person, the set of descriptions found for that person in the collection are grouped together.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Appositives 2.3.1 Introduction </SectionTitle> <Paragraph position="0"> The appositive phrases usually provide descriptions of attributes of a person. However, the preprocessing component described in Section 2.1 does produce errors in appositive extraction, which are filtered out by syntactic and semantic tests. The system also filters out redundant descriptions, both duplicate descriptions as well as similar ones. These filtering methods are discussed next.</Paragraph> <Paragraph position="1"> The appositive descriptions are first pruned to record only one instance of an appositive phrase which has multiple repetitions, and descriptions whose head does not appear to refer to a person. The latter test relies on a person typing program which uses semantic information from WordNet 1.6 (Miller 1995) to test whether the head of the description is a person. A given string is judged as a person if a threshold percentage th1 (set to 35% in our work) of senses of the string are descended from the synset for Person in WordNet. For example, this picks out counsel as a person, but accessory as a non-person.</Paragraph> <Paragraph position="2"> The pruning of erroneous and duplicate descriptions still leaves a large number of redundant appositive descriptions across documents. The system compares each pair of appositive descriptions of a person, merging them based on corpus frequencies of the description head stem, syntactic information, and semantic information based on the relationship between the heads in WordNet. The descriptions are merged if they have the same head stem, or if both heads have a common parent below Person in WordNet (in the latter case the head which is more frequent in the corpus is chosen as the merged head), or if one head subsumes the other under Person in WordNet (in which case the more general head is chosen).</Paragraph> <Paragraph position="3"> When the heads of descriptions are merged, the most frequent modifying phrase that appears in the corpus with the selected head is used. When a person ends up with more than one description, the modifiers are checked for duplication, with distinct modifiers being conjoined together, so that Wisconsin lawmaker and Wisconsin democrat yields Wisconsin lawmaker and Democrat.</Paragraph> <Paragraph position="4"> Prepositional phrase variants of descriptions are also merged here, so that chairman of the Budget Committee and Budget Committee Chairman are merged. Modifiers are dropped but their original order is preserved for the sake of fluency.</Paragraph> <Paragraph position="5"> The system then weights the appositives for inclusion in a summary. A person's appositives are grouped into equivalence classes, with a single head noun being chosen for each equivalence class, with a weight for that class based on the corpus frequency of the head noun. The system then picks descriptions in decreasing order of class weight until either the compression rate is achieved or the head noun is no longer in the top th2 % most frequent descriptions (th2 is set to 90% in our work). Note that the summarizer refrains from choosing a subsuming term from WordNet that is not present in the descriptions, preferring to not risk inventing new descriptions, instead confining itself to cutting and pasting of actual words used in the document.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Relative Clause Weighting </SectionTitle> <Paragraph position="0"> Once the relative clauses have been pruned for duplicates, the system weights the appositive clauses for inclusion in a summary. The weighting is based on how often the relative clause's main verb is strongly associated with a (deep) subject in a large corpus, compared to its total number of appearances in the corpus. The idea here is to weed out 'promiscuous' verbs that are weakly associated with lots of subjects.</Paragraph> <Paragraph position="1"> The corpus statistics are derived from the Reuters portion of the North American News Text Corpus (called 'Reuters' in this paper) -nearly three years of wire service news reports containing 105.5 million words.</Paragraph> <Paragraph position="2"> Examples of verbs in the Reuters corpus which show up as promiscuous include get, like, give, intend, add, want, be, do, hope, think, make, dream, have, say, see, tell, try. In a test, detailed below in Section 4.2, this feature fired 40 times in 184 trials.</Paragraph> <Paragraph position="3"> To compute strong associations, we proceed as follows. First, all subject-verb pairs are extracted from the Reuters corpus with a specially developed finite state grammar and the CASS parser. The head nouns and main verbs are reduced to their base forms by changing plural endings and tense markers for the verbs. Also included are 'gapped' subjects, such as the subject of run in the student promised to run the experiment; in this example, both pairs 'student-promise' and 'student-run' are recorded. Passive constructions are also recognized and the object of the by-PP following the verb is taken as the deep subject. Strength of association between subject i and verb j is measured using mutual information (Church and Hanks 1990):</Paragraph> <Paragraph position="5"> Here tfij is the maximum frequency of subject-verb pair ij in the Reuters corpus, tfi is the frequency of subject head noun i in the corpus, tfj is the frequency of verb j in the corpus, and N is the number of terms in the corpus. The associations are only scored for tf counts greater than 4, and a threshold th3 (set to log score > -21 in our work) is used for a strong association.</Paragraph> <Paragraph position="6"> The relative clauses are thus filtered initially (Filter 1) by excluding those whose main verbs are highly promiscuous. Next, they are filtered (Filter 2) based on various syntactic features, as well as the number of proper names and pronouns. Finally, the relative clauses are scored conventionally (Filter 3) by summing the within-document relative term frequency of content terms in the clause (i.e., relative to the number of terms in the document), with an adjustment for sentence length (achieved by dividing by the total number of content terms in the clause).</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Sentential Descriptions </SectionTitle> <Paragraph position="0"> These descriptions are the relatively large set of sentences which have a person name as a (deep) subject. We filter them based on whether their main verb is strongly associated with either of the head nouns of the appositive descriptions found for that person name (Filter 4). The intuition here is that particular occupational roles will be strongly associated with particular verbs. For example, politicians vote and elect, executives resign and appoint, police arrest and shoot; so, a summary of information about a policeman may include an arresting and shooting event he was involved with. (The verboccupation association isn't manifest in relative clauses because the latter are too few in number).</Paragraph> <Paragraph position="1"> A portion of the results of doing this is shown in Table 1. The results for executive are somewhat loose, whereas for politician and police, the associations seem tighter, with the associated verbs meeting our intuitions.</Paragraph> <Paragraph position="2"> All sentences which survive Filter 4 are extracted and then scored, just as relative clauses are, using Filter 1 and Filter 3. Filter 4 alone provides a high degree of compression; for example, it reduces a total of 16,000 words in the combined sentences that include Vernon Jordan' s name in the Clinton corpus to 578 words in 12 sentences; sentences up to the target length can be selected from these based on scores from Filter 1 and then Filter 3.</Paragraph> <Paragraph position="3"> However, there are several difficulties with these sentences. First, we are missing a lot of them due to the fact that we do not as yet handle pronominal subjects which are coreferential with the proper name. Second, these sentences contain lots of dangling anaphors, which will need to be resolved. Third, there may be redundancy between the sentential descriptions, on one hand, and the appositive and relative clause descriptions, on the other. Finally, the entire sentence is extracted, including any subordinate clauses, although we are working on refinements involving sentence compaction. As a result, we believe that more work is required before the sentential descriptions can be fully integrated into the biographies.</Paragraph> <Paragraph position="4"> executive police politician particular classes of people in the Reuters corpus (negative log scores).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Overview Methods for evaluating text summarization can </SectionTitle> <Paragraph position="0"> be broadly classified into two categories (Sparck-Jones and Galliers 1996). The first, an extrinsic evaluation, tests the summarization based on how it affects the completion of some other task, such as comprehension, e.g., (Morris et al. 1992), or relevance assessment (Brandow et al. 1995) (Jing et al. 1998) (Tombros and Sanderson 1998) (Mani et al. 1998). An intrinsic evaluation, on the other hand, can involve assessing the coherence of the summary (Brandow et al. 1995) (Saggion and Lapalme 2000).</Paragraph> <Paragraph position="1"> Another intrinsic approach involves assessing the informativeness of the summary, based on to what extent key information from the source is preserved in the system summary at different levels of compression (Paice and Jones 1993), (Brandow et al. 1995). Informativeness can also be assessed in terms of how much information in an ideal (or 'reference') summary is preserved in the system summary, where the summaries being compared are at similar levels of compression (Edmundson 1969).</Paragraph> <Paragraph position="2"> We have carried out a number of intrinsic evaluations of the accuracy of components involved in the summarization process, as well as the succinctness, coherence and informativeness of the descriptions. As this is a MDS system, we also evaluate the non-redundancy of the descriptions, since similar information may be repeated across documents.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Person Typing Evaluation </SectionTitle> <Paragraph position="0"> The component evaluation tests how accurately the tagger can identify whether a head noun in a description is appropriate as a person description The evaluation uses the WordNet 1.6 SEMCOR semantic concordance, which has files from the Brown corpus whose words have semantic tags (created by WordNet' s creators) indicating WordNet sense numbers. Evaluation on 6,000 sentences with almost 42,000 nouns compares people tags generated by the program with SEMCOR tags, and provided the following results: right = 41,555, wrong = 1,298, missing = 0, yielding Precision, Recall, and F-Measure of 0.97.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Relative Clause Extraction Evaluation </SectionTitle> <Paragraph position="0"> This component evaluation tests the well-formedness of the extracted relative clauses. For this evaluation, we used the Clinton corpus. The relative clause is judged correct if it has the right extent, and the correct coreference index indicating which person the relative clause description pertains to. The judgments are based on 36 instances of relative clauses from 22 documents. The results show 28 correct relative clauses found, plus 4 spurious finds, yielding Precision of 0.87, Recall of 0.78, and F-measure of .82. Although the sample is small, the results are very promising.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Appositive Merging Evaluation </SectionTitle> <Paragraph position="0"> This component evaluation tests the system's ability to accurately merge appositive descriptions. The score is based on an automatic comparison of the system's merge of system-generated appositive descriptions against a human merge of them. We took all the names that were identified in the Clinton corpus and ran the system on each document in the corpus.</Paragraph> <Paragraph position="1"> We took the raw descriptions that the system produced before merging, and wrote a brief description by hand for each person who had two or more raw descriptions. The hand-written descriptions were not done with any reference to the automatically merged descriptions nor with any reference to the underlying source material.</Paragraph> <Paragraph position="2"> The hand-written descriptions were then compared with the final output of the system (i.e., the result after merging). The comparison was automatic, measuring similarity among vectors of content words (i.e., stop words such as articles and prepositions were removed).</Paragraph> <Paragraph position="3"> Here is an example to further clarify the strict standard of the automatic evaluation (words scored correct are underlined): System: E. Lawrence Barcella is a Washington lawyer, Washington white-collar defense lawyer, former federal prosecutor Thus, although 'lawyer' and 'prosecutor' are synonymous in WordNet, the automatic scorer doesn't know that, and so 'prosecutor' is penalized as an extra word.</Paragraph> <Paragraph position="4"> The evaluation was carried out over the entire Clinton corpus, with descriptions compared for 226 people who had more than one description. 65 out of the 226 descriptions were Correct (28%), with a further 32 cases being semantically correct 'obviously similar' substitutions which the automatic scorer missed (giving an adjusted accuracy of 42%). As a baseline, a merging program which performed just a string match scored 21% accuracy. The major problem areas were errors in coreference (e.g., Clinton family members being put in the same coreference class), lack of good descriptions for famous people (news articles tend not to introduce such people), and parsing limitations (e.g., Senator Clinton being parsed erroneously as an NP in The Senator Clinton disappointed...). Ultimately, of course, domain-independent systems like ours are limited semantically in merging by the lack of world knowledge, e.g., knowing that Starr' s chief lieutenant can be a prosecutor.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.5 Description Coherence and Informativeness Evaluation </SectionTitle> <Paragraph position="0"> To assess the coherence and informativeness of the relative clause descriptions3, we asked 4 subjects who were unaware of our research to judge descriptions generated by our system from the Clinton corpus. For each relative clause description, the subject was given the description, a person name to whom that description pertained, and a capsule description consisting of merged appositives created by the system. The subject was asked to assess (a) the coherence of the relative clause description in terms of its succinctness (was it a good length?) and its comprehensibility (was it and understandable by itself or in conjunction with the capsule?), and (b) its informativeness in terms of whether it was an accurate description (does it conflict with the capsule or with what you know?) and whether it was non-redundant (is it distinct or does it repeat what is in the capsule?).</Paragraph> <Paragraph position="1"> The subjects marked 87% of the descriptions as accurate, 96% as non-redundant, and 65% as coherent. A separate 3-subject inter3 Appositives are not assessed in this way as few errors of coherence or informativeness were noticed in the appositive extraction.</Paragraph> <Paragraph position="2"> annotator agreement study, where all subjects judged the same 46 decisions, showed that all three subjects agreed on 82% of the accuracy decisions, 85% of the non-redundancy decisions and 82% of the coherence decisions.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Learning to Produce Coherent </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Descriptions 5.1 Overview </SectionTitle> <Paragraph position="0"> To learn rules for coherence for extracting sentential descriptions, we used the examples and judgments we obtained for coherence in the evaluation of relative clause descriptions in Section 4.5. Our focus was on features that might relate to content and specificity: low verb promiscuity scores, presence of proper names, pronouns, definite and indefinite clauses. The entire list is as follows: badend: boolean. is there an impossible end, indicating a bad extraction ( ... Mr.)? bestverb: continuous. use the verb promiscuity threshhold th3 to find the score of the most nonpromiscuous verb in the clause classes (label): boolean. accept the clause, reject the clause count pronouns: continuous. number of personal pronouns count proper: continuous. number of nouns tagged as NP hasobject: continuous. how many np'sfollow the verb? haspeople: continuous. how many &quot;name&quot;constituents are found? has possessive: continuous. how many possessive pronouns are there? hasquote: boolean. is there a quotation? hassubc: boolean. is there a subordinateclause? isdefinite: continuous. how many definiteNP's are there? repeater: boolean. is the subject's namerepeated, or is there no subject? timeref: boolean. is there a timereference? withquit: is there a quit or resignverb? withsay: boolean. is there a say verb inthe clause?</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Accuracy of Learnt Descriptions </SectionTitle> <Paragraph position="0"> Table 2 provides information on different learning methods. The results are for a ten-fold cross-validation on 165 training vectors and 19 test vectors, measured in terms of Predictive Accuracy (percentage test vectors correctly classified).</Paragraph> <Paragraph position="1"> The best learning methods are comparable with rules created by hand by one of the authors (Barry's rules). In the learners, the bestverb feature is used heavily in tests for the negative class, whereas in Barry's Rules it occurs in tests for the positive class.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> Our work on measuring subject-verb associations has a different focus from the previous work. (Lee and Pereira 1999), for example, examined verb-object pairs. Their focus was on a method that would improve techniques for gathering statistics where there are a multitude of sparse examples. We are focusing on the use of the verbs for the specific purpose of finding associations that we have previously observed to be strong, with a view towards selecting a clause or sentence, rather than just to measure similarity. We also try to strengthen the numbers by dealing with 'gapped' constructions.</Paragraph> <Paragraph position="1"> While there has been plenty of work on extracting named entities and relations between them, e.g., (MUC-7 1998), the main previous body of work on biographical summarization is that of (Radev and McKeown 1998). The fundamental differences in our work are as follows: (1) We extract not only appositive phrases, but also clauses at large based on corpus statistics; (2) We make heavy use of coreference, whereas they don't use coreference at all; (3) We focus on generating succinct descriptions by removing redundancy and merging, whereas they categorize descriptions using WordNet, without a focus on succinctness.</Paragraph> </Section> class="xml-element"></Paper>