File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0902_metho.xml
Size: 25,940 bytes
Last Modified: 2025-10-06 14:08:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0902"> <Title>Extracting and Evaluating General World Knowledge from the Brown Corpus</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction: deriving general </SectionTitle> <Paragraph position="0"> knowledge from texts We have been exploring a new method of gaining general world knowledge from texts, including fiction. The method does not depend on full or exact interpretation, but rather tries to glean general facts from particulars by combined processes of compositional interpretation and abstraction. For example, consider a sentence such as the following from the Brown corpus (Kucera and Francis, 1967): Rilly or Glendora had entered her room while she slept, bringing back her washed clothes.</Paragraph> <Paragraph position="1"> From the clauses and patterns of modification of this sentence, we can glean that an individual may enter a room, a female individual may sleep, and clothes may be washed. In fact, given the following Treebank bracketing, our programs produce the output shown:</Paragraph> <Paragraph position="3"/> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> A NAMED-ENTITY MAY ENTER A ROOM. A FEMALE-INDIVIDUAL MAY HAVE A ROOM. A FEMALE-INDIVIDUAL MAY SLEEP. A FEMALE-INDIVIDUAL MAY HAVE CLOTHES. CLOTHES CAN BE WASHED. </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> The results are produced as logical forms (the last five lines above - see Schubert, 2002, for some details), from which the English glosses are generated automatically.</Paragraph> <Paragraph position="3"> Our work so far has focused on data in the Penn Tree-bank (Marcus et al., 1993), particularly the Brown corpus and some examples from the Wall Street Journal corpus.</Paragraph> <Paragraph position="4"> The advantage is that Treebank annotations allow us to postpone the challenges of reasonably accurate parsing, though we will soon be experimenting with &quot;industrial strength&quot; parsers on unannotated texts.</Paragraph> <Paragraph position="5"> We reported some specifics of our approach and some preliminary results in (Schubert, 2002). Since then we have refined our extraction methods to the point where we can reliably apply them the Treebank corpora, on average extracting more than 2 generalized propositions per sentence. Applying these methods to the Brown corpus, we have extracted 137,510 propositions, of which 117,326 are distinct. Some additional miscellaneous examples are &quot;A PERSON MAY BELIEVE A PROPOSITION&quot;, &quot;BILLS</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> MAY BE APPROVED BY COMMITTEES&quot;, &quot;A US-STATE MAY HAVE HIGH SCHOOLS&quot;, &quot;CHILDREN MAY LIVE WITH RELA- TIVES&quot;, &quot;A COMEDY MAY BE DELIGHTFUL&quot;, &quot;A BOOK MAY BE WRITE-ED (i.e., written) BY AN AGENT&quot;, &quot;A FEMALE- INDIVIDUAL MAY HAVE A SPOUSE&quot;, &quot;AN ARTERY CAN BE THICKENED&quot;, &quot;A HOUSE MAY HAVE WINDOWS&quot;, etc. </SectionTitle> <Paragraph position="0"> The programs that produce these results consist of (1) a Treebank preprocessor that makes various modifications to Treebank trees so as to facilitate the extraction of semantic information (for instance, differentiating different kinds of &quot;SBAR&quot;, such as S-THAT and S-ALTHOUGH, and identifying certain noun phrases and prepositional phrases, such as &quot;next Friday&quot;, as temporal); (2) a pattern matcher that uses a type of regular-expression language to identify particular kinds of phrase structure patterns (e.g., verb + complement patterns, with possible inserted adverbials or other material); (3) a semantic pattern extraction routine that associates particular semantic patterns with particular phrase structure patterns and recursively instantiates and collects such patterns for the preprocessed tree, in bottom-up fashion; (4) abstraction routines that abstract away modifiers and other &quot;typepreserving operators&quot;, before semantic patterns are constructed at the next-higher level in the tree (for instance, stripping the interpreted modifier &quot;washed&quot; from the interpreted noun phrase &quot;her washed clothes&quot;); (5) routines for deriving propositional patterns from the resulting miscellaneous semantic patterns, and rendering them in a simple, approximate English form; and (6) heuristic routines for filtering out many ill-formed or vacuous propositions. In addition, semantic interpretation of individual words involves some simple morphological analysis, for instance to allow the interpretation of (VBD SLEPT) in terms of a predicate SLEEP[V].</Paragraph> <Paragraph position="1"> In (Schubert, 2002) we made some comparisons between our project and earlier work in knowledge extraction (e.g., (muc, 1993; muc, 1995; muc, 1998; Berland and Charniak, 1999; Clark and Weir, 1999; Hearst, 1998; Riloff and Jones, 1999)) and in discovery of selectional preferences (e.g., (Agirre and Martinez, 2001; Grishman and Sterling, 1992; Resnik, 1992; Resnik, 1993; Zernik, 1992; Zernik and Jacobs, 1990)). Reiterating briefly, we note that knowledge extraction work has generally employed carefully tuned extraction patterns to locate and extract some predetermined, specific kinds of facts; our goal, instead, is to process every phrase and sentence that is encountered, abstracting from it miscellaneous general knowledge whenever possible. Methods for discovering selectional preferences do seek out conventional patterns of verb-argument combination, but tend to &quot;lose the connection&quot; between argument types (e.g., that a road may carry traffic, a newspaper may carry a story, but a road is unlikely to carry a story); in any event, they have not led so far to amassment of data interpretable as general world knowledge.</Paragraph> <Paragraph position="2"> Our concern in this paper is with the evaluation of the results we currently obtain for the Brown corpus. The overall goal of this evaluation is to gain some idea of what proportion of the extracted propositions are likely to be credible as world knowledge. The ultimate test of this will of course be systems (e.g., QA systems) that use such extracted propositions as part of their knowledge base, but such a test is not immediately feasible. In the meantime it certainly seems worthwhile to evaluate the outputs subjectively with multiple judges, to determine if this approach holds any promise at all as a knowledge acquisition technique.</Paragraph> <Paragraph position="3"> In the following sections we describe the judging method we have developed, and two experiments based on this method, one aimed at determining whether &quot;literary style makes a difference&quot; to the quality of outputs obtained, and one aimed at assessing the overall success rate of the extraction method, in the estimation of several judges.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Judging the output propositions </SectionTitle> <Paragraph position="0"> We have created judging software that can be used by the researchers and other judges to assess the quality and correctness of the extracted information. The current scheme evolved from a series of trial versions, starting initially with a 3-tiered judging scheme, but this turned out to be difficult to use, and yielded poor inter-judge agreement. We ultimately converged on a simplified scheme, for which ease of use and inter-judge agreement are significantly better.</Paragraph> <Paragraph position="1"> The following are the instructions to a judge using the judger program in its current form: Welcome to the sentence evaluator for the KNEXT knowledge extraction program. Thank you for your participation. You will be asked to evaluate a series of sentences based on such criteria as comprehensibility and truth. Do your best to give accurate responses. The judgement categories are selected to try to ensure that each sentence fits best in one and only one category. Help is available for each menu item, along with example sentences, by selecting 'h'; PLEASE consult this if this is your first time using this program even if you feel confident of your choice. There is also a tutorial available, which should also be done if this is your first time. If you find it hard to make a choice for a particular sentence even after carefully considering the alternatives, you should probably choose 6 (HARD TO JUDGE)! But if you strongly feel none of the choices fit a sentence, even after consulting the help file, please notify Matthew Tong (mtong@cs.rochester.edu) to allow necessary modifications to the menus or available help information to occur. You may quit at any time by typing 'q'; if you quit partway through the judgement of a sentence, that partial judgement will be discarded, so the best time to quit is right after being presented with a new sentence.</Paragraph> <Paragraph position="2"> a0 here the first sentence to be judged is presented 1. SEEMS LIKE A REASONABLE GENERAL CLAIM (Of course. Yes.) A grand-jury may say a proposition. A report can be favorable.</Paragraph> <Paragraph position="3"> 2. SEEMS REASONABLE BUT EXTREMELY SPECIFIC OR OBSCURE (I suppose so) A surgeon may carry a cage. Gladiator pecs can be Reeves-type.</Paragraph> <Paragraph position="4"> 3. SEEMS VACUOUS (That's not saying anything) A thing can be a hen. A skiff can be nearest.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. SEEMS FALSE (No. I don't think so. Hardly) </SectionTitle> <Paragraph position="0"> A square can be round. Individual -s may have a world.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5. SOMETHING IS OBVIOUSLY MISSING (Give me a complete sentence) </SectionTitle> <Paragraph position="0"> A person may ask. A male-individual may attach an importance.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 6. HARD TO JUDGE (Huh?? How do you mean that? I don't know.) </SectionTitle> <Paragraph position="0"> A female-individual can be psychic. Supervision can be with a company.</Paragraph> <Paragraph position="1"> Based on this judging scheme, we performed two types of experiments: an experiment to determine whether literary style significantly impacts the percentage of propositions judged favorably; and experiments to assess over-all success rate, in the judgement of multiple judges. We obtained clear evidence that literary style matters, and achieved a moderately high success rate - but certainly sufficiently high to assure us that large numbers of potentially useful propositions are extracted by our methods. The judging consistency remains rather low, but this does not invalidate our approach. In the worst case, hand-screening of output propositions by multiple judges could be used to reject propositions of doubtful validity or value. But of course we are very interested in developing less labor-intensive alternatives. The following subsections provide some details.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Dependence of extracted propositions on </SectionTitle> <Paragraph position="0"> literary style The question this experiment addressed was whether different literary styles correlated with different degrees of success in extracting intuitively reasonable propositions. The experiment was carried out twice, first by one of the authors (who was unaware of the contents of the files being sampled) and the second time by an outside recruit. While further experimentation is desirable, we believe that the evidence from two judges that literary style correlates with substantial differences in the perceived quality of extracted propositions demonstrates that future work on larger corpora should control the materials used for literary style.</Paragraph> <Paragraph position="1"> Judgements were based on 4 Brown files (ck01, ck13, cd02, cd01). The 4 files were chosen by one of us on purely subjective grounds. Each contains about 2,200 words of text. (Our extraction methods yield about 1 proposition for every 8 words of text. So each file yields about 250-300 propositions.) The first two, ck01 and ck13, are straightforward, realistic narratives in plain, unadorned English, while cd01 and cd02 are philosophical and theological essays employing much abstract and figurative language. The expectation was that the first two texts would yield significantly more propositions judged to be reasonable general claims about the world than the latter two. To give some sense of the contents, the first few sentences from each of the texts are extracted here: Initial segments of each of the four texts ck01: Scotty did not go back to school. His parents talked seriously and lengthily to their own doctor and to a specialist at the University Hospital- Mr. McKinley was entitled to a discount for members of his family- and it was decided it would be best for him to take the remainder of the term off, spend a lot of time in bed and, for the rest, do pretty much as he chose- provided, of course, he chose to do nothing too exciting or too debilitating. His teacher and his school principal were conferred with and everyone agreed that, if he kept up with a certain amount of work at home, there was little danger of his losing a term.</Paragraph> <Paragraph position="2"> ck13: In the dim underwater light they dressed and straightened up the room, and then they went across the hall to the kitchen. She was intimidated by the stove. He found the pilot light and turned on one of the burners for her. The gas flamed up two inches high. They found the teakettle.</Paragraph> <Paragraph position="3"> And put water on to boil and then searched through the icebox.</Paragraph> <Paragraph position="4"> cd01: As a result, although we still make use of this distinction, there is much confusion as to the meaning of the basic terms employed. Just what is meant by &quot;spirit&quot; and by &quot;matter&quot;?? The terms are generally taken for granted as though they referred to direct and axiomatic elements in the common experience of all. Yet in the contemporary context this is precisely what one must not do. For in the modern world neither &quot;spirit&quot; nor &quot;matter&quot; refer to any generally agreed-upon elements of experience...</Paragraph> <Paragraph position="5"> cd02: If the content of faith is to be presented today in a form that can be &quot;understanded of the people&quot;- and this, it must not be forgotten, is one of the goals of the perennial theological task- there is no other choice but to abandon completely a mythological manner of representation. This does not mean that mythological language as such can no longer be used in theology and preaching. The absurd notion that demythologization entails the expurgation of all mythological concepts completely misrepresents Bultmann's intention.</Paragraph> <Paragraph position="6"> Extracted propositions were uniformly sampled from the 4 files, for a total count of 400, and the number of judgements in each judgement category were then separated out for the four files. In a preliminary version of this experiment, the judgement categories were still the 3level hierarchical ones we eventually dropped in favor of a 6-alternatives scheme. Still, the results clearly indicated that the &quot;plain&quot; texts yielded significantly more propositions judged to be reasonable claims than the more abstract texts. Two repetitions of the experiment (with newly sampled propositions from the 4 files) using the 6category judging scheme, and the heuristic postprocessing and filtering routines, yielded the following unequivocal results. (The exact sizes of the samples from files ck01, ck13, cd01, and cd02 in both repetitions were 120, 98, 85, and 97 respectively, where the relatively high count for ck01 reflects the relatively high count of extracted propositions for that text.) a0 For ck01 and ck13 around 73% of the propositions (159/218 for judge 1 and 162/218 for judge 2) were judged to be in the &quot;reasonable general claim&quot; category; for cd01 and cd02, the figures were much lower, at 41% (35/85 for judge 1 and 40/85 for judge 2) and less than 55% (53/97 for judge 1 and 47/97 for judge 2) respectively.</Paragraph> <Paragraph position="7"> a0 For ck01 and ck13 the counts in the &quot;hard to judge&quot; category were 12.5-15% (15-18/120) and 7.1-8.2% (6-7/85) respectively, while for cd01 and cd02 the figures were substantially higher, viz., 25.9-28.2% (22-24/85) and 19.6-23% (19-34/97) respectively.</Paragraph> <Paragraph position="8"> Thus, as one would expect, simple narrative texts yield more propositions recognized as reasonable claims about the world (nearly 3 out of 4) than abstruse analytical materials (around 1 out of 2). The question then is then how to control for style when we turn our methods to larger corpora. One obvious answer is to hand-select texts in relevant categories, such as literature for young readers, or from authors whose writings are realistic and stylistically simple (e.g., Hemingway). However, this could be quite laborious since large literary collections available online (such as the works in Project Gutenberg, http://promo.net/pg/, http://www.thalasson.com/gtn/, with expired copyrights) are not sorted by style. Thus we expect to use automated style analysis methods, taking account of such factors as vocabulary (checking for esoteric vocabulary and vocabulary indicative of fairy tales and other fanciful fiction), tense (analytical material is often in present tense), etc. We may also turn our knowledge extraction methods themselves to the task: if, for instance, we find propositions about animals talking, it may be best to skip the text source altogether.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Overall quality of extracted propositions </SectionTitle> <Paragraph position="0"> To assess the quality of extracted propositions over a wide variety of Brown corpus texts, with judgements made by multiple judges, the authors and three other individuals made judgements on the same set of 250 extracted propositions. The propositions were extracted from the third of the Brown corpus (186 files) that had been annotated with WordNet senses in the SEMCOR project (Landes et al., 1998) (chiefly because those were the files at hand when we started the experiment - but they do represent a broad cross-section of the Brown Corpus materials). We excluded the cj-files, which contain highly technical material. null Table 1 shows the judgements of the 5 judges (as percentages of counts out of 250) in each of the six judgement categories. The category descriptions have been mnemonically abbreviated at the top of the table. Judge 1 appears twice, and this represents a repetition, as a test of self-consistency, of judgements on the same data presented in different randomized orderings.</Paragraph> <Paragraph position="1"> reasonable obscure vacuous false incomplete hard As can be seen from the first column, the judges placed about 49-64% of the propositions in the &quot;reasonable general claim&quot; category. This result is consistent with the results of the style-dependency study described above, i.e., the average lies between the ones for &quot;straightforward&quot; narratives (which was nearly 3 out of 4) and the ones for abstruse texts (which was around 1 out of 2). This is an encouraging result, suggesting that mining general world knowledge from texts can indeed be productive.</Paragraph> <Paragraph position="2"> One point to note is that the second and third judgement categories need not be taken as an indictment of the propositions falling under them - while we wanted to distinguish overly specific, obscure, or vacuous propositions from ones that seem potentially useful, such propositions would not corrupt a knowledge base in the way the other categories would (false, incomplete, or incoherent propositions). Therefore, we have also collapsed our data into three more inclusive categories, namely &quot;true&quot; (collapsing the first 3 categories), &quot;false&quot; (same as the original &quot;false&quot; category), and &quot;undecidable&quot; (collapsing the last two categories). The corresponding variant of Table 1 would thus be obtained by summing the first 3 and last 2 columns. We won't do so explicitly, but it is easy to verify that the proportion of &quot;true&quot; judgements comprise about three out of four judgements, when averaged over the 5 judges.</Paragraph> <Paragraph position="3"> We now turn to the extent of agreement among the judgements of the five judges (and judge 1 with himself on the same data). The overall pairwise agreement results for classification into six judgement catagories are shown in Table 2.</Paragraph> <Paragraph position="4"> judge 1 A commonly used metric for evaluating interrater reliability in categorization of data is the kappa statistic (Carletta, 1996). As a concession to the popularity of that statistic, we compute it in a few different ways here, though - as we will explain - we do not consider it particularly appropriate. For 6 judgement categories, kappa computed in the conventional way for pairs of judges ranges from .195 to .367, averaging .306. For 3 (more inclusive) judgement categories, the pairwise kappa scores range from .303 to .462, with an average of .375.</Paragraph> <Paragraph position="5"> These scores, though certainly indicating a positive correlation between the assessments of multiple judges, are well below the lower threshold of .67 often employed in deciding whether judgements are sufficiently consistent across judges to be useful. However, to see that there is a problem with applying the conventional statistic here, imagine that we could improve our extraction methods to the point where 99% of extracted propositions are judged by miscellaneous judges to be reasonable general claims.</Paragraph> <Paragraph position="6"> This would be success beyond our wildest dreams - yet the kappa statistic might well be 0 (the worst possible score), if the judges generally reject a different one out of every one hundred propositions! One somewhat open-ended aspect of the kappa statistic is the way &quot;expected&quot; agreement is calculated. In the conventional calculation (employed above), this is based on the observed average frequency in each judgement category. This leads to low scores when one category is overwhelmingly favored by all judges, but the exceptions to the favored judgement vary randomly among judges (as in the hypothetical situation just described). A possible way to remedy this problem is to use a uniform distribution over judgement categories to compute expected agreement. Under such an assumption, our kappa scores are significantly better: for 6 categories, they range from .366 to .549, averaging .482; for 3 categories, they range from .556 to .730, averaging .645. This approaches, and for several pairs of judges exceeds, the minimum threshold for significance of the judgements.1 Since the ideal result, as implied above, would be agreement by multiple judges on the &quot;reasonableness&quot; or truth of a large proportion of extracted propositions, it seems worthwhile to measure the extent of such agreement as well. Therefore we have also computed the &quot;survival rates&quot; of extracted propositions, when we reject those not judged to be reasonable general claims by a0 judges (or, in the case of 3 categories, not judged to be true by a0 judges). Figure 1 shows the results, where the survival rate for a0 judges is averaged over all subsets of size a0 of the 5 available judges.</Paragraph> <Paragraph position="7"> Fraction of propositions placed in best categoryFigure 1.</Paragraph> <Paragraph position="8"> Thus we find that the survival rate for &quot;reasonable general claims&quot; starts off at 57%, drops to 43% and then 35% for 2 and 3 judges, and drops further to 31% and 28% for 4 and 5 judges. It appears as if an asymptotic level above 20% might be reached. But this may be an unrealistic extrapolation, since virtually any proposition, no matter how impeccable from a knowledge engineering perspective, might eventually be relegated to one of the other 5 categories by some uninvolved judge. The survival rates based on 2 or 3 judges seem to us more indicative of the likely proportion of (eventually) useful propositions than an extrapolation to infinitely many judges. For the 3-way judgements, we see that 75% of extracted propositions are judged &quot;true&quot; by individual judges (as noted earlier), and this drops to 65% and then 59% for 2 and 3 judges.</Paragraph> <Paragraph position="9"> Though again sufficiently many judges may eventually bring this down to 40% or less, the survival rate is certainly high enough to support the claim that our method of deriving propositions from texts can potentially deliver very large amounts of world knowledge.</Paragraph> <Paragraph position="10"> 1The fact that for some pairs of judges the kappa-agreement (with this version of kappa) exceeds 0.7 indicates that with more careful training of judges significant levels of agreement could be reached consistently.</Paragraph> </Section> </Section> class="xml-element"></Paper>