File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/a94-1011_metho.xml
Size: 20,455 bytes
Last Modified: 2025-10-06 14:13:35
<?xml version="1.0" standalone="yes"?> <Paper uid="A94-1011"> <Title>ing, Word Associations and Typical Predicate-Argument</Title> <Section position="4" start_page="65" end_page="65" type="metho"> <SectionTitle> 2 The Corpus </SectionTitle> <Paragraph position="0"> This paper reports work undertaken for the LRE project SISTA (Semi-automatic Indexing System for Technical Abstracts). This section briefly describes one of the corpora used by this project.</Paragraph> <Paragraph position="1"> The RAPRA corpus comprises some 212,000 technical abstracts pertaining to research and commercial exploitation in the rubber and plastics industry. To each abstract, an average of 15 descriptors selected from a thesaurus of some 10,000 descriptors is assigned to each article. The frequency of assignment of descriptors varies roughly in the same way as the frequency of word use varies (the frequencies of descriptor tokens (very) approximately satisfies the Zipf-Mandelbrot law). Descriptors are assigned by expert indexers from the entire article and expert domain knowledge, not just from the abstract, so it is unlikely that any automatic system which analyses only the abstracts can assign all the descriptors which are manually assigned to the abstract.</Paragraph> <Paragraph position="2"> We show a fairly typical example below. It is clear that many of these descriptors must have been assigned from the main text of the article, and not from the abstract alone. Moreover, this is common practice in the technical abstract indexing industry, so it seems unlikely that the situation will be better for other corpora. Nevertheless, we can hope to follow a strategy of assigning descriptors when there is enough information to do so.</Paragraph> <Paragraph position="3"> Macromolecular Deformation Model to Estimate Viscoelastic Flow Effects in Polymer Melts The elastic deformation of polymer macromolecules in a shear field is used as the basis for quantitative predictions of viscoelastic flow effects in a polymer melt. Non-Newtonian viscosity, capillary end correction factor, maximum die swell, and die swell profile of a polymer melt arc predicted by the model. All these effects can be reduced to generic master curves, which are independent of polymer type. Macromolecular deformation also influences the brittle failure strength of a processed polymer glass. The model gives simple and accurate estimates of practically important processing effects, and uses fitting parameters with the clear physical identity of viscoelastic constants, which follow well established trends with respect to changes in polymcr composition or processing conditions. 12 refs.</Paragraph> <Paragraph position="4"> Original assignment: BRITTLE FAILURE; COMPANY; DATA; DIE SWELL; ELASTIC DEFORMATION; EQUATION;</Paragraph> <Paragraph position="6"/> </Section> <Section position="5" start_page="65" end_page="67" type="metho"> <SectionTitle> COELASTICITY; VISCOSITY 3 Models </SectionTitle> <Paragraph position="0"> Two classes of models for assessing descriptor appropriateness were used. One class comprises variants of Salton's term-weighting models, and one is more allied to fuzzy or default logic in so much as it assigns descriptors due to the presence of certain diagnostic units. What is interesting for us is that term weighting models do not seem able to easily exploit the additional information provided by a more sophisticated representation of a document, while an alternative statistical single term model can.</Paragraph> <Section position="1" start_page="65" end_page="65" type="sub_section"> <SectionTitle> 3.1 Term weighting models </SectionTitle> <Paragraph position="0"> The standard term weighting model is defined by chosing a set of parameters {c~ij } (one for each worddescriptor pair) and {fli} (one for each desc,'iptor) so that a likelihood or appropriateness function, /2, can be defined by</Paragraph> <Paragraph position="2"> This has been widely used, and is provably equivalent to a large class of probabilistic models (e.g.</Paragraph> <Paragraph position="3"> Van Risjbergen, 1979) which make various assumptions about the independence between descriptors and diagnostic units (Fuhr & Buckley, 1993). Various strategies for estimating the parameters for this model have been proposed (e.g. Salton & Yang, 1973, Buckley 1993, Fuhr & Buekley, 1993). Some of these concentrate on the need for re-estimating weights according to relevance feedback information, while some make use of various functions of term frequency, document frequency, maximum within-document frequency, and various other measurements of corpora. Nevertheless, the problem of estimating the huge number of parameters needed for such a model is statistically problematic, and as Buckley (1993) points out, the choice of weights has a large influence on the effectiveness of any model for classification or for retrieval.</Paragraph> <Paragraph position="4"> There are so many variations on the theme of term weighting models that it is impossible to try them all in one experiment, so this paper uses a variation of a model used by Lewis (1992e) in which he re.ports the results of some experiments using phrases In a term weighting model (which has a probabilistic interpretation). Several term weighting models have been tried, but they all evaluate within 5 points of each other on both precision and recall (when suitably tweaked).</Paragraph> <Paragraph position="5"> The model eventually chosen for the tests reported here was a smoothed logistic model which gave the best results of all the probabilistically inspired term weighting models considered.</Paragraph> </Section> <Section position="2" start_page="65" end_page="67" type="sub_section"> <SectionTitle> 3.2 Single term model </SectionTitle> <Paragraph position="0"> In contrast to making assumptions of independence about the relationship between diagnostic units and words, the next model utilises only those diagnostic units which strongly predict descriptors (i.e. have frequently been associated with descriptors) without making assumptions about the independence of diagnostic units given descriptors.</Paragraph> <Paragraph position="1"> We shall investigate this class of models using probability theory. The main problem with using probability theory for problems in document classification is that while it might be relatively easy to estimate probabilities such as P(dlw ) for some diagnostic unit w and some descriptor d, it is not possible to infer much about P(dIw~), where * is some additional information (e.g. the other DUs which represent the document), since these probabilities have not been estimated, and would take a far larger corpus to reliably estimate in any case. The situation gets exponentially worse as the information we have about the document increases. The exception to this rule is when P(dlw ) is close to 1, in which case it is very unlikely that additional information changes its value much. This fact is further investigated now.</Paragraph> <Paragraph position="2"> The strategy explored here is to concentrate on finding &quot;sure-fire&quot; indicators of descriptors, in a somewhat similar manner to how Carnegie's TCS works, by exploiting the fact that with a pre-classified training corpus we can identify sure-fire indicators empirically and &quot;trawl&quot; in a large set of informative diagnostic units for those which identify descriptors with high precision. The basis of the model is the following: We consider a likelihood function, Z: defined by:</Paragraph> <Paragraph position="4"> That is, the number of articles in the training corpus that d was observed to occur with w divided by the number of articles in which w occurred in the training corpus. This is an empirical estimate of the conditional probability, P(d\[w). We shall assume (for simplicity's sake) that we have a large enough corpus do reliably estimate these probabilities.</Paragraph> <Paragraph position="5"> The strategy for descriptor assignment we are investigating is to assign a descriptor d if and only if one of a set of predicates over representations of documents is true. We define the rule C/(x) ~ d to be Probably Correct do degree C/ if and only if P(dlC/ ) > 1-C/. We wish to keep the precision resulting from using this strategy high while increasing the number of rules to improve recall. The predicates C/ we shall consider for this paper will be very simple (they will typically be true iff w E T~(x) for some diagnostic unit w), but in principle, they could be arbitrarily complex (as they are in Carnegie's TCS).</Paragraph> <Paragraph position="6"> The primary question of concern is whether the ensemble of rules {C/i --~ d} retains precision or not.</Paragraph> <Paragraph position="7"> Unfortunately, the answer to this question is that this is not necessarily the case unless we put some constraints on the predicates.</Paragraph> <Paragraph position="8"> Proposition 1 Let * be a set of predicates with the property that for some fixed descriptor d, C/ E * ---+ P(d\]C/ ) > 1 - C/. That is each of the rules C/i --+ d is probably correct to degree c.</Paragraph> <Paragraph position="9"> The expected precision of the rule (V C/i) --* d is _ ne where n is the cardinality, \](I)\]. at least 1 Proof: \[Straight-forward and omitted\] This proposition asserts that one cannot be guaranteed to be able to keep adding diagnostic units to improve recall without hurting precision, unless the quality of those diagnostic units is also improved (i.e. c is decreased in proportion to the number of DUs which are considered). This is unfortunate, but nevertheless the question of how much adding diagnostic units to help recall will hurt precision is an entirely empirical matter dependent on the true nature of P; this proposition is a worst case, and gives us reason to be careful. Performance will be expected to be poorest if there are many rules which correspond to the same true positives, but different sets of false positives. If the predicates are disjoint, for example, then the precision of a disjunction is at least as great as the precision of applying any single rule.</Paragraph> <Paragraph position="10"> So if we design our predicates so that they are disjoint, then we retain precision while increasing recall. In practice, this is infeasible, but it is feasible to look more carefully at frequently co-occurring predicates, since these will be most likely to reduce precision. 1 The main moral we can draw from the above two propositions is that we must be careful about the case where diagnostic units are highly correlated.</Paragraph> <Paragraph position="11"> One situation which is relatively frequent as the sophistication of representation increases is that some diagnostic units always co-occur with others.</Paragraph> <Paragraph position="12"> For example, if the document were represented by sequences of words, then the sequence &quot;olefin polymerisation&quot; always occurs whenever the sequence &quot;high temperature olefin polymerisation&quot; occurs. In this case, it might be thought to pay to look only at the most specific diagnostic units since we have if wl --* w2, then P(Z\]wlw2C) = P(XIwlC) for any distribution P whatsoever (here, C represents any other contextual information we have, for example the other diagnostic units representing the document). However, if wl is significantly less frequent than w2 estimation errors of P(d\[wl) will be larger for P(dlw2) for any descriptor d, so there may not be a significant advantage. However, it does give us a 1 One classic example is the case of the &quot;New Hampshire Yankee Power Plant&quot;. In a collection of New York Times articles studied by Jacobs & Rau (1990), the word &quot;Yankee&quot; was found to predict NUCLEAR POWER because of the frequent occurrence of articles about this plant. However, &quot;Yankee&quot; on its own without the other words in this phrase is a good predictor of articles about the New York Yankees, a baseball team. If highly mutually informative words &quot;are combined into conjunctive predicates (e.g. &quot;Yankee&quot; E x & &quot;Plant&quot; E x), and a document is represented by its most specific predicates only, then when &quot;Yankee&quot; appears alone, it will be a good predictor of the descriptor SPORT. This example can also show that the bound described above is tight. Imagine (suspending belief) that each of the five words in the phrase have the same number of occurrences, i, in the document collection without NUCLEAR POWER where they never occur together palrwise, and always occur all together in j true positives of the descriptor. Then the precision of assigning NUCLEAR POWER if any one of them appears in a document is j+51-'-2--, and since e in this case is i+--~, the bound follows (for the case n = 5) with a little algebra. theoretical reason to believe that representing a document by its set of most specific predicates is worth investigating, and this shall be investigated below.</Paragraph> <Paragraph position="13"> If one considers a calculus similar to the one described here, but allows ~ to limit to 0, then a weak default logic ensues which has been studied by Adams (1975), and further investigated by Pearl (1988).</Paragraph> </Section> </Section> <Section position="6" start_page="67" end_page="67" type="metho"> <SectionTitle> 4 Adding linguistic description </SectionTitle> <Paragraph position="0"> The simplest way of representing a document is as a set or multi set of words. Many people (eg. Lewis 1992bc; Jacobs & Rau 1990) have suggested that a more linguistically sophisticated representation of a document might be more effective for the purposes of statistical keyword assignment. Unfortunately, attempts to do this have not been found to reliably improve performance as measured by recall and precision for the task of document classification. I shall present evidence that a more sophisticated representation makes better predictions from the Single Term model defined above than it does from standard term weighting models.</Paragraph> <Section position="1" start_page="67" end_page="67" type="sub_section"> <SectionTitle> 4.1 Linguistic description </SectionTitle> <Paragraph position="0"> The simplest form of linguistic description of the content of a machine-readable document is in the form of a sequence (or a set) of words. More sophisticated linguistic information comes in several forms, all of which may need to be represented if performance in an automatic categorisation experiment is to be improved. Typical examples of linguistically sophisticated annotation include tagging words with their syntactic category (although this has not been found to be effective for 1R), lemma of the word (e.g. &quot;corpus&quot; for &quot;corpora&quot;), phrasal information (e.g. identifying noun groups and phrases (Lewis 1992c, Church 1988)), and subject-predicate identification (e.g. Hindle 1990). For the RAPRA corpus, we currently identify noun groups and adjective groups.</Paragraph> <Paragraph position="1"> This is achieved in a manner similar to Church's (1988) PARTS algorithm used by Lewis (1992bc), in the sense that its main properties are robustness and corpus sensitivity. All that is important for this paper is that the technique identifies various groupings of words (for example, noun-groups, adjective groups, and so on) with a high level of accuracy.</Paragraph> <Paragraph position="2"> Major parts of the technique are described in detail in Finch, 1993. As an example, this is some of the linguistic markup which represents the title of the sample document shown earlier.</Paragraph> <Paragraph position="3"> * macromolecular deformation (NG); macromolecular deformation model (NG); deformation (NG); deformation model (NG); model (NG); viscoelastic flow (NG); viscoelastic flow effects (NGS); flow (NG); flow effects (NGS); effects (NGS); polymer (NG); polymer melts (NGS); melts (NGS) It is clear that the markup is far from sophisticated, and is very much a small variation on a simple sequence-based representation. Nevertheless, it is fairly accurate in so much as well over 90% of what are claimed to be noun groups can be interpreted as such. One very useful by-product of using a linguistically based representation is that Il~ can help in linguistic tasks such as terminological collection. I shall present some examples of diagnostic units which are highly associated with descriptors later.</Paragraph> </Section> </Section> <Section position="7" start_page="67" end_page="68" type="metho"> <SectionTitle> 5 Predicting from sophisticated </SectionTitle> <Paragraph position="0"> representations In what follows, we shall compare the relative performance of a term weighting model with the single term model as we vary the sophistication of representation. Proportional assignmen~ (Lewis 1992b) is used to assign the descriptors from statistical measurements of their appropriateness. This method ensures that roughly the same number of assignments of particular descriptors are made as are actually made in the test corpus. The strategy is simply to assign descriptor d to the N documents which score highest for this descriptor, where N is chosen in proportion to the occurrence of d in the training corpus. For term weighting models, the score is simply the combined weight of the document; for the single term model, the score is sup~eTC/(~ ) P(dlw). The Rule Based assignment strategy applies only to the single term model and the rule w --~ d is included just in case P(dlw )> 1-C/.</Paragraph> <Paragraph position="1"> Figure 1 shows a few of the rules. All of these entries share the property that P(d\]w) > 0.8. They were selected at random from the 85,500 associations which were found.</Paragraph> <Section position="1" start_page="67" end_page="68" type="sub_section"> <SectionTitle> 5.1 Representations and models </SectionTitle> <Paragraph position="0"> Five paradigms of representation of documents will be compared, and two term appropriateness models will be compared. This gives us ten combinations.</Paragraph> <Paragraph position="1"> The first representation paradigm is a baseline one: represent documents as the set of the words contained in them. The second paradigm is to represent documents according to word sequences, and the third is to apply a noun-group and adjectivegroup recogniser. The fourth and fifth representation modes consider representing documents by only their most specific diagnostic units. For example, if the sequence &quot;thermoplastic elastomer compounds&quot; such rules.</Paragraph> <Paragraph position="2"> appeared in the abstract, then ordinarily this would include the sequence &quot;elastomer compounds&quot;, which would be included in the representation. The results of section 3.2 might encourage us to believe that representing a document by only its most specific diagnostic units will improve performance (or, at least, precision). Consequently, a sequence of words is defined to be most specific if (a) it is a diagnostic unit and (b) it is not properly contained in a token of any other diagnostic unit present in the document. 2 The noun-groups are found by performing a simple parse of the documents as described above, and identifying likely noun groups of length 3 or less.</Paragraph> <Paragraph position="3"> The contingency table of diagnostic units verses manually assigned descriptors on a training corpus of 200,000 documents was collected, and this was used as the basis for two term appropriateness models.</Paragraph> <Paragraph position="4"> Probabilities were estimated by adding a constant (usually 0.02 was found fairly optimal) to each cell, and directly estimating from these slightly adjusted counts.</Paragraph> <Paragraph position="5"> The 50,000 most frequent diagnostic unit types were chosen, and terms which appeared in more than 10% of documents were discarded.</Paragraph> <Paragraph position="6"> 2If &quot;elastomer compounds&quot; appeared separately in the document from &quot;thermoplastic elastomer compounds&quot;, then both of these sequences would be represented in the experiments reported here.</Paragraph> </Section> </Section> class="xml-element"></Paper>