File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/w94-0106_metho.xml
Size: 31,978 bytes
Last Modified: 2025-10-06 14:13:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W94-0106"> <Title>DO WE NEED LINGUISTICS WHEN WE HAVE STATISTICS? A COMPARATIVE ANALYSIS OF THE CONTRIBUTIONS OF LINGUISTIC CUES TO A STATISTICAL WORD GROUPING SYSTEM</Title> <Section position="4" start_page="43" end_page="44" type="metho"> <SectionTitle> 2. AN OVERVIEW OF THE ADJECTIVE GROUPING SYSTEM </SectionTitle> <Paragraph position="0"> Our adjective grouping system \[Hatzivassilogiou and McKeown, 1993\] starts with a set of adjectives to be clustered into semantically related groups. Ideally, we want highly related words such as synonyms, antonyms, and hyponyms to be the only ones placed in the same group. The system is given ~e number of groups to form as an input parameter 2, and has access to a text corpus. No semantic information about the adjectives is available to the system. The system operates by extracting pairs of modified nouns for each adjective, and, optionally, pairs of adjectives that we can expect to be semantically uurelated on linguistic grounds 3. From the estimated distribution of modified nouns for each adjective, a similarity sccxe is assigned to each possible pair of adjectives. This is based on KendaLl's x, a nonparametric, robust estimator of correlation \[Kendall, 193811 Using the similarity scores and, optionally, the established relationships of nonrelatedness, a non-hierarchical clustermg method \[Spath, 1985\] assigns the adjectives to groups in a way that maximizes the within-group similarity (and therefore also maximizes the between-group dissimilarity).</Paragraph> <Paragraph position="1"> IWe thank Johanna Moore for pointing out this application to us.</Paragraph> <Paragraph position="2"> 2Determining this number from the data is probably the hardest problem in cluster analysis in general; see \[Kaufman and Rousseeuw. 1990\].</Paragraph> <Paragraph position="3"> J~hese are adjectives that either modify the same noun in the same NP (e.g. big white house) or one of them modifies the other (e.g. light blue coat); see \[Hatzivassiloglou and McKeown. 1993\] for a detailed analysis.</Paragraph> <Paragraph position="4"> 1. deadly fatal 2. capitalist socialist * 3. clean dirty dumb 4. hazardous toxic 5. insufficient scant 6. generous outrageous unreasonable 7. endless protracted &plain 9. hostile unfi~endly I0. delicate fragile unstable 1 I. affluent impoverished prosperous 12. brilliant clever energetic smart stupid 13. communist leftist 14. astonishing meager vigorous 15. catastrophic disasm3ns harmful 16. dry exotic wet 17. chaotic turbulent 18. confusing misleading 19. dismal gloomy 20. dual multiple pleasant 21. fat slim 22. affordable inexpensive 23. abrupt gradual stunning 24. flexible lenient rigid strict stringent To evaluate our system, we have developed extended versions of the standard information retrieval measures precision, recell, and fallout. These extended versions score the grouping produced .by the system against a set of model groupings (mstead of just one) for the same adjecfives, supplied by humans. In the experiments reported m this paper, we employ 8 or 9 human-constructed models for each adjective set. We base our comparisons on and report the F-measure scores \[Van Rijsbergen, 1979\] which combine precision and recall in a single number. In addition, since the correct number of groupings is something that the system cannot yet determine (and, incidentally, something that human evaluators disagree abou0, we run the system for the five ceases m the range -2 to +2 around the average number of clusters employed by the humans and average the results. This smoothing operation prevents an accidental hilgh or low score being reported when a small variauon in the number of clusters produces very different scores. It should be noted here that the scores reported should not be interpreted as linear percentages. In other words, a score of 40 is not just twice as good as a score of 20, and going from 30 to 40 is much harder than goin$ from 20to 30. The latter is true for most applicanons, but the problem of interpreting the scores is exacerbated in our context because of the structural constraints imposed by the clustering and the presence of multiple models. Furthermore, even the best clustering that could be produced would not receive a score of 100, because of disagreement among humans on what is the correct answer. To el,~ the meaning of the scores, we accompany them with lower and _upper bounds for each adjective set we examine. These bounds are obtained by the performance of a system that creates random groupings (averaged over many runs) and by the average score of the human-produced partitions when evaluated against the other human models respectively.</Paragraph> <Paragraph position="5"> Figure 1 shows an example clustering produced by our system for one of the adjective sets analyzed in this paper.</Paragraph> </Section> <Section position="5" start_page="44" end_page="46" type="metho"> <SectionTitle> 3. THE LINGUISTIC FEATURES BEING TESTED </SectionTitle> <Paragraph position="0"> We have identified several sources of linguistic knowledge that can be incorporated in our system, augmenting the statistical component. Each such source represents a parameter of the system, i.e. a Ieature that can be present or absent or more generally take a value from a predefined set. We selected features that can be effficienfly computed in a completely automatic way for unrestricted text and do not require extensive amounts of knowledge to be available to the system. Almost all of these features can be ~generalized to other applications as well, as we discuss in Section 6. In this section we discuss first one of these parameters that can take several values, namely the method of extracting data from the corpus, and then several other bina~-valued features.</Paragraph> <Section position="1" start_page="45" end_page="45" type="sub_section"> <SectionTitle> 3.1 Extracting data from the corpus </SectionTitle> <Paragraph position="0"> Our adjective clustering system determines the distribution of related (modified) nouns for each adjective and eventuaUy the similarity between adjectives from pairs of the form (adjective.</Paragraph> <Paragraph position="1"> modified noun)observed in the corpus. Direct information about incompatible adjectives (in the form of appropriate adjective-adjective pairs) cart ~lso be collected from the corpus. Therefore, a tWst parameter of the st, stem and a possible dimension for comparisons ts the method employed to identify such pairs in free text. This is hardly a unigue feature of our system: all word-based statistical systems must fwst collect data from the corpus about the words of interest, on which the subsequent statistics operate 4 .</Paragraph> <Paragraph position="2"> There are several alternate models for this task of data collection, with different degees of linguistic sophistication. A first model Is to use no linguistic knowledge at all: we collect for each adjective of interest all words that fall within a window of some predetermined size. Naturally. no negative data (adjective-adjective pairs) can be collected with this method. However, the method can be implemented easily and does not require the identification of any linguistic constraints so it is completely general. It has been used for diverse problems such as machine translation and sense disambiguation \[Gale et al., 1992, Schiltze, 1992\].</Paragraph> <Paragraph position="3"> A second model is to restrict the words collected to the same sentence as the adjective of interest and to word elass(es) that we expect on linguistic grounds to be relevant to adjectives. For our application, we collect all nouns in the vicinity of an adjective without leaving the current sentence. We assume that these nouns have some relationship with the adjective and that semantically different adjectives will exhibit different collections of such nouns. This model requires only part-of-speech information (to identify nouns) and a method of detecting sentence boundaries. It uses a window of fixed length to define the neighborhood of each adjective. Such a model incorporates minimal linguistic knowledge.</Paragraph> <Paragraph position="4"> namely in determining what constitutes the informative class(es) of words collected (nouns in our problem). Again, negative knowledge such as incompatible adjective pairs cannot be collected with this model. Nevertheless, it has also been widely used, e.g. for collocation extraction \[Sm~ja, 1993\] and sense disambiguation \[Liddy and Park, 1992\].</Paragraph> <Paragraph position="5"> A third model uses a simple linguistic rule to identify pairs of interest that is even more restrictive and informative than the &quot;nouns in vicinity&quot; 4Although frequently details of the statistical model employed receive more consideration.</Paragraph> <Paragraph position="6"> approach. Since we are interested in .nouns modified by adjectives, such a rule is to correct a noun immediately following an adjective, assuming that this implies a modification relationship. Pairs of consecutive adjectives Can also be collected. null Up to this point we have successively restricted the collected pairs on linguistic grounds, so that less but cleaner data is collected. For the fourth model, we extend the simple rule given above, using linguistic information to catch more valid pairs without sacrificing accuracy. We employ a pattern matcher that retrieves any sequence of one or more adjectives followed by any sequence of zero or more nouns. These sequences are then analyzed with heuristics based on linguistics to obtain pairs. For example, it can be shown that all adjectives in such a sequence must be semantically unrelated, and that it is best to attach all the adjectives to the final noun.</Paragraph> <Paragraph position="7"> The regular expression and pattern matching rules of the previous model can be extended further, forming a grammar for the constructs of interest. This approach can detect more pairs, and at the same time address known problematic cases not detected by the previous models.</Paragraph> <Paragraph position="8"> We imp.lemented the above five data extraction models, using typical window sizes for the first two methods (50 and 5 on each side of the window respectively) which have been found useful in other problems before. For the fifth model, we developed a finite-state grammar for NPs which is able to handle both predicative and attributive modification of nouns, conjunctions of adjectives, adverbial modification of adjectives, quantifiers, and apposition of adjectives to nouns or other adjectives 5. Unfortunately, the resources required to perform our tests for the first model were too great (e.g. 12,287,320 pairs in a 151 MB file were extracted for the 21 adjectives in our smallest test set) so we dropped that model from further consideration and we use the second model as the baseline of minimal linguistic knowledge. Other researchers have also reported similar problems of excessive resource demands with the &quot;collect all neighbors&quot; model \[Gale et al., 1992\].</Paragraph> </Section> <Section position="2" start_page="45" end_page="46" type="sub_section"> <SectionTitle> 3.2 Other linguistic features </SectionTitle> <Paragraph position="0"> In addition to the data extraction method, we identified three other areas where linguistic knowledge can be introduced in our system. F'trst, we can employ morphology to convert plural nouns to the corresponding singular ones and adjectives in comparative or superlative degree to their base form. Almost all adjectives and nouns that appear in multiple forms have no semantic difference from their base form except for the number or deg'e e feature. This conversion combines counts of similar pairs, thus raising the expected and estimated fr~,quencies of each pair in any statistical model. We develoI.~zl a m0rphology component that produces the singular fohn of nouns nsmg rules pins a large table of exceptions. For adjectives, a set of rules is _again employed but because of the vowed in the suffix -er or -est, many base forms look plausible without a lexicon (e.g.</Paragraph> <Paragraph position="1"> bigger could have been produced from big, bi~g, or bigge). We solve this problem ~ counting me occurrences of each candidate form in our corpus and selecting the one with non-zero frequency.</Paragraph> <Paragraph position="2"> Another potential application of linguistic knowledge is the use of it spell-checking procedure, combined with a word list, to eliminate typographical errors from the corpus. Such en'ors can produce wrong estimates for the frequencies of modified nouns for an adjective, but most important!y introduce &quot;unique&quot; nouns appea~'ing only with one adjective, skewing the comparison of noun distributions. We implemented this component using the Unix spell program and associated word list, with extensions for hyphenated compounds. Uni~ortm~tedy, since a fixed and domgiu independent lemcon is used for this process, some valid but overspecialized words may be discarded too.</Paragraph> <Paragraph position="3"> Finally, we can use additional sources of knowledge which supplement the primary similarity relationships and are justified on linguistic grounds. We identified several potential sources of additional knowledge that can be extracted from the corpus (e.g. conjunctions of adjectives). In this comparison study we implemented and consider the significance of one of these knowledge sources, r~medy the negative examples offered by adjective-adjective pairs.</Paragraph> </Section> </Section> <Section position="6" start_page="46" end_page="48" type="metho"> <SectionTitle> 4. THE COMPARISON EXPERIMENTS </SectionTitle> <Paragraph position="0"> In the previous section we identified four parameters of the system, the effects of which we want to analyze. But in addition to these parameters that can be directly varied and have predetermined possible values, several other variables can affect the performance of the system.</Paragraph> <Paragraph position="1"> First, the performance of the system depends naturally on the adjective set that is to be clustered. Presumably variations in the adjective set can be modeled by several parameters, such as size of the set, number of semantic groups in it, and strength of semantic relatedness among its members, plus several parameters describing the properties of the adjectives in the set in isolation, such as frequency, specificity, etc.</Paragraph> <Paragraph position="2"> A second variable that affects the clustering is the corpus that is used as the main knowledge source, through the observed cooeeurrence patterns. Again the effects of different corpora can be separatecl into several factors, e.g. the size of the corpus, its generality, the genre of the texts, etc.</Paragraph> <Paragraph position="3"> Since in this paper we are interested in quantifying the effect of the linguistic knowlcdse in our system, or more precisely of the linguistic knowledge that we can explicitly control through the four parameters discussed above, we did not attempt to model in detail the various factors entering the system as a result of the choice of ad- null in measuring the effects of the linguistic parameters in a wide range of contexts, and m correlating these effects with variables originating from the choice of corpus and adjective set. For example, we would want to be able to detect that the linguistic parameter &quot;morphology&quot; is significant for small corpora but not for large ones, if that were the cease. Therefore, we included in our model two additional parameters, representing the corpus and the adjective set used.</Paragraph> <Paragraph position="4"> We used the Wall Street Journal articles from the ACL-DCI as our corpus. We selected four sub-corpora of decreasing size to study the relationship of corpus size with linguistic feature effects: all the 1987 articles (21 million words), every third of these articles (7 million words), every twenty-first (1 million words), and articles no. 50 and 100 (330,000 words). Since we use subsets of the same corpus, we are essentially modeling the corpus size parameter only.</Paragraph> <Paragraph position="5"> is changed.</Paragraph> <Paragraph position="6"> For each corpus, we analyzed three different sets of adjectives, listed in figures 2-4. The first of them was selected from a similar corpus, contains 21 frequent and ambiguous words that all associate strongly with a particular noun (problem), and was analyzed in \[Hatzivassiloglou and McKeown, 1993\]. The second set (43 adjectives) was saected with the constraint that it contain high frequency adjectives (more than 1,000 occurrences in the 21 million word corpus). The third set (62 adjec.fives) satisfies the opposite constraint containing adjectives of relatively low frequency (between 50 and 250). Figure 1 shows a typical uping found by our system for the third set of a jectives, when the full corpus and all linguistic modules were used.</Paragraph> <Paragraph position="7"> These three sets of adjectives represent various characteristics of the adjective sets that the system may be c,~led to. duster. First, they explicitly represent increasing sizes of the grouping problem. The second and third sets also contrast the independent frequencies of their member ad-JfreCtives. Furthermore, we have found that the less equent adjectives of the third set tend to be more specific than the more frequent ones. The human evaluators reported that the task of classification was easier for the third set, and their models exhibited about the same degree of agreement for the second and third sets although the third set is significantly larger. We plan to investigate the generality of this inverse correlation between frequency and specificity in the future.</Paragraph> <Paragraph position="8"> By including the parameters &quot;corpus size&quot; and &quot;adjective set&quot;, we have six parameters that we can vary in our experiments. Any remaining tactors affecting the performance of our system are modeled as random noise, so staUstical methods are used to evaluate the effects of the selected parameters. The six chosen parameters are completely orthogonal, with the exception that parameter &quot;negative knowledge&quot; must have the value &quot;not used&quot; when parameter &quot;extraction model&quot; has the value &quot;nouns in vicinity&quot;. In order to avoid introducing imbalance in our experiment, we constructed a complete designed experimerit \[Hicks, 1973\] for all their (4x2-l)x2x2x 4 x 3 = 336 valid combinations 6.</Paragraph> </Section> <Section position="7" start_page="48" end_page="50" type="metho"> <SectionTitle> 5. RESULTS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="48" end_page="48" type="sub_section"> <SectionTitle> 5.1 Average effect of each linguistic </SectionTitle> <Paragraph position="0"> parameter Space limitations do not allow us to present the scores for every one of the 336 individual experiments performed, corresponding to all valid combinations of the six modeled parameters. Instead we present several summary measures. We measured the effect of eachparticular setting of each linguistic parameter of Section 3 by averaging the scores obtained in all experiments where mat particular parameter had thatparticular value. In this way, Table 1 summarizes the differences in the performance of the system caused by each parameter. Because of the complete design of the experiment, each value in Table 1 is obtained in runs that are identical to the runs used for estimating the other values of the same parameter except for the difference in the parameter itself 7.</Paragraph> <Paragraph position="1"> Table I shows that there is indeed improve.ment with the introduction of any of the proposed linguistic teatures, or with the use of a lingnisticfilly more sophisticated extraction model. To assess the statistical significance of these differences, we compared each run for a particular value of a parameter to the corresponding identical valueo '(exceptflOr that parameter) run for a different u me parameter. Each pair of values for a paran~eter produces~ in this way a set of paired observations, on eacn of these sets, we performed a sign test \[Gibbons and Chakrahorti, 1992\] of the null hypoth~is that there is no real difference in the system s performance between the two values, i.e. that any observed difference is due to chance.</Paragraph> <Paragraph position="2"> We counted the number of times that the first of the two compared values led to superior performance relative to the second, distributing ties equally between the two cases. Under the null hypothesis, the number of times that the first value performs better follows the binomial distribution with parameter p=0.5. Table 2 gives the results of these tests along with the probabilities that the same or more extreme results would be encountered by chance. We can see from the table that all typos of linguistic knowledge except spell-checking have a beneficial effect that is statistically stgnificant at, or below, the 1% level.</Paragraph> <Paragraph position="3"> Statistical tests of the difference in performance offered by each linguistic feature.</Paragraph> </Section> <Section position="2" start_page="48" end_page="49" type="sub_section"> <SectionTitle> 5.2 Comparison among the linguistic features </SectionTitle> <Paragraph position="0"> In order to measure the significance of the contribufion of each linguistic feature relative to the other linguistic features, we fitted a linear regression model \[Draper and Smith, 1981\] to the data.</Paragraph> <Paragraph position="1"> We use the six parameters of our experiments as the predictors, and the measured F-score of the corresponding clustering as the response variable.</Paragraph> <Paragraph position="2"> In such a model the response Y is assumed to be a linear function of the predictors, i.e.</Paragraph> <Paragraph position="3"> Y=bo+bl.Xl+b2.X2+... +bn'X n (1) where X i is the i-th predictor and bi is its ccorresponding weight s . The weights found by the fitring process (Table 3) indicate by their absolute magriitude and sign how important each predictor is and whether it contributes positively or negatively to the final result. Numerical values such as SSueh a model is appropriate for comparative purposes, although extrapolating response values for prediction outside the range of predictor values used in the fitting may give incorrect results. For example, the coefficients in Table 3 earmot be used to predict the score when the corpus is figniticanfly smaller than 0.33 Mbytes or larger than 21 Mbytes. the corpus size enter formula (1) directly as predictors, so Table 3 indicates that each additional megabyte of text increases the performance of the system by 0.9417 on the average.</Paragraph> <Paragraph position="4"> For binary features, the weights in Table 3 indicate the increase in the system's performance when the feature is present, so introduction of morphology improves the system's performance by 0.5371 on the average. For the categorical variables &quot;extraction model&quot; and &quot;adjective set&quot;, the weights show the change in score for the indicated value in contrast to the base case (minimal linguistic knowledse represented by extraction model &quot;nouns m vicinity&quot; and adjective set 1 respectively). For example, using the finite-state parser instead of the &quot;nouns in vicinity&quot; model improves and of the humans.</Paragraph> <Paragraph position="5"> the score by 7.5423 on the average, while going from adjecuve set 2 to adjective set 3 decreases the score by -(-2.5996-11.4882) = 14.0878 on the average. Finally the intercept b 0 gives a baseline erformance of a minimal system that uses the ase case for each pararneter; the effects of corpus size are to be added to this system.</Paragraph> <Paragraph position="6"> From Table 3 we can see that the data extraction model has a significant effect on the quality of the produced clustering, and among the fingutstic parameters is the most important one. Increasing the size of the corpus also significantly increases the score. The adjective set that is clustered also has a major influence on the score, with rarer adjectives leading to worse clusterings. The two linguistic feat~u~e s &quot;morphology&quot; and &quot;negative knowledge' have less pronounced although still sil~nificant effects, while spell-cbecking offers minimal improvement that probably does not justify the effort of implementing the module and the cost of activating it at run-time.</Paragraph> </Section> <Section position="3" start_page="49" end_page="50" type="sub_section"> <SectionTitle> 5.3 Overall effect of linguistic knowledge </SectionTitle> <Paragraph position="0"> Up to this point we have described averages of scores, taken over many combinations of features that are orthogonal to the one studied. These averages are good for describing the existence of a difference caused by the different values of each feature, across all possible combinations of the other features. They are not, however, representative of the performance of the system m a particular setting of parameters, nor are they suitable for describing the difference in features quantitatively, since they are averages taken over widely differing settings of the system's parameters. In particular, the inclusion of very small corpora drives the average scores down, as we have confirmed in a more detailed analysis where averages were computed separately for each value of the corpus size parameter. To gtve a feeling of how important the introduction of linguistic knowledge is quantitatively, we compare in Table 4 the results obtained for the full corpus of 21 million words for the two cases of having all or none of the linguistic components active. The scores obtained by a random system that produces partitions of the adjectives with no knowledge except the number of groups are included as a lower bound.</Paragraph> <Paragraph position="1"> These estimates are obtained after averaging the scores of 20,000 such random partitions for each adjective set. The average scores that each human model receives when compared to all the other human models are also included, since they ovide an estimate of the maximum score that can achieved by any system. That maximum depends on the disagreement between models for each adjective set. For these measurements we use a smaller smoothing window of size 3 instead of 5, which is fairer to the system when its performance is compared to the humans. We also give in Figure 5 the grouping produced by the system witlaout using any of the linguistic modules for adjective set 3; this is to be contrasted with Figure I.</Paragraph> </Section> </Section> <Section position="8" start_page="50" end_page="51" type="metho"> <SectionTitle> 6. GENERALIZING TO OTHER APPLICATIONS </SectionTitle> <Paragraph position="0"> In the previous section, we showed that the introduction of linguistic knowledge in our system produces a performance difference, which is not only statistically observable but also quantitatively significant (cf. Table 4). We believe that these positive results should also apply to other corpus-based NLP systems that employ statistical methods. Many of the linguistic components of our system, including the extraction model that was shown to be the most important linguistic parameter, are not specific to the word grouping problem. They can thus be directly incorporated in systems designed for other problems but essentially following the same basic architecture as ours.</Paragraph> <Paragraph position="1"> Many statistical approaches share the same basic methodology with our system: a set of words is preselected, related words are identified in a corpus, the frequencies of words and of pairs of related words are estimated, and a statistical model is used to make predictions for the original words.</Paragraph> <Paragraph position="2"> Across appfications, there are differences in what words are selected, how related words are defined, and what kind of predictions is made. Nevertheless, the basic components stay the same. For example, in our appfication the original words are 1. catastrophic harmful 2. dry wet 3. lenient rigid strict stringent 4. communist leftist 5. flexible hostile protracted unfriendly 6. abrupt chaotic disastrous gradual turbulent vigorous 7. affluent affordable inexpensive system using no linguistic modules.</Paragraph> <Paragraph position="3"> the adjectives and the predictions are their groups: in machine translation, the predictions are the translations of the words in the source languag~ text; in sense disambiguation, the predictions are the senses assigned to the words of interest; in part-of-speech tagging or in classification the predictions are the t,~s or classes assigned to each word. Because of this underlying similarity, the comparative analysis presented m the paper is relevant to all these problems.</Paragraph> <Paragraph position="4"> For a concrete example, we examine the case of collocation extraction that has been addressed with statistical methods in the past. Smadja \[1993\] describes a system that imtially uses the &quot;nouns in vicinity&quot; extraction model to collect cooccurrence information about words, and then identifies collocations on the basis of distributional criteria. A later component falters the retrieved collocations, removing the ones where the participating words are not used consistently in the same syntactic relationship. This post-processing stage doubles the precision of the system. We believe that using from the start a more sophisticated extraction model to collect these pairs of related words will have similar positive effects. Other linguistic components, such as a morphology module that combines frequency counts, should also improve the performance of the system, In this way, we can benefit from linguistic knowledge without having to use a separate filtering process after expending the effort to coUeet the collocations.</Paragraph> <Paragraph position="5"> Similarly, the sense disambiguation problem is typically attacked by comparing the distribution of the neighbors of a word's occurrence to prototypical distributions associated with each of the word's senses \[Gale et al., 1992, Schtltze, 1992\]. Usually, no explicit linguistic knowledge is used in defining these neighbors, which are taken as all words appearing within a window of fixed width centered-at the word being disambiguated. Many words unrelated to the word of interest are collected in this way. In contrast, identifying appropriate word classes that can be expected on linguistic grounds to convey significant information about the original word should increase the performance of the disambiguation system. Such classes might be modified nouns for adjectives, nouns in a subject or object position for verbs, etc. As we have showed in Section 5, less but cleaner information increases the quality of the results.</Paragraph> <Paragraph position="6"> An interesting topic is the identification of parallels of our fingulstic modules for these applications, at least for those modules which, unlike morphology, are not ubiquitous. Negative knowledge for example improves the performance of our system, supplementing the positive information provided by adjective-noun pairs. It could be useful for other systems as well if an appropriate application-dependent method of extractmg such information is identified.</Paragraph> </Section> class="xml-element"></Paper>