File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/w93-0111_metho.xml
Size: 28,549 bytes
Last Modified: 2025-10-06 14:13:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0111"> <Title>EXPERIMENTS IN SYNTACTIC AND SEMANTIC CLASSIFICATION AND DISAMBIGUATION USING BOOTSTRAPPING*</Title> <Section position="4" start_page="117" end_page="118" type="metho"> <SectionTitle> & Chater, 1992). THE CORPUS -- TECHNICAL, FOCUSED AND SMALL </SectionTitle> <Paragraph position="0"> In the Biological Knowledge Laboratory we are pursuing a number of projects to analyze, store and retrieve biological research papers, including working with full text and graphics (Futrelle, Kakadiaris, Alexander, Carriero, Nikolakis & Futrelle, 1992; Gauch & Futrelle, 1993). The work is focused on the biological field of bacterial chemotaxis. A biologist has selected approximately 1,700 documents representing all the work done in this field since its inception in 1965. Our study uses the titles for all these documents plus all the abstracts available for them. The resulting corpus contains 227,408 words with 13,309 distinct word forms, including 5,833 words of frequency 1. There are 1,686 titles plus 8,530 sentences in the corpus. The sentence identification algorithm requires two factors -- contiguous punctuation C.&quot;, &quot;!&quot;, or &quot;?&quot;) and capitalization of the following token. To eliminate abbreviations, the token prior to the punctuation must not be a single capital letter and the capitalized token after the punctuation may not itself be followed by a contiguous &quot;.&quot;.</Paragraph> <Paragraph position="1"> is, An example of a sentence from the corpus &quot;$pre2$ $prel$ one of the open reading frames was translated into a protein with $pct$ amino acid identity to S. typhimurium Flil and $pct$ identity to the beta subunit of E. coli ATP synthase $posl$ $pos2$&quot; The positional items $pre... and $pos...</Paragraph> <Paragraph position="2"> have been added to furnish explicit context for sentence initial and sentence final constituents. Numbers have been converted to three forms corresponding to integers, reals and percentages C$pct$ in the example above). The machine-readable version of the corpus uses double quoted items to ease processing by Lisp, our language of choice.</Paragraph> <Paragraph position="3"> The terminology we will use for describing words is as follows: * Target word: A word to be classified.</Paragraph> <Paragraph position="4"> Context words: Appearing within some distance of a target word, &quot;The big ~ cat 9.a the mat...&quot;.</Paragraph> <Paragraph position="5"> * Word class: Any defined set of word forms or labeled instances.</Paragraph> <Paragraph position="6"> Simset: A word class in which each item, an expansion word, has a similarity greater than some chosen cutoffto a single base word.</Paragraph> <Paragraph position="7"> * Labeled instances: Forms such as &quot;cloned48&quot; or &quot;cloned73VBN&quot;, that would replace an occurrence of &quot;cloned&quot;.</Paragraph> </Section> <Section position="5" start_page="118" end_page="118" type="metho"> <SectionTitle> DESCRIBING AND QUANTIFYING WORD CONTI~TS </SectionTitle> <Paragraph position="0"> In these experiments, the context of a target word is described by the preceding two context words and the following two context words, Figure 1. Each position is represented by a 150 element vector corresponding to the occurrence of the 150 highest frequency words in the corpus, giving a 600-dimensional vector describing the four-word context. Initially, the counts from all instances of a target word form w are summed so that the entry in the corresponding context word position in the vector is the sum of the occurrences of that context word in that position for the corresponding target word form; it is the joint frequency of the context word. For example, if the word the immediately precedes 10 occurrences of the word gene in the corpus then the element corresponding to the in the -1C context vector of gene is set to 10.</Paragraph> <Paragraph position="1"> Subsequently, a 600-dimensional vector of mutual information values, MI, is computed from the frequencies as follows, log 2 NZ~ + Ml(cw)= \[.\].j1\] This expresses the mutual information value for the context word c appearing with the target word w. The mutual information is large whenever a context word appears at a much higher frequency, fcw, in the neighborhood of a target word than would be predicted from the overall frequencies in the corpus, fc and fw. The formula adds 1 to the frequency ratio, so that a 0 (zero) occurrence corresponds to 0 mutual information. A possibly better strategy (Church, Gale, Hanks & Hindle, 1991) is capable of generating negative mutual information for the non-occurrence or low-frequency occurrence of a very high-frequency word and has the form,</Paragraph> <Paragraph position="3"> In any case, some smoothing is necessary to prevent the mutual information from diverging when fcw= O.</Paragraph> </Section> <Section position="6" start_page="118" end_page="119" type="metho"> <SectionTitle> SIMILARITY, CLUSTERING AND CLASSIFICATION IN WORD SPACE </SectionTitle> <Paragraph position="0"> When the mutual information vectors are computed for a number of words, they can be compared to see which words have similar contexts. The comparison we chose is the inner product, or cosine measure, which can vary between -1.0 and +1.0 (Myaeng & Li, 1992). Once this similarity is computed for all word pairs in a set, various techniques can be used to identify classes of similar words. The method we chose is hierarchical agglomerative clustering (Jain & Dubes, 1988). The two words with the highest similarity are first joined into a two-word cluster.</Paragraph> <Paragraph position="1"> Word to be classified with context: -2C -1C W +1C +2C Figure 1. The 600-dimensional context vector around a target word W. Each subvecter describes the frequency and mutual information of the occurrences of the 150 highest frequency words, HFC, in the corpus.</Paragraph> <Paragraph position="2"> A mutual information vector for the cluster is computed and the cluster and remaining words are again compared, choosing the most similar to join, and so on.</Paragraph> <Paragraph position="3"> (To compute the new mutual information vector, the context frequencies in the vectors for the two words or clusters joined at each step are summed, element-wise.) In this way, a binary tree is constructed with words at the leaves leading to a single root covering all words. Each cluster, a node in the binary tree, is described by an integer denoting its position in the sequence of cluster formation, the total number of words, the similarity of the two children that make it up, and its member words. Here, for example, are the first 15 clusters from the analysis described in In this sample it is clear that clusters are sometimes formed by the pairing of two individual words, sometimes by pairing one word and a previous cluster, and sometimes by combining two already formed clusters.</Paragraph> <Paragraph position="4"> In normal tagging, a word is viewed as a member of one of a small number of classes.</Paragraph> <Paragraph position="5"> In the classification approach we are using, there can be thousands of classes, from pairs of words up to the root node which contains all words in a single class. Thus, every class generated is viewed extensionally, it is a structured collection of occurrences in the corpus, with their attendant frequencies and contexts. The classes so formed will reflect the particular word use in the corpus they are derived from.</Paragraph> </Section> <Section position="7" start_page="119" end_page="121" type="metho"> <SectionTitle> EXPERIMENT #1: CLASSIFICATION OF THE 1,000 HIGHEST FREQUENCY WORDS </SectionTitle> <Paragraph position="0"> The first experiment classified the 1,000 highest frequency words in the corpus, producing 999 clusters (0-998) during the process. $pre... and Spas... words were included in the context set, but not in the target set. Near the leaves, words clustered by syntax (part of speech) and by semantics.</Paragraph> <Paragraph position="1"> Later, larger clusters tended to contain words of the same syntactic class, but with less semantic homogeneity. In each example below, the words listed are the entire contents of the cluster mentioned. The most striking property of the clusters produced was the classification of words into coherent semantic fields. Grefenstette has pointed out (Grefenstette, 1992) that the Deese antonyms, such as &quot;large&quot; and &quot;small&quot; or &quot;hot&quot; and &quot;cold&quot; show up commonly in these analyses. Our methods discovered entire graded fields, rather than just pairs of opposites. The following example shows a cluster of seventeen adjectives describing comparative quantity terms, cluster 756, similarity 0.28, decreased, effective, few, greater, high, higher, increased, large, less, low, lower, more, much, no, normal, reduced, short Note that pairs such as &quot;high&quot; and &quot;higher&quot; and &quot;low&quot; and &quot;lower&quot; appear. &quot;No&quot;, meaning &quot;none&quot; in this collection, is located at one extreme. The somewhat marginal item, &quot;effective&quot;, entered the cluster late, at cluster 704. It appears in collocations, such as &quot;as effective as&quot; and &quot;effective than&quot;, in which the other terms also appear. Comparing the cluster to Roger's (Berrey, 1962) we find that all the items are in the Roget category Comparative Quantity except for &quot;effective&quot; and &quot;no&quot;. The cluster item, &quot;large&quot; is not in this Roget category but the category does include &quot;big&quot;, &quot;huge&quot; and &quot;vast&quot;, so the omission is clearly an error in Roget's. With this correction, 88% (15/17) of the items are in the single Roget category.</Paragraph> <Paragraph position="2"> The classification of technical terms from genetics and biochemistry is of particular interest, because many of these terms do not appear in available dictionaries or thesauri.</Paragraph> <Paragraph position="3"> Cluster 374, similarity 0.37, contains these 18 items, che, cheA, cheB, cheR, cheY, cheZ, double, fla, flaA, taB, flaE, H2, hag, mot, motB, tar, trg, tsr All of these are abbreviations for specific bacterial mutations, except for &quot;double&quot;. Its appearance drives home the point that the classification depends entirely on usage. 20 of the 30 occurrences of &quot;double&quot; precede the words &quot;mutant&quot; or &quot;mutants&quot;, as do most of the othermutation terms in this cluster.</Paragraph> <Paragraph position="4"> Cluster 240, similarity 0.4 contains these termS, microscopy, electrophoresis, chromatography Each of these is a noun describing a common technique used in experiments in this domain.</Paragraph> <Paragraph position="5"> The standard Linnean nomenclature of Genus followed by species, such as Escherichia coli, is reflected by cluster 414, which contains 22 species names, and cluster 510, which contains 9 genus names.</Paragraph> <Paragraph position="6"> In scientific research, the determination of causal factors and the discovery of essential elements is a major goal. Here are six concepts in this semantic field comprising cluster 183, similarity 0.43, required, necessary, involved, responsible, essential, important These terms are used almost interchangeably in our corpus, but they don't fare as well in Roget's because of anthropocentric attachments to concepts such as fame, duty and legal liability.</Paragraph> <Paragraph position="7"> Given the limited context and modest sized corpus, the classification algorithm is bound to make mistakes, though a study of the text concordance will always tell us why the algorithm failed in any specific case. For example, as the similarity drops to 0.24 at cluster 824 we see the adverb triple &quot;greatly&quot;, &quot;rapidly&quot; and &quot;almost&quot;. This is still acceptable, but by cluster 836 (similarity 0.24) we see the triple, &quot;them&quot;, &quot;ring&quot;, &quot;rings&quot;. At the end there is only a single cluster, 998, which must include all words. It comes together stubbornly with a negative similarity of-0.51. One problem encountered in this work was that the later, larger clusters have less coherence than we would hope for, identifying an important research issue.</Paragraph> <Paragraph position="8"> Experiment #1 took 20 hours to run on a Symbolics XL1200.</Paragraph> <Paragraph position="9"> A fundamental problem is to devise decision procedures that will tell us which classes are semantically or syntactically homogeneous; procedures that tell us where to cut the tree. The examples shown earlier broke down soon after, when words or clusters which in our judgment were weakly related began to be added. We are exploring the numerous methods to refine clusters once formed as well as methods to validate clusters for homogeneity (Jain & Dubes, 1988). There are also resampling methods to validate clusters formed by top-down partitioning methods (Jain & Moreau, 1987). All of these methods are computationally demanding but they can result in criteria for when to stop clustering. On the other hand, we mustn't assume that word relations are so simple that we can legitimately insist on finding neatly separated clusters. Word relations may simply be too complex and graded for this ever to occur.</Paragraph> <Paragraph position="10"> The semantic fields we discovered were not confined to synonyms. To understand why this is the case, consider the sentences, &quot;The temperature is higher today.&quot; and, &quot;The temperature is lower today.&quot; There is no way to tell from the syntax which word to expect.</Paragraph> <Paragraph position="11"> The choice is dependent on the situation in the world; it represents data from the world. The utterances are informative for just that reason. Taking this reasoning a step further, information theory would suggest that for two contrasting words to be maximally informative, they should appear about equally often in discourse. This is born out in our corpus (fhigher=58, i~ower=46) and for the Brown corpus (fhigher=147, fiower=110). The same relations are found for many other contrasting pairs, with some bias towards &quot;positive&quot; terms. The most extreme &quot;positive&quot; bias in our corpus is fpossible=88, fimpossible=0; &quot;never say never&quot; seems to be the catchphrase here -- highly appropriate for the field of biology.</Paragraph> <Paragraph position="12"> Some of the chemical term clusters that were generated are interesting because they contain class terms such as &quot;sugar&quot; and &quot;ion&quot; along with specific members of the classes (hyponyms), such as &quot;maltose&quot; and &quot;Na +''. Comparing these in our KWIC concordance suggests that there may be methodical techniques for identifying some of these generalization hierarchies using machine learning (supervised classification) (Futrelle & Gauch, 1993). For another discussion of attempts to generate generalization hierarchies, see (Myaeng & Li, 1992).</Paragraph> <Paragraph position="13"> As a corpus grows and new words appear, one way to classify them is to find their similarity to the N words for which context vectors have already been computed. This requires N comparisons. A more efficient method which would probably give the same result would be to successively compare the word to clusters in the tree, starting at the root. At each node, the child which is most similar to the unclassified word is followed.</Paragraph> <Paragraph position="14"> This is a logarithmic search technique for finding the best matching class which takes only O(log2N) steps. In such an approach, the hierarchical cluster is being used as a decision tree, which have been much studied in the machine learning literature (Quinlan, 1993). This is an alternate view of the classification approach as the unsupervised learning of a decision tree.</Paragraph> </Section> <Section position="8" start_page="121" end_page="124" type="metho"> <SectionTitle> EXPERIMENT #2: DISAMBIGUATION OF -ED FORMS </SectionTitle> <Paragraph position="0"> The following experiment is interesting because it shows a specific use for the similarity computations. They are used here to increase the accuracy of term disambiguation which means selecting the best tag or class for a potentially ambiguous word. Again, this is a bootstrap method; no prior tagging is needed to construct the classes. But if we do identify the tags for a few items by hand or by using a hand-tagged reference corpus, the tags for all the other items in a cluster can be assumed equal to the known items.</Paragraph> <Paragraph position="1"> The passive voice is used almost exclusively in the corpus, with some use of the editorial &quot;We&quot;. This results in a profusion of participles such as &quot;detected&quot;, &quot;sequenced&quot; and &quot;identified&quot;. But such -ed forms can also be simple past tense forms or adjectives. In addition, we identified their use in a postmodifying participle clause such as, &quot;... the value ~ from this measurement.&quot; Each one of the 88 instances of &quot;cloned&quot; and the 50 instances of &quot;deduced&quot; was hand tagged and given a unique ID. Then clustering was applied to the resulting collection, giving the result shown in Figure 2A. Experiments #2 and #3 took about 15 minutes each to run.</Paragraph> <Paragraph position="2"> The resultant clusters are somewhat complex. There are four tags and we have shown the top four clusters, but two of the clusters contain adjectives exclusively. The past participle and postmodifier occur together in the same cluster. (We studied the children of cluster 4, hoping to find better separation, but they are no better. ) The scoring metric we chose was to associate each cluster with the items that were in the majority in the node and score all other items as errors. This is a good approximation to a situation in which a &quot;gold standard&quot; is available to classify the clusters by independent means, such as comparing the clusters to items from a pretagged reference corpus.</Paragraph> <Paragraph position="4"> There is a strong admixture of adjectives in cluster 2 and all the postmodifiers are confounded with the past participles in cluster 4. The total number of errors (minority classes in a cluster) is 23 for a success rate of(138-23)/138 = 83%.</Paragraph> <Paragraph position="5"> All minority members of a cluster are counted as errors. This leads to the 83% error rate quoted in the figure caption.</Paragraph> <Paragraph position="6"> The results shown in Figure 2A can be improved as follows. Because we are dealing with single occurrences, only one element, or possibly zero, in each of the four context word vectors is filled, with frequency 1. The other 149 elements have frequency (and mutual information) 0.0. These sparse vectors will therefore have little or no overlap with vectors from other occurrences. In order to try to improve the classification, we expanded the context values in an effort to produce more overlap, using the following strategy: We proceed as if the corpus is far larger so that in addition to the actual context words already seen, there are many occurrences of highly similar words in the same positions. For each non-zero context in each set of 150, we expand it to an ordered class of similar words in the 150, picking words above a fixed similarity threshold (0.3 for the experiments reported here). Such a class is called a simset, made up of a base word and a sequence of expansion words.</Paragraph> <Paragraph position="7"> As an example of the expansion of context words via simsets, suppose that the occurrence of the frequency 1 word &quot;cheA-cheB&quot; is immediately preceded by &quot;few&quot; and the occurrence of the frequency 1 word &quot;CheA/CheB&quot; is immediately preceded by &quot;less&quot;. The -I C context vectors for each will have l's in different positions so there will be no overlap between them. If we expanded &quot;few&quot; into a large enough simset, the set would eventually contain, &quot;less&quot;, and vice-versa. Barring that, each simset might contain a distinct common word such as &quot;decreased&quot;. In either case, there would now be some overlap in the context vectors so that the similar use of &quot;cheA-cheB&quot; and &quot;CheA/CheB&quot; could be detected.</Paragraph> <Paragraph position="8"> The apparent frequency of each expansion word is based on its corpus frequency relative to the corpus frequency of the word being expanded. To expand a single context word instance ci appearing with frequency fik in the context of 1 or more occurrences of center word wk, choose all cj such that cj e {set of high-frequency context words} and the similarity S(ci,cj) _> St, a threshold value. Set the apparent frequency of each expansion word cj to fjk = S(ci,cj)xfik x fj / fi , where fi and fj are the corpus frequencies of ci and cj.</Paragraph> <Paragraph position="9"> Normalize the total frequency of the context word plus the apparent frequencies of the expansion words to fik. For the example being discussed here, fik = 1, St=0.3 and the average number of expansion words was 6.</Paragraph> <Paragraph position="10"> Recomputing the classification of the -ed forms with the expanded context words results in the improved classification shown in Figure 2B. The number of classification errors is halved, yielding a success rate of 92%. This is comparable in performance to many stochastic tagging algorithms.</Paragraph> <Paragraph position="11"> This analysis is very similar to part-of-speech tagging. The simsets of only 6 items are far smaller than the part-of-speech categories conventionally used. But since we use high frequency words, they represent a substantial portion of the instances. Also, they have higher specificity than, say, Verb.</Paragraph> <Paragraph position="12"> Many taggers work sequentially and depend on the left context. But some words are best classified by their right context. We supply both. Clearly this small experiment did not reach the accuracy of the very best taggers, but it performed well.</Paragraph> <Paragraph position="13"> This experiment has major ramifications for the future. The initial classifications found merged all identical word forms together, both as targets and contexts. But disambiguation techniques such as those in Experiment #2 can be used to differentially tag word occurrences with some degree of accuracy.</Paragraph> <Paragraph position="14"> These newly classified items can in turn be used as new target and context items (if their frequencies are adequate) and the analysis can be repeated. Iterating the method in this way should be able to refine the classes until a fixed point is reached at which no further improvement in classification occurs. The major challenge in using this approach will be to keep it computationally tractable. This approach is similar in spirit to the iterative computational approaches of the Hidden Markov Models (Kupiec, 1989; Kupiec, 1992; Rabiner, 1989), though our zeroth order solution begins quite close to the desired result, so it should converge very close to a global optimum.</Paragraph> <Paragraph position="15"> postmodifying form, not isolated before, is fairly well isolated in its own subclass. The total number of errors is reduced from 23 to 11, for a success rate of 92%.</Paragraph> </Section> <Section position="9" start_page="124" end_page="124" type="metho"> <SectionTitle> EXPERIMENT #3: CLASSIFICATION OF SINGLE WORD OCCURRENCES </SectionTitle> <Paragraph position="0"> When classifying multiple instances of a single word form as we did in Experiment #2, there are numerous collocations that aid the classification. For example, 16 of the 50 occurrences of the word &quot;deduced&quot; occur in the phrase, &quot;of the ~ amino acid sequence&quot;.</Paragraph> <Paragraph position="1"> But with words of frequency 1, we cannot rely on such similarities. Nevertheless, we experimented with classifying 100 words of corpus frequency 1 with and without expanding the context words. Though hand scoring the results is difficult, we estimate that there were 8 reasonable pairs found initially and 26 pairs when expansion was used.</Paragraph> <Paragraph position="2"> Examples of words that paired well without expansion are &quot;overlaps&quot; and &quot;flank&quot; (due to a preceding &quot;which&quot;) and &quot;malB&quot; and &quot;cheA-cheB&quot; (due to the context &quot;...the \[malB, cheA-cheB\] region...&quot;). After expansion, pairs such as &quot;setting&quot;, &quot;resetting&quot; appeared (due in part to the expansion of the preceding &quot;as&quot; and &quot;to&quot; context words into simsets which both included &quot;with&quot;, &quot;in&quot; and &quot;by&quot;).</Paragraph> <Paragraph position="3"> The amount of information available about frequency 1 words can vary from a lot to nothing at all, and most frequently tends to the latter, viz., &quot;John and Mary looked at the blork.&quot; Nevertheless, such words are prominent, 44% of our corpus' vocabulary.</Paragraph> <Paragraph position="4"> About half Of them are non-technical and can therefore be analyzed from other corpora or on-line dictionaries. Word morphology and Latinate morphology in particular, can be helpful. Online chemical databases, supplemented with rules for chemical nomenclature will clarify additional items, e.g., &quot;2-epoxypropylphosphonic&quot; or &quot;phosphoglucomutase-deflcient&quot;.</Paragraph> <Paragraph position="5"> Furthermore, there are naming conventions for genetic strains and mutants which aid recognition. The combination of all these methods should lead to a reasonable accuracy in the classification of frequency 1 words.</Paragraph> </Section> <Section position="10" start_page="124" end_page="124" type="metho"> <SectionTitle> FURTHER DISCUSSION AND FUTURE DIRECTIONS </SectionTitle> <Paragraph position="0"> Our corpus of 220,000 words is much smaller than ones of 40 million words (Finch & Chater, 1992) and certainly of 360 million (Brown, Della Pietra, deSousa, Lai & Mercer, 1992). But judging by the results we have presented, especially for the full 1,000 word clustering, our corpus appears to make up in specificity for what it lacks in size. Extending this work beyond abstracts to full papers will be challenging because our corpus requires SGML markup to deal with Greek characters, superscripts and subscripts, etc. (Futrelle, Dunn, Ellis & Pescitelli, 1991). We have over 500,000 words from the bacterial chemotaxis research papers carefully marked up by hand in this way.</Paragraph> <Paragraph position="1"> The characterization of context can obviously be extended to more context positions or words, and extensions of our word-rooted expansion techniques are potentially very powerful, combining broad coverage with specificity in a &quot;tunable&quot; way. Morphology can be added to the context vectors by using the ingenious suggestion of Brill to collect high-frequency tri-letter word endings (Brill & Marcus, 1992).</Paragraph> <Paragraph position="2"> One of the more subtle problems of the context specification is that it uses summed frequencies, so it may fail to retain important correlations. Thus if only AB or CD sequences occurred, or only AD or CB sequences, they would lead to the same (summed) context vector. The only correlations faithfully retained are those with the target word.</Paragraph> <Paragraph position="3"> Characterizing context n-grams could help work around this problem, but is a non-trivial task.</Paragraph> </Section> class="xml-element"></Paper>