File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/j88-1003_metho.xml

Size: 43,504 bytes

Last Modified: 2025-10-06 14:12:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="J88-1003">
  <Title>GRAMMATICAL CATEGORY DISAMBIGUATION BY STATISTICAL OPTIMIZATION</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
GRAMMATICAL CATEGORY DISAMBIGUATION BY
STATISTICAL OPTIMIZATION
</SectionTitle>
    <Paragraph position="0"> Several algorithms have been developed in the past that attempt to resolve categorial ambiguities in natural language text without recourse to syntactic or semantic level information. An innovative method (called &amp;quot;CLAWS&amp;quot;) was recently developed by those working with the Lancaster -Oslo/Bergen Corpus of British English. This algorithm uses a systematic calculation based upon the probabilities of co-occurrence of particular tags. Its accuracy is high, but it is very slow, and it has been manually augmented in a number of ways. The effects upon accuracy of this manual augmentation are not individually known.</Paragraph>
    <Paragraph position="1"> The current paper presents an algorithm for disambiguation that is similar to CLAWS but that operates in linear rather than in exponential time and space, and which minimizes the unsystematic augments. Tests of the algorithm using the million words of the Brown Standard Corpus of English are reported; the overall accuracy is 96%. This algorithm can provide a fast and accurate front end to any parsing or natural language processing system for English.</Paragraph>
    <Paragraph position="2"> Every computer system that accepts natural language input must, if it is to derive adequate representations, decide upon the grammatical category of each input word. In English and many other languages, tokens are frequently ambiguous. They may represent lexical items of different categories, depending upon their syntactic and semantic context.</Paragraph>
    <Paragraph position="3"> Several algorithms have been developed that examine a prose text and decide upon one of the several possible categories for a given word. Our focus will be on algorithms which specifically address this task of disambiguation, and particularly on a new algorithm called VOLSUNGA, which avoids syntactic-level analysis, yields about 96% accuracy, and runs in far less time and space than previous attempts. The most recent previous algorithm runs in NP (Non-Polynomial) time, while VOLSUNGA runs in linear time. This is provably optimal; no improvements in the order of its execution time and space are possible. VOLSUNGA is also robust in cases of ungrammaticality.</Paragraph>
    <Paragraph position="4"> Improvements to this accuracy may be made, perhaps the most potentially significant being to include some higher-level information. With such additions, the accuracy of statistically-based algorithms will approach 100%; and the few remaining cases may be largely those with which humans also find difficulty.</Paragraph>
    <Paragraph position="5"> In subsequent sections we examine several disambiguation algorithms. Their techniques, accuracies, and efficiencies are analyzed. After presenting the research carried out to date, a discussion of VOLSUNGA's application to the Brown Corpus will follow. The Brown Corpus, described in Kucera and Francis (1967), is a collection of 500 carefully distributed samples of English text, totalling just over one million words. It has been used as a standard sample in many studies of English. Generous advice, encouragement, and assistance from Henry Kucera and W. Nelson Francis in this research is gratefully acknowledged.</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 PREWOUS DISAMBIGUATION ALGORITHMS
</SectionTitle>
    <Paragraph position="0"> The problem of lexical category ambiguity has been little examined in the literature of computational linguistics and artificial intelligence, though it pervades English to an astonishing degree. About 11.5% of types (vocabulary), and over 40% of tokens (running words) in English prose are categorically ambiguous (as measured via the Brown Corpus). The vocabulary breaks down as shown in Table 1 (derived from Francis and Kucera (1982)).</Paragraph>
    <Paragraph position="1"> Copyright 1988 by the Association for Computational Linguistics. Permission to copy without fee all or part of this material is granted provided that the copies are not made for direct commercial advantage and the CL reference and this copyright notice are included on the first page. To copy otherwise, or to republish, requires a fee and/or specific permission. 0362-613X/88/010031-39503.00 Computational Linguistics, Volume 14, Number 1, Winter 1988 31 Steven J. DeRose Grammatical Category Disambiguation by Statistical Optimization Number of words by degree of ambiguity:  A search of the relevant literature has revealed only three previous efforts directed specifically to this problem. The first published effort is that of Klein and Simmons (1963), a simple system using suffix lists and limited frame rules. The second approach to lexical category disambiguation is TAGGIT (Greene and Rubin (1971)), a system of several thousand context-frame rules. This algorithm was used to assign initial tags to the Brown Corpus. Third is the CLAWS system developed to tag the Lancaster -Oslo/Bergen (or LOB) Corpus. This is a corpus of British written English, parallel to the Brown Corpus. Parsing systems always encounter the problem of category ambiguity; but usually the focus of such systems is at other levels, making their responses less relevant for our purposes here.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 KLEIN AND SIMMONS
</SectionTitle>
      <Paragraph position="0"> Klein and Simmons (1963) describe a method directed primarily towards the task of initial categorial tagging rather than disambiguation. Its primary goal is avoiding &amp;quot;the labor of constructing a very large dictionary&amp;quot; (p. 335); a consideration of greater import then than now.</Paragraph>
      <Paragraph position="1"> The Klein and Simmons algorithm uses a palette of 30 categories, and claims an accuracy of 90% in tagging. The algorithm first seeks each word in dictionaries of about 400 function words, and of about 1500 words which &amp;quot;are exceptions to the computational rules used&amp;quot; (p. 339). The program then checks for suffixes and special characters as clues.</Paragraph>
      <Paragraph position="2"> Last of all, context frame tests are applied. These work on scopes bounded by unambiguous words, as do later algorithms. However, Klein and Simmons impose an explicit limit of three ambiguous words in a row. For each such span of ambiguous words, the pair of unambiguous categories bounding it is mapped into a list. The list includes all known sequences of tags occurring between the particular bounding tags; all such sequences of the correct length become candidates. The program then matches the candidate sequences against the ambiguities remaining from earlier steps of the algorithm. When only one sequence is possible, disambiguation is successful.</Paragraph>
      <Paragraph position="3"> The samples used for calibration and testing were limited. First, Klein and Simmons (1963) performed &amp;quot;hand analysis of a sample \[size unspecified\] of Golden Book Encyclopedia text&amp;quot; (p. 342). Later, &amp;quot;\[w\]hen it was run on several pages from that encyclopedia, it correctly and unambiguously tagged slightly over 90% of the words&amp;quot; (p. 344). Further tests were run on small samples from the Encyclopedia Americana and from Scientific American.</Paragraph>
      <Paragraph position="4"> Klein and Simmons (1963) assert that &amp;quot;\[o\]riginal fears that sequences of four or more unidentified parts of speech would occur with great frequency were not substantiated in fact&amp;quot; (p. 3). This felicity, however, is an artifact. First, the relatively small set of categories reduces ambiguity. Second, a larger sample would reveal both (a) low-frequency ambiguities and (b) many long spans, as discussed below.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1.2 GREENE AND RUBIN (TAGGIT)
</SectionTitle>
    <Paragraph position="0"> Greene and Rubin (1971) developed TAGGIT for tagging the Brown Corpus. The palette of 86 tags that TAGGIT uses has, with some modifications, also been used in both CLAWS and VOLSUNGA. The rationale underlying the choice of tags is described on pages 3-21 of Greene and Rubin (1971). Francis and Kucera (1982) report that this algorithm correctly tagged approximately 77% of the million words in the Brown Corpus (the tagging was then completed by human post-editors). Although this accuracy is substantially lower than that reported by Klein and Simmons, it should be remembered that Greene and Rubin were the first to attempt so large and varied a sample.</Paragraph>
    <Paragraph position="1"> TAGGIT divides the task of category assignment into initial (potentially ambiguous) tagging, and disambiguation. Tagging is carried out as follows: first, the program consults an exception dictionary of about 3,000 words.</Paragraph>
    <Paragraph position="2"> Among other items, this contains all known closed-class words. It then handles various special cases, such as words with initial &amp;quot;$&amp;quot;, contractions, special symbols, and capitalized words. The word's ending is then checked against a suffix list of about 450 strings. The lists were derived from lexicostatistics of the Brown Corpus. If TAGGIT has not assigned some tag(s) after these several steps, &amp;quot;the word is tagged NN, VB, or JJ \[that is, as being three-ways ambiguous\], in order that the disambiguation routine may have something to work with&amp;quot; (Greene and Rubin (1971), p. 25).</Paragraph>
    <Paragraph position="3"> After tagging, TAGGIT applies a set of 3300 context frame rules. Each rule, when its context is satisfied, has the effect of deleting one or more candidates from the list of possible tags for one word. If the number of candidates is reduced to one, disambiguation is considered successful subject to human post-editing. Each rule can include a scope of up to two unambiguous words on each side of the ambiguous word to which the rule is being applied. This constraint was determined as follows: In order to create the original inventory of Context Frame Tests, a 900-sentence subset of the Brown University Corpus was tagged.., and its ambiguities were resolved manually; then a program was run 32 Computational Linguistics, Volume 14, Number 1, Winter 1988 Steven J. DeRose Grammatical Category Disambiguation by Statistical Optimization which produced and sorted all possible Context Frame Rules which would have been necessary to perform this disambiguation automatically. The rules generated were able to handle up to three consecutive ambiguous words preceded and followed by two non-ambiguous words \[a constraint similar to Klein and Simmons'\]. However, upon examination of these rules, it was found that a sequence of two or three ambiguities rarely occurred more than once in a given context. Consequently, a decision was made to examine only one ambiguity at a time with up to two unambiguously tagged words on either side. The first rules created were the results of informed intuition (Greene and Rubin (1972), p. 32).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.3 CLAWS
</SectionTitle>
      <Paragraph position="0"> Marshall (1983, p. 139) describes the LOB Corpus tagging algorithm, later named CLAWS (Booth (1985)), as &amp;quot;similar to those employed in the TAGGIT program&amp;quot;. The tag set used is very similar, but somewhat larger, at about 130 tags. The dictionary used is derived from the tagged Brown Corpus, rather than from the untagged. It contains 7000 rather than 3000 entries, and 700 rather than 450 suffixes. CLAWS treats plural, possessive, and hyphenated words as special cases for purposes of initial tagging.</Paragraph>
      <Paragraph position="1"> The LOB researchers began by using TAGGIT on parts of the LOB Corpus. They noticed that While less than 25% of TAGGIT's context frame rules are concerned with only the immediately preceding or succeeding word.., these rules were applied in about 80% of all attempts to apply rules. This relative overuse of minimally specified contexts indicated that exploitation of the relationship between successive tags, coupled with a mechanism that would be applied throughout a sequence of ambiguous words, would produce a more accurate and effective method of word disambiguation (Marshall (1983), p. 141).</Paragraph>
      <Paragraph position="2"> The main innovation of CLAWS is the use of a matrix of eolloeational probabilities, indicating the relative likelihood of co-occurrence of all ordered pairs of tags. This matrix can be mechanically derived from any pre-tagged corpus. CLAWS used &amp;quot;\[a\] large proportion of the Brown Corpus&amp;quot;, 200,000 words (Marshall (1983), pp.</Paragraph>
      <Paragraph position="3"> 141, 150).</Paragraph>
      <Paragraph position="4"> The ambiguities contained within a span of ambiguous words define a precise number of complete sets of mappings from words to individual tags. Each such assignment of tags is called a path. Each path is composed of a number of tag collocations, and each such collocation has a probability which may be obtained from the collocation matrix. One may thus approximate each path's probability by the product of the probabilities of all its collocations. Each path corresponds to a unique assignment of tags to all words within a span.</Paragraph>
      <Paragraph position="5"> The paths constitute a span network, and the path of maximal probability may be taken to contain the &amp;quot;best&amp;quot; tags.</Paragraph>
      <Paragraph position="6"> Marshall (1983) states that CLAWS &amp;quot;calculates the most probable sequence of tags, and in the majority of cases the correct tag for each individual word corresponds to the associated tag in the most probable sequence of tags&amp;quot; (p. 142). But a more detailed examination of the Pascal code for CLAWS revealed that CLAWS has a more complex definition of &amp;quot;most probable sequence&amp;quot; than one might expect. A probability called &amp;quot;SUMSUCCPROBS&amp;quot; is predicated of each word. SUMSUCCPROBS is calculated by looping through all tags for the words immediately preceding, at, and following a word; for each tag triple, an increment is added, defined by:</Paragraph>
      <Paragraph position="8"> GetSucc returns the collocational probability of a tag pair; Get3SeqFactor returns either 1, or a special value from the tag-triple list described below. DownGrade modifies the value of GetSucc in accordance with RTPs as described below.</Paragraph>
      <Paragraph position="9"> The CLAWS documentation describes SUMSUCCPROBS as &amp;quot;the total value of all relationships between the tags associated with this word and the tags associated with the next word... \[found by\] simulating all accesses to SUCCESSORS and ORDER2VALS which will be made .... &amp;quot; The probability of each node of the span network (or rather, tree) is then calculated in the following way as a tree representing all paths through which the span network is built:</Paragraph>
      <Paragraph position="11"> It appears that the goal is to make each tag's probability be the summed probability of all paths passing through it. At the final word of a span, pointers are followed back up the chosen path, and tags are chosen en route.</Paragraph>
      <Paragraph position="12"> We will see below that a simpler definition of optimal path is possible; nevertheless, there are several advantages of this general approach over previous ones. First, spans of unlimited length can be handled (subject to machine resources). Although earlier researchers (Klein and Simmons, Greene and Rubin) have suggested that spans of length over 5 are rare enough to be of little concern, this is not the case. The number of spans of a given length is a function of that length and the corpus size; so long spans may be obtained merely by examining more text. The total numbers of spans in the Brown Corpus, for each length from 3 to 19, are: 397111, 143447, 60224, 26515, 11409, 5128, 2161, 903, 382, 161, 58, 29, 14, 6, 1, 0, I. Graphing the logarithms Computational Linguistics, Volume 14, Number 1, Winter 1988 33 Steven J. DeRose Grammatical Category Disambiguation by Statistical Optimization of these quantities versus the span length for each, produces a near-perfect straight line.</Paragraph>
      <Paragraph position="13"> Second, a precise mathematical definition is possible for the fundamental idea of CLAWS. Whereas earlier efforts were based primarily on ad hoc or subjectively determined sets of rules and descriptions, and employed substantial exception dictionaries, this algorithm requires no human intervention for set-up; it is a systematic process.</Paragraph>
      <Paragraph position="14"> Third, the algorithm is quantitative and analog, rather than artificially discrete. The various tests and frames employed by earlier algorithms enforced absolute constraints on particular tags or collocations of tags. Here relative probabilities are weighed, and a series of very likely assignments can make possible a particular, a priori unlikely assignment with which they are associated.</Paragraph>
      <Paragraph position="15"> In addition to collocational probabilities, CLAWS also takes into account one other empirical quantity: Tags associated with words.., can be associated with a marker @ or %; @ indicates that the tag is infrequently the correct tag for the associated word(s) (less than 1 in 10 occasions), % indicates that it is highly improbable... (less than I in 100 occasions) .... The word disambiguation program currently uses these markers top devalue transition matrix values when retrieving a value from the matrix, @ results in the value being halved, % in the value being divided by eight (Marshall (1983), p. 149).</Paragraph>
      <Paragraph position="16"> Thus, the independent probability of each possible tag for a given word influences the choice of an optimal path. Such probabilities will be referred to as Relative Tag Probabilities, or RTPs.</Paragraph>
      <Paragraph position="17"> Other features have been added to the basic algorithm. For example, a good deal of suffix analysis is used in initial tagging. Also, the program filters its output, considering itself to have failed if the optimal tag assignment for a span is not &amp;quot;more than 90% probable&amp;quot;. In such cases it reorders tags rather than actually disambiguating. On long spans this criterion is effectively more stringent than on short spans. A more significant addition to the algorithm is that a number of tag triples associated with a scaling factor have been introduced which may either upgrade or downgrade values in the tree computed from the one-step matrix. For example, the triple \[1\] 'be' \[2\] adverb \[3\] past-tense-verb has been assigned a scaling factor which downgrades a sequence containing this triple compared with a competing sequence of \[1\] 'be' \[2\] adverb \[3\]-past-participle/adjective, on the basis that after a form of 'be', past participles and adjectives are more likely than a past tense verb (Marshall (1983), p. 146).</Paragraph>
      <Paragraph position="18"> A similar move was used near conjunctions, for which the words on either side, though separated, are more closely correlated to each other than either is to the conjunction itself (Marshall (1983), pp. 146-147).</Paragraph>
      <Paragraph position="19"> For example, a verb/noun ambiguity conjoined to a verb should probably be taken as a verb. Leech, Garside, and Atwell (1983, p. 23) describe &amp;quot;IDIOMTAG&amp;quot;, which is applied after initial tag assignment and before disambiguation. It was developed as a means of dealing with idiosyncratic word sequences which would otherwise cause difficulty for the automatic tagging .... for example, in order that is tagged as a single conjunction .... The Idiom Tagging Program... can look at any combination of words and tags, with or without intervening words. It can delete tags, add tags, or change the probability of tags. Although this program might seem to be an ad hoc device, it is worth bearing in mind that any fully automatic language analysis system has to come to terms with problems of lexical idiosyncrasy.</Paragraph>
      <Paragraph position="20"> IDIOMTAG also accounts for the fact that the probability of a verb being a past participle, and not simply past, is greater when the following word is &amp;quot;by&amp;quot;, as opposed to other prepositions. Certain cases of this sort may be soluble by making the collocational matrix distinguish classes of ambiguities---this question is being pursued. Approximately 1% of running text is tagged by IDIOMTAG (letter, G. N. Leech to Henry Kucera, June 7, 1985; letter, E. S. Atwell to Henry Kucera, June 20, 1985).</Paragraph>
      <Paragraph position="21"> Marshall notes the possibility of consulting a complete three-dimensional matrix of collocational probabilities. Such a matrix would map ordered triples of tags into the relative probability of occurrence of each such triple. Marshall points out that such a table would be too large for its probable usefulness. The author has produced a table based upon more than 85% of the Brown Corpus; it occupies about 2 megabytes (uncompressed).</Paragraph>
      <Paragraph position="22"> Also, the mean number of examples per triple is very low, thus decreasing accuracy.</Paragraph>
      <Paragraph position="23"> CLAWS has been applied to the entire LOB Corpus with an accuracy of &amp;quot;between 96% and 97%&amp;quot; (Booth (1985), p. 29). Without the idiom list, the algorithm was 94% accurate on a sample of 15,000 words (Marshall (1983)). Thus, the pre-processor tagging of 1% of all tokens resulted in a 3% change in accuracy; those particular assignments must therefore have had a substantial effect upon their context, resulting in changes of two other words for every one explicitly tagged.</Paragraph>
      <Paragraph position="24"> But CLAWS is time- and storage-inefficient in the extreme, and in some cases a fallback algorithm is employed to prevent running out of memory, as was discovered by examining the Pascal program code. How often the fallback is employed is not known, nor is it known what effect its use has on overall accuracy.</Paragraph>
      <Paragraph position="25"> Since CLAWS calculates the probability of every path, it operates in time and space proportional to the product of all the degrees of ambiguity of the words in the span. Thus, the time is exponential (and hence Non-Polynomial) in the span length. For the longest span in the Brown Corpus, of length 18, the number of paths examined would be 1,492,992.</Paragraph>
      <Paragraph position="26">  34 Computational Linguistics, Volume 14, Number 1, Winter 1988 Steven J. DeRose Grammatical Category Disambiguation by Statistical Optimization</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 THE LINEAR-TIME ALGORITHM (VOLSUNGA)
</SectionTitle>
    <Paragraph position="0"> The algorithm described here depends on a similar empirically-derived transitional probability matrix to that of CLAWS, and has a similar definition of &amp;quot;optimal path&amp;quot;. The tagset is larger than TAGGIT's, though smaller than CLAWS', containing 97 tags. The ultimate assignments of tags are much like those of CLAWS.</Paragraph>
    <Paragraph position="1"> However, it embodies several substantive changes.</Paragraph>
    <Paragraph position="2"> Those features that can be algorithmically defined have been used to the fullest extent. Other add-ons have been minimized. The major differences are outlined below.</Paragraph>
    <Paragraph position="3"> First, the optimal path is defined to be the one whose component collocations multiply out to the highest probability. The more complex definition applied by CLAWS, using the sum of all paths at each node of the network, is not used.</Paragraph>
    <Paragraph position="4"> Second, VOLSUNGA overcomes the Non-Polynomial complexity of CLAWS. Because of this change, it is never necessary to resort to a fallback algorithm, and the program is far smaller. Furthermore, testing the algorithm on extensive texts is not prohibitively costly. Third, VOLSUNGA implements Relative Tag Probabilities (RTPs) in a more quantitative manner, based upon counts from the Brown Corpus. Where CLAWS scales probabilities by 1/2 for RTP &lt; 0.1 (i.e., where less than 10% of the tokens for an ambiguous word are in the category in question), and by 1/8 for p &lt; 0.01, VOLSUNGA uses the RTP value itself as a factor in the equation which defines probability.</Paragraph>
    <Paragraph position="5"> Fourth, VOLSUNGA uses no tag triples and no idioms. Because of this, manually constructing special-case lists is not necessary. These methods are useful in certain cases, as the accuracy figures for CLAWS show; but the goal here was to measure the accuracy of a wholly algorithmic tagger on a standard corpus.</Paragraph>
    <Paragraph position="6"> Interestingly, if the introduction of idiom tagging were to make as much difference for VOLSUNGA as for CLAWS, we would have an accuracy of 99%. This would be an interesting extension. I believe that the reasons for VOLSUNGA's 96% accuracy without idiom tagging are (a) the change in definition of &amp;quot;optimal path&amp;quot;, and (b) the increased precision of RTPs. The difference in tag-set size may also be a factor; but most of the difficult cases are major class differences, such as noun versus verb, rather than the fine distinction which the CLAWS tag-set adds, such as several subtypes of proper noun. Ongoing research with VOLSUNGA may shed more light on the interaction of these factors.</Paragraph>
    <Paragraph position="7"> Last, the current version of VOLSUNGA is designed for use with a complete dictionary (as is the case when working with a known corpus). Thus, unknown words are handled in a rudimentary fashion. This problem has been repeatedly solved via affix analysis, as mentioned above, and is not of substantial interest here.</Paragraph>
    <Paragraph position="8"> Computational Linguistics, Volume 14, Number 1, Winter 1988</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="35" type="metho">
    <SectionTitle>
2.1 CHOICE OF THE OPTIMAL PATH
</SectionTitle>
    <Paragraph position="0"> Since the number of paths over a span is an exponential function of the span length, it may not be obvious how one can guarantee finding the best path, without examining an exponential number of paths (namely all of them). The insight making fast discovery of the optimal path possible is the use of a Dynamic Programming solution (Dano (1975), Dreyfus and Law (1977)).</Paragraph>
    <Paragraph position="1"> The two key ideas of Dynamic Programming have been characterized as &amp;quot;first, the recognition that a given 'whole problem' can be solved if the values of the best solutions of certain subproblems can be determined...; and secondly, the realization that if one starts at or near the end of the 'whole problem,' the subproblems are so simple as to have trivial solutions&amp;quot; (Dreyfus and Law (1977), p. 5). Dynamic Programming is closely related to the study of Graph Theory and of Network Optimization, and can lead to rapid solutions for otherwise intractable problems, given that those problems obey certain structural constraints. In this case, the constraints are indeed obeyed, and a linear-time solution is available.</Paragraph>
    <Paragraph position="2"> Consider a span of length n = 5, with the words in the path denoted by v, w, x, y, z. Assume that v and z are the unambiguous bounding words, and that the other three words are each three ways ambiguous. Subscripts will index the various tags for each word: w~ will denote the first tag in the set of possible tags for word w. Every path must contain vl and zl, since v and z are unambiguous. Now consider the partial spans beginning at v, and ending (respectively) at each of the four remaining words. The partial span network ending at w contains exactly three paths. One of these must be a portion of the optimal path for the entire span. So we save all three: one path to each tag under w. The probability of each path is the value found in the collocation matrix entry for its tag-pair, namely p(v,w i) for i ranging from one to three.</Paragraph>
    <Paragraph position="3"> Next, consider the three tags under word x. One of these tags must lie on the optimal path. Assume it is x~.</Paragraph>
    <Paragraph position="4"> Under this assumption, we have a complete span of length 3, for x is unambiguous. Only one of the paths to x I can be optimal. Therefore we can disambiguate v...</Paragraph>
    <Paragraph position="5"> w... x~ under this assumption, namely, as MAX (p(v,wi)* p(wi,Xl) ) for all w i.</Paragraph>
    <Paragraph position="6"> Now, of course, the assumption that x~ is on the optimal path is unacceptable. However, the key to VOLSUNGA is to notice that by making three such independent assumptions, namely for x~, x 2, and x 3, we exhaust all possible optimal paths. Only a path which optimally leads to one of x's tags can be part of the optimal path. Thus, when examining the partial span network ending at word y, we need only consider three possibly optimal paths, namely those leading to xt, x2, and x 3, and how those three combine with the tags of y.</Paragraph>
    <Paragraph position="7"> At most one of those three paths can lie along the optimal path to each tag of y; so we have 32, or 9,  comparisons. But only three paths will survive, namely, the optimal path to each of the three tags under y. Each of those three is then considered as a potential path to z, and one is chosen.</Paragraph>
    <Paragraph position="8"> This reduces the algorithm from exponential complexity to linear. The number of paths retained at any stage is the same as the degree of ambiguity at that stage; and this value is bounded by a very small value established by independent facts about the English lexicon. No faster order of speed is po,;sible if each word is to be considered at all.</Paragraph>
    <Section position="1" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
2.2. PROCESSING A SAMPLE SPAN
</SectionTitle>
      <Paragraph position="0"> As an example, we will consider the process by which VOLSUNGA would tag &amp;quot;The man still saw her&amp;quot;. We will omit a few ambiguities, reducing the number of paths to 24 for ease of exposition. The tags for each word are shown in Table 2. The notation is fairly mnemonic, but it is worth clarifying that PPO indicates a n objective personal pronoun, and PP$ the possessive thereof, while VBD is a past-tense verb.</Paragraph>
      <Paragraph position="1"> Examples of the various collocational probabilities are illustrated in Table 3 (VOLSUNGA does not actually consider any collocation truly impossible, so zeros are raised to a minimal non-zero value when loaded).</Paragraph>
      <Paragraph position="2"> The product of 1&amp;quot;2&amp;quot;3&amp;quot;2&amp;quot;2&amp;quot;I ambiguities gives 24 paths through this span. In this case, a simple process of choosing the best successor for each word in order would produce the correct tagging (AT NN RB VBD PPO). But of course this is often not the case.</Paragraph>
      <Paragraph position="3"> Using VOLSUNGA's method we would first stack &amp;quot;the&amp;quot;, with certainty for the tag AT (we will denote this by &amp;quot;p(the-AT) = CERTAIN)&amp;quot;). Next we stack &amp;quot;man&amp;quot;, and look up the collocational probabilities of all tag pairs between the two words at the top of the stack. In this case they will be p(AT, NN) = 186, and p(AT, VB) = 1. We save the best (in this case only) path to each of man-NN and man-VB. It is sufficient to save a pointer to the tag of &amp;quot;the&amp;quot; which ends each of these paths,  making backward-linked lists (which, in this case, converge). null Now we stack &amp;quot;still&amp;quot;. For each of its tags (NN, VB, and RB), we choose either the NN or the VB tag of &amp;quot;man&amp;quot; as better, p(still-NN) is the best of:</Paragraph>
      <Paragraph position="5"> Thus, the best path to still-NN is AT NN NN.</Paragraph>
      <Paragraph position="6"> Similarly, we find that the best path to still-RB is AT NN RB, and the best path to still-VB is AT NN RB.</Paragraph>
      <Paragraph position="7"> This shows the (realistically) overwhelming effect of an article on disambiguating an immediately following noun/verb ambiguity.</Paragraph>
      <Paragraph position="8"> At this point, only the optimal path to each of the tags for &amp;quot;still&amp;quot; is saved. We then go on to match each of those paths with each of the tags for &amp;quot;saw&amp;quot;, discovering the optimal paths to saw-NN and to saw-VB. The next iteration reveals the optimal paths to her-PPO and her-PP$, and the final one picks the optimal path to the period, which this example treats as unambiguous. Now we have the best path between two certain tags (AT and .), and can merely pop the stack, following pointers to optimal predecessors to disambiguate the sequence. The period becomes the start of the next span.</Paragraph>
    </Section>
    <Section position="2" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
2.3 RELATIVE TAG PROBABILITIES
</SectionTitle>
      <Paragraph position="0"> Initial testing of the algorithm used only transitional probability information. RTPs had no effect upon choosing an optimal path. For example, in deciding whether to consider the word &amp;quot;time&amp;quot; to be a noun or a verb, environments such as a preceding article or proper noun, or a following verb or pronoun, were the sole criteria. The fact that &amp;quot;time&amp;quot; is almost always a noun (1901 instances in the Brown Corpus) rather than a verb (16 instances) was not considered. Accuracy averaged 92-93%, with a peak of 93.7%.</Paragraph>
      <Paragraph position="1"> There are clear examples for which the use of RTPs is important. One such case which arises in the Brown Corpus is &amp;quot;so that&amp;quot;. &amp;quot;So&amp;quot; occurs 932 times as a qualifier (QL), 479 times as a subordinating conjunction (CS), and once as an interjection (UH). The standard tagging for &amp;quot;so that&amp;quot; is &amp;quot;CS CS&amp;quot;, but this is an extremely low-frequency collocation, lower than the alternative &amp;quot;UH CS&amp;quot; (which is mainly limited to fiction). Barring strong contextual counter-evidence, &amp;quot;UH CS&amp;quot; is the preferred assignment if RTP information is not used. By weighing the RTPs for &amp;quot;so&amp;quot;, however, the &amp;quot;UH&amp;quot; assignment can be avoided.</Paragraph>
      <Paragraph position="2"> The LOB Corpus would (via idiom tagging) use &amp;quot;CS CS&amp;quot; in this case, employing a special &amp;quot;ditto tag&amp;quot; to indicate that two separate orthographic words constitute (at least for tagging purposes) a single syntactic word. Another example would be &amp;quot;so as to&amp;quot;, tagged 'TO TO TO&amp;quot;. Blackwell comments that &amp;quot;it was difficult to know where to draw the line in defining what constituted an idiom, and some such decisions seemed 36 Computational Linguistics, Volume 14, Number 1, Winter 1988 Steven J. DeRose Grammatical Category Disambiguation by Statistical Optimization to have been influenced by semantic factors. Nonetheless, IDIOMTAG had played a significant part in increasing the accuracy of the Tagging Suite \[i.e., CLAWS\]...&amp;quot; (Blackwell (1985), p. 7). It may be better to treat this class of &amp;quot;idioms&amp;quot; as lexical items which happen to contain blanks; but RTPs permit correct tagging in some of these cases.</Paragraph>
      <Paragraph position="3"> The main difficulty in using RTPs is determining how heavily to weigh them relative to collocational information. At first, VOLSUNGA multiplied raw relative frequencies into the path probability calculations; but the ratios were so high in some cases as to totally swamp collocational data. Thus, normalization is required. The present solution is a simple one; all ratios over a fixed limit are truncated to that limit. Implementing RTPs increased accuracy by approximately 4%, to the range 95-97%, with a peak of 97.5% on one small sample. Thus, about half of the residual errors were eliminated. It is likely that tuning the normalization would improve this figure slightly more.</Paragraph>
    </Section>
    <Section position="3" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
2.4 LEARNABILITY
</SectionTitle>
      <Paragraph position="0"> VOLSUNGA was not designed with psychological reality as a goal, though it has some plausible characteristics. We will consider a few of these briefly. This section should not be interpreted as more than suggestive. null First, consider dictionary learning; the program currently assumes that a full dictionary is available. This assumption is nearly true for mature language users, but humans have little trouble even with novel lexical items, and generally speak of &amp;quot;context&amp;quot; when asked to describe how they figure out such words. As Ryder and Walker (1982) note, the use of structural analysis based on contextual clues allows speakers to compute syntactic structures even for a text such as Jabberwocky, where lexical information is clearly insufficient. The immediate syntactic context severely restricts the likely choices for the grammatical category of each neologism.</Paragraph>
      <Paragraph position="1"> VOLSUNGA can perform much the same task via a minor modification, even if a suffix analysis fails. The most obvious solution is simply to assign all tags to the unknown word and find the optimal path through the containing span as usual. Since the algorithm is fast, this is not prohibitive. Better, one can assign only those tags with a non-minimal probability of being adjacent to the possible tags of neighboring words. Precisely calculating the mean number of tags remaining under this approach is left as a question for further research, but the number is certainly very low. About 3900 of the 9409 theoretically possible tag pairs occur in the Brown Corpus. Also, all tags marking closed classes (about two-thirds of all tags) may be eliminated from consideration. null Also, since VOLSUNGA operates from left to right, it can always decide upon an optimum partial result, and can predict a set of probable successors. For these reasons, it is largely robust against ungrammaticality.</Paragraph>
      <Paragraph position="2"> Shannon (1951) performed experiments of a similar sort, asking human subjects to predict the next character of a partially presented sentence. The accuracy of their predictions increased with the length of the sentence fragment presented.</Paragraph>
      <Paragraph position="3"> The fact that VOLSUNGA requires a great deal of persistent memory for its dictionary, yet very little temporary space for processing, is appropriate. By contrast, the space requirements of CLAWS would overtax the short-term memory of any language user.</Paragraph>
      <Paragraph position="4"> Another advantage of VOLSUNGA is that it requires little inherent linguistic knowledge. Probabilities may be acquired simply through counting instances of collocation. The results will increase in accuracy as more input text is seen. Previous algorithms, on the other hand, have included extensive manually generated lists of rules or exceptions.</Paragraph>
      <Paragraph position="5"> An obvious difference between VOLSUNGA and humans is that VOLSUNGA makes no use whatsoever of semantic information. No account is taken of the high probability that in a text about carpentry, &amp;quot;saw&amp;quot; is more likely a noun than in other types of text. There may also be genre and topic-dependent influences upon the frequencies of various syntactic, and hence categorial, structures. Before such factors can be incorporated into VOLSUNGA, however, more complete dictionaries, including semantic information of at least a rudimentary kind, must be available.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="35" end_page="35" type="metho">
    <SectionTitle>
3 ACCURACY ANALYSIS:
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
3.1 CALIBRATION
</SectionTitle>
      <Paragraph position="0"> VOLSUNGA requires a tagged corpus upon which to base its tables of probabilities. The calculation of transitional probabilities is described by Marshall (1983).</Paragraph>
      <Paragraph position="1"> The entire Brown Corpus (modified by the expansion of contracted forms) was analyzed in order to produce the tables used in VOLSUNGA. A complete dictionary was therefore available when running the program on that same corpus.</Paragraph>
      <Paragraph position="2"> Since the statistics comprising the dictionary and probability matrix used by the program were derived from the same corpus analyzed, the results may be considered optimal. On the other hand, the Corpus is comprehensive enough so that use of other input text is unlikely to introduce statistically significant changes in the program's performance. This is especially true because many of the unknown words would be (a) capitalized proper names, for which tag assignment is trivial modulo a small percentage at sentence boundaries, or (b) regular formations from existing words, which are readily identified by suffixes. Greene and Rubin (1971) note that their suffix list &amp;quot;consists mainly of Romance endings which are the source of continuing additions to the language&amp;quot; (p. 41).</Paragraph>
      <Paragraph position="3"> A natural relationship exists between the size of a dictionary, and the percentage of words in an average text which it accounts for. A complete table showing the Computational Linguistics, Volume 14, Number 1, Winter 1988 37  vocabulary items occur at least &amp;quot;Freq Limit&amp;quot; times in the Corpus. The &amp;quot;#Tokens&amp;quot; column shows how many tokens are accounted for by those types, and the &amp;quot;%Tokens&amp;quot; column converts this number to a percentage. (See also pp. 358-362 in the same volume for several related graphs.)</Paragraph>
    </Section>
    <Section position="2" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
3.2 OVERALL ACCURACY
</SectionTitle>
      <Paragraph position="0"> Table 5 lists the accuracy for each genre from the Brown Corpus. The total token count differs from Table 4 due to inclusion of non-lexical tokens, such as punctuation. The figure shown deducts from the error count  those particular instances in which the Corpus tag indicates by an affix that the word is part of a headline, title, etc. Since the syntax of such structures is often deviant, such errors are less significant. The difference this makes ranges from 0.09% (Genre L), up to 0.64% (Genre A), with an unweighted mean of 0.31%. Detailed breakdowns of the particular errors made for each genre exist in machine-readable form.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML