File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/a88-1005_metho.xml
Size: 21,274 bytes
Last Modified: 2025-10-06 14:12:00
<?xml version="1.0" standalone="yes"?> <Paper uid="A88-1005"> <Title>TWO SIMPLE PREDICTION ALGORITHMS TO FACILITATE TEXT PRODUCTION</Title> <Section position="4" start_page="0" end_page="36" type="metho"> <SectionTitle> MOTIVATION </SectionTitle> <Paragraph position="0"> One hundred English words account for 47 per cent of the Brown corpus (about one million words of American English text taken from a wide range of sources).</Paragraph> <Paragraph position="1"> It seems reasonable to suppose that a single individual might in fact require fewer words to account for a large proportion of generated text. From our work on the HWYE system it was known that 75 words accounted for half of all the text of Vanity Fair, a 300,000 word Victorian English novel by Thackeray (which incorporated a fairly involved syntax, much embedded quotation, and passages in dialect and in French) \[English and Boggess, 1986\]. We further found that 50 words accounted for half of all the verbiage in a 20,000 word set of sentences provided by an individual who collaborated with us. This latter corpus, called the Sherri data, is a set of texts provided by a speech-handicapped individual who uses a typewriter to communicate, even with her family; it is conversational in nature, as can be seen in Figure 1. Most of the work reported in this paper gives special attention to the set of words required to account for half of all the verbiage of a given individual. We refer to this set as the set of high-frequency words.</Paragraph> <Paragraph position="2"> You said something about a magazine that <namel> had about computers that I might like to borrow. I would some time.</Paragraph> <Paragraph position="3"> I think we have to pick up the children while <name2> is in the hospital.</Paragraph> <Paragraph position="4"> I want to visit her in the hospital.</Paragraph> <Paragraph position="5"> But you have to lift me up to the window for me to see the baby.</Paragraph> <Paragraph position="6"> Well, it's May first now. Help! I thought it would not be so busy but it looks like it might be now.</Paragraph> <Paragraph position="7"> It seems reasonable to suppose that for conversational English, approximately 50 words may account for half of the verbiage of most English users. From the standpoint of human factors, an argument could be made that one should simply put the 50 words up on the screen with the alphabet and thus be assured that half of all the words desired by the user were instantly available, in known locations that the user would quickly become accustomed to. Constantly changing menus introduce an element of user fatigue \[Gibler and Childress, 1982\]. That argument may especially make sense as larger screens with more lines per screen and more characters per line become more common.</Paragraph> <Paragraph position="8"> If we limit ourselves to the top 20 most frequent words as a constant menu, only about 30 per cent of the user's verbiage is accounted for. However, it was observed, while working with the HWYE system, that if one looked at the top 20 words for any given sentence position, one did not see the same set of words occurring. Clearly the high frequency words (the set that comprise half of word use) are mildly sensitive to &quot;context&quot; even when &quot;context&quot; is so broadly defined as sentence position.</Paragraph> <Paragraph position="9"> Different subsets of the 50 member set of high frequency words appear in the set of 20 most frequent words for a given sentence position. Moreover, after processing approximately 2000 sentences from the user, it was still the case that some of the top 20 words for a given position were not members of the high frequency set at all. For example, the word &quot;they', a member of the menu for the first sentence position Isee Figure 2) and hence one of the 20 most frequent words to start a sentence, is not a member of the global high frequency set. A preliminary analysis by English suggested that, whereas a constant &quot;prediction&quot; of the top 20 most frequent words would yield a success rate of 30 per cent, predicting the top 20 most frequent words per position in sentence would yield a success rate of 40 per cent.</Paragraph> <Paragraph position="10"> ~CONTEX7&quot; AS SENTENCE POSITION The simplest scheme, which has been built as a prototype on an IBM PC with two floppy disk drives, presents the user with the top 20 most frequent words that the user has employed at whatever position in a sentence is current. For example, Figure 2 shows the screen presented to the user at the beginning of production of a sentence. On the left is a list of the 20 words which that particular user is known to have used most often to begin sentences. On the right is the alphabet, which is normally available to the user; and in other places on the screen are special functions. (Selection of words, letters and functions is made by mouse, though the actual selection mechanism is separated from the bulk of the code so that replacement with another selection mechanism should be relatively easy to implement.) The sentence is built at the bottom of the screen. If the user selects a word from the menu at the left, it is placed in first position in the sentence, and a second menu, consisting of the 20 most frequent words that the user has used in second place in a sentence, appears in the left portion of the screen. After a second word has been produced and added to the sentence, a third menu, consisting of the 20 most frequent words for that user in third place in a sentence, is offered, and so on.</Paragraph> <Paragraph position="11"> At any time the user may reject the lefthand menu by selecting a letter of the alphabet. Figure 3 shows the screen after the user has produced two words of a sentence and has begun to spell a third word by selecting the letter &quot;a +. At this point, the top 20 most frequently used words beginning with +a&quot; have been offered at the left. If the desired word is not in the list, the user continues by selecting the second letter of the desired word (in this case, &quot;n'). The left-hand menu becomes the 20 most frequently used words beginning with the pair of letters given so far. As is shown in Figure 4, there are times when fewer than 20 words of a given two-letter starting combination have been encountered from the user's past history, in which case this algorithm offers a shortened list.</Paragraph> <Paragraph position="12"> In the case illustrated, the desired word was on the list. If it were not, the user would have had to spell out the entire word, and it would have been entered into the sentence. In either case, the system subsequently returns to offering the menu of most-frequently-used words for the fourth position, and continues in similar fashion to the end of the sentence.</Paragraph> <Paragraph position="13"> The system keeps up with how often a word has been used and with how many times it has occurred in each position in a sentence, so that from time to time a word is promoted to one of the top 20 alphabetic or top 20 position-related sets of words. For details on the file organization scheme that allows this to be done in real time, see Wei \[1987\]. Details on the mouse-based implementation for IBM PC's are available in Chow \[1986\].</Paragraph> <Paragraph position="14"> A SECOND ALGORITHM An alternative predictive algorithm has been implemented which replaces the sentence-position-based first menu. It pays special attention to the 50 most frequently used words in the individual's vocabulary (the high-frequency words) and to the words most likely to follow them. By virtue of their frequency, these are precisely the words about which the most is known, with the greatest confidence, after a relatively small body of input such as a few thousand sentences.</Paragraph> <Paragraph position="15"> For each of the 50 high-frequency words, a list is kept of the top 20 most frequent words to follow that word. Let us call these the first order followers. For each of the first order followers, there is a list of second-order followers: words known to have followed the two word sequence consisting of the high-frequency word and its first order follower.</Paragraph> <Paragraph position="16"> For example, the word &quot;I&quot; is a high-frequency word. The first order followers for &quot;I&quot; include the word &quot;wol)ld'. The second-order followers for &quot;I would&quot; include the word &quot;like'. (See Figure 5.) The second-order followers for &quot;I would&quot; also include many one-time-only followers, as well, so the system maintains a threshold for the number of oceurrances below which a word is not included in the list of second-order followers. The reasoning is that a word's having occurred only once in an environment that by definition occurs frequently may be taken as counter-evidence that the word should be predicted.</Paragraph> <Paragraph position="17"> Rather than predict a word with low reliability, one of two alternatives are taken. If the first-order follower is itself a high-frequency word, then lowreliability second-order followers may be replaced with the first-order follower's own followers. ('Would&quot; is a first-order</Paragraph> <Paragraph position="19"> follower of &quot;I&quot; and is itself a high-frequency word. There are relatively few reliable second-order followers to &quot;would&quot; in the left context of &quot;I', so the list is augmented with first-order followers of &quot;would&quot; to round out a list of 20 words.) The other alternative, taken when the first-order follower is not a high-frequency word, is to fill out any short list of second-order words with the high-frequency words themselves.</Paragraph> <Paragraph position="20"> This algorithm is related to, but takes less memory and is less powerful than a full-blown second order Markov model. Each state in a second-order (trigram) Marker model is uniquely determined by the previous two inputs.</Paragraph> <Paragraph position="21"> For an input vocabulary of 2000 words, the number of mathematically possible states in a trigram Marker model is 4,000,000, with more than 8 billion arcs interconnecting the states. Fortunately, in the real world most of these mathematically possible states and arcs do not actually occur, but a trigram model for the real world possibilities is still quite large.</Paragraph> <Paragraph position="22"> We experimented with abstracting the input vocabulary by restricting it to the 50 highest-frequency words plus the pseudo-input OTHER onto which all other words were mapped. When we did so, the number of states and arcs in the various order Markov models was still fairly large for the real world data \[English and Boggess, 1986\]. As Figure 6 shows, for example, the rate of growth for a fourth-order abstract Markov model (just the 50 highest-frequency words plus OTHER plus end-of-sentence) is in the neighborhood of 250 new states and 450 new arcs per 1000 new words of text, after 17000 words of input. This was true for both the Sherri data (conversational English) and the more formal Thackeray data. Moreover, the fourth-order Marker model for the abstracted Thackeray data continued to grow. After 100,000 words of input, with a model of approximately 22,000 states and approximately 45,000 arcs, the rate of growth was still more than 1,000 states and 3,000 ares per 10,000 words of input.</Paragraph> <Paragraph position="23"> For this particular implementation, however, neither r. full-blown Markov model using total vocabulary nor an abstract model using the 50-word vocabulary seemed appropriate. On the one hand, models of the entire vocabulary confirmed that many multiple word sequences did occur regularly. Nevertheless, for any but the simplest order Marker models (orders zero and one), the vast bulk of the networks were taken by word combinations that occurred only once. On the other hand, restricting the predictive mechanism to only the high-frequency words obviously left out some of the regularly occurring word combinations. Our firstand second-follower algorithm described on the previous pages allows lower frequency words to be predicted when they occur regularly in combination with high-frequency words.</Paragraph> </Section> <Section position="5" start_page="36" end_page="37" type="metho"> <SectionTitle> PREDICTIVE CAPABILITIES </SectionTitle> <Paragraph position="0"> The data used to test the predictive capabilities of the system were typescripts provided by the user, who was utilizing a manual typewriter; it follows that the results were not biased by the user's favoring sentence patterns that the system itself provided. The system had bccn given 1750 prior scntcnces produced by the user and the data collected were for the performance of the system over the next 97 sentences. The 1750 sentences were 14,669 words in length with a vocabulary of 1512 words. Twelve sentences of the 1750 were a single word in length {e.g. &quot;yeah&quot;, &quot;no&quot; and &quot;gesundheit&quot;) and 51 were of length 20 or greater. Average length of sentence for the initial body was 8.4 words per sentence. The first 200 sentences included transcriptions of oral sentences, which were much shorter on average, since the user is speech handicapped. If the first 200 sentences are omitted, the average sentence length is 8.6 for the following 1550 sentences.</Paragraph> <Paragraph position="1"> Of the next 97 sentences generated, the shortest sentence was &quot;Thanks again.&quot; The longest was &quot;You said something about a magazine that Jenni had about computers that I might like to borrow.&quot; The 97 sentences consisted of 884 words (six of which were numbers in digital form), for an average length of 9.1 words per sentence.</Paragraph> <Paragraph position="2"> Of the 884 words, 350 were presented on the first menu, 373 were presented on the second menu (after one letter had been spelled), 109 were presented on the third menu (after two letters had been spelled),.</Paragraph> <Paragraph position="3"> 2 were presented on the fourth menu (after three letters had been spelled, 43 were spelled out in their entirety, and 7 were numbers in digital form, produced using the number screen of the system.</Paragraph> <Paragraph position="4"> From the above, it is obvious that the device of predicting the 20 most frequent words by sentence position is successful 39.6 per cent of the time; 42.2 per cent of the time, the desired word is among the 20 most frequent words of a given initial letter but not in the 20 most frequent words by position; combining these two facts, we see that 81.8 per cent of the time, this simple prediction scheme presents the desired word on a first or second selection. The desired word is offered in the first, second, or third menu 94.1 per cent of the time, and most of the rest of the time (5.7 per cent of total), the desired word is unknown to the system and is &quot;spelled out', where &quot;spelling&quot; includes producing numbers.</Paragraph> <Paragraph position="5"> Although the fourth menu, consisting of words with a three-letter initial sequence, presently has a low success rate, it is precisely this category that we expect to see improve as more of the user's words become known to the system through spelling. That is, as time passes, we expect the user to have to resort to complete spelling less and less because the known vocabulary will include more and more of the actual vocabulary of the user. Many of the new words will be low frequency words that we would expect to find on the menu for three-letter combinations after they are known.</Paragraph> <Paragraph position="6"> The second algorithm, using first- and second-followers of the high-frequency words, was run on i00 sentences, the shortest of which was &quot;Help!&quot; (94 of the 97 test sentences for the first algorithm were represented in the test set for the second.) There were 895 words in the sample, of which 448 were presented on the first menu, 280 were presented on the second (after one letter had been spelled out, 83 on the third (after two letters were spelled), 1 on the fourth, and 83 were spelled out in their entirety (this category included numbers).</Paragraph> <Paragraph position="7"> Running the second test gave us a very quick appreciation for the value of adding new words to the system as they are encountered, since this implementation of the second algorithm did not. One especially striking example was a word beginning with &quot;w-o&quot; which had never been used before, but which occurred five times in the 100 test sentences and had to be spelled out each time. This was especially irritating since the &quot;w-o&quot; menu (third menu) had fewer than 20 entries and would have accommodated the new word. A comparison of the two columns of Figure 7 suggests that for the text held in common by the two tests, approximately 30 words had to be spelled out by the second algorithm, which were selected by menu in the first algorithm because it added new words to its data sets as they were encountered.</Paragraph> </Section> <Section position="6" start_page="37" end_page="38" type="metho"> <SectionTitle> PROPOSED EXTENSIONS </SectionTitle> <Paragraph position="0"> We have several plans for the future, most of them involving the second algorithm. Our first task is to increase the number of sentences in the Sherri data to 3000 and determine how much (if at all) an enlarged base of experience improves the ability of the algorithm to predict Sentence position algorithm number sentences: 97 number of words: 884 frequent word/left context algorithm number sentences: 100 number of words: 895 the desired word on the first try. In its present form, the system is reliable in its predictions after several hundred sentences by the user have been processed. We intend to take something like the Brown corpus for American English and from it create a vanillaflavored predictor as a start-up version for a new user, with facilities built in to have the user's own language patterns gradually outweigh the Brown corpus initialization as they are input.</Paragraph> <Paragraph position="1"> Eventually the Brown corpus would have essentially no effect, or at least no effect overriding the user's individual use of language (it might serve as a basic dictionary for text vocabulary not yet seen from the user).</Paragraph> <Paragraph position="2"> We intend to investigate what effect generating sentences while using the system has on our collaborator. To date, she has obligingly been willing to continue to use a typewriter to generate text, but she does own a personal computer and is able to use a mouse. Our own experience in entering her sentences on the system has made it clear that in many instances she would have expressed the same ideas more rapidly on the system with a slight change in wording. Since the preferred words and patterns are derived by the system from her own language history, they should feel normal and natural to her and could influence her to modify her intentions in generating a sentence. On the other hand, a different handicapped individual (a quadriplegic) has informed us that ease of mechanical production of a sentence has little or no effect on his choice of words, and that would appear to be the case for our collaborator while she uses the typewriter.</Paragraph> <Paragraph position="3"> Finally, we wish to make use of the much larger amounts of memory available on personal computers by taking account of the followers for many of the moderatefrequency words. For example, in the sentence &quot;would you be able...&quot; the word &quot;able&quot; is not high frequency. Nevertheless, the system could easily deduce what following word to expect, since every known occurrence of &quot;able&quot; is followed by &quot;to'. As it happens, &quot;to&quot; is one of the top 20 most frequent words and hence fortuitously is on the default menu after the non-high-frequency word &quot;able', but there are many other examples where the system is not so lucky. For instance, &quot;pick&quot; is usually followed by &quot;up&quot; in the Sherri data, but &quot;pick&quot; is low frequency and &quot;up&quot; is not on the default first menu. Similarly, &quot;think&quot; is a high-frequency word and has a well developed set of followers. &quot;Thinks&quot; and &quot;thought&quot; are not high-frequency and hence are followed by the default first menu. Yet virtually every follower for &quot;thinks&quot; and &quot;thought&quot; in the Sherri data happens to belong to the set of followers for &quot;think'. We believe that by storing information on moderate frequency words with strongly associated followers and on clusters of verb forms we may significantly improve the success of the first menu.</Paragraph> </Section> class="xml-element"></Paper>