File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/80/c80-1043_abstr.xml
Size: 15,750 bytes
Last Modified: 2025-10-06 13:45:50
<?xml version="1.0" standalone="yes"?> <Paper uid="C80-1043"> <Title>SYSTEM SUPPORT IN CHINESE DATA ENTRY</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> SYSTEM SUPPORT IN CHINESE DATA ENTRY </SectionTitle> <Paragraph position="0"> Joseph E. Grimes Cornell University, Ithaca NY, USA Summary Our aim is a viable software system to support language data processing involving non-alphabetic symbols, specifically the Chinese characters. The key to the system is the exploitation of certain linguistic relationships between pairs of those characters in sequence. In the development vehicle for this system, independent modules of data are linked by pointers at two critical interfaces. The first links an input recognizer to a process that recognizes significant pairings. The second links both recognizers to a character generator.</Paragraph> <Paragraph position="1"> Chinese typists are readily trained to end the input sequence that identifies the first character of a pair with a special delimiter for pairs rather than the usual delimiter. The pairs delimiter alerts the system to look up the pairing potential of the first character. Then it matches the second character of the pair against that potential. The result is automatic contextual disambiguation performed on input codes that otherwise might not identify characters uniquely.</Paragraph> <Paragraph position="2"> Overview Up to now, devices for typing Chinese characters or for entering them into computers have not been widely successful. The problem is not one of storing character shapes in a computer or of reproducing them once they are stored; it is rather a problem of designating quickly which of a large number of accessible shapes should be reproduced.</Paragraph> <Paragraph position="3"> The companion paper by Paul King on human factors and linguistic considerations shows that a satisfactory solution can be worked out by paying attention not just to the graphic shapes involved, but to the characteristics of the Chinese language that stand behind those graphic shapes and their combinations.</Paragraph> <Paragraph position="4"> The solution has three components.</Paragraph> <Paragraph position="5"> First, it is possible to use nonunique strings of key strokes to identify the shape of a character; that is, many of the identifiers in King's Cornell Code identify two or more characters. Second, since often in Chinese it is a two-character sequence that is significant rather than the individual characters that make it up, access to information about such pairings makes it possible to eliminate, or at least reduce, the ambiguity inherent in the use of nonunique identifiers. Third, in the residual cases where it is still not clear which of several characters or character pairs is intended, it has proved adequate to display the possibilities and interrupt the operator to ask her to indicate which one she wants by typing the number of that one on the screen before her.</Paragraph> <Paragraph position="6"> The results of this approach to Chinese data entry are encouraging.</Paragraph> <Paragraph position="7"> Speakers of Chinese with a middle school education or better learn the keyboard with about half an hour's instruction. After a two week training session or its equivalent, the median speed is around 40 characters per minute and the best speeds are above 50. The error rate by that time approaches zero, and residual errors are correctible by means of a cursor editor that is built into the system. Operators can type for several hours at a time without fatigue.</Paragraph> <Paragraph position="8"> The software system that makes this behavior possible is not particularly complex. It derives its power from the amount of information about the Chinese language that it holds in compact form in its internal store: specifically, information about the most likely character pairings in Chinese.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Requirements </SectionTitle> <Paragraph position="0"> To create this system we began with two overall requirements. The first was that it be modular, so that any component of it could be worked on without disturbing the other components, and so that it could be implemented on a variety of physical configurations. Each of the stored data structures is manipulated not only by the main data entry program, but also by utility programs that make it possible to add new characters to the repertoire (for example, to put together a specialized vocabulary for a particular application), or to augment the number of pairings that the system recognizes.</Paragraph> <Paragraph position="1"> Modularity also makes it possible to change output character fonts as desired.</Paragraph> <Paragraph position="2"> .... 283</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Chinese Data Entry Grimes </SectionTitle> <Paragraph position="0"> The second requirement was, of course, that all functions of the system be accomplished at a speed that would permit typists to achieve their best performance without any limit being imposed on them. In the disk oriented prototype, this requirement involved paying special attention to minimizing disk accesses when searching through chains of pointers.</Paragraph> <Paragraph position="1"> The prototype embodies a third requirement that production versions will not have to meet: each typist's performance needs to be logged in a way that slows nothing down. An internal clock puts times and character codes into a buffer that is written to a log file from time to time. A utility program later compares the log with a stored version of what was to be typed. From that comparison it determines the error rate. From the timing information it determines the typing rate. The results of this utility are made available to another that plots the performance of all typists being tested as a function of time.</Paragraph> <Paragraph position="2"> Stored information There are two kinds of information that the system uses in deciding which character the typist wants: a file of identifiers, and a file of pairings. The identifier file is used to find a direct match to strings typed in from the keyboard. A successful match against the identifier file yields one or more Chinese telegraph codes (four-digit numbers) that represent all the characters that that identifier could stand for. The file of pairings, on the other hand, tells what other characters could follow a given character in a close-knit relationship to it when it comes first in a pair.</Paragraph> <Paragraph position="3"> The identifier file is organized as a B*-tree with variable length entries. This type of structure keeps the number of disk accesses needed to match an identifier near the theoretical minimum, with the result that it can be traversed very rapidly. It has the additional advantage that it can be implemented in such a way that the disk blocks nearest the root are kept in a core buffer area, thereby eliminating outright some of the disk accesses that would be needed in a full search. Since the tree is formed at the time when ,new identifiers are being introduced into the system by a utility program, the complex steps needed to keep the B*-tree in balance are performed only at a time when speed is not a factor.</Paragraph> <Paragraph position="4"> The result of a match between an identifier and the B*-tree is a string of one or more telegraph code numbers.</Paragraph> <Paragraph position="5"> These are the same four-digit numbers that have been used for years to transmit Chinese characters over telegraph lines. They are defined by a standard code book. The string of one or more telegraph codes that comes from the identifier file represents all the Chinese characters whose shape matches the identifier string. This string of codes is held in an internal buffer, where later stages of the process work on it.</Paragraph> <Paragraph position="6"> The file of pairings is the heart of the system and the focal point of the patent that has been filed on it. It consists of two parts: index and contents. The index is derived by applying an arithmetic function to the telegraph code that is desired, yielding a fixed offset from the beginning of the file. At that offset is stored an internal pointer to where the contents begin. This makes possible a rapid second access to a nearby location on the disk. The contents, stored at the second location, are variable in length; they are a string of telegraph codes that identify all the characters that are known to pair with the character whose telegraph code forms the original search argument; that is, all the characters whose sequential relationship to the first character is significant in Chinese.</Paragraph> <Paragraph position="7"> The system also stores graphic information that defines character shapes for display on the screen and for printing. The shape information is indexed just as it is in the file of pairings, using the telegraph code of each character as a pointer. The same algorithm that is used for the file of pairings converts the telegraph code to a fixed disk offset, and at the position on the disk that is so indicated, the information is found that tells where the actual graphic information begins later in the same file.</Paragraph> <Paragraph position="8"> The form of the graphic information or its display is of no direct concern to the selection logic; it can be treated as a cluster of a data structure that is pointed to, together with the processes necessary to handle it. We have implemented both vector and raster displays, and within each mode any available font can be called up simply by naming the appropriate character file, since all are accessed through the same pointer structure.</Paragraph> <Paragraph position="9"> -284 *</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Chinese Data Entry Grimes Processes </SectionTitle> <Paragraph position="0"> Three major processes select the Chinese character the typist wants. The first recognizes the identifier string that is typed in, using the B*-tree structure of the identifier file. The second process is invoked whenever a character pair is typed in. It recognizes all pairings that match both identifiers in the pair in order. The third process is invoked only if more than one possible result remains after the first two processes have finished; it gets the typist's attention and asks her to make a decision.</Paragraph> <Paragraph position="1"> The process that recognizes an identifier finds an exact match in a B*-tree of known identifiers. If no match is possible it gets the operator's attention: either the identifier was typed wrong, or it is not yet in the system. In either case something else needs to be typed.</Paragraph> <Paragraph position="2"> When a pair is typed in, the string up to the special delimiter for pairs is taken as the first identifier and matched separately from the string for the second identifier, which comes between the special delimiter and the final delimiter. For each identifier, the identifier file yields a string of four-digit telegraph codes, one for each of the Chinese characters that that identifier can represent. In about one case out of nine, there is only one telegraph code for an identifier.</Paragraph> <Paragraph position="3"> Frequently, however, there are two; and the number sometimes goes as high as fifteen telegraph codes for one identifier.</Paragraph> <Paragraph position="4"> When a pair of characters is typed in using the special delimiter to separate them, the presence of that delimiter activates the second process that recognizes known pairings. This process goes through the string of telegraph codes that correspond to the first identifier. For each telegraph code in that string, it looks in the file of possible pairings to see what other codes might form possible pairings with the first one. The process then goes through the telegraph codes that correspond to the second identifier to see if any of them actually does form a pairing with the first. If one does, that pair of telegraph codes is copied into a special array which the third process uses for its final selection.</Paragraph> <Paragraph position="5"> If on the other hand the pairing the typist reacts to is not yet in the file of pairings, the third process is automatically applied to . the first character of the pair so that it can be disambiguated manually, then to the second member of the pair separately for the same process.</Paragraph> <Paragraph position="6"> If the process that recognizes identifiers finds only one telegraph code for an identifier, and there is no pairing, then that telegraph code is accepted as the correct representation for the character the typist intended.</Paragraph> <Paragraph position="7"> If there is a pairing, but after the second process is over only one pair of telegraph codes has been found, those two can be taken as the characters the typist intended.</Paragraph> <Paragraph position="8"> Sometimes, however, an unpaired character identifier corresponds to two or more telegraph codes, and the decision as to which code is intended has to be made by the typist. Much more rarely, two paired identifiers match more than one pair of telegraph codes in the second process, and the decision as to which pair of characters is intended has to be made by the typist. It is also possible for a legitimate pairing to not yet be in the file of pairings, so that each of the two characters the typist typed has to be presented separately for a decision.</Paragraph> <Paragraph position="9"> In all these cases the third process interrupts the typist by emitting an audible signal to indicate that she needs to divert her attention from what she is typing to the question posed on the screen in front of her.</Paragraph> <Paragraph position="10"> The screen displays the entire list of possibilities, either single characters or pairs of characters. She in return types a number to tell the process which one of them she wants: &quot;2&quot; for the second one displayed, &quot;5&quot; for the fifth, and so forth.</Paragraph> <Paragraph position="11"> All the telegraph codes that are found by the three processes of matching identifiers, resolving pairs, and deciding among alternatives go into an output buffer in the form of character strings that represent the four-digit telegraph codes. The contents of this buffer are available for any one of several subsequent processes.</Paragraph> <Paragraph position="12"> The first process that operates on the output string of telegraph codes is an editor that allows a cursor to be moved through the string to the place where some change is to be made. The editor then allows deletions and insertions to be made wherever the cursor is located. All display On the screen, of course, is in the form of the Chinese Data Entry Gr imes Chinese characters that correspond to the stored telegraph codes; the codes themselves are never seen by the typist. The edited string can be stored on a magnetic medium, sent over a communications line, or formatted vertically or horizontally to be sent to a printing device. All these processes are conventional. They give the Chinese data entry system the potential of being used as a computer terminal, a communications terminal, an off line data entry device, or even a simple office typewriter.</Paragraph> <Paragraph position="13"> The work on which this paper was based received support from the NCR</Paragraph> </Section> </Section> class="xml-element"></Paper>