File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1037_metho.xml

Size: 11,222 bytes

Last Modified: 2025-10-06 14:13:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1037">
  <Title>LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*</Title>
  <Section position="4" start_page="0" end_page="191" type="metho">
    <SectionTitle>
2. RESOURCES
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="5" start_page="191" end_page="191" type="metho">
    <SectionTitle>
3. OVERVIEW OF SYSTEM
ARCHITECTURE
</SectionTitle>
    <Paragraph position="0"> An initial implementation of the interactive translation system for Japanese has been completed, running under MS-DOS on PC (486) hardware. In its current form, lexical and syntactic analyses are done in a pre-processing step (initiated by the user) that produces an annotated source document and a document-specific dictionary, which are then presented to the user in a customized word-processing environment.</Paragraph>
    <Paragraph position="1"> The pre-processing step consists of a number of sub- null tasks, including: 1. Breaking the Japanese character stream into words using a maximum-likelihood tokenizer in conjunction with a morphological analyzer (de-inflector) that recognizes all inflected forms of Japanese verbs, adjectives, and nouns 2. Attaching lexical information to the identified words, including inflection codes and roots (for inflected forms), pronunciation, English glosses (some automatically generated from parallel text), and English definitions 3. Finding &amp;quot;best guess&amp;quot; transliterations of katakana words using dynamic-programming techniques 4. Translating numbers with following counters (eliminating a large source of user errors arising from the unusual numbering conventions in Japanese) 5. Using a finite-state parser to identify modifying phrases 6. Creating the annotated document and document null specific dictionary The user's word-processing environment consists normally of two windows, one containing the original Japanese broken into words and annotated with pronunciation and &amp;quot;best guess&amp;quot; glosses, the other for entry of the English translation. Information extracted during pre-processing but not available in the annotated document (longer definitions, inflection information, etc.) can be accessed instantly from the document-specific dictionary using the keyboard or mouse, and is presented in a pop-up window. The interface also allows easy access to browsing resources such as on-line dictionaries and proper name lists.</Paragraph>
  </Section>
  <Section position="6" start_page="191" end_page="193" type="metho">
    <SectionTitle>
4. IMPLEMENTATION DETAILS
</SectionTitle>
    <Paragraph position="0"> Tokenizatlon. Tokenization is done using a maximum-likelihood algorithm that finds the &amp;quot;best&amp;quot; way to break up a given sentence into words. Conceptually, the idea is to find all ways to tokenize a sentence, score each tokenization, then choose the one with the best score. The tokenizer uses a master list of Japanese words with uni-gram frequencies.</Paragraph>
    <Paragraph position="1"> The score of a tokenization is defined to be the sum of the scores assigned to the words it contains, and the score of a word is taken to be proportional to the log of its unigram probability. Any character sequence not in the master list is considered infinitely bad, although to guarantee that a tokenization is always found, an exception is made for single character tokens not in the master list, which are assigned a very low, but finite, score. The tokenizer also assigns a moderate score to unfamiliar strings of ASCII or katakana, as well as to numbers.</Paragraph>
    <Paragraph position="2"> The search for the best tokenization is done using a simple dynamic programming algorithm. Let score(w) and lenflh(w) denote the score and length of the character sequence w. For a sentence of N characters numbered from 0 to N - 1, let best\[i\] denote the score of the best tokenization of the character sequence from 0 to i- 1, and initialize best\[O\] = O, best\[i\] = -oo for 1 &lt; i &lt; N.</Paragraph>
    <Paragraph position="3"> The best tokenization score for the sentence is then given  Note that when two tokenizations both have a word ending at a given position, only the higher scoring solution up to that position is used in subsequent calculations.</Paragraph>
    <Paragraph position="4"> Currently the most serious tokenization errors are caused by kanji proper nouns in the incoming document. Unlike European languages, there is no lexical cue (such as capitalization) to identify such nouns, and since most kanji can appear as words in isolation, the tokenizer will always find some way to break up a multi-kanji name into legal, but probably not sensible, pieces.</Paragraph>
    <Paragraph position="5"> De-inflection. In order to keep the master list relatively small, only root forms of words that inflect have an entry. To recognize inflected forms, the tokenizer calls a de-inflector whenever it fails to find a candidate token in the master list.</Paragraph>
    <Paragraph position="6"> In Japanese there are three classes of words that inflect: verbs (no person or number, but negatives and many tenses), adjectives (no cases or plurals, but negatives, adverbial, and tense), and nani-nouns (adjectival and adverbial). De-inflection is typically a multi-step process, as in tabetakunakalta (didn't want to eat) --~ iabetakunai (doesn't want to eat) tabetai (wants to eat) taberu (eats).</Paragraph>
    <Paragraph position="7"> It may also happen that a particular form can de-inflect along multiple paths to different roots.</Paragraph>
    <Paragraph position="8"> The engine of the LINGSTAT de-inflection module is language-independent (to the extent that words inflect by transformation of their endings), driven by a language-specific de-inflection table. It handles multi-step and multi-path de-inflections, and for a given candidate will return all possible root forms to the tokenizer, along with the probability of the particular inflection for incorporation into the word score. The de-inflector also returns information about the de-inflection path for use by the annotation module. De-inflection tables have been developed for Japanese, Spanish, and English.</Paragraph>
    <Paragraph position="9"> Annotation. The annotation module attaches pronunciations, English glosses, English definitions, and inflection information to each word identified by the tokenizer. Pronunciation information might seem superfluous but is often of value to a Japanese translator. One of the consequences of the difficulty of written Japanese is that most students of the language can speak much better than they can read (recall that the pronunciation of a kanji cannot be deduced from its shape). The verbal cue that LINGSTAT provides through the pronunciation may therefore be enough to allow a user to identify an otherwise unfamiliar kanji word. In any case, having the pronunciation allows the user access to supplementary paper dictionaries ordered by pronunciation, which are much faster to use than radical-and-stroke dictionaries ordered by character shape information.</Paragraph>
    <Paragraph position="10"> The glosses used by LINGSTAT come from three sources: hand entry, the Japanese-English CD-ROM dictionary, and automatic extraction from the definitions in the EDR dictionary. There are two methods of automatic extraction: Pull the gloss out of the definition--for example, A type of financial transaction named leveraged buyout becomes leveraged buyout.</Paragraph>
    <Paragraph position="11"> Use the English and Japanese definitions in the EDR dictionary as sentenced-aligned parallel text and apply CANDIDE's word alignment algorithm (Model 1) \[1\] to determine which English words correspond to each Japanese word.</Paragraph>
    <Paragraph position="12"> The first method is moderately successful because many of the definitions adhere to a particular style. The second method gives good glosses for those Japanese words that occur frequently in the text of the definitions.</Paragraph>
    <Paragraph position="13"> Katakana Transliteration. Words are borrowed so frequently from other languages, particularly English, that their transliterations into katakana rarely appear in even the largest dictionaries. The best way to determine their meaning, therefore, is to transliterate them back into English. This is made difficult by the fact that the transformation to katakana is not invertible: for example, English I and r both map to the Japanese r, r following a vowel is sometimes dropped, and vowels are inserted into consonant clusters.</Paragraph>
    <Paragraph position="14"> The LINGSTAT katakana transliterator attempts to guess what English words might have given rise to an unfamiliar katakana word. It converts the katakana pronunciation into a representation intermediate between Japanese and English, then compares this to a list of 80,000 English words in the same representation. A dynamic programming algorithm is used to identify the English words that most closely match the katakana. These words are then attached to the katakana token in the annotation step.</Paragraph>
    <Paragraph position="15"> This procedure fails for non-English foreign words, and for most proper names (since they rarely appear in the master English list).</Paragraph>
    <Paragraph position="16"> Number Translation. In traditional Japanese, numbers up to 104 are formed by using the kanji digits in  conjunction with the kanji symbols for the various powers of ten up to 1000, e.g., 6542 would be written (6)( 1000)(5)(100) (4)(10) (2), with each number in parentheses replaced by the appropriate kanji symbol. Notice that the powers of ten are explicitly represented, rather than being implied by position. null There are special kanji for the large numbers 104, l0 s, elc. These may be preceded by expressions like that above to form very large numbers, such as (2)(10s)(5)(1000)(5)(100)(104) = 2 x l0 s +5500 x 104 = 255,000,000.</Paragraph>
    <Paragraph position="17"> Modern Japanese often mixes the traditional Japanese representation with the &amp;quot;place-holding&amp;quot; representation used in English. Arabic numerals are freely mixed with kanji symbols in both formats. To ease the burden on the translator LINGSTAT has a function that recognizes numbers in all their styles, including following counters, and translates them into conventional English notation. These translations are then attached to the number token in the annotation step. Comparison of manual and LINGSTAT-aided translations has demonstrated that this feature eliminates a large source of critical errors, particularly in the evaluation domain, which frequently references large monetary transactions.</Paragraph>
    <Paragraph position="18"> Finlte-state parser. As a first pass at helping the user with Japanese sentence structure, LINGSTAT incorporates a simple finite-state parser designed to identify modifying phrases in Japanese sentences. An interface function has also been added to display this information in a structured way. At this stage, the quality of the parse is only fair. This function has not yet been tested for its effect on translation speed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML