File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/c90-3072_metho.xml

Size: 10,212 bytes

Last Modified: 2025-10-06 14:12:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="C90-3072">
  <Title>Spelling-checking for Highly Inflective Languages</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. The Model of Inflection
</SectionTitle>
    <Paragraph position="0"> First, we decided to exclude the phonology level which is usually part of a morphological processing, because of the time penalty it would cause during processing. This means that all the phonological changes, although some of them are really regular, have to be treated in a single processing step together with the morphotactics. The space increase caused by this decision is still acceptable (for Czech, and, as far as we know, for the other Slavonic languages too).</Paragraph>
    <Paragraph position="1"> The basic model of inflection we use assumes that a word form is a concatenation of a ~tem and an ending. For this purpose, we had to define the terms stem and ending in the following &amp;quot;computational&amp;quot; way to suit our purposes: the term stem means for us the part of the word which does not change in the course of inflection, the term ending means the part of the form which, when appended to the stem, completes the stem to a meaningful form. Exactly this model is used for nouns.</Paragraph>
    <Paragraph position="2"> 358 1 For verbs, it is suitable to extend this basic model to cover negation, as the negation is formed by the prefix he-. Moreover, as a spelling-checker does not need to use the meanings of the words, we extended the word fotm definition further to cover verb prefixes. Of course, it is not economical to consider all possible verb prefixes, because most Czech verbs can have 3 to 8 derivatives by prefixes only. We use a compromise of 15 most frequent verb prefLxes. All the other, as well as their combinations, are considered to be part of the stein as defined in the previous paragraph.</Paragraph>
    <Paragraph position="3"> Our system uses two types of adjective structure.</Paragraph>
    <Paragraph position="4"> First, proper adjectives are viewed as consisting of a stem and an ending and possibly the superlative prefix (nej-) and/or the negative prefix. Second, verbal adjecfives can have a verbal prefix in addition to the parts mentioned above. The latter type of partitiozfing is the most complicated one in our system.</Paragraph>
    <Paragraph position="5"> For example, the form nejnevykupovdvandjdi (lit.</Paragraph>
    <Paragraph position="6"> 'not the (item which is) mostly bought for speculative purposes iteratively') consists, from the point of view of our model, of five par'm: the superlative prefix nej-, the negative prefix ne-, the &amp;quot;speculative&amp;quot; prefix vyo, the stein (of &amp;quot;to buy&amp;quot;) kup and the ending ovdvandflt, which combines the functions of itemtiveness, passive, comparison, and nominative singular.</Paragraph>
    <Paragraph position="7"> Thus, we had to employ 240 sets of endings. Of course, there are also hundreds of exceptions. For them, as well as for indeclinable word classes, there is a special set consisting of a zero ending and the whole form is stored, i.e., in our terms, the whole form is considered to be the &amp;quot;stem&amp;quot;.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. User Interface
</SectionTitle>
    <Paragraph position="0"> As the Czech users (not differing from their foreign colleagues in this respect) do not like learning a new text processors, we decided to follow the ideas behind Turbo Lightning. This way, using a memory resident program which is user-configumble to different text processors, we obtained a unified interface for virtually all users.</Paragraph>
    <Paragraph position="1"> The basic functions of interactive single word/page check and/or correction are accompanied also by batch functions, which are preferred by some users fi&gt;r longer texts and some types of text processors. The types of texts supported by the batch mode range fl:om simple ASCII files to files produced by WordPerfect 5.0, including die source texts for the TEX typesetting system.</Paragraph>
    <Paragraph position="2"> The system also facilitates the process of adding word forms to the user's own dictionary. Due to the reasons discussed above, this causes problems, as the other forms of that word cannot be included fully automatically. An algorithm exists (see below) how to accomplish this task with the user's assisstance. The idea is similar to Finkler and Neumann (1988), though simplified for our purposes; Carter (1989) in his VEX system also uses the method of giving sunple questions to the user (supposedly non-linguist) to learn about word's behaviour, but it is for English and primarily intended for assigning syntax properties rather than morphological. The hnplementation of the algoritlma together with its user interface will be included as an off-line utility (in the first version, available in autumn '89, there was no such utility; it should be included in the second version).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. The Semi-automatic Word Classification
</SectionTitle>
    <Paragraph position="0"> Equipping the lexical entries with morphological information is an unpleasant task; very boring for linguists, and en'or-inducing for anybody. And if the dictionary is to be updated primarily by nondinguists, the need for (at least some) automation is obvious.</Paragraph>
    <Paragraph position="1"> Fortunately, some inflectional languages (including Czech, as well as the other Slavonic languages) tend to indicate their morphological properties by (some of) the forms of the word itself, at least statistically. null As our purpose is to facilitate morphological classification of new words which are added to a dictionary, and as newly coined words or technical terms not included in file main dictionary are mostly regular, we can suppose that the irregular words are already in the dictionary.</Paragraph>
    <Paragraph position="2"> When classifying a given word from the user dictionary (added to it dufing the on-line checking/correcting process), the user should first change the ending of the form moved here from the text to create the dictionary form of the word, i.e., nominative singular for nomis, nominative singular masculine for adjectives, and infinitive for verbs. In some cases, the system can provide the dictionary form automatically, but mostly the only help it can offer is to position the cursor under the last character of the word form.</Paragraph>
    <Paragraph position="3"> Then the user should select the basic class to which the word belongs: indeclinable, verb, adjective or noun. There are no other questions for indeclinables, of course. For adjectives, the only further decision conceres the possibility of creating its comparative and/or negative forms. For verbs, the user should do two firings: first, select all possible prefixes from the 15 prefixes handled by the system, and then, assign perfective/imperfective/both flag to the word and to its prefixed forms (for all the prefixed forms, this flag has the same value). For nouns, where the situation is very complicated, there is a hierarchy of questions and selections, which, for some masculine inanimates, reaches the level of five questions/selections. Fortunately, thanks to lots of investigations performed by mathematical and statistical linguists in the past, we can arrange things so that in most cases the fu~t selection displayed is the fight one.</Paragraph>
    <Paragraph position="4"> 2 359 For an experienced user, there is the possibility of writing directly the name of the appropriate class.</Paragraph>
    <Paragraph position="5"> We used this mode of operation when entering all regular Czech nouns into the dictionary.</Paragraph>
    <Paragraph position="6"> Then the system constructs the stem and assigns the set of endings and prompts the user to confirm the resulting set of forms, For example, when classifying the form radionuklidy (radionuclides), first the user deletes the ending -y (which is one of the plural endings). Then he/she selects &amp;quot;noun&amp;quot; as the basic class; then &amp;quot;masculine inanimate&amp;quot; is the right choice. Then, he/she should select radionuklidu as the right form which can follow the preposition bez (without), and state that radionuklida is not correct in this case. The last selection conceres the preposition o (about), after which radionuklidu is the only possibility (as opposed to the form radionuklid~, which cannot be used after the preposition o ). Using this information, the system is able to decide that the stem is radionuklid (i.e., it equals to the nominative singular form) and the set of endings has the identification hdl. The user then confirms that radionuklid,-lidu,-lidem,-lidy ,-lid?t,-lid~m,-lidech are the all and only correct forms of radionuklid.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. Implementation
</SectionTitle>
    <Paragraph position="0"> As mentioned above, we selected the memory resident version as the primary way of operation. The progranL together with the cca 7,000 most fi'equent Czech words, takes approximately 110K of memory. It is able to check one screenful of a 60 column standard text (approx. 200 words) within 3 seconds on a 10 MHz PC AT with a 28msec hard disc. When the program runs as an ordinary program (in the mark- only batch mode), it is possible to have almost all the dictionary entries in main memory, and then it runs more than five limes faster (100K of text in less than one minute).</Paragraph>
    <Paragraph position="1"> The size of the main dictionary was in the first version, covering 80.000 - 100.000 Czech &amp;quot;dictionary&amp;quot; words, approximately 290K (not counting the 7000 most frequent ones, which reside in the memory anyway). This means that it can be used even on the oldest floppy based systems, e.g., in high schools. Since October 1989, the system is available for anybody wishing to avoid misprints when writing in Czech.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML