File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/i05-2007_evalu.xml
Size: 9,569 bytes
Last Modified: 2025-10-06 13:59:20
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2007"> <Title>A resource-based Korean morphological annotation system</Title> <Section position="5" start_page="38" end_page="40" type="evalu"> <SectionTitle> 4 Morphotactics and connectivity </SectionTitle> <Paragraph position="0"> The final part of some verbal and adjectival stems undergoes phonotactic variations when a suffix is appended to them. For example, the stem keu: keu- &quot;big&quot; becomes k: k- before the suffix -eoss:- oss- (past). In order to reduce the level of redundancy of manually updated resources, lexicons of base-form stems were con- null structed. Each stem was assigned a structured tag. Stem allomorphs are generated from base-form stems with 71 transducers of the same type as those used to inflect words in inflectional languages (Silberztein, 2000). The input part of the transducer specifies letters to remove or to add in order to obtain the allomorph from the base form. The output part specifies the tag and compatibility symbol (see below) to be assigned to the allomorph. These transducers are viewed and edited in graphical form with the open-source Unitex system (Paumier, 2002).</Paragraph> <Paragraph position="1"> The combination of a stem with a sequence of suffixes obeys a number of constraints.</Paragraph> <Paragraph position="2"> Checking these constraints is necessary to discard wrong segmentations. We distinguish two types of suffixes: derivational and inflectional. Derivational suffixes are markers of verbalization, adjectivalization and adverbialization. They are appended by applying transducers of the same type as above. In our current version, 8 transducers append derivational suffixes. These transducers invoke 5 subgraphs, thus constituting recursive transition networks (RTN). Inflectional suffixes comprise all other types of suffixes. A single (possibly derived) stem can be combined with up to 5,500 different sequences of inflectional suffixes. Compatibility between stems and inflectional suffixes is represented by a set of 59 compatibility symbols (CS). Each stem and stem allomorph is assigned a CS, which defines the set of suffix sequences that can be appended to it. The CSs take into account two types of constraints: grammatical and phonotactic constraints. CSs are comparable with adjacency symbols, except that they include the constraints between all the morphemes in a word, not only between adjacent morphemes. They convey more information than adjacency symbols, but they are less numerous: 59 to be compared to 300 (Lee et al., 2002). The lexicon of stems assigns CSs to base stems. CSs are automatically assigned to stem allomorphs during the generation of allomorphs.</Paragraph> <Paragraph position="3"> Connectivity between suffixes obeys phonotactic and grammatical constraints. Phonotactic constraints affect surface forms, whereas grammatical constraints affect base form/tag pairs. The standard model for representing both types of constraints is the finite-state model. For ex- null http://www-igm.univ-mlv.fr/~unitex/manuelunitex.pdf ample, Lee et al. (2002) use a table that encodes connectivity between morphemes with the aid of morpheme tags and adjacency symbols. Such a table can be viewed as a finite-state automaton in which the states are the adjacency symbols and the transitions are labelled by the morpheme tags. In Kim et al. (1994) and in the Klex system of Han Na-rae, these constraints are represented in the two-level formalism, which is equivalent to regular expressions, which are in turn equivalent to finite-state automata. All these forms are computationally relevant, but they are little readable: the inclusion of a new item or the correction of an error is error-prone. Two-level rules have a very low level of redundancy, but they are complex to read because they combine a morphological part and a logical part (the symbols <=>, <=, =>).</Paragraph> <Paragraph position="4"> In our system, connectivity constraints between suffixes are represented in finite-state transducers, i.e. finite-state automata with input/output labels. These transducers describe sequences of suffixes. Their input represents surface forms and their output represents base forms and tags. We introduced two innovations in order to enhance their readability. Firstly, they are edited and viewed graphically. Secondly, since most of the transducers are large and would not display conveniently on a single screen or page, they take the form of RTNs: transitions can be labelled by a call to a subtransducer instead of an input/output pair. The 59 CSs correspond to 59 transducers. Most of the sub-transducers that they call are shared, which reduces the level of redundancy of the system. The total number of simple graphs making up the RTNs is 230.</Paragraph> <Paragraph position="5"> In the case of several of the RTNs, the graph of calls to sub-transducers admits cycles. Due to these cycles, these RTNs generate an infinite set of endings. The lexicon compiler allows for keeping the set of generated endings finite by breaking all cycles.</Paragraph> <Paragraph position="6"> Word lexicon The various readable resources described above are compiled into an operational lexicon of words whenever one of them is updated. The lexicon of words has an index for fast matching. This index is a finite-state transducer over the Korean alphabet of letters. This is a transposi- null tion of the state-of-the-art technology of representation of lexicons of forms in inflectional languages (Appel and Jacobson, 1988; Silberztein, 1991; Revuz, 1992; Lucchesi and Kowaltowski, 1993). Another index structure, the trie, has been tested with the same lexicon. The size of the trie (930 Kb) is slightly larger than the size of the transducer (560 Kb), due to the representation of endings which is repeated many times in the trie.</Paragraph> <Paragraph position="7"> The compilation of the lexicon of words from the readable resources follows several sequential steps. First, all resources are converted from the Korean syllabic alphabet to the Korean alphabet of letters. In a second step, lexicons of stem allomorphs and of derived stems are generated from the base-form stem lexicons by applying the transducers with Unitex. In a third step, the resulting lexicons of stems are compiled by the Unitex lexicon compiler. Each compiled lexicon has an index, which is a finite-state automaton. The final states of the automaton give access to the lexical information, and in particular to the CSs of the stems. In a fourth step, each transducer of sequences of suffixes is converted into a list by a path enumerator, and each of these lists is processed by the lexicon compiler. The names of the compiled ending lexicons contain the corresponding CSs. In the final step, the stem lexicons and the ending lexicons are merged into a word lexicon. This operation links the final states of the stem lexicons to the initial states of the corresponding ending lexicons. The path enumerator and the lexicon link editor have been implemented for this experiment and will receive an open-source status. The path enumerator allows for breaking cycles in the graph of calls to sub-transducers, so that the enumeration remains finite.</Paragraph> <Paragraph position="8"> The current version of this compilation process generates a lexicon of one-stem words only. Multi-stem words will be represented in later versions.</Paragraph> <Paragraph position="9"> These operations are independent of the text to be annotated; they are performed beforehand.</Paragraph> <Paragraph position="10"> They need to be repeated whenever one of the language resources is updated.</Paragraph> <Paragraph position="11"> The operation of the morphological annotator is simple. The text is pre-processed for sentence segmentation, and tokenised (words are tokens).</Paragraph> <Paragraph position="12"> In each word, Korean syllables are converted into Korean letters; then, the lexicon of words is searched for the word. Lexicon search is efficient: it processes 41,222 words per second on a P4-400 Windows PC. When Chinese ideograms occur in a stem, the lexicon search module searches directly the lexical information attached to stem entries. We did not include any modules for processing words not found in the lexicon.</Paragraph> <Paragraph position="13"> All analyses that are conform to phonotactic and grammatical in-word constraints are retained. However, checking these constraints does not suffice to remove all ambiguity from Korean words. A thorough removal of ambiguity requires a syntactic process (Voutilainen, 1995; Laporte, 2001). Our system presents its output in an acyclic finite-state automaton (also called a graph or a lattice) of morphemes, as in Lee et al. (1997b), but displayed graphically.</Paragraph> <Paragraph position="14"> The output for each morpheme is presented in three parts: surface form, base form, and a structured tag providing the general tag of Table 1 and syntactic features. Word separators such as spaces are also present in this automaton.</Paragraph> <Paragraph position="15"> The annotation of an evaluation sample by the system presented 67 % recall and 46 % precision. The annotation of a morpheme was considered wrong when any of the features was wrong. Among these errors, 78 % are resource errors that can be corrected by updating the resources, whereas the correction of the remaining 22 % would involve enhancing the compilation procedure.</Paragraph> </Section> class="xml-element"></Paper>