File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-2007_intro.xml

Size: 6,144 bytes

Last Modified: 2025-10-06 14:02:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2007">
  <Title>A resource-based Korean morphological annotation system</Title>
  <Section position="3" start_page="37" end_page="38" type="intro">
    <SectionTitle>
(HAM
</SectionTitle>
    <Paragraph position="0"> ) is one of the best Korean morphological analysers. Other fairly representative examples are described in (Shin et al., 1995; Park et al., 1998) and in (Lee et al., 1997a; Cha et al., 1998; Lee et al., 2002). The output for each morpheme is presented in two parts: the morpheme itself, and a grammatical tag. Morphemes are usually presented in their base form if they are stems, and in they surface form if they are functional morphemes. Tags are represented by symbols; they give the POS of stems, and grammatical information about functional morphemes. In output, 95% to 97% of morpheme/tag pairs are considered correct.</Paragraph>
    <Paragraph position="1"> Morphological annotation of Korean text is usually performed in two steps (Sproat, 1992).</Paragraph>
    <Paragraph position="2"> In the first step, morpheme segmentation is performed with the aid of a lexicon of morphemes.</Paragraph>
    <Paragraph position="3"> This generates all possible ways of segmenting the input word. The second step makes a selection among the segmentations obtained and among the tags attached to the morphemes. The second step involves frequency-based learning from a tagged corpus with statistical models such as hidden Markov models, and sometimes also with error-driven learning of symbolic transformation rules (Brill, 1995; Lee et al., 1997a; Lee et al., 2002). Morphemes not found in the lexicon undergo a special treatment that guesses at their properties. A recent variant of this approach (Han and Palmer, 2005) swaps the main two steps: first, a sequence of tags is assigned to each word on the basis of a statistical model; then, morphological segmentation is performed with a lexicon of morphemes. The other approaches are less popular among searchers and language engineering companies. Some systems are based on two-level models, such as (Kim et al., 1994) and the Klex system of Han Na-rae  http://www.cis.upenn.edu/~nrh/klex.html The delimitation of morphemes is provided, but some morpheme boundaries are usually modified so that they coincide with syllable boundaries. For example, if two suffixes make up a single syllable, like -syeoss:- syoss- which is a contraction of -eusi:- eusi- (honorification towards sentence subject) and -eoss:- oss- (past), they are usually considered as one morpheme.</Paragraph>
    <Paragraph position="4"> Such simplifications make it possible to encode morphemes on the Korean syllable-based alphabet, and are compatible with syllable-based models (Kang and Kim, 1994). However, they are an approximation.</Paragraph>
    <Paragraph position="5"> We opted for the resource-based approach to obtain more accurate and more informative output. null The language resources used in annotators are corpora, rules and lexicons.</Paragraph>
    <Paragraph position="6"> Corpus-based systems have an inherent lack of flexibility. A morphological annotator is not static infrastructure, it has to evolve with time. Due to the evolution of language across time, and especially of technical language, regular updates are necessary; a new application may involve the selection of a domain-specific vocabulary. The flexibility of a resource can be defined as the ability to control its evolution. In order to adapt a corpus-based system, one feeds a new corpus into the training process, since the operation of the system is dependent on the nature of the training corpus. A training process with a tagged corpus gives much better performance than unsupervised training (Merialdo, 1994). The extension of a system to input texts of new types or of a new period of time involves the costly task of tagging a corpus of new texts. Another type of evolution of a corpus-based system, a refinement of the tag set, such as the addition of new features, involves a re-tagging of existing tagged corpora, a task which is seldom achieved.</Paragraph>
    <Paragraph position="7"> The situation is different with rules or lexicons. The flexibility of a manually constructed and updated rule set or lexicon depends on its level of readability and of non-redundancy (see section 4).</Paragraph>
    <Paragraph position="8"> In current practice, words are segmented by a morphological analysis module that accesses a lexicon of morphemes and uses a set of rules. It has been claimed that morphological annotation of Korean text could only be performed this way, because a lexicon of words would be too  large (e.g. Lee et al., 2002; Han and Palmer, 2005). We show that it can be performed directly with a lexicon of words; this solution dispenses with rules, thus simplifying and speeding up morphological annotation. The evidence given by Han and Palmer (2005) in support of their claim is the fact that the number of different words in Korean is very large, which is undisputed. In fact, they implicitly assume that the lexicon would be obtained by sequentially generating all words and associated information. Such a naive procedure would surely be impractical. Our system constructs a lexicon of words without generating any list of words at any of the phases of its construction or maintenance.</Paragraph>
    <Paragraph position="9"> In our design, all morphological rules are applied to all possible configurations during the compilation of the resources and stored in a lexicon of words, which is searched during text annotation. No morphological rules are applied then. The lexicon of words occupies less than 600 Kb, and specifies 138,000,000 surface forms of words obtained from 39,130 base-form stems. The size of the lexicon does not grow with the number of words, due to our adaptation to Korean of state-of-the-art technology for lexicon management (Appel and Jacobson, 1988; Silberztein, 1991; Revuz, 1992; Lucchesi and Kowaltowski, 1993). Our approach could even be adapted further to allow for constructing a lexicon with infinitely many words.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML