File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2007_metho.xml

Size: 2,375 bytes

Last Modified: 2025-10-06 14:09:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2007">
  <Title>A resource-based Korean morphological annotation system</Title>
  <Section position="4" start_page="38" end_page="38" type="metho">
    <SectionTitle>
3 Alphabets and tag set
</SectionTitle>
    <Paragraph position="0"> Our system uses three Unicode character sets: the Korean syllabic alphabet, the Korean alphabet of letters, and the Chinese ideograms. The lexicon of words is constructed from a set of language resources that has been manually constructed and is manually updated by Korean linguists (Nam, 1996). In order to ensure that these resources are readable, they are encoded in the Korean syllabic alphabet. The only situation when this is impossible is when a morpheme boundary does not coincide with a syllable boundary. In that case, the morpheme boundary divides the syllable into two parts; one of these parts has no vowel and cannot be encoded in the syllabic alphabet: it is then encoded in the Korean alphabet of letters, which is another zone of the Unicode character set. This convention allows for an accurate delimitation of surface forms and base forms of all morphemes, including irregular ones. Chinese ideograms are provided in the information on Sino-Korean stems, which are sometimes spelled in Chinese ideograms in texts.</Paragraph>
    <Paragraph position="1"> In the lexicon of words itself, words are encoded over the Korean alphabet of letters, for more efficient lexicon search. During text annotation, words in the text are converted into letters before the lexicon is searched.</Paragraph>
    <Paragraph position="2"> Our tag set is more fine-grained than stateof-the-art: it comprises 173 tags for stems [to be compared to 18 in Lee et al. (2002) and 14 in Han and Palmer (2005)], and 84 tags for functional morphemes [15 in Lee et al. (2002) and in Han and Palmer (2005)]. Tags are more informative. null In addition, the tags are structured. They combine a general tag taken in a list of 16 general tags, and 0 to 4 features specifying subcategories. The list of general tags is displayed in  This structure is in conformity with emerging international standards in representation of lexical tags (Lee et al., 2004). Tag sets in previous Korean morphological analysers were unstructured or hierarchical (Lee et al., 2002), not feature-based. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML