File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-1027_metho.xml

Size: 26,720 bytes

Last Modified: 2025-10-06 14:07:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1027">
  <Title>Compound Noun Segmentation Based on Lexical Data Extracted from Corpus*</Title>
  <Section position="3" start_page="0" end_page="198" type="metho">
    <SectionTitle>
2 Lexical Data Acquisition
</SectionTitle>
    <Paragraph position="0"> Since the compound noun consists of a series of nouns, the probability model using transition among parts of speech is not helpful, and rather lexical information is required for the compound noun segmentation. Our segmentation algorithm is based on a large collection of lexical information that consists of two kinds of data: One is the hand built segmentation dictionary (HBSD) and the other is the simple noun dictionary for segmentation (SND).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Hand-Built Segmentation Dictionary
</SectionTitle>
      <Paragraph position="0"> The first phase of compound noun segmentation uses the built-in dictionary (HBSD). The advantage of using the built-in dictionary is that the segmentation could (1) be very accurate by hand-made data and (2) become more efficient. In Korean compound noun, one syllable noun is sometimes highly ambiguous between suffix and noun, but human can easily identify them using semantic knowledge. For example, one syllable noun 'ssi' in Korean might be used either as a suffix or as a noun which means 'Mr/Ms' or 'seed' respectively. Without any semantic information, the best way to distinguish them is to record all the compound noun examples containing the meaning of seed in the dictionary since the number of compound nouns containing a meaning of 'seed' is even smaller. Besides, we can treat general spacing errors using the dictionary. By the spacing rule for Korean, there should be one content word except noun in an eojeol, but it turns out that one or more content words of short length sometimes appear without space in real texts, which causes the lexical ambiguities. It makes the system inefficient to deal with all these words on the phase of basic morphological analysis.</Paragraph>
      <Paragraph position="1"> compound nouns</Paragraph>
      <Paragraph position="3"> information in built-in dictionary To construct the dictionary, compound nouns axe extracted from corpus and manually elaborated.</Paragraph>
      <Paragraph position="4"> First, the morphological analyzer analyzes 30 million eojeol corpus using only simple noun dictionary, and the failed results are candidates for compound noun. After postpositions, if any, are removed from the compound noun candidates of the failure eojeols, the candidates axe modified and analyzed by hand. In addition, a collection of compound nouns of KAIST (Korea Advanced Institute of Science &amp; Technology) is added to the dictionary in order to supplement them. The number of entries contained in the built-in dictionary is about 100,000. Table 1 shows some examples in the built-in dictionary. _The italic characters such as 'n' or 'x' in analysis information (right column) of the table is used to make distinction between noun and suffix.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="198" type="sub_section">
      <SectionTitle>
2.2 Extraction of Lexical Information for
Segmentation from Corpus
</SectionTitle>
      <Paragraph position="0"> As we said earlier, it is impossible for all compound nouns to be registered in the dictionary, and thus the built-in dictionary cannot cover all compound nouns even though it gives more accurate results. We need some good segmentation model for compound noun, therefore.</Paragraph>
      <Paragraph position="1"> In compound noun segmentation, the thing that we pay attention to was that lexical information is crucial for segmenting noun compounds. Since a compound noun consists only of a sequence of nouns i.e. (noun)+, the transition probability of parts of speech is no use. Namely, the frequency of each noun plays highly important role in compound noun segmentation. Besides, since the parameter space is huge, we cannot extract enough lexicai information from hundreds of thousands of POS tagged corpus 1 even if accurate lexical information can be extracted from annotated corpus. Thus, a large size of corpus should be used to extract proper frequencies of nouns. However, it is difficult to look at a large size of corpus and to assign analyses to it, which makes it difficult to estimate the frequency distribution of words. Therefore, we need another approach for obtaining frequencies of nouns.</Paragraph>
      <Paragraph position="2"> ~It is the size of POS tagged corpus currently publicized by ETRI (Electronics and Telecommunications Research Institute) project.</Paragraph>
      <Paragraph position="3">  It must be noted here that each noun in compound nouns could be easily segmented by human in many cases because it has a prominent figure in the sense that it is a frequently used word and so familiar with him. In other words, nouns prominent in documents can be defined as frequently occurred ones, which we call distinct nouns. Compound nouns contains these distinct nouns in many cases, which makes it easier to segment them and to identify their constituents.</Paragraph>
      <Paragraph position="4"> Empirically, it is well-known that too many words in the dictionary have a bad influence on morphological analysis in Korean. It is because rarely used nouns result in oversegmentation if they are included in compound noun segmentation dictionary. Therefore, it is necessary to select distinct nouns, which leads us to use a part of corpus instead of entire corpus that consists of frequently used ones in the corpus.</Paragraph>
      <Paragraph position="5"> First, we examined distribution of eojeols in corpus in order to make the subset of corpus to extract lexical frequencies of nouns. The notable thing in our experiment is that the number of eojeols in corpus is increased in proportion to the size of corpus, but a small portion of eojeols takes most parts of the whole corpus. For instance, 70% of the corpus consists of just 60 thousand types of eojeols which take 7.5 million of frequency from 10 million eojeol corpus and 20.5 million from 30 million eojeols. The lowest frequency of the 60,000 eojeols is 49 in 30 million eojeol corpus. We decided to take 60,000 eojeols which are manually tractable and compose most parts of corpus (Figure 1).</Paragraph>
      <Paragraph position="6"> Second, we made morphological analyses for the 60,000 eojeols by hand. Since Korean is an agglutinative language, an eojeol is represented by a sequence of content words and functional words as mentioned before. Especially, content words and functional words often have different distribution of syllables. In addition, inflectional endings for predicate and postpositions for nominals also have quite different distribution for syllables. Hence we can distinguish the constituents of eojeols in many cases. Of course, there are also many cases in which the result of morphological analysis has ambiguities. For example, an eojeol 'na-neun' in Korean has ambiguity of 'na/N+neun/P', 'na/PN+neun/P' and 'nal/V+neun/E'. In this example, the parts of speech N, PN, P, V and E mean noun, pronoun, postposition, verb and ending, respectively.</Paragraph>
      <Paragraph position="7"> On the other hand, many eojeols which are analyzed as having ambiguities by a morphological analyzer are actually not ambiguous. For instance, 'ga-geora' (go/imperative) has ambiguities by most morphological analyzer among 'ga/V+geora/E' and 'ga/N+i/C+geora/E' (C is copula), but it is actually not ambiguous. Such morphological ambiguity is caused by overgeneration of the morphological analyzer since the analyzer uses less detailed rules for robustness of the system. Therefore, if we examine and correct the results scrupulously, many ambiguities can be removed through the process.</Paragraph>
      <Paragraph position="8"> As the result of the manual process, only 15% of 60,000 eojeols remain ambiguous at the mid-level of part of speech classification 2. Then, we extracted simple nouns and their frequencies from the data.</Paragraph>
      <Paragraph position="9"> Despite of manual correction, there must be ambiguities left for the reason mentioned above. There may be some methods to distribute frequencies in case of ambiguous words, but we simply assign the equal distribution to them. For instance, gage has two possibilities of analysis i.e. 'gage/N' and 'galV+ge/E', and its frequency is 2263, in which the noun 'gage' is assigned 1132 as its frequency. Table 2 shows examples of manually corrected morphological analyses of eojeols containing a noun 'gage' and their frequencies. We call the nouns extracted in such a way a set of distinct nouns.</Paragraph>
      <Paragraph position="10"> In addition, we supplement the dictionary with other nouns not appeared in the words obtained by the method mentioned above. First, nouns of more than three syllables are rare in real texts in Korean, as shown in Lee and Ahn (1996). Their experiments proved that syllable based bigram indexing model makes much better result than other n-gram model such as trigram and quadragram in Korean IR. It follows that two syllable nouns take an overwhelming majority in nouns. Thus, there are not many such nouns in the simple nouns extracted by the manually corrected nouns (a set of distinct nouns). In particular, since many nouns of more 2At the mid-level of part of speech classification, for example, endings and postpositions are represented just by one tag e.g. E and P. To identify the sentential or clausal type (subordinate or declarative) in Korean, the ending should be subclassified for syntactic analysis more detail which can be done by statistical process. It is beyond the subject of this paper.</Paragraph>
      <Paragraph position="11">  and ending and '@' is marked for representation of ambiguous analysis than three syllables are derived by a word and suffixes and have some syllable features, they are useful for distinguishing the boundaries of constituents in compound nouns. We select nouns of more than three syllables from morphological dictionary which is used for basic morphological analysis and consists of 89,000 words (noun, verb, adverb etc). Second, simple nouns are extracted from hand-built segmentation dictionary. We selected nouns which do not exist in a set of distinct nouns.</Paragraph>
      <Paragraph position="12"> The frequency is assigned equally with some value fq. Since the model is based on min-max composition and the nouns extracted in the first phase are most important, the value does not take an effect on the system performance.</Paragraph>
      <Paragraph position="13"> The nouns extracted in this way are referred to as a set of supplementary nouns. And the SND for compound noun segmentation is composed of a set of distinct nouns and a set of supplementary nouns.</Paragraph>
      <Paragraph position="14"> The number of simple nouns for compound noun segmentation is about 50,000.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="198" end_page="201" type="metho">
    <SectionTitle>
3 Compound Word Segmentation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="198" end_page="199" type="sub_section">
      <SectionTitle>
Algorithm
3.1 Basic Idea
</SectionTitle>
      <Paragraph position="0"> To simply describe the basic idea of our compound noun segmentation, we first consider a compound noun to be segmented into only two nouns. Given a compound noun, it is segmented by the possibility that a sequence of syllables inside it forms a word.</Paragraph>
      <Paragraph position="1"> The possibility that a sequence of syllables forms a word is measured by the following formula.</Paragraph>
      <Paragraph position="2"> Word(si,... sj) - fq(si,.., sj) Iq~ (1) In the formula, fq(s~,...sj) is the frequency of the syllable si...sj, which is obtained from SND constructed on the stages of lexical data extraction.</Paragraph>
      <Paragraph position="3"> And, fqN is the total sum of frequencies of simple nouns. Colloquially, the equation (1) estimates how much the given sequence of syllables are likely to be word. If a sequence of syllables in the set of distinct nouns is included in a compound noun, it is more probable that it is divided around the syllables. If a compound noun consists of, for any combination of syllables, sequences of syllables in the set of supplementary nouns, the boundary of segmentation is somewhat fuzzy. Besides, if a given sequence of syllables is not found in SND, it is not probable that it is a noun.</Paragraph>
      <Paragraph position="4"> Consider a compound noun 'hagdeggyo-saenghwal(school life)'. In case that segmentation of syllables is made into two, there would be four possibilities of segmentation for the example as follows:  1. hag 9yo-saeng-hwal 2. hag-gyo saeng-hwal 3. hag-gyo-saeng hwal 4. hag-gyo-saeng-hwal C/  As we mentioned earlier, it is desirable that the eojeol is segmented in the position where each sequence of syllables to be divided occurs frequently enough in training data. As the length of a sequence of syllables is shorter in Korean, it occurs more frequently. That is, the shorter part usually have higher frequency than the other (longer) part when we divide syllables into two. Moreover, if the other part is the syllables that we rarely see in texts, then the part would not be a word. In the first of the above example, hag is a sequence of syllable appearing frequently, but gyo-saeng-hwa! is not. Actually, gyo-saeng-hwal is not a word. On the other hand, both hag-gyo and saeng-hwal are frequently occurring syllables, and actually they are all words. Put another way, if it is unlikely that one sequence of syllables is a word, then it is more likely that the entire syllables are not segmented. The min-max composition is a suitable operation for this case. Therefore, we first  take the minimum value from the function Word for each possibility of segmentation, and then we choose the maximum from the selected minimums. Also, the argument taking the maximum is selected as the most likely segmentation result.</Paragraph>
      <Paragraph position="5"> Here, Word(si... sj) is assigned the frequency of the syllables si... sj from the dictionary SND. Besides, if two minimums are equal, the entire syllable such as hag-gyo-saeng-hwal, if compared, is preferred, the values of the other sequence of syllables are compared or the dominant pattern has the priority. null</Paragraph>
    </Section>
    <Section position="2" start_page="199" end_page="199" type="sub_section">
      <SectionTitle>
3.2 Segmentation Algorithm
</SectionTitle>
      <Paragraph position="0"> In this section, we generalize the word segmentation algorithm based on data obtained by the training method described in the previous section. The basic idea is to apply min-max operation to each syllable in a compound noun by the bottom-up strategy. That is, if the minimum between Words of two sequences of syllables is greater than Word of the combination of them, the syllables should be segmented. For instance, let us suppose a compound noun consist of two syllable Sl and s2. If min(Word(Sl), Word(s2)) &gt; Word(sis2), then the compound noun is segmented into Sl and s2. It is not segmented, otherwise. That is, we take the maximum among minimums. For example, 'hag' is a frequently occurring word, but 'gyo' is not in Korean.</Paragraph>
      <Paragraph position="1"> In this case, we can hardly regard the sequence of syllable 'hag-gyo' as the combination of two words 'hag' and 'gyo'. The algorithm can be applied recursively from individual syllable to the entire syllable of the compound noun.</Paragraph>
      <Paragraph position="2"> The segmentation algorithm is effectively implemented by borrowing the CYK parsing method.</Paragraph>
      <Paragraph position="3"> Since we use the bottom-up strategy, the execution looks like composition rather than segmentation. After all possible segmentation of syllables being checked, the final result is put in the top of the table. When a compound noun is composed of n syllables, i.e. sis2.., s,~, the composition is started from each si (i = 1... n). Thus, the possibility that the individual syllable forms a word is recorded in the cell of the first row.</Paragraph>
      <Paragraph position="4"> Here, Ci,j is an element of CYK table where the segment result of the syllables sj,...j+i-1 is stored (Figure 2). For instance, the segmentation result such that ar g max(min( W ord( s l ), Word(s2)), Word(s1 s2)) is stored in C1,2. What is interesting here is that the procedure follows the dynamic programming. Thus, each cell C~,j has the most probable segmentation result for a series of syllables sj ..... j+i-1- Namely, C1,2 and C2,3 have the most likely segmentation of sis2 and s2s3 respectively. When the segmentation of sls2s3 is about to be checked, min(value(C2,1), value(C1,3)),</Paragraph>
      <Paragraph position="6"> are compared to determine the segmentation for the syllables, because all Ci,j have the most likely segmentation. Here, value(Ci,j) represents the possibility value of Ci,j.</Paragraph>
      <Paragraph position="7"> Then, we can describe the segmentation algorithm as follows: When it is about to make the segmentation of syllables s~... sj, the segmentation results of less length of syllables like si...sj-1, S~+l... sj and so forth would be already stored in the table. In order to make analysis of si... s j, we combine two shorter length of analyses and the word generation possibilities are computed and checked.</Paragraph>
      <Paragraph position="8"> To make it easy to explain the algorithm, let us take an example compound noun 'hag-gyo-saenghwa~ (school life) which is segmented with 'haggyo' (school) and 'saenghwar (life) (Figure 3). When it comes up to cell C4,1, we have to make the most probable segmentation for 'hag-gyo-saeng-hwal' i.e.</Paragraph>
      <Paragraph position="9"> SlS2S3S4. There are three kinds of sequences of syllables, i.e. sl in CI,1, sis2 in C2,1 and SlS2S3 in C3,1 that can construct the word consisting of 8182s384 which would be put in Ca,1. For instance, the word sls2s3s4 (hag-gyo-saeng-hwal) is made with Sl (hag) combined with sus3s4 (gyo-saeng-hwal). Likewise, it might be made by sis2 combined with s3s4 and sls2s3 combined with s4. Since each cell has the most probable result and its value, it is simple to find the best segmentation for each syllables. In addition, four cases, including the whole sequences of syllables, are compared to make segmentation of SlS2SaS4 as follows:</Paragraph>
      <Paragraph position="11"> Again, the most probable segmentation result is put in C4,1 with the likelihood value for its segmentation. We call it MLS (Most Likely Segmentation)</Paragraph>
      <Paragraph position="13"> From the four cases, the maximum value and the segmentation result are selected and recorded in C4,1. To generalize it, the algorithm is described as shown in Figure 4.</Paragraph>
      <Paragraph position="14"> The algorithm is straightforward. Let Word and MLS be the likelihood of being a noun and the most likely segmentation for a sequence of syllables. In the initialization step, each cell of the table is assigned Word value for a sequence of syllables sj ... sj+i+l using its frequency if it is found in SND. In other words, if the value of Word for the sequence in each cell is greater than zero, the syllables might be as a noun a part of a compound noun and so the value is recorded as MLS. It could be substituted by more likely one in the segmentation process.</Paragraph>
      <Paragraph position="15"> In order to make it efficient, the segmentation result is put as MLS instead of the syllables in case the sequence of syllables exists in the HBND. The minimum of each Word for constituents of the result as Word is recorded.</Paragraph>
      <Paragraph position="16"> Then, the segmenter compares possible analyses to make a larger one as shown in Figure 4. Whenever Word of the entire syllables is less than that of segmented one, the syllables and value are replaced with the segmented result and its value. For instance, sl + s2 and its likelihood substitutes C2,1 if min(Word(sl), Word(s2)) &gt; Word(sis2). When the entire syllables from the first to nth syllable are processed, C,~,x has the segmentation result.</Paragraph>
      <Paragraph position="17"> The overall complexity of the algorithm follows that of CYK parsing, O(n3).</Paragraph>
    </Section>
    <Section position="3" start_page="199" end_page="201" type="sub_section">
      <SectionTitle>
3.3 Default Analysis and Tuning
</SectionTitle>
      <Paragraph position="0"> For the final result, we should take into consideration several issues which are related with the syllables that left unsegmented. There are several reasons that the given string remains unsegmented: i .odeg1 .. .i i .oo I ~equence of ~yllabl~ iu diviner gel default ~ementatitm pointer  chug-sa-si-heom' where 'si-heom' is a very frequently used noun.</Paragraph>
      <Paragraph position="1"> 1. The first one is a case where the string consists of several nouns but one of them is a unregistered word. A compound noun 'geon-chug-sa-si-heom' is composed of 'geon-chug-sa' and 'si-heom', which have the meanings of authorized architect and examination. In this case, the unknown noun is caused by the suffix such as 'sa' because the suffix derives many words. However, it is known that it is very difficult to treat the kinds of suffixes since the suffix like 'sa' is a very frequently used character in Korean and thus prone to make oversegmentation if included in basic morphological analysis.</Paragraph>
      <Paragraph position="2"> 2. The string might consist of a proper noun alad a noun representing a position or geometric information. For instance, a compound noun 'kimdae-jung-dae-tong-ryeong' is composed of 'kimdae-jung' and 'dae-tong-ryeong' where the former is personal name and the latter means president respectively.</Paragraph>
      <Paragraph position="3"> 3. The string might be a proper noun itself. For example, 'willi'amseu' is a transliterated word for foreign name 'Williams' and 'hong-gil-dong' is a personal name in Korean. Generally, since it has a different sequence of syllables from in a general Korean word, it often remains unsegmented. null If the basic segmentation is failed, three procedures would be executed for solving three problems above. For the first issue, we use the set of distinct nouns. That is, the offset pointer is stored in the initialization step as well as frequency of each noun in compound noun is recorded in the table. Attention should be paid to non-frequent sequence of syllables (ones in the set of supplementary nouns) in the default segmentation because it could be found in any proper noun such as personal names, place names, etc or transliterated words. It is known that the performance drops if all nouns in the compound noun segmentation dictionary are considered for default segmentation. We save the pointer to the boundary only when a noun in distinct set appears. For the above example 'geon-chug-sa-si-heom', the default segmentation would be 'geon-chug-sa' and 'si-heom' since 'si-heom' is in the set of distinct nouns and the pointer is set before 'si-heom' (Figure 5).</Paragraph>
      <Paragraph position="4">  If this procedure is failed, the sequence of syllables is checked whether it might be proper noun or not. Since proper noun in Korean could have a kind of nominal suffix such as 'daetongryeong(president)' or 'ssi(Mr/Ms)' as mentioned above, we can identify it by detaching the nominal suffixes. If there does not exist any nominal suffix, then the entire syllables would be regarded just as the transliterated foreign word or a proper noun like personal or place name.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="201" end_page="202" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> For the test of compound noun segmentation, we first extracted compound noun from ETRI POS tagged corpus 3. By the processing, 1774 types of compound nouns were extracted, which was used as a gold standard test set.</Paragraph>
    <Paragraph position="1"> We evaluated our system by two methods: (1) the precision and recall rate, and (2) segmentation accuracy per compound noun which we refer to as SA. They are defined respectively as follows: Precision = number of correct constituents in proposed segment results total number o\] constituents in proposed segment results Recall = number of correct constituents in proposed segment results total number of constituents in compoundnouns SA = number of correctly segmented compound  and Telecommunications Research Institute) project for standardization of natural language processing technology and the corpus presented consists of about 270,000 eojeols at present. What influences on the Korean IR system is whether words are appropriately segmented or not.</Paragraph>
    <Paragraph position="2"> The precision and recall estimate how appropriate the segmentation results are. They are 98.04% and 97.80% respectively, which shows that our algorithm is very effective (Table 3).</Paragraph>
    <Paragraph position="3"> SA reflects how accurate the segmentation is for a compound noun at all. We compared two methods: (1) using only the segmentation algorithm with default analysis which is a baseline of our system and so is needed to estimate the accuracy of the algorithm. (2) using both the built-in dictionary and the segmentation algorithm which reflects system accuracy as a whole. As shown in Table 4, the baseline performance using only distinct nouns and the algorithm is about 94.3% and fairly good. From the results, we can find that the distinct nouns has great impact on compound noun segmentation. Also, the overall segmentation accuracy for the gold standard is about 97.29% which is a very good result for the application system. In addition, it shows that the built-in dictionary supplements the algorithm which results in better segmentation.</Paragraph>
    <Paragraph position="4"> Lastly, we compare our system with the previous work by (Yun et al. , 1997). It is impossible that we directly compare our result with theirs, since the test set is different. It was reported that the accuracy given in the paper is about 95.6%. When comparing the performance only in terms of the accuracy, our system outperforms theirs.</Paragraph>
    <Paragraph position="5"> Embeded in the morphological analyzer, the compound noun segmentater is currently being used for some projects on MT and IE which are worked in several institutes and it turns out that the system is very effective.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML