File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0603_metho.xml
Size: 18,217 bytes
Last Modified: 2025-10-06 14:08:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0603"> <Title>Unsupervised Discovery of Morphemes</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Method 1: Recursive Segmentation and MDL Cost </SectionTitle> <Paragraph position="0"> The task is to find the optimal segmentation of the source text into morphs. One can think of this as constructing a model of the data in which the model consists of a vocabulary of morphs, i.e. the codebook and the data is the sequence of text. We try to find a set of morphs that is concise, and moreover gives a concise representation for the data. This is achieved by utilizing an MDL cost function.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Model Cost Using MDL </SectionTitle> <Paragraph position="0"> The total cost consists of two parts: the cost of the source text in this model and the cost of the codebook. Let M be the morph codebook (the vocabulary of morph types) and D = m1m2 ...mn the sequence of morph tokens that makes up the string of words. We then define the total cost C as</Paragraph> <Paragraph position="2"> The cost of the source text is thus the negative log-likelihood of the morph, summed over all the morph tokens that comprise the source text. The cost of the codebook is simply the length in bits needed to represent each morph separately as a string of characters, summed over the morphs in the codebook. The length in characters of the morph mj is denoted by l(mj) and k is the number of bits needed to code a character (we have used a value of 5 since that is sufficient for coding 32 lower-case letters). For p(mi) we use the ML estimate, i.e., the token count of mi divided by the total count of morph tokens.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Search Algorithm </SectionTitle> <Paragraph position="0"> The online search algorithm works by incrementally suggesting changes that could improve the cost function. Each time a new word token is read from the input, different ways of segmenting it into morphs are evaluated, and the one with minimum cost is selected.</Paragraph> <Paragraph position="1"> Recursive segmentation. The search for the optimal morph segmentation proceeds recursively. First, the word as a whole is considered to be a morph and added to the codebook. Next, every possible split of the word into two parts is evaluated.</Paragraph> <Paragraph position="2"> The algorithm selects the split (or no split) that yields the minimum total cost. In case of no split, the processing of the word is finished and the next word is read from input. Otherwise, the search for a split is performed recursively on the two segments.</Paragraph> <Paragraph position="3"> The order of splits can be represented as a binary tree for each word, where the leafs represent the morphs making up the word, and the tree structure describes the ordering of the splits.</Paragraph> <Paragraph position="4"> During model search, an overall hierarchical data structure is used for keeping track of the current segmentation of every word type encountered so far. Let us assume that we have seen seven instances of linja-auton (Engl. 'of [the] bus') and two instances of autonkuljettajallakaan (Engl. 'not even by/at/with [the] car driver'). Figure 1 then shows a possible structure used for representing the segmentations of the data. Each chunk is provided with an occurrence count of the chunk in the data set and the split location in this chunk. A zero split location denotes a leaf node, i.e., a morph. The occurrence counts flow down through the hierachical structure, so that the count of a child always equals the sum of the counts of its parents.</Paragraph> <Paragraph position="5"> The occurrence counts of the leaf nodes are used for computing the relative frequencies of the morphs.</Paragraph> <Paragraph position="6"> To find out the morph sequence that a word consists of, we look up the chunk that is identical to the word, and trace the split indices recursively until we reach the leafs, which are the morphs.</Paragraph> <Paragraph position="7"> Note that the hierarchical structure is used only during model search: It is not part of the final model, and accordingly no cost is associated with any other nodes than the leaf nodes.</Paragraph> <Paragraph position="8"> Adding and removing morphs. Adding new morphs to the codebook increases the codebook cost. Consequently, a new word token will tend to be split into morphs already listed in the codebook, which may lead to local optima. To better escape local optima, each time a new word token is encoun- null tion of the words linja-auton and autonkuljettajallakaan. The boxes represent chunks.</Paragraph> <Paragraph position="9"> Boxes with bold text are morphs, and are part of the codebook. The numbers above each box are the split location (to the left of the colon sign) and the occurrence count of the chunk (to the right of the colon sign).</Paragraph> <Paragraph position="10"> tered, it is resegmented, whether or not this word has been observed before. If the word has been observed (i.e. the corresponding chunk is found in the hierarchical structure), we first remove the chunk and decrease the counts of all its children. Chunks with zero count are removed (remember that removal of leaf nodes corresponds to removal of morphs from the codebook). Next, we increase the count of the observed word chunk by one and re-insert it as an unsplit chunk. Finally, we apply the recursive splitting to the chunk, which may lead to a new, different segmentation of the word.</Paragraph> <Paragraph position="11"> &quot;Dreaming&quot;. Due to the online learning, as the number of processed words increases, the quality of the set of morphs in the codebook gradually improves. Consequently, words encountered in the beginning of the input data, and not observed since, may have a sub-optimal segmentation in the new model, since at some point more suitable morphs have emerged in the codebook. We have therefore introduced a 'dreaming' stage: At regular intervals the system stops reading words from the input, and instead iterates over the words already encountered in random order. These words are resegmented and thus compressed further, if possible. Dreaming continues for a limited time or until no considerable decrease in the total cost can be observed. Figure 2 shows the development of the average cost per word as a function of the increasing amount of source text. when processing newspaper text. Dreaming, i.e., the re-processing of the words encountered so far, takes place five times, which can be seen as sudden drops on the curve.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Method 2: Sequential Segmentation and ML Cost </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Model Cost Using ML </SectionTitle> <Paragraph position="0"> In this case, we use as cost function the likelihood of the data, i.e., P(data|model). Thus, the model cost is not included. This corresponds to Maximum-Likelihood (ML) learning. The cost is then</Paragraph> <Paragraph position="2"> where the summation is over all morph tokens in the source data. As before, for p(mi) we use the ML estimate, i.e., the token count of mi divided by the total count of morph tokens.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Search Algorithm </SectionTitle> <Paragraph position="0"> In this case, we utilize batch learning where an EM-like (Expectation-Maximization) algorithm is used for optimizing the model. Moreover, splitting is not recursive but proceeds linearly.</Paragraph> <Paragraph position="1"> 1. Initialize segmentation by splitting words into morphs at random intervals, starting from the beginning of the word. The lengths of intervals are sampled from the Poisson distribution with l = 5.5. If the interval is larger than the number of letters in the remaining word segment, the splitting ends.</Paragraph> <Paragraph position="2"> 2. Repeat for a number of iterations: (a) Estimate morph probabilities for the given splitting.</Paragraph> <Paragraph position="3"> (b) Given the current set of morphs and their probabilities, re-segment the text using the Viterbi algorithm for finding the segmentation with lowest cost for each word. (c) If not the last iteration: Evaluate the segmentation of a word against rejection criteria. If the proposed segmentation is not accepted, segment this word randomly (as in the Initialization step).</Paragraph> <Paragraph position="4"> Note that the possibility of introducing a random segmentation at step (c) is the only thing that allows for the addition of new morphs. (In the cost function their cost would be infinite, due to ML probability estimates). In fact, without this step the algorithm seems to get seriously stuck in suboptimal solutions. Rejection criteria. (1) Rare morphs. Reject the segmentation of a word if the segmentation contains a morph that was used in only one word type in the previous iteration. This is motivated by the fact that extremely rare morphs are often incorrect. (2) Sequences of one-letter morphs. Reject the segmentation if it contains two or more one-letter morphs in a sequence. For instance, accept the segmentation halua + n (Engl. 'I want', i.e. present stem of the verb 'to want' followed by the ending for the first person singular), but reject the segmentation halu + a + n (stem of the noun 'desire' followed by a strange sequence of endings). Long sequences of one-letter morphs are usually a sign of a very bad local optimum that may even get worse in future iterations, in case too much probability mass is transferred onto these short morphs3.</Paragraph> <Paragraph position="5"> 3Nevertheless, for Finnish there do exist some one-letter morphemes that can occur in a sequence. However, these morphemes can be thought of as a group that belongs together: e.g.,</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Evaluation Measures </SectionTitle> <Paragraph position="0"> We wish to evaluate the method quantitatively from the following perspectives: (1) correspondence with linguistic morphemes, (2) efficiency of compression of the data, and (3) computational efficiency. The efficiency of compression can be evaluated as the total description length of the corpus and the codebook (the MDL cost function). The computational efficiency of the algorithm can be estimated from the running time and memory consumption of the program. However, the linguistic evaluation is in general not so straightforward.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Linguistic Evaluation Procedure </SectionTitle> <Paragraph position="0"> If a corpus with marked morpheme boundaries is available, the linguistic evaluation can be computed as the precision and recall of the segmentation. Unfortunately, we did not have such data sets at our disposal, and for Finnish such do not even exist. In addition, it is not always clear exactly where the morpheme boundary should be placed. Several alternatives may be possible, cf. Engl. hope + dvs. hop + ed, (past tense of to hope).</Paragraph> <Paragraph position="1"> Instead, we utilized an existing tool for providing a morphological analysis, although not a segmentation, of words, based on the two-level morphology of Koskenniemi (1983). The analyzer is a finite-state transducer that reads a word form as input and outputs the base form of the word together with grammatical tags. Sample analyses are shown in Figure 3.</Paragraph> <Paragraph position="2"> The tag set consists of tags corresponding to morphological affixes and other tags, for example, part-of-speech tags. We preprocessed the analyses by removing other tags than those corresponding to affixes, and further split compound base forms (marked using the # character by the analyzer) into their constituents. As a result, we obtained for each word a sequence of labels that corresponds well to a linguistic morphemic analysis of the word. A label can often be considered to correspond to a single word segment, and the labels appear in the order of the segments.</Paragraph> <Paragraph position="3"> The following step consists in retrieving the segmentation produced by one of the unsupervised segmentation algorithms, and trying to align this segthe Finnish talo + j + a (plural partitive of 'house'); can also be thought of as talo + ja.</Paragraph> <Paragraph position="4"> and Finnish word forms. The Finnish words areauton (car's), puutaloja ([some] wooden houses) and tehnyt ([has] done). The tags are A (adjective), ACT (active voice), ADV (adverb), CMP (comparative), GEN (genitive), N (noun), PCP2 (2nd participle), PL (plural), PTV (partitive), SG (singular), V (verb), and <DER:ly> (-ly derivative).</Paragraph> <Paragraph position="5"> mentation with the desired morphemic label sequence (cf. Figure 4).</Paragraph> <Paragraph position="6"> A good segmentation algorithm will produce morphs that align gracefully with the correct morphemic labels, preferably producing a one-to-one mapping. A one-to-many mapping from morphs to labels is also acceptable, when a morph forms a common entity, such as the suffix -ja in puutaloja, which contains both the plural and partitive element.</Paragraph> <Paragraph position="7"> By contrast, a many-to-one mapping from morphs to a label is a sign of excessive splitting, e.g., t + alo for talo (cf. English h + ouse for house).</Paragraph> <Paragraph position="8"> We assume that the segmentation algorithm has split the word bigger into the morphs bigg + er, hours' into hour + s + ' and puutaloja into puu + t + alo + ja.</Paragraph> <Paragraph position="9"> Alignment procedure. We align the morph sequence with the morphemic label sequence using dynamic programming, namely Viterbi alignment, to find the best sequence of mappings between morphs and morphemic labels. Each possible pair of morph/morphemic label has a distance associated with it. For each segmented word, the algorithm searches for the alignment that minimizes the total alignment distance for the word. The distance d(M,L) for a pair of morph M and label L is given by:</Paragraph> <Paragraph position="11"> where cM,L is the number of word tokens in which the morph M has been aligned with the label L; and cM is the number of word tokens that contain the morph M in their segmentation. The distance measure can be thought of as the negative logarithm of a conditional probability P(L|M). This indicates the probability that a morph M is a realisation of a morpheme represented by the label L. Put another way, if the unsupervised segmentation algorithm discovers morphs that are allomorphs of real morphemes, a particular allomorph will ideally always be aligned with the same (correct) morphemic label, which leads to a high probability P(L|M), and a short distance d(M,L)4. In contrast, if the segmentation algorithm does not discover meaningful morphs, each of the segments will be aligned with a number of different morphemic labels throughout the corpus, and as a consequence, the probabilities will be low and the distances high.</Paragraph> <Paragraph position="12"> We then utilize the EM algorithm for iteratively improving the alignment. The initial alignment that is used for computing initial distance values is obtained through a string matching procedure: String matching is efficient for aligning the stem of the word with the base form (e.g., the morph puu with the label PUU, and the morphs t + alo with the label TALO). The suffix morphs that do not match well with the base form labels will end up aligned somehow with the morphological tags (e.g., the morph ja with the labels PL + PTV).</Paragraph> <Paragraph position="13"> 4This holds especially for allomorphs of 'stem morphemes', e.g., it is possible to identify the English morpheme easy with a probability of one from both its allomorphs: easy and easi. However, suffixes, in particular, can have several meanings, e.g., the English suffix s can mean either the plural of nouns or the third person singular of the present tense of verbs. Comparison of methods. In order to compare two segmentation algorithms, the segmentation of each is aligned with the linguistic morpheme labels, and the total distance of the alignment is computed.</Paragraph> <Paragraph position="14"> Shorter total distance indicates better segmentation.</Paragraph> <Paragraph position="15"> However, one should note that the distance measure used favors long morphs. If a particular &quot;segmentation&quot; algorithm does not split one single word of the corpus, the total distance can be zero. In such a situation, the single morph that a word is composed of is aligned with all morphemic labels of the word. The morph M, i.e., the word, is unique, which means that all probabilities P(L|M) are equal to one: e.g., the morph puutaloja is always aligned with the labels PUU + TALO + PL + PTV and no other labels, which yields the probabilities P(PUU |</Paragraph> <Paragraph position="17"> Therefore, part of the corpus should be used as training data, and the rest as test data. Both data sets are segmented using the unsupervised segmentation algorithms. The training set is then used for estimating the distance values d(M,L). These values are used when the test set is aligned. The better segmentation algorithm is the one that yields a better alignment distance for the test set.</Paragraph> <Paragraph position="18"> For morph/label pairs that were never observed in the training set, a maximum distance value is assigned. A good segmentation algorithm will find segments that are good building blocks of entirely new word forms, and thus the maximum distance values will occur only rarely.</Paragraph> </Section> </Section> class="xml-element"></Paper>