File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1084_metho.xml
Size: 11,441 bytes
Last Modified: 2025-10-06 14:10:19
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1084"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation</Title> <Section position="5" start_page="666" end_page="668" type="metho"> <SectionTitle> 3 Morpheme-Based Model for </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="666" end_page="667" type="sub_section"> <SectionTitle> Hebrew 3.1 Morpheme-Based HMM </SectionTitle> <Paragraph position="0"> The lexical items of word-based models are the words of the language. The implication of this decision is that both lexical and syntagmatic relations of the model, are based on a word-oriented tagset. With such a tagset, it must be possible to tag any word of the language with at least one tag.</Paragraph> <Paragraph position="1"> Let us consider, for instance, the Hebrew phrase bclm hn'im3, which contains two words. The word bclm has several possible morpheme segmentations and analyses4 as described in Table 1. In word-based HMM, we consider such a phrase to be generated by a Markov process, based on the word-oriented tagset of N = 1934 tags/states and about</Paragraph> <Paragraph position="3"> scribes the size of a first-order word-based HMM, built over our corpus. In this model, we found 834 entries for the P vector (which models the distribution of tags in first position in sentences) out of possibly N = 1934, about 250K entries for the A matrix (which models the transition probabilities from tag to tag) out of possibly N2 [?] 3.7M, and about 300K entries for the B matrix (which models the emission probabilities from tag to word) out of possibly M *N [?] 350M. For the case of a second-order HMM, the size of the A2 matrix (which models the transition probabilities from two tags to the third one), grows to about 7M entries, where the size of the B2 matrix (which models the emission probabilities from two tags to a word) is about 5M.</Paragraph> <Paragraph position="4"> Despite the sparseness of these matrices, the number of their entries is still high, since we model the whole set of features of the complex word forms.</Paragraph> <Paragraph position="5"> Let us assume, that the right segmentation for the sentence is provided to us - for example: b clm hn'im - as is the case for English text. In such a way, the observation is composed of morphemes, generated by a Markov process, based on a morpheme-based tagset. The size of such a tagset for Hebrew is about 200, where the size of the P,A,B,A2 and B2 matrices is reduced to 145, 16K, 140K, 700K, and 1.7M correspondingly, as described in line M of Table 2 - a reduction of 90% when compared with the size of a word-based model.</Paragraph> <Paragraph position="6"> The problem in this approach, is that &quot;someone&quot; along the way, agglutinates the morphemes of each word leaving the observed morphemes uncertain.</Paragraph> <Paragraph position="7"> For example, the word bclm can be segmented in four different ways in Table 1, as indicated by the placement of the '-' in the Segmentation column, while the word hn'im can be segmented in two different ways. In the next section, we adapt the parameter estimation and the searching algorithms for such uncertain output observation.</Paragraph> </Section> <Section position="2" start_page="667" end_page="668" type="sub_section"> <SectionTitle> 3.2 Learning and Searching Algorithms for Uncertain Output Observation </SectionTitle> <Paragraph position="0"> In contrast to standard HMM, the output observations of the above morpheme-based HMM are ambiguous. We adapted Baum-Welch (Baum, 1972) and Viterbi (Manning and Schutze, 1999, 9.3.2) algorithms for such uncertain observation. We first formalize the output representation and then describe the algorithms.</Paragraph> <Paragraph position="1"> Output Representation The learning and searching algorithms of HMM are based on the output sequence of the underlying Markov process. For the case of a morpheme-based model, the output sequence is uncertain - we don't see the emitted morphemes but the words they form. If, for instance, the Markov process emitted the morphemes b clm h n'im, we would see two words (bclm hn'im) instead. In order to handle the output ambiguity, we use static knowledge of how morphemes are combined into a word, such as the four known combinations of the word bclm, the two possible combinations of the word hn'im, and their possible tags within the original words. Based on this information, we encode the sentence into a structure that represents all the possible &quot;readings&quot; of the sentence, according to the possible morpheme combinations of the words, and their possible tags.</Paragraph> <Paragraph position="2"> The representation consists of a set of vectors, each vector containing the possible morphemes and their tags for each specific &quot;time&quot; (sequential position within the morpheme expansion of the words of the sentence). A morpheme is represented by a tuple (symbol, state, prev, next), where symbol denotes a morpheme, state is one possible tag for this morpheme, prev and next are sets of indexes, denoting the indexes of the morphemes (of the previous and the next vectors) that precede and follow the current morpheme in the overall lattice, representing the sentence. Fig. 2 describes the representation of the sentence bclm hn'im. An emission is denoted in this figure by its symbol, its state index, directed edges from its previous emissions, and directed edges to its next emissions.</Paragraph> <Paragraph position="3"> In order to meet the condition of Baum-Eagon inequality (Baum, 1972) that the polynomial P(O|u) - which represents the probability of an observed sequence O given a model u - be homogeneous, we must add a sequence of special EOS (end of sentence) symbols at the end of each path up to the last vector, so that all the paths reach the same length.</Paragraph> <Paragraph position="4"> The above text representation can be used to model multi-word expressions (MWEs). Consider the Hebrew sentence: hw' 'wrk dyn gdwl, which can be interpreted as composed of 3 units (he lawyer great / he is a great lawyer) or as 4 units (he edits law big / he is editing an important legal decision). In order to select the correct interpretation, we must determine whether 'wrk dyn is an MWE.</Paragraph> <Paragraph position="5"> This is another case of uncertain output observation, which can be represented by our text encoding, as done in Fig. 1.</Paragraph> <Paragraph position="6"> 'wrk dyn 6 gdwl 19 EOS 17 EOS 17 dyn 6 gdwl 19'wrk 18 This representation seems to be expensive in term of the number of emissions per sentence.</Paragraph> <Paragraph position="7"> However, we observe in our data that most of the words have only one or two possible segmentations, and most of the segmentations consist of at most one affix. In practice, we found the average number of emissions per sentence in our corpus (where each symbol is counted as the number of its predecessor emissions) to be 455, where the average number of words per sentence is about 18. That is, the cost of operating over an ambiguous sentence representation increases the size of the sentence (from 18 to 455), but on the other hand, it reduces the probabilistic model by a factor of 10 (as discussed above).</Paragraph> <Paragraph position="8"> Morphological disambiguation over such a sequence of vectors of uncertain morphemes is similar to words extraction in automatic speech recognition (ASR)(Jurafsky and Martin, 2000, chp. 5,7). The states of the ASR model are phones, where each observation is a vector of spectral features.</Paragraph> <Paragraph position="9"> Given a sequence of observations for a sentence, the encoding - based on the lattice formed by the phones distribution of the observations, and the language model - searches for the set of words, made of phones, which maximizes the acoustic likelihood and the language model probabilities. In a similar manner, the supervised training of a speech recognizer combines a training corpus of speech wave files, together with word-transcription, and language model probabilities, in order to learn the phones model.</Paragraph> <Paragraph position="10"> There are two main differences between the typical ASR model and ours: (1) an ASR decoder deals with one aspect - segmentation of the observations into a set of words, where this segmentation can be modeled at several levels: subphones, phones and words. These levels can be trained individually (such as training a language model from a written corpus, and training the phones model for each word type, given transcripted wave file), and then combined together (in a hierarchical model).</Paragraph> <Paragraph position="11"> Morphological disambiguation over uncertain morphemes, on the other hand, deals with both morpheme segmentation and the tagging of each morpheme with its morphological features. Modeling morpheme segmentation, within a given word, without its morphology features would be insufficient. (2) The supervised resources of ASR are not available for morphological disambiguation: we don't have a model of morphological features sequences (equivalent to the language model of ASR) nor a tagged corpus (equivalent to the transcripted wave files of ASR).</Paragraph> <Paragraph position="12"> These two differences require a design which combines the two dimensions of the problem, in order to support unsupervised learning (and searching) of morpheme sequences and their morphological features, simultaneously.</Paragraph> <Paragraph position="13"> Parameter Estimation We present a variation of the Baum-Welch algorithm (Baum, 1972) which operates over the lattice representation we have defined above. The algorithm starts with a probabilistic model u (which can be chosen randomly or obtained from good initial conditions), and at each iteration, a new model -u is derived in order to better explain the given output observations. For a given sentence, we define T as the number of words in the sentence, and -T as the number of vectors of the output representation O = {ot},1 [?] t [?] -T, where each item in the output is denoted by olt = (sym,state,prev,next),1 [?] t [?] -T,1 [?] l [?] |ot|. We define a(t,l) as the probability to reach olt at time t, and b(t,l) as the probability to end the sequence from olt. Fig. 3 describes the expectation and the maximization steps of the learning algorithm for a first-order HMM. The algorithm works inO( .T) time complexity, where .T is the total number of symbols in the output sequence encoding, where each symbol is counted as the size of its prev set.</Paragraph> <Paragraph position="14"> Searching for best state sequence The searching algorithm gets an observation sequence O and a probabilistic model u, and looks for the best state sequence that generates the observation.</Paragraph> <Paragraph position="15"> We define d(t,l) as the probability of the best state sequence that leads to emission olt, and ps(t,l) as the index of the emission at time t[?]1 that precedes olt in the best state sequence that leads to it. Fig. 4 describes the adaptation of the Viterbi (Manning and Schutze, 1999, 9.3.2) algorithm to our text representation for first-order HMM, which works in O( .T) time.</Paragraph> </Section> </Section> class="xml-element"></Paper>