File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2016_metho.xml
Size: 14,572 bytes
Last Modified: 2025-10-06 14:10:24
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2016"> <Title>Markov model</Title> <Section position="4" start_page="120" end_page="122" type="metho"> <SectionTitle> 2 Hierarchical Hidden Markov Model </SectionTitle> <Paragraph position="0"> A HHMM is a structured multi-level stochastic process, and can be visualised as a tree structured HMM (see Figure 1(b)). There are two types of states: * Production state: a leaf node of the tree structure, which contains only observations (represented in Figure 1(b) as the empty circle (c)).</Paragraph> <Paragraph position="1"> * Internal state: contains several production states or other internal states (represented in Figure 1(b) as a circle with a cross inside circleplustext). The output of a HHMM is generated by a process of traversing some sequence of states within the model. At each internal state, the automation traverses down the tree, possibly through further internal states, until it encounters a production state where an observation is contained. Thus, as it continues through the tree, the process generates a sequence of observations. The process ends when a final state is entered. The difference between a standard HMM and a hierarchical HMM is that individual states in the hierarchical model can traverse to a sequence of production states, whereas each state in the standard model corresponds is a production state that contains a single observation.</Paragraph> <Section position="1" start_page="120" end_page="121" type="sub_section"> <SectionTitle> 2.1 Merging </SectionTitle> <Paragraph position="0"> of reconstructing a HMM as a HHMM. Figure 1(a) shows a HMM with 11 states. The two dashed boxes (A) indicate regions of the model that have a repeated structure. These regions are furthermore independent of the other states in the model. Figure 1(b) models the same structure as a hierarchical HMM, where each repeated structure is now grouped under an internal state. This HHMM uses a two level hierarchical structure to expose more information about the transitions and probabilities within the internal states. These states, as discussed earlier, produce no observation of their own. Instead, that is left to the child production states within them. Figure 1(b) shows that each internal state contains four production states.</Paragraph> <Paragraph position="1"> In some cases, different internal states of a HHMM correspond to exactly the same structure in the output sequence. This is modelled by making them share the same sub-models. Using a HHMM allows for the merging of repeated parts of the structure, which results in fewer states needing to be identified--one of the three fundamental problems of HMM construction (Rabiner and Juang, 1986).</Paragraph> </Section> <Section position="2" start_page="121" end_page="122" type="sub_section"> <SectionTitle> 2.2 Sub-model Calculation </SectionTitle> <Paragraph position="0"> Estimating the parameters for multi-level HHMMs is a complicated process. This section describes a probability estimation method for internal states, which transforms each internal state into three production states. Each internal state Si in the HHMM is transformed by resolving each child production state Si,j, into one of three transformed states, Si = {s(i)in,s(i)stay,s(i)out}. The transformation requires re-calculating the new observational and transition probabilities for each of these transformed states. Figure 2 shows the internal states of S2 have been transformed into s(2)in , s(2)stay, s(2)stay and s(2)out.</Paragraph> <Paragraph position="1"> The procedure to transform internal states is: I) calculate the transformed observation ( -O) for each internal state; II) apply the forward algorithm to estimate the state probabilities (-b) for the three transformed states; III) reform the transition matrix by including estimated values for additional transformed internal states ( -A).</Paragraph> <Paragraph position="2"> I. Calculate the observation probabilities -O: Every observation in each internal state Si is re-calculated by summing up all the observation probabilities in each production state Sj as:</Paragraph> <Paragraph position="4"> where time t corresponds to a position in the sequence, O is an observation sequence over t, Oj,t is the observation probability for state Sj at time t, and Ni represents the number of production states for internal state Si.</Paragraph> <Paragraph position="5"> II. Apply forward algorithm to estimate the transform observation value -b: The transformed observation values are simplified to {-b(i)in,t,-b(i)stay,t,-b(i)out,t}, which are then given as the observation values for the three productions states (s(i)in , s(i)stay, s(i)out). The observational probability of entering state Si at time t, i.e. production state s(i)in , is given by:</Paragraph> <Paragraph position="7"> where pij represents the transition probabilities of entering child state Sj. The second probability of staying in state Si at time t, i.e.</Paragraph> <Paragraph position="8"> production state, s(i)stay, is given by:</Paragraph> <Paragraph position="10"> where ^j[?] is the state corresponding to ^j calculated at previous time t[?]1, and A^j[?],j represents the transition probability from state S^j[?] to state to Sj. The third probability of exiting state Si at time t, i.e. production state, s(i)out, is given by:</Paragraph> <Paragraph position="12"> where tj is the transition probabilities for leaving the state Sj.</Paragraph> <Paragraph position="13"> III. Reform transition probability -A(i): Each internal state Si reforms a new 3 x 3 transition probability matrix -A, which records the transition status for the transform matrix. The formula for the estimated cells in -A are:</Paragraph> <Paragraph position="15"> where Ni is the number of child states for state Si, -A(i)in,stay is estimated by summing up all entry state probabilities for state Si, -A(i)in,out is estimated from the observation that 50% of sequences transit from state s(i)in directly to state s(i)out, -A(i)stay,stay is the sum of all the internal transition probabilities within state Si, and -A(i)stay,out is the sum of all exit state probabilities. The rest of the probabilities for transition matrix -A are set to zero to prevent illegal transitions.</Paragraph> <Paragraph position="16"> Each internal state is implemented by a bottom-up algorithm using the values from equations (1)(8), where lower levels of the hierarchy tree are calculated first to provide information for upper level states. Once all the internal states have been calculated, the system then need only to use the top-level of the hierarchy tree to estimate the probability sequences. This means the model will now become a linear HMM for the final Viterbi search process (Viterbi, 1967).</Paragraph> </Section> </Section> <Section position="5" start_page="122" end_page="123" type="metho"> <SectionTitle> 3 Partial flattening </SectionTitle> <Paragraph position="0"> Partial flattening is a process for reducing the depth of hierarchical structure trees. The process involves moving sub-trees from one node to another. This section presents an interesting automatic partial flattening process that makes use of the term extractor method (Pantel and Lin, 2001).</Paragraph> <Paragraph position="1"> The method discovers ways of more tightly coupling observation sequences within sub-models thus eliminating rules within the HHMM. This results in more accurate model. This process involves calculating dependency values to measure the dependency between the elements in the state sequence (or observation sequence).</Paragraph> <Paragraph position="2"> This method uses mutual information and loglikelihood, which Dunning (1993) used to calculate the dependency value between words. Where there is a higher dependency value between words they are more likely to be treat as a term. The process involves collecting bigram frequencies from a large dataset, and identifying the possible two word candidates as terms. The first measurement used is mutual information, which is calculated using the formula:</Paragraph> <Paragraph position="4"> where x and y are words adjacent to each other in the training corpus, C(x,y) to be the frequency of the two words, and [?] represents all the words in entire training corpus. The log-likelihood ratio of x and y is defined as:</Paragraph> <Paragraph position="6"> The system computes dependency values between states (tree nodes) or observations (tree leaves) in the tree in the same way. The mutual information and log-likelihood values are highest when the words are adjacent to each other throughout the entire corpus. By using these two values, the method is more robust against low frequency events.</Paragraph> <Paragraph position="7"> Figure 3 is a tree representation of the HHMM, the figure illustrates the flattening process for the sentence: (S (N[?] A AT1 graphical JJ zoo NN1 (P[?] of IO (N ( strange JJ and CC peculiar JJ ) attractors NN2 )))).</Paragraph> <Paragraph position="8"> where only the part-of-speech tags and grammar information are considered. The left hand side of the figure shows the original structure of the sentence, and the right hand side shows the transformed structure. The model's hierarchy is reduced by one level, where the state P[?] has become a sub-state of state S instead of N[?]. The process is likely to be useful when state P[?] is highly dependent on state N[?].</Paragraph> <Paragraph position="9"> The flattening process can be applied to the model based on two types of sequence dependancy; observation dependancy and state dependancy. null * Observation dependency : The observation dependency value is based upon the observation sequence, which in Figure 3 would be the sequence of part-of-speech tags {AT1 JJ NN1 IO JJ CC JJ NN2}. Given observations NN1 and IO's as terms with a high dependency value, the model then re-construct the sub-tree at IO parent state P[?] moving it to the same level as state N[?], where the states of P[?] and N[?] now share the same parent, state S.</Paragraph> <Paragraph position="10"> value is based upon the state sequence, which in Figure 3 would be {N[?], P[?], N}. The flattening process occurs when the current state has a high dependency value with the previous state, say N[?] and P[?].</Paragraph> <Paragraph position="11"> of-speech tags This paper determines the high dependency values by selecting the top n values from a list of all possible terms ranked by either observation or state dependency values, where n is a parameter that can be configured by the user for better performance. Table 1 shows the observation dependency values of terms for part-of-speech tags for dency value than JJ NN1, therefore state P[?] is joined as a sub-tree of state S. States P[?] and N remain unchanged since state P[?] has already been moved up a level of the tree. After the flattening process, the state P[?] no longer belongs to the child state of state N[?], and is instead joined as the sub-tree to state S as shown in Figure 3.</Paragraph> </Section> <Section position="6" start_page="123" end_page="123" type="metho"> <SectionTitle> 4 Application </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="123" end_page="123" type="sub_section"> <SectionTitle> 4.1 Text Chunking </SectionTitle> <Paragraph position="0"> Text chunking involves producing non-overlapping segments of low-level noun groups. The system uses the clause information to construct the hierarchical structure of text chunks, where the clauses represent the phrases within the sentence. The clauses can be embedded in other clauses but cannot overlap one another.</Paragraph> <Paragraph position="1"> Furthermore each clause contains one or more text chunks.</Paragraph> <Paragraph position="2"> Consider a sentence from a CoNLL-2004 corpus: null</Paragraph> <Paragraph position="4"> where the part-of-speech tag associated with each word is attached with an underscore, the clause information is identified by the S symbol and the chunk information is identified by the rest of the symbols NP (noun phrase), VP (verb phrase), PP (prepositional phrase) and O (null complementizer). The brackets are in Penn Treebank II style3. The sentence can be re-expressed just as its part-of-speech tags thusly: {PRP VBZ DT JJ NN NN</Paragraph> </Section> </Section> <Section position="7" start_page="123" end_page="124" type="metho"> <SectionTitle> MD VB TO RB # CD D IN NNP}, where only </SectionTitle> <Paragraph position="0"> the part-of-speech tags and grammar information are to be considered for the extraction tasks. This is done so the system can minimise the computation cost inherent in learning a large number of unrequired observation symbols. Such an approach also maximises the efficiency of trained data by learning the pattern that is hidden within words (syntax) rather than the words themselves (seman null sentation of an HHMM for the text chunking task. This example involves a hierarchy with a depth of three. Note that state NP appears in two different levels of the hierarchy. In order to build an HHMM, the sentence shown above must be restructured as:</Paragraph> <Paragraph position="2"> where the model makes no use of the word information contained in the sentence.</Paragraph> <Section position="1" start_page="124" end_page="124" type="sub_section"> <SectionTitle> 4.2 Grammar Parsing </SectionTitle> <Paragraph position="0"> Creation of a parse tree involves describing language grammar in a tree representation, where each path of the tree represents a grammar rule.</Paragraph> <Paragraph position="1"> Consider a sentence from the Lancaster Treebank4: null (S (N A AT1 graphical JJ zoo NN1 (P of IO (N ( strange JJ and CC peculiar JJ) attractors NN2)))) where the part-of-speech tag associated with each word is attached with an underscore, and the syntactic tag for each phrase occurs immediately after the opening square-bracket. In order to build the models from the parse tree, the system takes the part-of-speech tags as the observation sequences, and learns the structure of the model using the information expressed by the syntactic tags. During construction, phrases, such as the noun phrase &quot;( strange JJ and CC peculiar JJ )&quot;, are grouped under a dummy state (N d). Figure 5 illustrates the model in the tree representation with the structure of the model based on the previous sentence from Lancaster Treebank.</Paragraph> </Section> </Section> class="xml-element"></Paper>