File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1046_metho.xml
Size: 15,780 bytes
Last Modified: 2025-10-06 14:12:43
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1046"> <Title>A A Trellis-Based Algorithm For Estimating The Parameters Of Hidden Stochastic Context-Free Grammar</Title> <Section position="3" start_page="0" end_page="242" type="metho"> <SectionTitle> REPRESENTATION </SectionTitle> <Paragraph position="0"> The representation of a grammar and the basic trellis structure are discussed in this section. The starting point is the conventional HMM network in which symbols are generated at states (rather than on transitions) as described in Levinson et al. (1983). Such a network is represented by the parameter set (A, B, I) comprising the transition, output and initial matrices. The states in this kind of network will be referred to as terminal states from now on, and Will be represented pictorially with single circles. As a short-hand convenience in what follows, if the circle contains a symbol, then it is assumed that only that symbol is ever generated by the state. (The probability of generating it is then unity, and zero for all other symbols.) A single symbol is generated by a transition to a terminal state. For the grammars considered here, terminal states correspond to lexical categories.</Paragraph> <Paragraph position="1"> To this parameter set we will add four other parameters (N, F, To2, L). The boolean Top indicates whether the network is to be considered as the top-level network. Only one network may be assigned as the top-level network, and it is analogous to the root symbol of a grammar. The parameter F is the set of final states, specifying the allowable states in which a network can be considered to have accepted a sequence of observations. A different type of state will now be introduced, called a nonterminal state. It represents a reference to another network and is indicated diagrammatically with two concentric circles. When a transition is made to a nonterminal state, the state does not generate any observations per se, but terminal nodes within the referred network do. A nonterminal state may be associated with a sequence of observation symbols, corresponding to the sequence accepted by the underlying network. The parameter N is a matrix which indicates whether a state is a terminal or nonterminal state. Terminal states have a null entry in the matrix, and nonterminal states have a reference to the network which they represent. A grammar is usually composed of several networks, so each one is referred to with a unique label L.</Paragraph> <Paragraph position="2"> Figure 1 shows how rules in Chomsky normal form are represented as networks using the above scheme. The lexical form of the rules is included, illustrating how the left hand side of a rule corresponds to a network label, and the network structure is associated with the right-hand side. Terminal states are labeled in lower case and nonterminals in upper case. The numbers associated with the states are their initial probabilities which are also rule probabilities. For terminal nodes in the top-level network, initial probabilities have the same meaning as in the F/B algorithm.</Paragraph> <Paragraph position="3"> For all other networks, an initial probabihty corresponds to a production probability. States which have a non-zero initial probability will be termed &quot;Initial states&quot; from now on. Any sequence recognized by a network must start on an initial state and end on a final state. In Figure 1, final states are designated with the annotation &quot;F'. Figure 2 shows how the terminal symbols in Figure 1 may be represented in a more compact style, by a single state having different B matrix probabilities for the symbols x and y.</Paragraph> <Section position="1" start_page="241" end_page="242" type="sub_section"> <SectionTitle> Terminology </SectionTitle> <Paragraph position="0"> A grammar is represented as a set A/&quot; of networks, and a component network labeled n is composed of parameters (A, B, I, N,F, Top, n). To strictly identify an element in the parameter set each element must be a function of its associated network (e.g. A(n), I(n) etc.). In the following sections however, where the reference is obvious this notation has been omitted to make formulae less cumbersome.</Paragraph> <Paragraph position="1"> Thus, given a network n E A\[, an element of its transition matrix A, from state i to state j is written a(i, j). Likewise the initial probability for state i is I(i). Assuming that sentences are used as text units, an observation sequence may consist of Y + 1 words, indexed from 0 to Y: (WO, ~/)1, ~/)2...'tOY) Network State: 1 State: 2 It is useful to define a lookup function W(y) which returns the index k of the vocabulary entry vk matching the word Wy at positioh y in the sentence. The vocabulary entry may be a word or an equivalence class based on categories (Kupiec, 1989). An element of the output matrix B, representing the the probability of seeing word wy in terminal state j is then b(j,W(y)). In addition, three sets will be mentioned: 1. Term(n) The set of terminal states in network n. 2. Nonterm(n) This is the set of nonterminal states in network n.</Paragraph> <Paragraph position="2"> 3. Final(n) The set F of final states in network n. The predicate Top(n) indicates that n is a top-level network, and ~ Top(n) indicates it isn't. Finally, the function N(p, n) returns the network to which state p in network n refers to. (If p is a terminal state it returns a null value).</Paragraph> </Section> </Section> <Section position="4" start_page="242" end_page="244" type="metho"> <SectionTitle> TRELLIS DIAGRAM </SectionTitle> <Paragraph position="0"> Figure 3 shows the form of the trellis diagrams that are used for the computation of probabilities. In the F/B algorithm a single trellis is used, whose dimensions are the number of states in the network and the length of the sentence. A single trellis spans the whole sentence. In the new algorithm each network has an associated set of trellises, for subsequences starting at different positions in a sentence and ending at subsequent ones. (Only a single trellis starting at w0 is shown in Figure 3.) It can be seen that terminal state 2 has corresponding nodes in the trellis diagram, but nonterminal state 1 is represented by pairs of nodes. One node of the pair is called the start node and the other is termed the end node. Paths exist in the trellis for possible state transitions between successive words. However, it is also implicitly understood that paths also exist between the start node and subsequent end nodes for each nonterminal state. These implicit paths are shown as broken lines in Figure 3 and correspond to paths that enter network A at some time, and return from it at the same or a later time. The probabilities associated with the implicit paths are assigned by reference to the trellis diagrams of the appropriate network. An implicit path from a start node at position x to an end node at position y for a nonterminal state p can be thought of as a constituent labeled p, that dominates the words from positions x through to y (inclusive) in a sentence. A network n is deemed to include the sequence wx ...w~ if paths exists through the network which will generate this sequence or a longer one which includes it as a prefix.</Paragraph> <Paragraph position="1"> Thus it is not necessary to be at a final state of n at word w, to include w .... wy.</Paragraph> <Paragraph position="2"> The algorithm makes use of one set of trellis diagrams to compute &quot;alpha&quot; probabilities, and another for &quot;beta&quot; probabilities. These are both sprit into terminal, nonterminalstart and nonterminal-end probabilities, corresponding to the three different types of nodes in the trellis diagram. For the alpha set, these are labeled at, a,ts and a,t~ respectively. null at(x, y, j, n): The probability of generating the words w; ...wu inclusive and network n includes them, and being at the node for terminal state j at position y.</Paragraph> <Paragraph position="4"> It can be seen that if x = 0 and there are no nonterminal states, the previous expressions are as in the F/B algorithm.</Paragraph> <Paragraph position="5"> a,t~(x, y,p, n): The probability of generating the words wz...w~_l inclusive and network n includes them, and being at the start node of nonterminal state p at position y.</Paragraph> <Paragraph position="7"> Otnte(X, y,p, n): The probability of generating the words w .... wu inclusive and network n includes them, and being at the end node of nonterminal state p at position y.</Paragraph> <Paragraph position="9"> The quantity Oltotat(V, y, n) refers to the probability that network n generates the words w .... w u inclusive and being in a final state of n at position y. 'The OLtota! probabilities correspond to the &quot;Inner&quot; (bottom-up) probabilities of the I/O algorithm. If the network topology for Chomsky normal form shown in Figure 1 is substituted in equation (6), the reeursion for the inner probabilities of the I/O algorithm will be produced after further substitutions using equations 0)46)-In the previous equations (5) and (6) it can be seen that the a,te probabilities for a network are defined in terms of other ones. They will never be self-referential if the grammar is cycle-free, (i.e. there are no derivations A ~ A for any nonterminal production A). In the new algorithm cycles can be detected and self-referencing avoided. This is a similar situation to a chart parser where once a constituent with a given label, start and end position is built, no further instances of it are added.</Paragraph> <Paragraph position="10"> The alpha probabilities are all computed first. The beta probabilities can then be calculated, which unlike the F/B algorithm involve the alpha probabilities because prefixes of a sentence must be accounted for as well as suffixes. The beta probabilities are described below. For convenience in later equations the following functions fl~bo,,, and B,id, are first defined:</Paragraph> <Paragraph position="12"> Bit(x, y, j, n): The probability of generating the prefix w0 * ..w,-1 and suffix wu+l...wr given that network n includes wx...wy and is in terminal state j at position y. The indicator function Ind 0 is used in subsequent equations. Its value is unity when its argument is true and zero otherwise. In addition, elements that are not explicitly referenced by the ranges in the equations are assumed to be zero.</Paragraph> <Paragraph position="14"> The previous equations reduce to the definitions for fl in the F/B algorithm when x = 0 and there are no nonterminal states in the network.</Paragraph> <Paragraph position="15"> Pnte(~, y, p, n): The probability of generating the prefix WO...W,-1 and suffix wy+l ... WY given that network n includes wX...wy and is at the end node of state p at position Y.</Paragraph> <Paragraph position="16"> Pnte(x, YtPj n) = Pside(~, Y,P, n) It can be seen that the values for Pnte(x,y,p, n) are defined in terms of those in other networks which reference n via Pabove. As a result this computation has a topdown order. In contrast, the ant,(z,y,p, n) probabilities involve other networks that are referred to by network n and so assigned in a bottom-up order. If the network topology for Chomsky normal form is substituted in equation (12), the recursion for the &quot;Outer&quot; probabilities of the 110 algorithm can be derived after further substitutions. The p probabilities for final states then correspond to the outer probabilities.</Paragraph> <Paragraph position="17"> Pnts(x, y,p, n): The probability of generating the prefix wo ... w,-1 and suffix wy ... wy given that network n includes w,...wy-1 and is at at the start node of state p at position</Paragraph> </Section> <Section position="5" start_page="244" end_page="244" type="metho"> <SectionTitle> Y RE-ESTIMATION FORMULAE </SectionTitle> <Paragraph position="0"> Once the alpha and beta probabilities are available, it is straightforward to obtain new parameter estimates (A, B, 0.</Paragraph> <Paragraph position="1"> The total probability P of a sentence is found from the top level network nTop.</Paragraph> <Paragraph position="2"> There are four different kinds of transition: 1. Terminal node i to terminal node j.</Paragraph> <Paragraph position="3"> 2. Terminal node i to nonterminal start node p. 3. Nonterminal end node p to nonterminal start node q. 4. Nonterminal end node p to terminal node i.</Paragraph> <Paragraph position="4"> The expected total number of times a transition is made from state i to state j conditioned on the observed sentence is E($i,j). The following formulae give E($) for each of the above cases:</Paragraph> <Paragraph position="6"> A new estimate a(;, j) for a typical transition is then: Only B matrix elements for terminal states are used, and are re-estimated as follows. The expected total number of times the k'th vocabulary entry vk is generated in state a conditioned on the observed sentence is E(qi,k). A new estimate for 6(i, k) can then be found: The initial state matrix I is re-estimated as follows:</Paragraph> <Paragraph position="8"/> </Section> <Section position="6" start_page="244" end_page="244" type="metho"> <SectionTitle> IMPLEMENTATION </SectionTitle> <Paragraph position="0"> Inspection of the preceding equations indicates that in similar fashion to the I/O algorithm, this algorithm has cubic complexity in both the length of a sentence and the number of states in the grammar. It has been implemented as a computer program, and verification was conducted in four stages, to facilitate debugging: 1. Using top-level networks having only terminal states, check for exact numerical agreement of re-estimated parameters with those obtained by applying the F/B algorithm to the same examples.</Paragraph> <Paragraph position="1"> 2. Create examples involving nonterminals, but which have finite-state equivalents, and verify as in stage 1. 3. Create examples with several references to a given net- null work, then build a finite-state equivalent in which the references are supplanted by network copies having tied parameters. Verify as in stage 1.</Paragraph> <Paragraph position="2"> 4. Test using examples in Chomsky normal form and compare with results from the I/O algorithm.</Paragraph> <Paragraph position="3"> Unscaled arithmetic was employed to simplify the initial implementation. Subsequent versions will include logarithmic scaling to prevent inaccuracies due to arithmetic underflow. The representation would also benefit from the inclusion of a probability matrix for final states, rather than their use simply as constraints.</Paragraph> <Paragraph position="4"> As the representation used by the algorithm is a superset of that used by the F/B algorithm, it conveniently permits &quot;Staged Training&quot;. Components that are finite-state networks can be pre-trained using the F/B algorithm, and then inserted into a context-free superstructure. This may be done to obtain improved initial estimates for the algorithm, and/or to reduce the total amount of computation involved. Lari and Young (1990) describe experiments using the I/O algorithm in which such pre-training was found useful. Using the algorithm, the parameters of a context-free grammar can be trained from a corpus of untagged text. Values for the production probabilities are directly available, and no conversion of the rules to or from Chomsky normal form is needed. Once trained, a grammar can be used to predict the most likely syntactic structure of new sentences using a corresponding analogue of the Cocke-Younger-Kasami parser.</Paragraph> </Section> class="xml-element"></Paper>