File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1062_metho.xml
Size: 8,718 bytes
Last Modified: 2025-10-06 14:13:49
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1062"> <Title>Tree-Based State Tying for High Accuracy Modelling</Title> <Section position="3" start_page="307" end_page="308" type="metho"> <SectionTitle> 2. TIED-STATE HMM SYSTEM </SectionTitle> <Paragraph position="0"> The aim in building a tied-state HMM system is to ensure that there is sufficient training data to robustly estimate each set of state output distribution parameters whilst retaining the important context-dependent acoustic distinctions within each phone class. The method described here uses continuous density mixture Gaussian distributions for two reasons. Firstly, continuous density models are potentially more accurate than discrete (or semi-continuous) systems since they do not require the input feature space to be quantised (or represented by only a few basis functions). This becomes particularly important when derivative features are used since discrete systems have to regard each derivative set as being statistically independent in order to achieve adequate coverage of the feature space. In continuous density systems, derivative features are simply appended to the static parameters and although it is usually necessary to make a diagonal covariance assumption, the feature sets remain coupled through a common set of mixture weights.</Paragraph> <Paragraph position="1"> The second key advantage of continuous density systems is that the modelling accuracy of any particular distribution can be smoothly adjusted by increasing or decreasing the number of mixture components. This allows simple single Gaussian distributions to be used for an initial untied model set where the training data is very patchy. Then once tying has been performed such that every state has an adequate amount of data, more complex mixture Gaussian distributions can be estimated to give increased accuracy.</Paragraph> <Paragraph position="2"> The process of building a tied state HMM system is illustrated by Fig. 1. There are 4 main steps 1. An initial set of a 3 state left-right monophone models with single Gaussian output probability density functions is created and trained.</Paragraph> <Paragraph position="3"> 2. The state output distributions of these monophones are then cloned to initialise a set of untied context dependent triphone models which are then trained using Baum-Welch re-estimation. The transition matrix is not cloned but remains tied across all the triphones of each phone.</Paragraph> <Paragraph position="4"> 3. For each set of triphones derived from the same monophone, corresponding states are clustered. In each resulting cluster, a typical state is chosen as exemplar and all cluster members are tied to this state.</Paragraph> <Paragraph position="5"> 4. The number of mixture components in each state is incremented and the models re-estimated until performance on a development test set peaks or the desired number of mixture components is reached.</Paragraph> <Paragraph position="6"> In the above, all parameter estimation uses embedded Baum-Welch re-estimation for which a transcription is Initial set of untied states R-Liquid? ~ L-Fricative? Tie states in each leaf node needed for every training utterance. Since the dictionary typically has more than one pronunciation per word, transcriptions are derived from the known orthography by using an initial bootstrap set of monophones to do a .forced recognition of each training utterance. Since these models will be rather poor, the build procedure may need to be repeated using the models generated from the first pass to re-transcribe the training data.</Paragraph> <Paragraph position="7"> As noted in the introduction, previous work on statetying used a data-driven agglomerative clustering procedure in which the distance metric depended on the Euclidean distance between the state means scaled by the state variances. This works well but it provides no easy way of handling unseen triphones. The next section describes an alternative clustering procedure which overcomes this problem.</Paragraph> </Section> <Section position="4" start_page="308" end_page="308" type="metho"> <SectionTitle> 3. TREE-BASED CLUSTERING </SectionTitle> <Paragraph position="0"> A phonetic decision tree is a binary tree in which a question is attached to each node. In the system described here, each of these questions relates to the phonetic context to the immediate left or right. For example, in Fig. 2, the question &quot;Is the phone on the left of the current phone a nasal?&quot; is associated with the root node of the tree. One tree is constructed for each state of each phone to cluster all of the corresponding states of all of the associated triphones. For example, the tree shown in Fig. 2 will partition its states into six subsets corresponding to the six terminal nodes. The states in each subset are tied to form a single state and the questions and the tree topology are chosen to maximise the likelihood of the training data given these tied states whilst ensuring that there is sufficient data associated with each tied state to estimate the parameters of a mixture Gaussian PDF. Once all such trees have been constructed, unseen triphones can be synthesised by finding the appropriate terminal tree nodes for that triphone's contexts and then using the tied-states associated with those nodes to construct the triphone.</Paragraph> <Paragraph position="1"> All of the questions used have the form &quot;Is the left or right phone a member of the set X&quot; where the set X ranges from broad phonetic classes such as Nasal, Fricative, Vowel, etc. through to singleton sets such as {l}, {m}, etc.</Paragraph> <Paragraph position="2"> Each tree is built using a top-down sequential optimisation procedure \[4,6\]. Initially, all of the states to be clustered are placed in the root node of the tree and the log likelihood of the training data calculated on the assumption that all of the states in that node are tied. This node is then split into two by finding the question which partitions the states in the parent node so as to give the maximum increase in log likelihood. This process is then repeated by splitting the node which yields the greatest increase in log likelihood until this increase falls below a threshold. To ensure that all terminal nodes have sufficient training data associated with them, a minimum occupation count is applied.</Paragraph> <Paragraph position="3"> Let S be a set of HMM states and let L(S) be the log likelihood of S generating the set of training frames F under the assumption that all states in S are tied i.e.</Paragraph> <Paragraph position="4"> they share a common mean/~(S) and variance ~(S) and that transition probabilities can be ignored. Then, assuming that tying states does not change the frame/state alignment, a reasonable approximation for L(S) is given</Paragraph> <Paragraph position="6"> where n is the dimensionality of the data. Thus, the log likelihood of the whole data set depends only on the pooled state variance ~(S) and the total state occupancy of the pool, ~=~s FIeF ~s(o.f). The former can be calculated from the means and variances of the states in the pool, and the state occupancy counts can be saved during the preceding Baum-Welch re-estimation. For a given node with states S which is partitioned into two subsets Su(q) and Sn(q) by question q, the node is split using the question q. which maximises</Paragraph> <Paragraph position="8"> task.</Paragraph> <Paragraph position="9"> provided that both ALq. and the total pooled state occupation counts for both Su(q.) and S~(q*) exceed their associated thresholds.</Paragraph> <Paragraph position="10"> As a final stage, the decrease in log likelihood is calculated for merging terminal nodes with differing parents. Any pair of nodes for which this decrease is less than the threshold used to stop splitting are then merged. In practice, this reduces the number of states by 10-20% without any degradation in performance.</Paragraph> <Paragraph position="11"> To gain some impression of question usage, Table 1 shows, for a typical system built for the Wall Street Journal task, the first six most useful questions calculated for all states of all models, the entry state of all models and the exit state of all consonants. The rating given is the total increase in log likelihood achieved by that question. As can be seen, the presence of a following vowel is the most important context-dependent effect. There were 202 questions in total to choose from and in the three cases 195, 182 and 152 questions, respectively were actually used in at least one decision tree.</Paragraph> </Section> class="xml-element"></Paper>