File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/90/h90-1067_intro.xml

Size: 4,195 bytes

Last Modified: 2025-10-06 14:04:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1067">
  <Title>Experiments with Tree-Structured MMI Encoders on the RM Task</Title>
  <Section position="2" start_page="0" end_page="346" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Most hidden Markov model systems use minimum distortion (MD) vector quantizers (encoders) to convert continuously valued speech parameters into streams of integer codes. However, MD encoders do not optimize a criterion that is directly related to recognition accuracy. Moreover, they use a single distortion measure that may not be appropriate for all speech classes. In this paper, we propose the use of maximum mutual information (MMI) encoders that are trained to extract phonetic information and thereby minimize phonetic recognition errors. We further compress the frames into larger segments and repeat the encoding.</Paragraph>
    <Paragraph position="1"> Our MMI encoders are binary decision trees built to maximize the average mutual information between the phonetic targets and the codes assigned to them. The task of training such encoders has been extensively addressed in the theory of binary decision trees \[5, 8, 2\]. For example, Breiman et al. systematically consider binary decision trees applied to various classification tasks. The decision (interior) nodes of the tree are allowed to use linear combinations of feature vectors, as well as unordered categorical features. Training criteria (&amp;quot;impurity&amp;quot; criteria) for the binary decision trees include the average leaf-node-conditional class entropy. Training is performed in a top-down node-at-a-time fashion, adding new leaf nodes and maximizing reduction in the average leaf node impurity attained by such additions. It is demonstrated on many practical classification problems that the above procedure results in a suboptimal, but sufficiently accurate tree.</Paragraph>
    <Paragraph position="2"> Labelled data necessary for the supervised training is obtained by aligning speech frames with phonetic transcriptions using dynamic programming. We train a two-stage cascade of binary-tree encoders. In the first stage, frames are encoded to extract maximum information about their target label classes. Feature vectors used in the tree encoder are frame-based. Contiguous runs of frames with the same code are compressed into segments. In the second stage, the resulting segments are encoded to extract maximum information about their target label classes (we assign a single target label class per segment). Segment-based acoustic feature vectors are used in the second-stage tree encoder, along with some categorical features based on the phonetic identities uncovered by the first-stage tree encoder. Segment duration features are also used. Resulting runs of segments with the same code are again compressed into larger segments.</Paragraph>
    <Paragraph position="3"> Speech Systems Incorporated (SSI) has been using a version of this two-stage cascade of the MMI encoders in the Phonetic Engine (r), an integral part of SSI's largevocabulary, continuous speech recognition system \[6, 1\]. The two-stage trees are very fast; they encode one second of speech in one-third of a second on a 16 mHz 68020 microprocessor. In this study, we apply these MMI encoders in a more limited sense -- as vector quantizers for the Sphinx speech recognition system \[3\]. This enables a direct comparison of MMI encoders and standard MD encoders. In our experiments, for the sake of expediency, we used a simplified version of the Sphinx system limited to 48 context-independent phonetic HMMs and 26 acoustic frame features. The two-stage cascade of MMI encoders outperforms the standard MD encoder: Word error rate drops by 33% and recognition is performed roughly 1.6 times faster.</Paragraph>
    <Paragraph position="4"> We also ran a preliminary evaluation of the MMI and MD encoders using the Sphinx 1100 context-dependent (generalized triphone) HMMs. We used the same codes without re-growing the trees for context-dependent class targets. Error rate was reduced by more than half relative to no use of context.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML