File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/h94-1062_evalu.xml

Size: 6,857 bytes

Last Modified: 2025-10-06 14:00:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1062">
  <Title>Tree-Based State Tying for High Accuracy Modelling</Title>
  <Section position="5" start_page="308" end_page="310" type="evalu">
    <SectionTitle>
4. EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"> Experiments have been performed using both the ARPA  Resource Management (RM) and Wall Street Journal (WSJ) databases. Results are presented here for the 1000 word RM task using the standard word pair grammar and for 5k closed vocabulary and 20k open vocabulary WSJ test sets. All tables show the percentage word error rate.</Paragraph>
    <Paragraph position="1"> For both databases the parameterised data consisted of 12 MFCC coefficients and normalised energy plus 1st and 2nd order derivatives. In addition, for the WSJ data, the  cepstral mean was calculated and removed on a sentence by sentence basis.</Paragraph>
    <Paragraph position="2"> The RM systems used the standard SI-109 training data and used the pronunciations and phone set (46 phones plus silence) produced by CMU and listed in \[5\] together with the standard word-pair grammar. The RM systems were tested on the four official evaluation test sets identified by the dates when the tests took place (Feb'89, Oct'89, Feb'91 and Sep'92).</Paragraph>
    <Paragraph position="3"> The WSJ systems used training data from the SI84 or the SI284 data sets and the pronunciations and phone set from the Dragon Wall Street Journal Pronunciation Lexicon Version 2.0 together with the standard bigram and trigram language models supplied by Lincoln Labs. Some locally generated additions and corrections to the dictionary were used and the stress markings were ignored resulting in 44 phones plus silence.</Paragraph>
    <Paragraph position="4"> Both 5k word and 20k word WSJ systems were tested.</Paragraph>
    <Paragraph position="5"> Four 5k closed vocabulary test sets were used. These were the Nov'92 and Nov'93 5k evaluation test sets; 202 sentences from the si_dt_s6 'spoke' development test set and 248 sentences fl'om the si_dt_05 'hub' development test set. At 20k, three test sets were used. These were the Nov'92 and Nov'93 evaluation test sets and a 252 sentence subset of the si_dt_20 development test set. For both the 5k and 20k cases, the Nov'93 test data was used just once for the actual evaluation.</Paragraph>
    <Paragraph position="6"> All phone models had three emitting states and a left-to-right topology. Training was performed using the HTK toolkit\[Ill. All recognition networks enforced silence at the start and end of sentences and allowed optional silences between words. All cross-word triphone systems used a one pass decoder that performed a beam search through a tree-structured dynamically constructed network\[7\]. Word internal systems used the standard HTK decoder, HVite.</Paragraph>
    <Paragraph position="7"> 4.1. Data-Driven vs. Tree-based Clustering null In order to compare top-down tree clustering with the bottom-up agglomerative approach used in previous systems, an RM system was constructed using each of the two methods. Both systems used the same initial set  of untied triphones. Agglomerative data-driven clustering was then applied to create a word-internal triphone system and decision tree-based clustering was used to create a second word-internal triphone system. The cluster thresholds in each case were adjusted to obtain systems with approximately equal numbers of states, 1655 and 1581, respectively. After clustering, the construction of the two systems was completed by applying identical mixture-splitting and Baum-Welch re-estimation procedures to produce systems in which all states had 6 component mixture Gaussian PDFs and both systems had a total of approximately 750k parameters.</Paragraph>
    <Paragraph position="8"> The results are shown in Table 2. As can be seen, the, performance of the tree clustered models is similar to that of the agglomeratively clustered system but the treebased models have the advantage that, were it necessary, they would allow the construction of unseen triphones. null  As noted in the introduction, the traditional approach to reducing the total number of parameters in a system is to use model-based clustering to produce generalised triphones. To compare this with the state-based ap proach, systems of similar complexity were constructed using both methods for the RM task and the 5k closed vocabulary WSJ task. For RM, each system had ap proximately 2400 states with 4 mixture components per state giving about 800k parameters in total. The WSJ  clustering on the 5k WSJ task. Each recogniser used cross-word triphones and a bigram language model, and had approximately 4800 tied-states and 8 mixture components per state.</Paragraph>
    <Paragraph position="9"> systems were trained on the S184 data set and had ap proximately 4800 states with 8 mixture components per state giving about 3000k parameters in total.</Paragraph>
    <Paragraph position="10"> Tables 3 and 4 show the results. As can be seen, the state-clustered systems consistently out-performed the model-clustered systems (by %20% and an average of 14%).</Paragraph>
    <Section position="1" start_page="310" end_page="310" type="sub_section">
      <SectionTitle>
4.3. Overall Performance
</SectionTitle>
      <Paragraph position="0"> To determine the overall performance of the treeclustered tiecl-state approach, a number of systems were constructed for both the RM and WSJ tasks in order to establish absolute performance levels.</Paragraph>
      <Paragraph position="1"> For the RM task, a gender independent cross word triphone system was constructed with 1778 states each with 6 mixture components per state. The performance of this system on the four test sets is shown in Table 5. For the WSJ task, two gender dependent cross-word triphone systems were constructed. The first used the SI-84 training set with 3820 tied-states per gender and 8 mixture components per state. The variances across corresponding male and female states were tied leading to a system with approximately 3600k parameters. The second sys tem was similar but used the larger SI284 training set. It had 7558 tied-states per gender, 10 mixture components per state and about 8900k parameters in total. The re sults for the the 5k tests are shown in Table 6 and for the 20k tests in Table 7. These systems achieved the lowest ' error rates reported for the November 1993 WSJ eval- null models, t denotes systems used for the ARPA November 1993 WSJ evaluation.</Paragraph>
      <Paragraph position="2"> uations on the H2-C1 and H2-P0 5k closed vocabulary tasks, and the H1-C2 20k open vocabulary task; and the second lowest on the HI-C1 20k open vocabulary task.</Paragraph>
      <Paragraph position="3"> A full description of these Wall Street Journal systems can be found in \[9\].</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML