XML Viewer - h90-1074

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/90/h90-1074_abstr.xml
Size: 8,985 bytes
Last Modified: 2025-10-06 13:47:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1074">
  <Title>Recent Progress on the SUMMIT System</Title>
  <Section position="2" start_page="382" end_page="384" type="abstr">
    <SectionTitle>
MLP Classifier
</SectionTitle>
    <Paragraph position="0"> Recently, we have been investigating the use of multi-layer perceptrons for phonetic classification. Our work was motivated by the belief that these networks might offer a flexible framework for us to utilize our improved, albeit incomplete, speech knowledge. Until recently, our study was performed on the constrained task of classifying the 16 vowels in American English, spoken by many speakers and excised from continuous speech \[8\]. Our encouraging results suggested that MLP is a promising technique worthy of further investigation.</Paragraph>
    <Paragraph position="1"> We extended our work to one of classifying 38 vowels and consonants. In moving to this larger phonetic classification problem, we discovered some major problems in training the network. In this section, we will describe these problems and suggest some procedures to overcome them.</Paragraph>
    <Paragraph position="2"> Initialization Without any a priori knowledge, the connection weights of MLP are often randomly initialized. Since the transition region of the sigmoid function is relatively narrow while the saturation regions are relatively wide, randomly initializing the network can have the adverse effect of causing most of the basic units to operate in the saturation regions of the sigmoid function, where learning is slower than in the transition region.</Paragraph>
    <Paragraph position="3"> Let z~ = Ej wijxj, where zi is the input to unit i. If we assume that the weights, w~j, are randomly initialized such that they are uncorrelated with zero mean and constant variance.</Paragraph>
    <Paragraph position="4"> It can be shown that \[8\]: 2 2~E\[x~\]. (3) Eiz,\] = ~ E\[w,ilE\[xj\] = 0 and a,, = a~ J J Thus although the expected value of the input to the sigmoid 2 depends on the of a basic unit, E\[z~\], is zero, the variance, a,~ , variance of the random weights, 2 a~, as well as the magnitudes 2 is and the number of dimensions of the input vectors. If a~ large, many basic units may operate in the saturation regions. Normalization of Inputs Several procedures have been suggested to enable the basic units to operate initially in the transition regions. By initializing the network with small random weights or biasing the inputs, a~ can be reduced \[1\]. A method called center initialization has also been suggested that guarantees all the basic units initially operate at the center of the transition region, where learning is fastest \[8\].</Paragraph>
    <Paragraph position="5">  However, as training proceeds, the connection weights are 2 depend progressively more changed, resulting in E\[z~\] and cry, on the set of input vectors, PS. If E\[xj\] and ~jE\[z~\] are large, the hidden units may get driven well into the saturation regions, hindering the learning capability of the network.</Paragraph>
    <Paragraph position="6"> Let {SS} denote the set of training samples, where g = \[sl, s~ .... \]t is the speech-vector for each phoneme token. Furthermore, let</Paragraph>
    <Paragraph position="8"> where C/j is the standard deviation of sj, gj is the mean of sj over all the training tokens, and 7 is a positive constant.</Paragraph>
    <Paragraph position="9"> Thus, E\[xj\] = E\[z~\] = 0. Assuming &amp;quot;7o'j &gt; 1, = &lt; E (51 j &amp;quot;7o'j j Thus by subtracting and normalizing the input patterns according to Equation (4), the hidden units may operate more often in the transition region of the sigmoid function.</Paragraph>
    <Paragraph position="10"> Adaptive Gain Although Equation (4) provides a mechanism to increase the learning capability of the network, it requires that ~j and a,~ be compute d before the network is trained. In this section, we discuss a different technique to deal with the learning capability of the network.</Paragraph>
    <Paragraph position="11"> During training, the connection weights are usually modified according to Aw~j = q$ixj, where r/is the gain term, and $i is the error signal for unit i. (For simplicity, the momentum term is ignored in the following analysis.) Thus IAwij\] depends on txjh which means the training procedure pays more attention to inputs with larger magnitudes than to inputs with smaller magnitudes.</Paragraph>
    <Paragraph position="12"> Alternatively, q can be chosen to be adaptive. Specifically, (6) = I ,1' i where ~0 is a small positive constant. Thus</Paragraph>
    <Paragraph position="14"> As a result, the total change in the connection weights to a hidden unit is independent of the input, ~, thus allowing the training procedure to pay similar attention to all input vectors.</Paragraph>
    <Section position="1" start_page="383" end_page="383" type="sub_section">
      <SectionTitle>
Boundary Classification
</SectionTitle>
      <Paragraph position="0"> In our segmental framework formulated in Equation i, the main difference between classification and recognition is the incorporation of a probability for each segment, p(s). As described previously in Equation 2, we have simplified the problem of estimating p(s) to one of determining the probability that a boundary exists, p(b). Currently in the SUMMIT system, boundary classification is based on the height attained by a boundary in the dendrogram. A small VQ codebook of size 12 is used to quantize the spectral average of each segment, and distributions are collected and parameterized for each possible context.</Paragraph>
      <Paragraph position="1"> Since one of the segment networks we considered was not based on a dendrogram, an alternative classification procedure for the boundaries was adopted. In this procedure, a MLP with two output units was used, one for the valid boundaries and the other for the extraneous boundaries. By referencing the time-aligned phonetic transcription, the desired outputs of the network can be determined. In our current implementation, the probability of a detected boundary, p(b~), is determined using four abutting segments. Let ti stand for the time at which b~ is located, and si stand for the segment between t~ and t~+1, where ti+t &gt; ti. The boundary probability, p(bi), is then determined by using the acoustic measurements in s,_2,si-l,s~, and si+l as inputs to the MLP.</Paragraph>
      <Paragraph position="2"> Data Representation There were two representations used as input for the MLP classifier. The first representation was identical to the SUMMIT system, and consisted of 82 acoustic attributes. The attributes were generated automatically by a search procedure that uses the training data to determine the settings of free parameters of a set of generic property detectors using an optimization procedure\[10\]. The second representation consisted of a vector of three average spectra which corresponded to the left, middle, and right thirds of a segment. Th e spectra were the mean-rate and synchrony outputs of a 40 channel auditory model \[Ii\]. Thus, there were 120 points used for each representation. Finally, segment duration was also included.</Paragraph>
    </Section>
    <Section position="2" start_page="383" end_page="383" type="sub_section">
      <SectionTitle>
Experimental Results
Phonetic Classification
</SectionTitle>
      <Paragraph position="0"> The first experiments which were performed were based on phonetic classification. In these tests the system classified a token taken from a phonetic transcription that had been aligned with the speech waveform. Since there was no detection involved in these experiments only substitution errors were possible. As has been reported previously, the baseline speaker-independent classification performance of su M MIT on the testing data was 70% \[12\]. The performance of the MLP classifier using the same input representation was 74%. In the second set of classification experiments the representation was based on the spectral outputs described previously. Four experiments were performed using i) the synchrony outputs, 2) the mean-rate outputs, 3) the synchrony and mean-rate outputs and 4) the synchrony and mean-rate outputs and segment duration. The results of all experiments have been summarized in Table 3. None of these alternative representations was able to achieve the same level of performance as the SUMMIT acoustic attributes.</Paragraph>
    </Section>
    <Section position="3" start_page="383" end_page="384" type="sub_section">
      <SectionTitle>
Boundary Classification
</SectionTitle>
      <Paragraph position="0"> We have evaluated the MLP boundary classifier using the training and testing data described earlier. The inputs to the network are the averages of the mean rate response in the four abutting segments, resulting in a total of 160 input units. By using 32 hidden units, the network can classify 87% of the boundaries in the test set correctly.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML