File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1003_metho.xml
Size: 5,203 bytes
Last Modified: 2025-10-06 14:07:35
<?xml version="1.0" standalone="yes"?> <Paper uid="H01-1003"> <Title>Advances in Meeting Recognition</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. SPEECH RECOGNITION ENGINE </SectionTitle> <Paragraph position="0"> To achieve robust performance over a range of different tasks, we trained our baseline system on Broadcast News (BN). The system deploys a quinphone model with 6000 distributions sharing 2000 codebooks. There are about 105K Gaussians in the system. Vocal Tract Length Normalization and cluster-based Cepstral Mean Normalization are used to compensate for speaker and channel variations. Linear Discriminant Analysis is applied to reduce feature dimensionality to 42, followed by a diagonalization transform (Maximum Likelihood Linear Transform). A 40k vocabulary and trigram language model are used. The baseline language model is trained on the BN corpus.</Paragraph> <Paragraph position="1"> Our baseline system has been evaluated across the above mentioned tasks resulting in the word error rates shown in Table 1. While we achieve a first pass WER of 18.5% on all F-conditions and 9.6% on the F0-conditions in the Broadcast News task, the word error rate of 44.2% on meeting data is quite high, reflecting the challenges of this task. Results on the ESST system [9] are even worse with a WER of 54.1% which results from the fact that ESST is a highly specialized system trained on noise-free but spontaneous speech in the travel domain.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Acoustic and Language Model Adaptation </SectionTitle> <Paragraph position="0"> The BN acoustic models have been adapted to the meeting data thru Viterbi training, MLLR (Maximum Likelihood Linear Regression), and MAP (Maximum A Posteriori) adaptation. To improve the robustness towards the unseen channel conditions, speaking mode and training/test mismatch, we trained a system &quot;BN+ESST&quot; using a mixed training corpus. The comparison of the results indicate that the mixed system is more robust (44.2% a6 42.2%), without loosing the good performance on the original BN test set (18.5% vs. 18.4%).</Paragraph> <Paragraph position="1"> To tackle the lack of training corpus, we investigated linear interpolation of the BN and the meeting (MT) language model. Based on a cross-validation test we calculated the optimal interpolation weight and achieved a perplexity reduction of 21.5% relative compared to the MT-LM and more than 50% relative compared to the BN-LM. The new language model gave a significant improvement decreasing the word error rate to 38.7%. Overall the error rate was reduced by a7a9a8a11a10 a12a14a13 relative (44.2% a6 38.7%) compared to the BN baseline system.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Model Combination based Acoustic Map- </SectionTitle> <Paragraph position="0"> ping (MAM) For the experiments on meeting data reported above we have used comparable recording conditions as each speaker in the meeting has been wearing his or her own lapel microphone. Frequently however this assumption does not apply. We have also carried out experiments aimed at producing robust recognition when microphones are positioned at varying distances from the speaker. In this case data, specific for the microphone distance and SNR found in the test condition is unavailable. We therefore apply a new method, Model Combination based Acoustic Mapping (MAM) to the recognition of speech at different distances. MAM was originally proposed for recognition in different car noise environments, please refer to [10, 11] for details.</Paragraph> <Paragraph position="1"> MAM estimates an acoustic mapping on the log-spectral domain in order to compensate for noise condition mismatches between training and test. During training, the generic acoustic models a15 a3 a16a18a17a20a19 a7a22a21a23a8a11a21a24a10a25a10a25a10a26a21a23a27a29a28 and a variable noise model a30 are estimated. Then, model combination is applied to get new generic models a31a15 a3 a19</Paragraph> <Paragraph position="3"> a30 , which correspond to noisy speech. During decoding of a given input a34 , the mapping process requires a classification as a first step. The score for each a35a37a36a39a38a41a40a9a40 a16a43a42a45a44a47a46a11a48 a36a43a28 is computed as a49 a3 a16 a34a50a28 a19</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> Acoustic Mapping (MAM) </SectionTitle> <Paragraph position="0"> We applied MAM to data that was recorded simultaneously by an array of microphones positions at different distances from the speaker. Each speaker read several paragraphs of text from the Broadcast News corpus. The results of experiments with nine speakers (5 male, 4 female) are summarized in Table 2. Experiments suggest that MAM effectively models the signal condition found in the test resulting in substantial performance improvements. It outperforms unsupervised MLLR adaptation while requiring less computational effort.</Paragraph> </Section> class="xml-element"></Paper>