XML Viewer - w06-1646

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1646_metho.xml
Size: 23,038 bytes
Last Modified: 2025-10-06 14:10:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1646">
  <Title>Corrective Models for Speech Recognition of Inflected Languages</Title>
  <Section position="5" start_page="391" end_page="393" type="metho">
    <SectionTitle>
3 Inflectional Morphology
</SectionTitle>
    <Paragraph position="0"> Inflectional abundance in a language generally corresponds to some flexibility in word order. In a free word-order language, the order of sentential participants is relatively unconstrained. This does not mean a speaker of the language can arbitrarily choose an order. Word-order choice may change the semantic and/or pragmatic interpretation of an utterance. Czech is known as a free word-order language allowing for subject, object, and verbal components to come in any order. Morphological inflection in these languages must include a syntactic case marker to allow the determination of which participants are subjects (nominative case), objects (accusative or dative) and other such entities. Additionally, morphological inflection encodes features such as gender and number.</Paragraph>
    <Paragraph position="1"> The agreement of these features between sentential components (adjectives with nouns, subjects with verbs, etc.) may further disambiguate the target of a modifier (e.g., identifying the noun that is modified by a particular adjective).</Paragraph>
    <Paragraph position="2"> The increased flexibility in word order aggravates the data sparsity of standard n-gram language model for two reasons: first, the number of valid configurations of a group of words increases with the free order; and second, lexical items are decorated with the inflectional morphemes, multiplying the number of word-forms that appear.</Paragraph>
    <Paragraph position="3"> In addition to modeling sequences of wordforms, we model sequences of morphologically reduced lemmas, sequence of morphological tags and sequences of various factored representations of the morphological tags. Factoring a word into the semantics-bearing lemma and syntaxbearing morphological tag alleviates the data sparsity problem to some extent. However, the number of possible factorizations of n-grams is large. The approach adopted in this work is to provide a rich class of features and defer the modeling of their interaction to the learning procedure.</Paragraph>
    <Section position="1" start_page="391" end_page="393" type="sub_section">
      <SectionTitle>
3.1 Extracting Morphological Features
</SectionTitle>
      <Paragraph position="0"> The extraction of reliable morphological features critically effects further morphological modeling.</Paragraph>
      <Paragraph position="1"> Here, we first select the most likely morphological analysis for each word using a morphological  of the closed set of possible values. Not all values are used in the annotated data.</Paragraph>
      <Paragraph position="2"> tagger. In particular, we use the Czech feature-based tagger distributed with the Prague Dependency Treebank (HajiVc et al., 2005). The tagger is based on a morphological analyzer which uses a lexicon and a rule-based tag guesser for words not found in the lexicon. Trained by the maximum entropy procedure, the tagger uses left and right contextual features from the input string. Currently, this is the best available Czech-language tagger.</Paragraph>
      <Paragraph position="3"> See HajiVc and Vidov'a-Hladk'a (1998) for further details on the tagger.</Paragraph>
      <Paragraph position="4"> A disadvantage of such an approach is that the tagger works on strings rather than the word-lattices that we expect from an ASR system.</Paragraph>
      <Paragraph position="5"> Therefore, we must extract a set of strings from the lattices prior to tagging. An alternative approach is to hypothesize all morphological analyses for each word in the lattice, thereby considering the entire set of analyses as features in the model. In the current implementation we have chosen to use a tagger to reduce the complexity of the model by limiting the number of active features while still obtaining relatively reliable features. Moreover, systematic errors in tagging can be potentially compensated by the corrective model.</Paragraph>
      <Paragraph position="6"> The initial stage of feature extraction begins with an analysis of the data on which we train and test our models. The process follows:  1. Extract the n-best hypotheses according to a baseline model, where n varies from 50 to 1000 in the current work.</Paragraph>
      <Paragraph position="7"> 2. Tag each of the hypotheses with the morphological tagger.</Paragraph>
      <Paragraph position="8"> 3. Re-encode the original word strings along  with their tagged morphological analysis in a weighted finite state transducer to allow  Word-form to obdob'i bylo pomVern'e kr'atk'e gloss that period was relatively short lemma ten obdob'i b'yt pomVernVe kr'atk'y tag PDNS1 NNNS1 VpNS- Dg-- AAFS2  form to obdob'i bylo pomVern'e kr'atk'e to obdob'i obdob'i bylo bylo pomVern'e pomVern'e kr'atk'e lemma ten obdob'i b'yt pomVernVe kr'atk'y</Paragraph>
      <Paragraph position="10"> subset of the feature classes is presented here. The morphological feature values are those assigned by the HajiVc tagger.</Paragraph>
      <Paragraph position="11"> an efficient means of projecting the hypotheses from word-form to morphology and vice versa.</Paragraph>
      <Paragraph position="12"> 4. Extract appropriately factored n-gram features for each hypothesis as described below.</Paragraph>
      <Paragraph position="13"> Each word state in the original lattice has an associated lemma/tag from which a variety of n-gram features can be extracted.</Paragraph>
      <Paragraph position="14"> From the morphological features assigned by the tagger, we chose to retain only a subset and discard the less reliable features which are semantic in nature. The basic morphological features used are detailed in Table 2. In the tag-based model, a string of 5 characters representing the 5 morphological fields is used as a unique identifier. The derived features include n-grams of POS, D-POS, gender (gen), number (num), and case features as well as their combinations.</Paragraph>
      <Paragraph position="15"> POS, D-POS Captures the sub-categorization of the part-of-speech tags.</Paragraph>
      <Paragraph position="16"> gen, num Captures complex gender-number agreement features.</Paragraph>
      <Paragraph position="17"> num, case Captures number agreement between specific case markers.</Paragraph>
      <Paragraph position="18"> POS, case Captures associated POS/Case features (e.g., adjectives associated with nominative elements).</Paragraph>
      <Paragraph position="19"> The paired features allow for complex inflectional interactions and are less sparse than the composite 5-component morphological tags. Additionally, the morphologically reduced lemma and n-grams of lemmas are used as features in the models.</Paragraph>
      <Paragraph position="20"> Table 3 presents a morphological analysis of the Czech sentence To obdob'i bylo pomVernVe kr'atk'e. The encoded tags represent the first 5 fields of the Prague Dependency Treebank morphological encoding and correspond to the last 5 rows of Table 2. Features for this sentence include the wordform, lemma, and composite tag features as well as the components of each tag and the above mentioned concatenation of tag fields. Additionally, n-grams of each of these features are included. Bi-gram features extracted from an example sentence are illustrated in Table 4.</Paragraph>
      <Paragraph position="21"> The following section describes how the fea- null tures extracted above are modeled in a discriminative framework to reduce word error rate.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="393" end_page="393" type="metho">
    <SectionTitle>
4 Corrective Model and Estimation
</SectionTitle>
    <Paragraph position="0"> In this work, we adopt the reranking framework of Charniak and Johnson (2005) for incorporating morphological features. The model scores each test hypothesis y using a linear function, vth(y), of features extracted from the hypothesis fj(y) and model parameters thj, i.e., vth(y) = summationtextj thjfj(y). The hypothesis with the highest score is then chosen as the output.</Paragraph>
    <Paragraph position="1"> The model parameters, th, are learned from a training set by maximum entropy estimation of the following conditional model:</Paragraph>
    <Paragraph position="3"> Here, Ys = {yj} is the set of hypotheses for each training utterance s and the function g returns an extrinsic evaluation score, which in our case is the WER of the hypothesis. Pth(yi|Ys) is modeled by a maximum entropy distribution of the form,</Paragraph>
    <Paragraph position="5"> choice simplifies the numerical estimation procedure since the gradient of the log-likelihood with respect to a parameter, say thj, reduces to difference in expected counts of the associated feature,</Paragraph>
    <Paragraph position="7"> To allow good generalization properties, a Gaussian regularization term is also included in the cost function.</Paragraph>
    <Paragraph position="8"> A set of hypotheses Ys is generated for each training utterance using a baseline ASR system. Care is taken to reduce the bias in decoding the training set by following a jack-knife procedure. The training set is divided into 20 subsets and each subset is decoded after excluding the transcripts of that subset from the language model of the decoder. null The model allows the exploration of a large feature space, including n-grams of words, morphological tags, and factored tags. In a large vocabulary system, this could be an enormous space. However, in a discriminative maximum entropy framework, only the observed features are considered. Among the observed features, those associated with words that are correct in all hypotheses do not provide any additional discrimination capability. Mathematically, the gradient of the log-likelihood with respect to the parameters of these features tends to zero and they may be discarded.</Paragraph>
    <Paragraph position="9"> Additionally, the parameters associated with features that are rarely observed in the training set are difficult to learn reliably and may be discarded.</Paragraph>
    <Paragraph position="10"> To avoid redundant features, we focus on words which are frequently incorrect; this is the error region we aim to model. In the training utterance, the error regions of a hypothesis are identified using the alignment corresponding to the minimum edit distance from the reference, akin to computing word error rate. To mark all the error regions in an ASR lattice, the minimum edit distance alignment is obtained using equivalent finite state machine operations (Mohri, 2002). From amongst all the error regions in the training lattices, the most frequent 12k words in error are shortlisted. Features are computed in the corrective model only if they involve words for the shortlist. The parameters, th, are estimated by numerical optimization as in (Charniak and Johnson, 2005).</Paragraph>
  </Section>
  <Section position="7" start_page="393" end_page="394" type="metho">
    <SectionTitle>
5 Feature Selection
</SectionTitle>
    <Paragraph position="0"> The space of features spanned by the cross-product space of words, lemmas, tags, factoredtags and their n-gram can potentially be overwhelming. However, not all of these features are equally important and many of the features may not have a significant impact on the word error rate. The maximum entropy framework affords the luxury of discarding such irrelevant features without much bookkeeping, unlike maximum likelihood models. In the context of modeling morphological features, we investigate the efficacy of simple feature selection based on the kh2 statistics, which has been shown to effective in certain text categorization problems. e.g. (Yang and Pedersen, 1997).</Paragraph>
    <Paragraph position="1"> The kh2 statistics measures the lack of independence by computing the deviation of the observed counts Oi from the expected counts Ei.</Paragraph>
    <Paragraph position="3"> In our case, there are two classes - oracle hypotheses c and competing hypotheses -c. The expected count is the count marginalized over classes.</Paragraph>
    <Paragraph position="5"> This can be simplified using a two-way contingency table of feature and class, where A is the number of times f and c co-occur, B is the number of times f occurs without c, C is the number of times c occurs without f, and D is the number of times neither f nor c occurs, and N is the total number of examples. Then, the kh2 is defined to be:</Paragraph>
    <Paragraph position="7"> The kh2 statistics are computed for all the features and the features with larger value are retained. Alternatives feature selection mechanisms such as those based on mutual information and information gain are less reliable than kh2 statistics for heavy-tailed distributions. More complex feature selection mechanism would entail computing higher order interaction between features which is computationally expensive and so is not explored in this work.</Paragraph>
  </Section>
  <Section position="8" start_page="394" end_page="396" type="metho">
    <SectionTitle>
6 Empirical Evaluation
</SectionTitle>
    <Paragraph position="0"> The corrective model presented in this work is evaluated on a large vocabulary task consisting of spontaneous spoken testimonies in Czech language, which is a subset of the multilingual MALACH corpus (Psutka et al., 2003).</Paragraph>
    <Section position="1" start_page="394" end_page="394" type="sub_section">
      <SectionTitle>
6.1 Task
</SectionTitle>
      <Paragraph position="0"> For acoustic model training, transcripts are available for about 62 hours of speech from 336 speakers, amounting to 507k spoken words from a vocabulary of 79k. A portion of this data containing speech from 44 speakers, about 21k words in all is treated as development set (dev). The test set (eval) consists of about 2 hours of speech from 10 new speakers and contains about 15k words.</Paragraph>
    </Section>
    <Section position="2" start_page="394" end_page="394" type="sub_section">
      <SectionTitle>
6.2 Baseline ASR System
</SectionTitle>
      <Paragraph position="0"> The baseline ASR system uses perceptual linear prediction (PLP) features which is computed on 44KHz input speech at the rate of 10 frames per second, and is normalized to have zero mean and unit variance per speaker. The acoustic models are made of 3-state HMM triphones, whose observation distributions are clustered into about 4500 allophonic (triphone) states. Each state is modeled by a 16 component Gaussian mixture with diagonal covariances. The parameters of the acoustic models are initially estimated by maximum likelihood and then refined by five iterations of maximum mutual information estimation (MMI).</Paragraph>
      <Paragraph position="1"> Unlike other comparable corpora, this corpus contains a relatively high percentage of colloquial words - about 9% of the vocabulary and 7% of the tokens. For the sake of downstream application, the colloquial variants are subsumed in the lexicon. As a result, common words contain several pronunciation variants, and a few have as many as 14 variants.</Paragraph>
      <Paragraph position="2"> For the first pass decoding, a language model was created by interpolating the in-domain model (weight=0.75), estimated from 600k words of transcripts with an out-of-domain model, estimated from 15M words of Czech National Corpus (Psutka et al., 2003). Both models are parameterized by a trigram language model with Katz back-off. The decoding graph was built by composing the language model, the lexical transducer and the context-dependent transducer (phones to triphones) into a single compact finite state machine. null The baseline ASR system decodes test utterance in two passes. A first pass decoding is performed with MMIE acoustic models, whose output transcripts are bootstrapped to estimate two maximum likelihood linear regression transforms for each speaker using five iterations. A second pass decoding is then performed with the new speaker adapted acoustic models. The resulting performance is given in Table 5. The performance reflects the difficulty of transcribing spontaneous speech from the elderly speakers whose speech is also heavily accented and emotional in this corpus.</Paragraph>
      <Paragraph position="3">  system is reported, showing the word error rate of 1-best MAP hypothesis and the oracle in 1000best hypotheses for dev and eval sets.</Paragraph>
    </Section>
    <Section position="3" start_page="394" end_page="395" type="sub_section">
      <SectionTitle>
6.3 Experiments With Morphology
</SectionTitle>
      <Paragraph position="0"> We present a set of contrastive experiments to gauge the performance of the corrective models and the contribution of morphological features.</Paragraph>
      <Paragraph position="1"> For training the corrective models, 50 best hypotheses are generated for each utterance using the  loss in performance, as observed in dev (a) and eval (b) sets. jack-knife procedure mentioned earlier. For each hypothesis, bigram and unigram features are computed which consist of word-forms, lemmas, morphologoical tags, factored morphological tags, and the likelihood from the baseline ASR system. For testing, the baseline ASR system is used to generate 1000 best hypotheses for each utterance. These are then evaluated using the corrective models and the best scored hypothesis is chosen as the output.</Paragraph>
      <Paragraph position="2"> Table 6 summarizes the results on two test sets - the dev and the eval set. A corrective model with word bigram features improve the word error rate by about an absolute 1% over the baseline. Morphological features provide a further gain on both the test sets consistently.</Paragraph>
      <Paragraph position="3">  model is compared with that of the baseline ASR system, illustrating the improvement in performance with morphological features.</Paragraph>
      <Paragraph position="4"> The gains on the dev set are significant at the level of p &lt; 0.001 for three standard NIST tests, namely, matched pair sentence segment, signed pair comparison, and Wilcoxon signed rank tests.</Paragraph>
      <Paragraph position="5"> For the smaller eval set the significant levels were lower for morphological features. The relative gains observed are consistent over a variety of conditions that we have tested including the ones reported below.</Paragraph>
      <Paragraph position="6"> Subsequently, we investigated the impact of reducing the number of features using kh2 statistics, as described in section 5. The experiments with bigram features of word-forms and morphology were repeated using reduced feature sets, and the performance was measured at 10%, 30% and 60% of their original features. The results, as illustrated in Figure 1, show that the word error rate does not change significantly even after the number of features are reduced by 70%. We have also observed that most of the gain can be achieved by evaluating 200 best hypotheses from the baseline ASR system, which could further reduce the computational cost for time-sensitive applications.</Paragraph>
    </Section>
    <Section position="4" start_page="395" end_page="396" type="sub_section">
      <SectionTitle>
6.4 Analysis of Feature Classes
</SectionTitle>
      <Paragraph position="0"> The impact of feature classes can be analyzed by excluding all features from a particular class and evaluating the performance of the resulting model without re-estimation. Figure 2 illustrates the effectiveness of different features class. The y-axis shows the gain in F-score, which is monotonic with the word error rate, on the entire development dataset. In this analysis, the likelihood score from the baseline ASR system was omitted since our interest is in understanding the effectiveness of categorical features such as words, lemmas and tags.</Paragraph>
      <Paragraph position="1"> The most independently influential feature class is the factored tag features. This corresponds with  form, lemma, tag, and factored tag model. Y -axis is the contribution of this feature if added to an otherwise complete model. Feature classes are labeled: TNG - tag n-gram, LNG - lemma n-gram, FNG - form n-gram and TFAC - factored tag ngrams. The number following the # represents the order of the n-gram.</Paragraph>
      <Paragraph position="2"> our belief that modeling morphological features requires detailed models of the morphology; in this model the composite morphological tag n-gram features (TNG) offer little contribution in the presence of the factored features.</Paragraph>
      <Paragraph position="3"> Analysis of feature reduction by the kh2 statistics reveals a similar story. When features are ranked according to their kh2 statistics, about 57% of the factored tag n-grams occur in the top 10% while only 7% of the word n-grams make it. The lemma and composite tag n-grams give about 6.2% and 19.2% respectively. Once again, the factored tag is the most influential feature class.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="396" end_page="396" type="metho">
    <SectionTitle>
7 Conclusion
</SectionTitle>
    <Paragraph position="0"> We have proposed a corrective modeling framework for incorporating inflectional morphology into a discriminative language model. Empirical results on a difficult Czech speech recognition task support our claim that morphology can help improve speech recognition results for these types of languages. Additionally, we present a feature selection method that effectively reduces the model size by about 70% while having little or no impact on recognition accuracy. Model size reduction greatly reduces training time which can often be prohibitively expensive for maximum entropy training.</Paragraph>
    <Paragraph position="1"> Analysis of the models learned on our task show that factored morphological tags along with word-forms provide most of the discriminative power; and, in the presence of these features, composite morphological tags are of little use.</Paragraph>
    <Paragraph position="2"> The corrective model outlined here operates on the word lattices produced by an ASR system. The morphological tags are inferred from the word sequences in the lattice. Alternatively, by employing an ASR system that models the morphological constraints in the acoustics as in (Chung and Seneff, 1999), the corrective model could be applied directly to a lattice with morphological tags.</Paragraph>
    <Paragraph position="3"> When dealing with ASR word lattices, the efficacy of the proposed feature selection mechanism can be exploited to eliminate the intermediate tagger, a potential source of errors. Instead of considering the best morphological analysis, the model could consider all possible analyses of the words. Further, the feature space could be enriched with syntactic features which are known to be useful (Collins et al., 2005). The task of modeling is then tackled by feature selection and the maximum entropy training procedure.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML