File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2001_metho.xml
Size: 6,963 bytes
Last Modified: 2025-10-06 14:10:13
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2001"> <Title>Factored Neural Language Models</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Generalization in Language Models </SectionTitle> <Paragraph position="0"> An important task in language modeling is to provide reasonable probability estimates for n-grams that were not observed in the training data. This generalization capability is becoming increasingly relevant in current large-scale speech and NLP systems that need to handle unlimited vocabularies and domain mismatches. The smooth predictor function learned by NLMs can provide good generalization if the test set contains n-grams whose individual words have been seen in similar context in the training data. However, NLMs only have a simplistic mechanism for dealing with words that were not observed at all: OOVs in the test data are mapped to a dedicated class and are assigned the singleton probability when predicted (i.e. at the output layer) and the features of a randomly selected singleton word when occurring in the input. In standard back-off n-gram models, OOVs are handled by reserving a small xed amount of the discount probability mass for the generic OOV word and treating it as a standard vocabulary item. A more powerful backoff strategy is used in factored language models (FLMs) (Bilmes and Kirchhoff, 2003), which view a word as a vector of word features or factors : w = <f1,f2,... ,fk> and predict a word jointly from previous words and their factors: A generalized backoff procedure uses the factors to provide probability estimates for unseen n-grams, combining estimates derived from different backoff paths. This can also be interpreted as a generalization of standard class-based models (Brown et al., 1992).</Paragraph> <Paragraph position="1"> FLMs have been shown to yield improvements in perplexity and word error rate in speech recognition, particularly on sparse-data tasks (Vergyri et al., 2004) and have also outperformed backoff models using a linear decomposition of OOVs into sequences of morphemes. In this study we use factors in the input encoding for NLMs.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Factored Neural Language Models </SectionTitle> <Paragraph position="0"> NLMs de ne word similarity solely in terms of their context: words are assumed to be close in the continuous space if they co-occur with the same (subset of) words. But similarity can also be derived from word shape features (af xes, capitalization, hyphenation etc.) or other annotations (e.g. POS classes). These allow a model to generalize across classes of words bearing the same feature. We thus de ne a factored neural language model (FNLM) (Fig. 2) which takes as input the previous n [?] 1 vectors of factors. Different factors map to disjoint row sets of the matrix. The h and o layers are identical to the standard NLM's. Instead of predicting the probabilities for word and feature indices are mapped to rows in M. The nal multiplicative layer outputs the word probability distribution. all words at the output layer directly, we rst group words into classes (obtained by Brown clustering) and then compute the conditional probability of each word given its class: P(wt) = P(ct) x P(wt|ct).</Paragraph> <Paragraph position="1"> This is a speed-up technique similar to the hierarchical structuring of output units used by (Morin and Bengio, 2005), except that we use a at hierarchy. Like the standard NLM, the network is trained to maximize the log-likelihood of the data. We use BKP with cross-validation on the development set and L2 regularization (the sum of squared weight values penalized by a parameter l) in the objective function.</Paragraph> </Section> <Section position="6" start_page="0" end_page="2" type="metho"> <SectionTitle> 5 Handling Unknown Factors in FNLMs </SectionTitle> <Paragraph position="0"> In an FNLM setting, a subset of a word's factors may be known or can be reliably inferred from its shape although the word itself never occurred in the training data. The FNLM can use the continuous representation for these known factors directly in the input. If unknown factors are still present, new continuous representations are derived for them from those of known factors of the same type. This is done by averaging over the continuous vectors of a selected subset of the words in the training data, which places the new item in the center of the region occupied by the subset. For example, proper nouns constitute a large fraction of OOVs, and using the mean of the rows in M associated with words with a proper noun tag yields the average proper noun representation for the unknown word. We have experimented with the following strategies for subset selection: NULL (the null subset, i.e. the feature vector components for unknown factors are 0), ALL (average of all known factors of the same type); TAIL (averaging over the least frequently encountered factors of that type up to a threshold of 10%); and LEAST, i.e. the representation of the single least frequent factors of the same type. The prediction of OOVs themselves is unaffected since we use a factored encoding only for the input, not for the output (though this is a possibility for future work).</Paragraph> </Section> <Section position="7" start_page="2" end_page="2" type="metho"> <SectionTitle> 6 Data and Baseline Setup </SectionTitle> <Paragraph position="0"> We evaluate our approach by measuring perplexity on two different language modeling tasks. The rst is the LDC CallHome Egyptian Colloquial Arabic (ECA) Corpus, consisting of transcriptions of phone conversations. ECA is a morphologically rich language that is almost exclusively used in informal spoken communication. Data must be obtained by transcribing conversations and is therefore very sparse. The present corpus has 170K words for training (|V |= 16026), 32K for development (dev), 17K for evaluation (eval97). The data was preprocessed by collapsing hesitations, fragments, and foreign words into one class each. The corpus was further annotated with morphological information (stems, morphological tags) obtained from the LDC ECA lexicon. The OOV rates are 8.5% (development set) and 7.7% (eval97 set), respectively.</Paragraph> <Paragraph position="1"> with unknown words in order-2 context The second corpus consists of Turkish newspaper text that has been morphologically annotated and disambiguated (Hakkani-Tcurrency1ur et al., 2002), thus providing information about the word root, POS tag, number and case. The vocabulary size is 67510 (relatively large because Turkish is highly agglutinative). 400K words are used for training, 100K for development (11.8% OOVs), and 87K for testing (11.6% OOVs). The corpus was preprocessed by removing segmentation marks (titles and paragraph boundaries).</Paragraph> </Section> class="xml-element"></Paper>