File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1066_metho.xml
Size: 24,160 bytes
Last Modified: 2025-10-06 14:08:19
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1066"> <Title>Unsupervised Learning of Dependency Structure for Language Modeling</Title> <Section position="4" start_page="0" end_page="4" type="metho"> <SectionTitle> 3 Dependency Language Model </SectionTitle> <Paragraph position="0"> The DLM attempts to generate the dependency structure incrementally while traversing the sentence left to right. It will assign a probability to every word sequence W and its dependency structure D. The probability assignment is based on an encoding of the (W, D) pair described below.</Paragraph> <Paragraph position="1"> Let W be a sentence of length n words to which we have prepended <s> and appended </s> so that</Paragraph> <Paragraph position="3"> = </s>. In principle, a language model recovers the probability of a sentence P(W) over all possible D given W by estimating the joint probability P(W, D): P(W) = [?] D P(W, D). In practice, we used the so-called maximum approximation where the sum is approximated by a single term Below we restrict the discussion to the most probable dependency structure of a given sentence, and simply use D to represent D * . In the remainder of this section, we first present a statistical dependency parser, which estimates the parsing probability at the word level, and generates D incrementally while traversing W left to right. Next, we describe the elements of the DLM that assign probability to each possible W and its most probable D, P(W, D). Finally, we present an EM-like iterative method for unsupervised learning of dependency structure.</Paragraph> <Section position="1" start_page="0" end_page="4" type="sub_section"> <SectionTitle> 3.1 Dependency parsing </SectionTitle> <Paragraph position="0"> The aim of dependency parsing is to find the most probable D of a given W by maximizing the probability P(D|W). Let D be a set of probabilistic dependencies d, i.e. d [?] D. Assuming that the dependencies are independent of each other, we have</Paragraph> <Paragraph position="2"> where P(d|W) is the dependency probability conditioned by a particular sentence.</Paragraph> <Paragraph position="3"> It is impossible to estimate P(d|W) directly because the same sentence is very unlikely to appear in both training and test data. We thus approximated P(d|W) by P(d), and estimated the dependency probability from the training corpus. Let d</Paragraph> <Paragraph position="5"> ) be the dependency The model in Equation (3) is not strictly probabilistic because it drops the probabilities of illegal dependencies (e.g., crossing dependencies).</Paragraph> <Paragraph position="6"> between w</Paragraph> <Paragraph position="8"> have a dependency relation in a sentence in training data, and C(w</Paragraph> <Paragraph position="10"> are seen in the same sentence. To deal with the data sparseness problem of MLE, we used the backoff estimation strategy similar to the one proposed in Collins (1996), which backs off to estimates that use less conditioning context. More specifically, we used the following three estimates: C=d .</Paragraph> <Paragraph position="11"> in which * indicates a wild-card matching any word. The final estimate E is given by linearly interpolating these estimates:</Paragraph> <Paragraph position="13"> are smoothing parameters.</Paragraph> <Paragraph position="14"> Given the above parsing model, we used an approximation parsing algorithm that is O(n</Paragraph> </Section> </Section> <Section position="5" start_page="4" end_page="23" type="metho"> <SectionTitle> ). Tradi- </SectionTitle> <Paragraph position="0"> tional techniques use an optimal Viterbi-style algorithm (e.g., bottom-up chart parser) that is O(n ).</Paragraph> <Paragraph position="1"> Although the approximation algorithm is not guaranteed to find the most probable D, we opted for it because it works in a left-to-right manner, and is very efficient and simple to implement. In our experiments, we found that the algorithm performs reasonably well on average, and its speed and simplicity make it a better choice in DLM training where we need to parse a large amount of training data iteratively, as described in Section 3.3. The parsing algorithm is a slightly modified version of that proposed in Yuret (1998). It reads a sentence left to right; after reading each new word For parsers that use bigram lexical dependencies, Eisner and Satta (1999) presents parsing algorithms that are ). We thank Joshua Goodman for pointing this out.</Paragraph> <Paragraph position="2"> w j , it tries to link w</Paragraph> <Paragraph position="4"> and push the generated dependency d ij into a stack.</Paragraph> <Paragraph position="5"> When a dependency crossing or a cycle is detected in the stack, the dependency with the lowest dependency probability in conflict is eliminated. The algorithm is outlined in Figures 2 and 3. ) is smaller than P(d</Paragraph> </Section> <Section position="6" start_page="23" end_page="24" type="metho"> <SectionTitle> ) and P(d </SectionTitle> <Paragraph position="0"> ), d is removed (represented as dotted line). (b) An example of a dependency crossing: given that P(d</Paragraph> <Paragraph position="2"> ), d is removed. Let the dependency probability be the measure of the strength of a dependency, i.e., higher probabilities mean stronger dependencies. Note that when a strong new dependency crosses multiple weak dependencies, the weak dependencies are removed even if the new dependency is weaker than the sum of the old dependencies.</Paragraph> <Paragraph position="3"> Although this action results in lower total probability, it was implemented because multiple weak dependencies connected to the beginning of the sentence often pre- null This operation leaves some headwords disconnected; in such a case, we assumed that each disconnected head-word has a dependency relation with its preceding headword.</Paragraph> <Paragraph position="4"> vented a strong meaningful dependency from being created. In this manner, the directional bias of the approximation algorithm was partially compensated for.</Paragraph> <Section position="1" start_page="24" end_page="24" type="sub_section"> <SectionTitle> 3.2 Language modeling </SectionTitle> <Paragraph position="0"> The DLM together with the dependency parser provides an encoding of the (W, D) pair into a sequence of elementary model actions. Each action conceptually consists of two stages. The first stage assigns a probability to the next word given the left context. The second stage updates the dependency structure given the new word using the parsing algorithm in Figure 2. The probability P(W, D) is calculated as:</Paragraph> <Paragraph position="2"> is a dependency structure containing only those dependencies whose two related words are included in the word (j-1)-prefix, W</Paragraph> <Paragraph position="4"> is the word to be predicted. D</Paragraph> <Paragraph position="6"> is the incremental dependency structure that generates D</Paragraph> <Paragraph position="8"> pendency structure built on top of D and the newly predicted word w j (see the for-loop of line 2 in Figure 2). p</Paragraph> <Paragraph position="10"> denotes the ith action of the parser at position j in the word string: to generate a new dependency d ij , and eliminate dependencies with the lowest dependency probability in conflict (see lines 4 - 7 in Figure 2). Ph is a function that maps the</Paragraph> <Paragraph position="12"> ) onto equivalence classes.</Paragraph> <Paragraph position="13"> The model in Equation (8) is unfortunately infeasible because it is extremely difficult to estimate the probability of p</Paragraph> <Paragraph position="15"> due to the large number of parameters in the conditional part. According to the parsing algorithm in Figure 2, the probability of Theoretically, we should arrive at the same dependency structure no matter whether we parse the sentence left to right or right to left. However, this is not the case with the approximation algorithm. This problem is called directional bias.</Paragraph> <Paragraph position="16"> each action p</Paragraph> <Paragraph position="18"> depends on the entire history (e.g.</Paragraph> <Paragraph position="19"> for detecting a dependency crossing or cycle), so any mapping Ph that limits the equivalence classification to less context suitable for model estimation would be very likely to drop critical conditional information for predicting p</Paragraph> <Paragraph position="21"> ). This approximation is probabilistically deficient, but our goal is to apply the DLM to a decoder in a realistic application, and the performance gain achieved by this approximation justifies the modeling decision.</Paragraph> <Paragraph position="22"> Now, we describe the way P(w</Paragraph> <Paragraph position="24"> estimated. As described in Section 2, headwords and function words play different syntactic and semantic roles capturing different types of dependency relations, so the prediction of them can better be done separately. Assuming that each word token can be uniquely classified as a headword or a function word in Japanese, the DLM can be conceived of as a cluster-based language model with two clusters, headword H and function word F. We can then define the conditional probability of w j based on its history as the product of two factors: the probability of the category given its history, and the probability of w j given its category. Let h be the headwords in (j-1)-prefix, i.e., containing only those headwords that are included in W</Paragraph> <Paragraph position="26"> ). The problem is to determine the mapping Ph so as to identify the related words in the left context that we would like to condition on. Based on the discussion in Section 2, we chose a mapping function that retains (1) two preceding words w</Paragraph> <Paragraph position="28"> is determined in two stages: First, the parser updates the dependency structure</Paragraph> <Paragraph position="30"> assuming that the next word is w j . Second, when there are multiple words that have dependency relations with w</Paragraph> <Paragraph position="32"> lected using the following decision rule:</Paragraph> <Paragraph position="34"> given its linguistic related word w</Paragraph> <Paragraph position="36"> We thus have the mapping function Ph(W</Paragraph> <Paragraph position="38"> ). The estimate of headword probability is an interpolation of three probabilities: are the interpolation weights optimized on held-out data.</Paragraph> <Paragraph position="39"> We now come back to the estimate of the other three probabilities in Equation (9). Following the work in Gao et al. (2002b), we used the unigram estimate for word category probabilities, (i.e., All conditional probabilities in Equation (13) are obtained using MLE on training data. In order to deal with the data sparseness problem, we used a backoff scheme (Katz, 1987) for parameter estimation. This backoff scheme recursively estimates the probability of an unseen n-gram by utilizing (n-1)-gram estimates. In particular, the probability of Equation (11) backs off to the estimate of where N is the total number of dependencies in training data, and C(w j , R) is the number of dependencies that contains w j . To keep the model size manageable, we removed all n-grams of count less than 2 from the headword bigram model and the word trigram model, but kept all long-distance dependency bigrams that occurred in the training data.</Paragraph> </Section> <Section position="2" start_page="24" end_page="24" type="sub_section"> <SectionTitle> 3.3 Training data creation </SectionTitle> <Paragraph position="0"> This section describes two methods that were used to tag raw text corpus for DLM training: (1) a method for headword detection, and (2) an unsupervised learning method for dependency structure acquisition.</Paragraph> <Paragraph position="1"> In order to classify a word uniquely as H or F, we used a mapping table created in the following way. We first assumed that the mapping from part-of-speech (POS) to word category is unique and fixed; we then used a POS-tagger to generate a POS-tagged corpus, which are then turned into a category-tagged corpus.</Paragraph> <Paragraph position="2"> Based on this corpus, we created a mapping table which maps each word to a unique category: when a word can be mapped to either H or F, we chose the more frequent category in the corpus. This method achieved a 98.5% accuracy of headword detection on the test data we used.</Paragraph> <Paragraph position="3"> Given a headword-tagged corpus, we then used an EM-like iterative method for joint optimization of the parsing model and the dependency structure of training data. This method uses the maximum likelihood principle, which is consistent with lan- null The tag set we used included 1,187 POS tags, of which 102 counted as headwords in our experiments. Since the POS-tagger does not identify phrases (bunsetsu), our implementation identifies multiple headwords in phrases headed by compounds.</Paragraph> <Paragraph position="4"> guage model training. There are three steps in the algorithm: (1) initialize, (2) (re-)parse the training corpus, and (3) re-estimate the parameters of the parsing model. Steps (2) and (3) are iterated until the improvement in the probability of training data is less than a threshold.</Paragraph> <Paragraph position="5"> Initialize: We set a window of size N and assumed that each headword pair within a headword N-gram constitutes an initial dependency. The optimal value of N is 3 in our experiments. That is, given a . From the initial dependencies, we computed an initial dependency parsing model by Equation (4).</Paragraph> <Paragraph position="6"> (Re-)parse the corpus: Given the parsing model, we used the parsing algorithm in Figure 2 to select the most probable dependency structure for each sentence in the training data. This provides an updated set of dependencies.</Paragraph> <Paragraph position="7"> Re-estimate the parameters of parsing model: We then re-estimated the parsing model parameters based on the updated dependency set.</Paragraph> </Section> </Section> <Section position="7" start_page="24" end_page="24" type="metho"> <SectionTitle> 4 Evaluation Methodology </SectionTitle> <Paragraph position="0"> In this study, we evaluated language models on the application of Japanese Kana-Kanji conversion, which is the standard method of inputting Japanese text by converting the text of a syllabary-based Kana string into the appropriate combination of Kanji and Kana. This is a similar problem to speech recognition, except that it does not include acoustic ambiguity. Performance on this task is measured in terms of the character error rate (CER), given by the number of characters wrongly converted from the phonetic string divided by the number of characters in the correct transcript.</Paragraph> <Paragraph position="1"> For our experiments, we used two newspaper corpora, Nikkei and Yomiuri Newspapers, both of which have been pre-word-segmented. We built language models from a 36-million-word subset of the Nikkei Newspaper corpus, performed parameter optimization on a 100,000-word subset of the Yomiuri Newspaper (held-out data), and tested our models on another 100,000-word subset of the Yomiuri Newspaper corpus. The lexicon we used contains 167,107 entries.</Paragraph> <Paragraph position="2"> Our evaluation was done within a framework of so-called &quot;N-best rescoring&quot; method, in which a list of hypotheses is generated by the baseline language model (a word trigram model in this study), which is then rescored using a more sophisticated language model. We use the N-best list of N=100,g2 whose &quot;oracle&quot; CER (i.e., the CER of the hypotheses with the minimum number of errors) is presented in Table 1, indicating the upper bound on performance. We also note in Table 1 that the performance of the conversion using the baseline tri-gram model is much better than the state-of-the-art performance currently available in the marketplace, presumably due to the large amount of training data we used, and to the similarity between the training and the test data.</Paragraph> <Section position="1" start_page="24" end_page="24" type="sub_section"> <SectionTitle> Baseline Trigram Oracle of 100-best </SectionTitle> <Paragraph position="0"/> </Section> </Section> <Section position="8" start_page="24" end_page="24" type="metho"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> The results of applying our models to the task of Japanese Kana-Kanji conversion are shown in</Paragraph> </Section> <Section position="9" start_page="24" end_page="24" type="metho"> <SectionTitle> HBM </SectionTitle> <Paragraph position="0"> stands for headword bigram model, which does not use any dependency structure (i.e. l = 1 in Equation (13)). DLM_1 is the DLM that does not use head-word bigram (i.e. l = 0 in Equation (13)). DLM_2 is the model where the headword probability is estimated by interpolating the word trigram probability, the headword bigram probability, and the probability given one previous linguistically related word in the dependency structure.</Paragraph> <Paragraph position="1"> )) and the parsing model probability can be combined through simple multiplication, some weighting is desirable in practice, especially when our parsing model is estimated using an approximation by the parsing score P(D|W). We therefore introduced a parsing model weight PW: both DLM_1 and DLM_2 models were built with and without PW. In Table 2, the PWprefix refers to the DLMs with PW = 0.5, and the DLMs without PW- prefix refers to DLMs with PW = 0. For both DLM_1 and DLM_2, models with the parsing weight achieve better performance; we For a detailed description of the baseline trigram model, see Gao et al. (2002a).</Paragraph> <Paragraph position="2"> therefore discuss only DLMs with the parsing weight for the rest of this section.</Paragraph> <Paragraph position="3"> By comparing both HBM and PW-LDM_1 models with the baseline model, we can see that the use of headword dependency contributes greatly to the CER reduction: HBM outperformed the baseline model by 8.8% in CER reduction, and PW-LDM_1 by 7.8%. By combining headword bigram and dependency structure, we obtained the best model PW-DLM_2 that achieves 11.3% CER reduction over the baseline. The improvement achieved by PW-DLM_2 over the HBM is statistically significant according to the t test (P<0.01). These results demonstrate the effectiveness of our parsing technique and the use of dependency structure for language modeling.</Paragraph> </Section> <Section position="10" start_page="24" end_page="24" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> In this section, we relate our model to previous research and discuss several factors that we believe to have the most significant impact on the performance of DLM. The discussion includes: (1) the use of DLM as a parser, (2) the definition of the mapping function Ph, and (3) the method of unsupervised dependency structure acquisition.</Paragraph> <Paragraph position="1"> One basic approach to using linguistic structure for language modeling is to extend the conventional language model P(W) to P(W, T), where T is a parse tree of W. The extended model can then be used as a parser to select the most likely parse by T</Paragraph> <Paragraph position="3"> P(W, T). Many recent studies (e.g., Chelba and Jelinek, 2000; Charniak, 2001; Roark, 2001) adopt this approach. Similarly, dependency-based models (e.g., Collins, 1996; Chelba et al., 1997) use a dependency structure D of W instead of a parse tree T, where D is extracted from syntactic trees.</Paragraph> <Paragraph position="4"> Both of these models can be called grammar-based models, in that they capture the syntactic structure of a sentence, and the model parameters are estimated from syntactically annotated corpora such as the Penn Treebank. DLM, on the other hand, is a non-grammar-based model, because it is not based on any syntactic annotation: the dependency structure used in language modeling was learned directly from data in an unsupervised manner, subject to two weak syntactic constraints (i.e., dependency structure is acyclic and planar).</Paragraph> <Paragraph position="5"> This resulted in capturing the dependency relations that are not precisely syntactic in nature within our model. For example, in the conversion of the string below, the word g6469 ban 'evening' was correctly predicted in DLM by using the long-distance bigram g6586~g6469 asa~ban 'morning~evening', even though these two words are not in any direct syntactic dependency relationship: g2597g5096g2025g10367g3037g2025g3304g2018g6586g1940g5936g9244g2061g2708g1993g1940g6469g2022g4238g3668g2061g11541g1985 'asks for instructions in the morning and submits daily reports in the evening' Though there is no doubt that syntactic dependency relations provide useful information for language modeling, the most linguistically related word in the previous context may come in various linguistic relations with the word being predicted, not limited to syntactic dependency. This opens up new possibilities for exploring the combination of different knowledge sources in language modeling.</Paragraph> <Paragraph position="6"> Regarding the function Ph that maps the left context onto equivalence classes, we used a simple approximation that takes into account only one linguistically related word in left context. An alternative is to use the maximum entropy (ME) approach (Rosenfeld, 1994; Chelba et al., 1997).</Paragraph> <Paragraph position="7"> Although ME models provide a nice framework for incorporating arbitrary knowledge sources that can be encoded as a large set of constraints, training and using ME models is extremely computationally expensive. Our working hypothesis is that the information for predicting the new word is dominated by a very limited set of words which can be selected heuristically: in this paper, Ph is defined as a heuristic function that maps D to one word in D that has the strongest linguistic relation with the word being predicted, as in (8). This hypothesis is borne out by In this sense, our model is an extension of a dependency-based model proposed in Yuret (1998). However, this work has not been evaluated as a language model with error rate reduction.</Paragraph> <Paragraph position="8"> an additional experiment we conducted, where we used two words from D that had the strongest relation with the word being predicted; this resulted in a very limited gain in CER reduction of 0.62%, which is not statistically significant (P>0.05 according to the t test).</Paragraph> <Paragraph position="9"> The EM-like method for learning dependency relations described in Section 3.3 has also been applied to other tasks such as hidden Markov model training (Rabiner, 1989), syntactic relation learning (Yuret, 1998), and Chinese word segmentation (Gao et al., 2002a). In applying this method, two factors need to be considered: (1) how to initialize the model (i.e. the value of the window size N), and (2) the number of iterations. We investigated the impact of these two factors empirically on the CER of Japanese Kana-Kanji conversion. We built a series of DLMs using different window size N and different number of iterations. Some sample results are shown in Table 3: the improvement in CER begins to saturate at the second iteration. We also find that a larger N results in a better initial model but makes the following iterations less effective.</Paragraph> <Paragraph position="10"> The possible reason is that a larger N generates more initial dependencies and would lead to a better initial model, but it also introduces noise that prevents the initial model from being improved. All DLMs in Table 2 are initialized with N = 3 and are run for two iterations.</Paragraph> <Paragraph position="11"> ferent window size N, for 0-3 iterations</Paragraph> </Section> class="xml-element"></Paper>