File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1025_metho.xml
Size: 21,619 bytes
Last Modified: 2025-10-06 14:07:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1025"> <Title>A Study on Richer Syntactic Dependencies for Structured Language Modeling</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 SLM Review </SectionTitle> <Paragraph position="0"> An extensive presentation of the SLM can be found in (Chelba and Jelinek, 2000). The model assigns a probability a0a2a1a4a3a6a5a8a7a10a9 to every sentence a3 and every possible binary parse a7 . The terminals of a7 are the words of a3 with POS tags, and the nodes of a7 are annotated with phrase headwords and non-terminal labels. Let a3 be a sentence of length a11</Paragraph> <Paragraph position="2"> Let a3a25a24 a16a26a13 a14a28a27a29a27a29a27 a13 a24 be the word a12 -prefix of the sentence -- the words from the beginning of the sentence up to the current position a12 -- and a3a30a24a31a7a32a24 the word-parse a12 -prefix. Figure 1 shows a word-parse a12 -prefix; h_0, .., h_{-m} are the exposed heads, each head being a pair (headword, non-terminal label), or (word, POS tag) in the case of a root-only tree. The exposed heads at a given position a12 in the input sentence are a function of the word-parse a12 -prefix.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Probabilistic Model </SectionTitle> <Paragraph position="0"> The joint probability a0a2a1a4a3a6a5a8a7a10a9 of a word sequence a3 and a complete parse a7 can be broken up into: a79a87a81 is the number of operations the CONSTRUCTOR executes at sentence position a12 before ...............</Paragraph> <Paragraph position="2"/> <Paragraph position="4"> carried out at position k in the word string; the operations performed by the CONSTRUCTOR are illustrated in Figures 2-3 and they ensure that all possible binary branching parses, with all possible head-word and non-terminal label assignments for the a13 a22a92a27a29a27a29a27 a13 a24 word sequence, can be generated. The</Paragraph> <Paragraph position="6"> a word-parse a12 -prefix.</Paragraph> <Paragraph position="7"> The SLM is based on three probabilities, each estimated using deleted interpolation and parameterized (approximated) as follows:</Paragraph> <Paragraph position="9"> It is worth noting that if the binary branching structure developed by the parser were always right-branching and we mapped the POS tag and non-terminal label vocabularies to a single type, then our model would be equivalent to a trigram language model. Since the number of parses for a given word prefix a3a6a24 grows exponentially with a12 ,</Paragraph> <Paragraph position="11"> a9 , the state space of our model is huge even for relatively short sentences, so we have to use a search strategy that prunes it. One choice is a synchronous multi-stack search algorithm which is very similar to a beam search.</Paragraph> <Paragraph position="12"> The language model probability assignment for the word at position a12a1a0 a81 in the input sentence is which ensures a proper probability normalization over strings a3a16a15 , where a17 a24 is the set of all parses present in our stacks at the current stage a12 . Each model component --WORD-PREDICTOR, TAGGER, CONSTRUCTOR-- is initialized from a set of parsed sentences after undergoing headword percolation and binarization, see Section 2.2. An N-best EM (Dempster et al., 1977) variant is then employed to jointly reestimate the model parameters such that the PPL on training data is decreased -the likelihood of the training data under our model is increased. The reduction in PPL is shown experimentally to carry over to the test data.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Headword Percolation And Binarization </SectionTitle> <Paragraph position="0"> As explained in the previous section, the SLM is initialized on parse trees that have been binarized and the non-terminal (NT) tags at each node have been enriched with headwords. We will briefly review the headword percolation and binarization procedures; they are explained in detail in (Chelba and Jelinek, 2000).</Paragraph> <Paragraph position="1"> The position of the headword within a constituent -- equivalent to a context-free production of the type</Paragraph> <Paragraph position="3"> POS tags (only for a21 a90 ) -- is specified using a rule-based approach.</Paragraph> <Paragraph position="4"> Assuming that the index of the headword on the right-hand side of the rule is a12 , we binarize the constituent as follows: depending on the a18 identity we apply one of the two binarization schemes in Figure 4. The intermediate nodes created by the above binarization schemes receive the NT label a18a25a24 1. The choice among the two schemes is made according to a list of rules based on the identity of the label on the left-hand-side of a CF rewrite rule.</Paragraph> <Paragraph position="5"> ditioned on the left contextual information. There are two main reasons we prefer strict left-to-right parsers for the purpose of language modeling (Roark, 2001): when looking for the most likely word string given the acoustic signal (as required in a speech recognizer), the search space is organized as a prefix tree. A language model whose aim is to guide the search must thus operate left-to-right.</Paragraph> <Paragraph position="6"> previous results (Chelba and Jelinek, 2000) (Charniak, 2001) (Roark, 2001) show that a grammar-based language model benefits from interpolation with a 3-gram model. Strict left-to-right parsing makes it easy to combine with a standard 3-gram at the word level (Chelba and Jelinek, 2000) (Roark, 2001) rather than at sentence level (Charniak, 2001).</Paragraph> <Paragraph position="7"> For these reasons, we prefer enriching the syntactic dependencies by information from the left context. However, as mentioned in (Roark, 2001), one way of conditioning the probabilities is by annotating the extra conditioning information onto the node labels in the parse tree. We can annotate the training corpus with richer information and with the same SLM training procedure we can estimate the probabilities under the richer syntactic tags. Since the treebank parses allow us to annotate parent information onto the constituents, as Johnson did in (Johnson, 1998), this richer predictive annotation can extend information slightly beyond the left context. Under the equivalence classification in Eq.( 2, 3, 4), the conditional information available to the SLM model components is made up of the two most-recent exposed heads consisting of two NT tags and two headwords. In an attempt to extend the syntactic dependencies beyond this level, we enrich the non-terminal tag of a node in the binarized parse tree with the NT tag of the parent node, or the NT tag of the child node from which the headword is not being percolated (same as in (Chelba and Xu, 2001)), or we add the NT tag of the third most-recent exposed head to the history of the CONSTRUCTOR component. The three ways are briefly described as: 1. opposite (OP): we use the non-terminal tag of the child node from which the headword is not being percolated 2. parent (PA): we use the non-terminal tag of the parent node to enrich the current node 3. h-2: we enrich the conditioning information of the CONSTRUCTOR with the non-terminal tag of the third most-recent exposed head, but not the headword itself. Consequently, Eq. 4</Paragraph> <Paragraph position="9"> We take the example from (Chelba and Xu, 2001) to illustrate our enrichment approaches. Assume that after binarization and headword percolation, we have a noun phrase constituent: specified the NT tag of the parent of the NP group A given binarized tree is traversed recursively in depth-first order and each constituent is enriched in the parent or opposite manner or both. Then from the resulting parse trees, all three components of the SLM are initialized and N-best EM training can be started.</Paragraph> <Paragraph position="10"> Notice that both parent and opposite affect all three components of the SLM since they change the NT/POS vocabularies, but h-2 only affects the CONSTRUCTOR component. So we believe that if h-2 helps in reducing PPL and WER, it's because we have thereby obtained a better parser. We should also notice the difference between parent and opposite in the bottom-up parser. In opposite scheme, POS (part of speech) tags are not enriched. As we parse the sentence, two most-recent exposed heads will be adjoined together under some enriched NT label (Figure 2, 3), the NT label has to match the NT tag of the child node from which the headword is not being percolated. Since the NT tags of the children are already known at the moment, the opposite scheme actually restricts the possible NT labels. In the parent scheme, POS tags are also enriched with the NT tag of the parent node. When a POS tag is predicted from the TAGGER, actually both the POS tag and the NT tag of the parent node are hypothesized. Then when two most recent exposed heads are adjoined together under some enriched NT label, the NT label has to match the parent NT information carried in both of the exposed heads. In other words, if the two exposed heads bear different information about their parents, they can never be adjoined. Since this restriction of adjoin movement is very tight, pruning may delete some or all the good parsing hypotheses early and the net result may be later development of inadequate parses which lead to poor language modeling and poor parsing performance. null Since the SLM parses sentences bottom-up while the parsers used in (Charniak, 2000), (Charniak, 2001) and (Roark, 2001) are top-down, it's not clear how to find a direct correspondence between our schemes of enriching the dependency structure and the ones employed above. However, it is their &quot;pick-and-choose&quot; strategy that inspired our study of richer syntactic dependencies for the SLM.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> With the three enrichment schemes described in Section 3 and their combinations, we evaluated the PPL performance of the resulting seven models on the UPenn Treebank and the WER performance on the WSJ setup, respectively. In order to see the correspondence between parsing accuracy and PPL/WER performance, we also evaluated the labeled precision and recall statistics (LP/LR, the standard parsing accuracy measures) on the UPenn Treebank corpus. For every model component in our experiments, deleted-interpolation was used for smoothing. The interpolation weights were estimated from separate held-out data. For example, in the UPenn Treebank setup, we used section 00-20 as training data, section 21-22 as held-out data, and section 2324 as test data.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Perplexity </SectionTitle> <Paragraph position="0"> We have evaluated the perplexity of the seven different models, resulting from applying parent, opposite, h-2 and their combinations. For each way of initializing the SLM we have performed 3 iterations of N-best EM training. The SLM is interpolated with a 3-gram model, built from exactly the same training data and word vocabulary, using a fixed interpolation weight. As we mentioned in Section 3, the NT/POS vocabularies for the seven models are different because of the enrichment of NT/POS tags.</Paragraph> <Paragraph position="1"> Table 1 shows the actual vocabulary size we used for each model (for parser, the vocabulary is a list of all possible parser operations). The baseline model is the standard SLM as described in (Chelba and Jelinek, 2000).</Paragraph> <Paragraph position="2"> The PPL results are summarized in Table 2. The SLM is interpolated with a 3-gram model as shown in the equation: We should note that the PPL result of the 3-gram model is 166.6. As we can see from the table, without interpolating with the 3-gram, the opposite scheme performed the best, reducing the PPL of the baseline SLM by almost 5% relative. When the SLM is interpolated with the 3-gram, the h-2+opposite+parent scheme performed the best, reducing the PPL of the baseline SLM by 3.3%. However, the parent and opposite+parent schemes are both worse than the baseline, especially before the EM training and with a2 =0.0. We will discuss the results further in Section 4.4.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Parsing Accuracy Evaluation </SectionTitle> <Paragraph position="0"> Table 3 shows the labeled precision/recall accuracy results. The labeled precision/recall results of our model are much worse than those reported in (Charniak, 2001) and (Roark, 2001). One of the reasons is that the SLM was not aimed at being a parser, but rather a language model. Therefore, in the search algorithm, the end-of-sentence symbol can be predicted before the parse of the sentence is ready for completion3, thus completing the parse with a series of special CONSTRUCTOR moves (see (Chelba and Jelinek, 2000) for details). The SLM allows right-branching parses which are not seen in the UPenn Treebank corpus and thus the evaluation against the UPenn Treebank is inherently biased.</Paragraph> <Paragraph position="1"> It can also be seen that both the LP and the LR dropped after 3 training iterations: the N-best EM variant used for SLM training algorithm increases the likelihood of the training data, but it cannot guarantee an increase in LP/LR, since the re-estimation algorithm does not explicitly use parsing accuracy as a criterion.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 N-best Re-scoring Results </SectionTitle> <Paragraph position="0"> To test our enrichment schemes in the context of speech recognition, we evaluated the seven models in the WSJ DARPA'93 HUB1 test setup. The same setup was also used in (Roark, 2001), (Chelba and Jelinek, 2000) and (Chelba and Xu, 2001). The size of the test set is 213 utterances, 3446 words. The 20k words open vocabulary and baseline 3-gram model are the standard ones provided by NIST and LDC -see (Chelba and Jelinek, 2000) for details. The lattices and N-best lists were generated using the standard 3-gram model trained on 45M words of WSJ.</Paragraph> <Paragraph position="1"> The N-best size was at most 50 for each utterance, 3A parse is ready for completion when at the end of the sentence there are exactly two exposed headwords, the first of which if the start-of-sentence symbol and the second is an ordinary word. See (Chelba and Jelinek, 2000) for details about special rules.</Paragraph> <Paragraph position="2"> and the average size was about 23. The SLM was trained on 20M words of WSJ text automatically parsed using the parser in (Ratnaparkhi, 1997), binarized and enriched with headwords and NT/POS tag information as explained in Section 2.2 and Section 3. Because SLM training on the 20M words of WSJ text is very expensive, especially after enriching the NT/POS tags, we only evaluated the WER performance of the seven models with initial statistics from binarized and enriched parse trees. The results are shown in Table 4. The table shows not only the results according to different interpolation weights a2 , but also the results corresponding to a2 a15 , a virtual interpolation weight. We split the test data into two parts, a2 and a3 . The best interpolation weight, estimated from part a2 , was used to decode part a3 , and vice versa. We finally put the decoding results of the two parts together to get the final decoding output. The interpolation weight a2 a15 is virtual because the best interpolation weights for the two parts might be different. Ideally, a2 should be estimated from separate held-out data and then applied to the test data. However, since we have a small number of N-best lists, our approach should be a good estimate of the WER under the ideal interpolation weight.</Paragraph> <Paragraph position="3"> As can be seen, the h-2+opposite scheme achieved the best WER result, with a 0.5% absolute reduction over the performance of the opposite scheme. Overall, the enriched SLM achieves 10% relative reduction in WER over the 3-gram model baseline result(a2 a16 a81 a27a5a4 ).</Paragraph> <Paragraph position="4"> The SLM enriched with the h-2+opposite scheme outperformed the 3-gram used to generate the lattices and N-best lists, without interpolating it with the 3-gram model. Although the N-best lists are already highly restricted by the 3-gram model during the first recognition pass, this fact still shows the potential of a good grammar-based language model.</Paragraph> <Paragraph position="5"> In particular, we should notice that the SLM was trained on 20M words of WSJ while the lattice 3-gram was trained on 45M words of WSJ. However, our results are not indicative of the performance of SLM as a first pass language model.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Discussion </SectionTitle> <Paragraph position="0"> By enriching the syntactic dependencies, we expect the resulting models to be more accurate and thus give better PPL results. However, in Table 2, we can see that this is not always the case. For example, the parent and opposite+parent schemes are worse than baseline in the first iteration when a2 =0.0, the h-2+parent and h-2+opposite+parent schemes are also worse than h-2 scheme in the first iteration when a2 =0.0.</Paragraph> <Paragraph position="1"> Why wouldn't more information help? There are two possible reasons that come to mind: 1. Since the size of our training data is small (1M words), the data sparseness problem (overparameterization) is more serious for the more complicated dependency structure. We can see the problem from Table 1: the NT/POS vocabularies grow much bigger as we enrich the NT/POS tags.</Paragraph> <Paragraph position="2"> 2. As mentioned in Section 3, a potential problem of enriching NT/POS tags in parent scheme is that pruning may delete some hypotheses at an early time and the search may not recover from those early mistakes. The result of this is a high parsing error and thus a worse language model.</Paragraph> <Paragraph position="3"> In order to validate the first hypothesis, we evaluated the training data PPL for each model scheme. As can be seen from Table 5, over-parameterization is indeed a problem. From scheme h-2 to h2+opposite+parent, as we add more information to the conditioning context, the training data PPL decreases. The test data PPL in Table 2 does not follow this trend, which is a clear sign of overparameterization. null Over-parameterization might also occur for parent and opposite+parent, but it alone can not explain the high PPL of training data for both schemes. The LP/LR results in Table 3 show that bad parsing accuracy also plays a role in these situations. The labeled recall results of parent and opposite+parent are much worse than those of baseline and other schemes. The end-of-sentence parse completion strategy employed by the SLM is responsible for the high precision/low recall operation of the parent and opposite+parent models. Adding h-2 remedies the parsing performance of the SLM in this situation, but not sufficiently.</Paragraph> <Paragraph position="4"> precision/recall(%) error It is very interesting to note that labeled recall and language model performance (WER/PPL) are well correlated. Figure 5 compares PPL, WER (a2 =0.0 at training iteration 0) and labeled precision/recall error(100-LP/LR) for all models. Overall, the labeled recall is well correlated with the WER and PPL values. Our results show that improvement in the parser accuracy is expected to lead to improvement in WER.</Paragraph> <Paragraph position="5"> Finally, in comparison with the language model in (Roark, 2001) which is based on a probabilistic top-down parser, and with the Bihead/Trihead language models in (Charniak, 2001) which are based on immediate head parsing, our enriched models are less effective in reducing the test data PPL: the best PPL result of (Roark, 2001) on the same experimental setup is 137.3, and the best PPL result of (Charniak, 2001) is 126.1. We believe that examining the differences between the SLM and these models could help in understanding the degradation: 1. The parser in (Roark, 2001) uses a &quot;pick-and-choose&quot; strategy for the conditioning information used in the probability models. This allows the parser to choose information depending on the constituent that is being expanded. The SLM, on the other hand, always uses the same dependency structure that is decided beforehand. null 2. The parser in (Charniak, 2001) is not a strict left-to-right parser. Since it is top-down, it is able to use the immediate head of a constituent before it occurs, while this immediate head is not available for conditioning by a strict left-to-right parser such as the SLM. Consequently, the interpolation with the 3-gram model is done at the sentence level, which is weaker than interpolating at the word level.</Paragraph> <Paragraph position="6"> Since the WER results in (Roark, 2001) are based on less training data (2.2M words total), we do not have a fair comparison between our best model and Roark's model.</Paragraph> </Section> </Section> class="xml-element"></Paper>