File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1031_metho.xml
Size: 32,881 bytes
Last Modified: 2025-10-06 14:08:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1031"> <Title>The SuperARV Language Model: Investigating the E ectiveness of Tightly Integrating Multiple Knowledge Sources</Title> <Section position="3" start_page="1" end_page="1" type="metho"> <SectionTitle> 2 SuperARV Language Model </SectionTitle> <Paragraph position="0"> The SuperARV LM is a highly lexicalized probabilistic LM based on the Constraint Dependency Grammar (CDG) (Harper and Helzerman, 1995). CDG represents a parse as assignments of dependency relations to functional variables (denoted roles) associated with each word in a sentence. Consider the parse for What did you learn depicted in the white boxofFigure1. Eachwordintheparsehasalexical category and a set of feature values. Also, each word has a governor role (denoted G)whichisassigned a role value, comprised of a label as well as a modi ee, which indicates the position of the word's governor or head. For example, the role value assigned to the governor role of did is vp-1,whereits label vp indicates its grammatical function and its modi ee 1 is the position of its head what. The need roles (denoted N1, N2, and N3) are used to ensure the grammatical requirements (e.g., subcategorization) of a word are met, as in the case of the verb did, which needs a subject and a base form verb (but since the word takes no other complements, the mod- null cat1=verb, subcat=base, verbtype=past, voice=active, inverted=yes, type=none, gapp=yes, mood=whquestion, semtype=auxiliary, agr=all, role1=G, label1=vp, (PX1>MX1) ARVP for vp-1 assigned to G for did and subj-2 assigned to G for you: cat1=verb, subcat=base, verbtype=past, voice=active, inverted=yes, type=none, gapp=yes, mood=whquestion, semtype=auxiliary, agr=all, role1=G, label1=vp, (PX1>MX1)</Paragraph> <Paragraph position="2"> ARVP, and the SuperARV of the word did in the sentence what did you learn.Note:G represents the governor role; the need roles, Need1, Need2,andNeed3,areused to ensure that the grammatical requirements of the word are met. PX and MX([R]) represent the position of a word and its modi ee (for role R), respectively.</Paragraph> <Paragraph position="3"> i ee of the role value assigned to N3 is set equal to its own position). Including need roles also provides a mechanism for using non-headword dependencies to constrain parse structures, which Bod (2001) has shown contributes to improved parsing accuracy.</Paragraph> <Paragraph position="4"> During parsing, the grammaticality of a sentence in a language de ned by a CDG is determined by applying a set of constraints to the possible role value assignments (Harper and Helzerman, 1995; Maruyama, 1990). Originally, the constraints were comprised of a set of hand-written rules specifying which role values (unary constraints) and pairs of role values (binary constraints) were grammatical (Maruyama, 1990). In order to derive the constraints directly from CDG annotated sentences, we have developed an algorithm to extract grammar relations using information derived directly from annotated sentences (Harper et al., 2000; Harper and Wang, 2001). Using the relationship between a role value's position and its modi ee's position, unary and binary constraints can be represented as a nite set of abstract role values (ARVs) and abstract role value pairs (ARVPs), respectively. The light gray box of Figure 1 shows an example of an ARV and an ARVP.</Paragraph> <Paragraph position="5"> The ARV for the governor role value of did indicates its lexical category, lexical features, role, label, and positional relation information. (PX1 >MX1) indicates that did is governed by a word that precedes it. Note that the constraints of a CDG can be extracted from a corpus of parsed sentences.</Paragraph> <Paragraph position="6"> A super abstract role value (SuperARV)isanabstraction of the joint assignment of dependencies for a word, which provides a mechanism for lexicalizing CDG parse rules. The dark gray box of Figure 1 presents an example of a SuperARV for the word did.</Paragraph> <Paragraph position="7"> The SuperARV structure provides an explicit way to organize information concerning one consistent set of dependency links for a word that can be directly derived from its parse assignments. SuperARVs encode lexical information as well as syntactic and semantic constraints in a uniform representation that is much more ne-grained than POS. A SuperARV can be thought of as providing admissibility constraints on syntactic and lexical environments in which a word may be used.</Paragraph> <Paragraph position="8"> A SuperARV is formally de ned as a four-tuple for a word, hC;F,(R;L;UC;MC)+;DCi,whereC is the lexical category of the word, F = fFname</Paragraph> <Paragraph position="10"> is its corresponding value), (R, L, UC, MC)+ is a list of one or more four-tuples, each representing an abstraction of a role value assignment, where R is a role variable, L is a functionality label, UC represents the relative position relation of a word and its dependent, MC is the lexical category of the modi ee for this dependency relation, and DC represents the relative ordering of the positions of a word and all of its modi ees. The following features are used in our SuperARV LM: agr, case, vtype (e.g., progressive), mood, gapp (e.g., gap or not), inverted, voice, behavior (e.g., mass, count), type (e.g., interrogative, relative). These lexical features constitute a much richer set than the features used by the parser-based LMs in Section 1.</Paragraph> <Paragraph position="11"> Since Harper et al. (1999) found that enforcing modi ee constraints (e.g., the lexical categories of modiees) in parsing results in e cient pruning, we also include the modi ee lexical category (MC) in our SuperARV structure to impose modi ee constraints.</Paragraph> <Paragraph position="12"> Words typically have more than one SuperARV to indicate di erent types of word usage. The average number of SuperARVs for words of di erent lexical categories vary, with verbs having the greatest SuperARV ambiguity. This is mostly due to the variety of feature combinations and variations on complement types and positions. We have observed in several experiments that the number of SuperARVs does not grow signi cantly as training set size increases; the moderate-sized Resource Management corpus (Price et al., 1988) with 25,168 words produces 328 SuperARVs, compared to 538 SuperARVs for the 1 million word Wall Street Journal (WSJ) Penn Treebank set (Marcus et al., 1993), and 791 for the 37 million word training set of the WSJ continuous speech recognition task.</Paragraph> <Paragraph position="13"> SuperARVs can be accumulated from a corpus annotated with CDG relations and stored directly with words in a lexicon, so we can learn their frequency of occurrence for the corresponding word. A SuperARV can then be selected from the lexicon and used to generate role values that meet their constraints.</Paragraph> <Paragraph position="14"> Since there are no large benchmark corpora annotated with CDG information , we have developed a methodology to automatically transform constituent bracketing found in available treebanks into CDG annotations. In addition to generating dependency structures by headword percolation (Chelba, 2000), our transformer also utilizes a rule-based method to determine lexical features and need role values for words, as described by Wang et al. (2001).</Paragraph> <Paragraph position="15"> Our SuperARV LM estimates the joint probability of words w</Paragraph> <Paragraph position="17"> Notice we use a joint probabilistic model to enable the joint prediction of words and their SuperARVs so that word form information is tightly integrated at the model level. Our SuperARV LM does not encode the word identity directly at the data structure level as was done in (Galescu and Ringger, 1999) since this could cause serious data sparsity problems.</Paragraph> <Paragraph position="18"> To estimate the probability distributions in Equation (3) from training data, we use recursive linear interpolation among probability estimations of di erent orders. Representing each multiplicand in Equation (3) as the conditional probability ) is the order n-gram maximum likelihood estimation.</Paragraph> <Paragraph position="19"> Table 1 enumerates the n-grams and their order for the interpolation smoothing of the two distributions in Equation (3). The ordering was based on our hypothesis that n-grams with more ne-grained history information should be ranked higher in the n-gram list since that information should be more helpful for discerning word and SuperARVs based on their history. The SuperARV LM hypothesizes categories for out-of-vocabulary words using the leave-one-out technique (Niesler and Woodland, 1996).</Paragraph> <Paragraph position="20"> smoothing the distributions in Equation (3).</Paragraph> <Paragraph position="22"> In preliminary experiments, we compared several algorithms for smoothing the probability estimations for our SuperARV LM. The best performance was achieved by using the modi ed Kneser-Ney smoothing algorithm initially introduced in (Chen and Goodman, 1998) and adapting it by employing a heldout data set to optimize parameters, including cuto s for rare n-grams, by using Powell's search (Press et al., 1988). Parameters are chosen to optimize the perplexity on a heldout set.</Paragraph> <Paragraph position="23"> In order to compare our SuperARV LM with a word-based LM, we must use the following equation to calculate the word perplexity (PPL):</Paragraph> <Paragraph position="25"> Equation (4) is used by class-based LMs to calculate word perplexity (Heeman, 1998). Parser-based LMs use a similar procedure that sums over parses.</Paragraph> <Paragraph position="26"> The SuperARV LM is most closely related to the almost-parsing-based LM developed by Srinivas (1997). Srinivas' LM, based on the notion of a supertag , the elementary structure of Lexicalized Tree-Adjoining Grammar, achieved a perplexity reduction compared to a conditional POS n-gram LM (Niesler and Woodland, 1996). By comparison, our LM incorporates dependencies directly on words instead of through nonterminals, uses more lexical features than the supertag LM, uses joint instead of conditional probability estimations, and uses modi ed Kneser-Ney rather than Katz smoothing.</Paragraph> </Section> <Section position="4" start_page="1" end_page="4" type="metho"> <SectionTitle> 3 Evaluating the SuperARV </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="2" type="sub_section"> <SectionTitle> Language Model </SectionTitle> <Paragraph position="0"> Traditionally, the LM quality in speech recognition is evaluatedontwometrics: perplexityandWER,with the former commonly selected as a less computationally expensive alternative. We carried out two experiments, one using the Wall Street Journal Penn Tree-bank (WSJ PTB), a text corpus on which perplexity can be measured and compared to other LMs, and the Wall Street Journal Continuous Speech Recognition (WSJ CSR) task, a speech corpus on which both perplexity and WER can be evaluated after LM rescoring. These two experiments compare our SuperARV LM to a baseline trigram, a POS LM that was implemented using Equation (3) (where for this model t represents POS tags instead of SuperARV tags) and modi ed Kneser-Ney smoothing (as used in the SuperARV LM), and one or more parser-based LMs. Additionally, we evaluate the performance of a conditional probability SuperARV LM (denoted cSuperARV) implemented following Equation (1) rather than Equation (3) to evaluate the importance of using joint probability estimations.</Paragraph> <Paragraph position="1"> For the WSJ PTB task, we compare the SuperARV LMs to the parser LMs developed by Chelba (2000), Roark (2001), and Charniak (2001). Although Srinivas (1997) developed an almost-parsing supertag-based LM, we cannot compare his LM with the other LMs because he used a small non-standard subset of the WSJ PTB and a trainable supertag LM is unavailable. Because none of the parser LMs has been fully trained for the WSJ CSR task, it is essential that we retrain them for comparison. The availability of a trainable version of Chelba's model enables us to train and test on the CSR task; however, because we do not have access to a trainable version of Charniak's or Roark's LMs, they are not considered in the CSR task. Note that for lattice rescoring, however, Roark found that Chelba's model achieves a greater reduction on WER than his LM (Roark, 2001).</Paragraph> </Section> <Section position="2" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 3.1 Evaluating on the WSJ PTB </SectionTitle> <Paragraph position="0"> To evaluate the perplexity of the LMs on the WSJ PTB task, we adopted the conventions of Chelba (2000), Roark (2001), and Charniak (2001) for pre-processing the data. The vocabulary is limited to the most common 10K words, with all words outside this vocabulary mapped tohUNKi. All punctuation is removed and no case information is retained. All symbols and digits are replaced by the symbol N.</Paragraph> <Paragraph position="1"> Sections 0-20 (929,564 words) are used as the training set for collecting counts, sections 21-22 (73,760 words) as the development set for tuning parameters, and sections 23-24 (82,430 words) for testing.</Paragraph> <Paragraph position="2"> The baseline trigram uses Katz back-o model with Good-Turing discounting for smoothing. The POS, cSuperARV, and SuperARV LMs were implemented as described previously. The results for the parser-based LMs were initially taken from the literature. The perplexity on the test set using each LM and their interpolation with the corresponding trigram (and the interpolation weight) are shown in the top six rows of Table 2.</Paragraph> <Paragraph position="3"> As can be seen in Table 2, the SuperARV LM obtains the lowest perplexity of all of the LMs (and so it is depicted in bold face). The SuperARV LM achieves the greatest perplexity reduction of 29.19% compared to the trigram, with Charniak's interpolated trihead LM a close second at 24.91%. The cSuperARV LM is clearly inferior to the SuperARV LM, even after interpolation. This result highlights the value of tight coupling of word, lexical feature, and syntactic knowledge both at the data structure level (which is the same for the SuperARV and cSuperARV LMs) and at the probability model level (which is di erent).</Paragraph> <Paragraph position="4"> Notice that the cSuperARV, Chelba's, Roark's, and Charniak's LMs obtain an improvement in performance when interpolated with a trigram; whereas, Using the same 180,000 word training and 20,000 word test set as (Srinivas, 1997), our SuperARV LM obtains a perplexity of 92.76, compared to a perplexity of 101 obtained by the supertag LM.</Paragraph> <Paragraph position="5"> the POS LM and the SuperARV LM do not bene t from trigram interpolation . To gain more insight into why a trigram is e ectively interpolated with some, but not all, of the LMs, we calculate the correlation of the trigram with each LM. A standard correlation is calculated between the probabilities assigned to each test set sentence by the trigram LM and the LM in question. This technique has been used in (Wang et al., 2002) to identify whether two LMs can be e ectively interpolated.</Paragraph> <Paragraph position="6"> Since we have access to an executable version of Charniak's LM trained on the WSJ PTB (ftp.cs.brown.edu/pub/nlparser) and a trainable version of Chelba's LM, we are able to calculate their correlations with our trigram LM. Chelba's LM was retrained using more parameter reestimation iterations than in (Chelba, 2000) to optimize the performance. Table 2 shows the correlation between each of the executable LMs and the trigram LM.</Paragraph> <Paragraph position="7"> The POS LM has the highest correlation with the trigram, closely followed by the SuperARV LM. Because these two LMs tightly integrate the word information jointly with the tag distribution, the trigram information is already represented. In contrast, the cSuperARV LM and Chelba's and Charniak's parser-based LMs have much lower correlations, indicating they have much lower overlap with the trigram. Because the cSuperARV LM only uses weak word distribution information in probability estimations, it leaves room for the trigram LM to compensate for the lack of word knowledge. The correlations for the parser-based LMs suggest that they capture di erent aspects of the words' distributions in the language than the words themselves.</Paragraph> </Section> <Section position="3" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 3.2 Evaluating on the WSJ CSR Task </SectionTitle> <Paragraph position="0"> Next we compare the e ectiveness of using the tri-gram word-based, POS, cSuperARV, SuperARV, and Chelba's LMs in rescoring hypotheses generated by a speech recognizer. The training set of the WSJ CSR task is composed of the 1987-1989 les containing 37,243,300 words. The speech data for the training set is used for building the acoustic model; whereas, the parse trees for the training set are generated following the policy that if the context-free grammar constituent bracketing can be found in the WSJ PTB, it becomes the parse tree for the training sentence; otherwise, we use the corresponding tree in the BLLIP treebank (Charniak et al., 2000). Since WSJ CSR is a speech corpus, there is no punctuation or case information. All words outside the provided vocabulary are mapped tohUNKi.Notethat In the remaining experiments, the POS LM and the the word-level tokenization of treebank texts di ers from that used in the speech recognition task with the major di erences being: numbers (e.g., \1.2%&quot; versus \one point two percent&quot;), dates (e.g., \Dec. 20, 2001&quot; versus \December twentieth, two thousand one&quot;) , currencies (e.g., \$10.25&quot; versus \ten dollars and twenty ve cents&quot;), common abbreviations (e.g., \Inc.&quot; versus \Incorporated&quot;), acronyms (e.g., \I.B.M.&quot; versus \I. B. M.&quot;), hyphenated and period-delimited phrases (e.g., \red-carpet&quot; versus \red carpet&quot;), and contractions and possessives (e.g., \do n't&quot; versus \don't&quot;). The POS, parser-based, and SuperARV LMs are all trained using the text-based tokenization from the treebank. Hence, during testing, a transformation converts the output of the recognizer to a form compatible with the text-based tokenization (Roark, 2001) for rescoring.</Paragraph> <Paragraph position="1"> For testing the LMs, we use the four available WSJ CSR evaluation sets: 1992 5K closed vocabulary (denoted 92-5k) with 330 utterances and 5,353 words, 1993 5K closed vocabulary (93-5k) with 215 utterances and 3,849 words, 1992 20K open vocabulary (92-20k) with 333 utterances and 5,643 words, and 1993 20K (93-20k) with 213 utterances and 3,446 words. We also employ a development set for each vocabulary size: 93-5k-dt (513 utterances and 8,635 words) and 93-20k-dt (252 utterances and 4,062 words).</Paragraph> <Paragraph position="2"> The trigram provided by LDC for the CSR task was used due to its high quality. Before evaluation, all the other LMs (i.e., the POS LM, the cSuperARV and SuperARV LMs, and Chelba's LM) are retrained on the training set trees for the CSR task. Parameter tuning for the LMs on each task uses the corresponding development set The interpolation weight for cSuperARV for lattice rescoring was 0.63 on the 5k tasks and 0.60 on the 20k tasks, and 0.68 and 0.65 for Chelba's LM, respectively. in bold face. The SuperARV LM yields the lowest perplexity, with Chelba's LM a close second. The perplexity reductions for the SuperARV LM over the trigram across the test sets are 53.19%, 53.63%, 34.33%, and 32.05%, which is even higher than on the WSJ PTB task. This is probably due to the fact that more training data was used for the CSR task</Paragraph> </Section> <Section position="4" start_page="4" end_page="4" type="sub_section"> <SectionTitle> Rescoring Lattices Next using the same LMs, we </SectionTitle> <Paragraph position="0"> rescored the lattices generated by an acoustic recognizer built using HTK (Ent, 1997). For each test set sentence, we generated a word lattice. We tuned the parameters of the LMs using the lattices on the corresponding development sets to minimize WER.</Paragraph> <Paragraph position="1"> Lattices were rescored using a Viterbi search for each LM.</Paragraph> <Paragraph position="2"> Table 4 shows the WER and sentence accuracy (SAC) after rescoring lattices using each LM, with the lowest WER and highest SAC for each test set presented in bold face. We also give the lattice WER/SAC which de nes the best accuracy possible given perfect knowledge. As can be seen from Table 4, the SuperARV LM produces the best reduction in WER with Chelba's LM the second best. When rescoring lattices on the 92-5k, 93-5k, 92-20k, and 93-20k test sets, the SuperARV LM yields a relative WER reduction of 13.54%, 9.70%, 8.64%, and 3.12% compared to the trigram, respectively. SAC results are similar: the SuperARV LM achieves an absolute increase on SAC of 4.24%, 6.97%, 2.7%, and 3.75%, compared to the trigram. Note that Chelba's LM tied once with the SuperARV LM on 93-20k SAC, but always obtained higher WER across the four test sets. Because Chelba's LM focuses on developing the complete parse structure for a word sequence, it enforces more strict pruning based on the entire sentence. As can be seen in Table 4, the cSuperARV LM,evenwheninterpolatedwithatrigramLM,obtains a lower accuracy than our SuperARV LM. This result is consistent with the hypothesis that a conditional model su ers from label bias (La erty et al., 2001).</Paragraph> <Paragraph position="3"> The WER reported by Chelba (2000) on the 93-20k test set was 13.0%. This WER is lower than what we obtained for Chelba's retrained LM on the same task. This disparity is due to the fact that a higher quality acoustic decoder was used in (Chelba, 2000), which is not available to us. We further compare the LMs on Dr. Chelba's 93-20K lattices kindly provided by him, with the rescoring results shown in the last column of Table 4. We observe that Chelba's retrained LM improves his original result, but the SuperARV LM still obtains a greater accuracy. Sign tests show that the di erences between the accuracies achieved by the SuperARV LM and the trigram, POS, and cSuperARV LMs are statistically signi cant. Although there is no signi cant di erence between the SuperARV LM and Chelba's LM, the SuperARV LM has a much lower complexity than Chelba's LM.</Paragraph> </Section> </Section> <Section position="5" start_page="4" end_page="4" type="metho"> <SectionTitle> 4 Investigating the Knowledge </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> Source Contributions </SectionTitle> <Paragraph position="0"> Next, we attempt to explain the contrast between the encouraging results from our SuperARV LM and the reported poor performance of several probabilistic dependency grammar models, i.e., the traditional probabilistic dependency grammar (PDG) LM, the probabilistic link grammar (PLG) (La erty et al., 1992) LM, and Zeman's probabilistic dependency grammar model (ZPDG) (Hajic et al., 1998). ZPDG was evaluated on the Prague Dependency Treebank (Hajic, 1998) during the 1998 Johns Hopkins summer workshop (Hajic et al., 1998) and produced a much lower parsing accuracy (under 60%) than Collins' probabilistic context-free grammar parser (80%) (Collins, 1996). Fong et al. (1995) evaluated the probabilistic link grammar LM described in (La erty et al., 1992) on small arti cial corpora and found that the LM has a greater perplexity than a standard bigram. Additionally, only a modest improvement on the bigram was achieved after Fong and Wu (1995) revised the model to make grammar rule learning feasible.</Paragraph> <Paragraph position="1"> One possible reason for their poor performance, especially in the light of our SuperARV LM results, is that these probabilistic dependency grammar models do not utilize su cient knowledge to achieve a high level of accuracy. The knowledge sources the SuperARV LM uses, represented as components of the structure shown in Figure 1, include: lexical category (denoted c), lexical features (denoted f), role label or link type information (denoted L), a governor role dependency relation constraint (R, L, UC) (denoted g), a set of need role dependency relation constraints (R;L;UC)+ (denoted n), and modi ee constraints represented as the lexical category of the modi ee for each role (denoted m). Table 5 summarizes the knowledge sources that each of the probabilistic dependency grammar models uses.</Paragraph> <Paragraph position="2"> To determine whether the poor performance of the three probabilistic dependency grammar models results from our hypothesis that they utilize insu cient knowledge, we will evaluate our SuperARV LM after eliminating those knowledge sources that are not used by each of these models. Additionally, we will evaluate the contribution of each of the knowledge sources to the predictiveness of our SuperARV LM.</Paragraph> <Paragraph position="3"> We use the methodology of selectively ignoring different types of knowledge as constraints to evaluate the knowledge source contributions to our SuperARV LM, as well as to approximate the performance of the other probabilistic dependency grammar models.</Paragraph> <Paragraph position="4"> The framework of CDG, on which our SuperARV LM is built, allows constraints to be tightened by adding more knowledge sources or loosened by ignoring certain knowledge. The SuperARV structure inherits this capability from CDG; selective constraint relaxation is implemented by eliminating one or more knowledge source in K = fc;f;L;g;n;mg from the SuperARV structure. We have constructed nine different LMs based on reduced SuperARV structures denoted SARV-k (i.e., a SuperARV structure after removing k with k K), where [?]k represents the deletion of a subset of knowledge types (e.g., f, mn, cgmn). Each model is described next.</Paragraph> <Paragraph position="5"> Modi ee constraints potentially hamper grammar generality, and so we consider their impact by deleting them from the LM by using the SARV-m structure. Need roles are important for capturing the structural requirements of di erent types of words (e.g., subcategorization), and we investigate their effects by using the SARV-n structure. The model based on SARV-L is built to investigate the importance of link type information. We can investigate the contribution of the combination of m and n, fundamental to the enforcement of valency constraints, by using the SARV-mn structure. The model based on SARV-f is used to evaluate whether abilistic dependency grammar models compared to our SuperARV LM. Note link type is de ned as L and link direction is de ned as UC in the SuperARV structure.</Paragraph> <Paragraph position="6"> lexical features improve or degrade LM quality. The model based on SARV-fmn is very similar to the standard probabilistic dependency grammar LM, in which only word, POS, link type, and link direction information is used for probability estimations. The model based on SARV-gmn uses a feature augmentation of POS, and the model based on SARV-cgmn uses lexical features only. Additionally, we built the model ZPDG-SARV to approximate ZPDG. Zeman's PDG (Hajic et al., 1998) differs signi cantly from our original SuperARV LM in that it ignores label information L and some lexical feature information (the morphological tags do not include some lexical features having influence on syntax, denoted syntactic lexical features, i.e., gapp, inverted, mood, type, case, voice), and does not enforce valency constraints (instead, the model only counts the number of links associated with a word without discriminating whether the links represent governing or linguistic structural requirements). Also, word identity information is not used, instead, the model uses a loose integration of a word's lemma and its morphological tag. Given this analysis, we built the model ZPDG-SARV based on a structure including lexical category, morphological features, order of perplexity.</Paragraph> <Paragraph position="7"> and (G;UC;MC).</Paragraph> <Paragraph position="8"> Table 6 shows the perplexity results on the WSJ CSR test sets ordered from highest to lowest for each test set, with the best result for each in bold face. The full SuperARV LM yields the lowest perplexity. We found that ignoring modi ee constraints (SARV-m) increases perplexity the least, and ignoring link type information (SARV-L) and need role constraints (SARV-n) are a little worse than that.</Paragraph> <Paragraph position="9"> Ignoring both knowledge sources (SARV-mn) should result in even greater degradation, which is veri ed by the results. However, ignoring lexical features (SARV-f) produces an even greater increase in perplexity than relaxing both m and n. The SARVfmn, which is closest to the traditional probabilistic dependency grammar LM, shows fairly poor quality, not much better than the POS LM. One might hypothesize that lexical features individually contribute the most to the overall performance of the SuperARV LM. However, using this knowledge source by itself (SARV-cgmn) results in dramatic degradation on perplexity, in fact even worse than that of the POS LM, but still slightly better than the baseline trigram. However, as demonstrated by SARV-gmn, the constraints from lexical features are strengthened by combining them with POS. Given the descriptions in Table 5, we can approximate PLG by a model based on a SuperARV structure eliminating f and m (which should have a quality between SARV-f and SARV-fmn). It is noticeable that without word identity information, syntactic lexical features, and valency constraints, the ZPDG-SARV LM performs worse than the POS-based LM and only slightly better than the LM based on SARV-cgmn.</Paragraph> <Paragraph position="10"> This suggests that ZPDG can be strengthened by incorporating more knowledge.</Paragraph> <Paragraph position="11"> The same ranking of the performance of the LMs was obtained for WER/SAC after rescoring the lattices using each LM, as shown in Table 7. Our experiments with relaxed SuperARV LMs suggest likely methods for improving PDG, PLG, and ZPDG models. The tight integration of word identity, lexical category, lexical features, and structural dependency constraints is likely to improve their performance.</Paragraph> <Paragraph position="12"> Clearly the investigated knowledge sources are quite synergistic, and their tight integration achieves the greatest improvement on both perplexity and WER.</Paragraph> </Section> </Section> <Section position="6" start_page="4" end_page="4" type="metho"> <SectionTitle> 5 Conclusions </SectionTitle> <Paragraph position="0"> We have compared our SuperARV LM to a variety of LMs and found that it achieves both perplexity and WER reductions compared to a trigram, and despite the fact that it is an almost-parsing LM, it outperforms (or performs comparably to) the more complex parser-based LMs on both perplexity and rescoring accuracy. Additional experiments reveal that selecting a joint instead of a conditional probabilistic model is an important factor in the performance of our SuperARV LM. The SuperARV structure provides a flexible framework that tightly couples a variety of knowledge sources without combinatorial explosion. We found that although each knowledge source contributes to the performance of the LM, it is the tight integration of the word level knowledge sources (word identity, POS, and lexical features) together with the structural information of governor and subcategorization dependencies that produces the best level of LM performance. We are currently extending the almost-parsing SuperARV LM to a full parser-based LM.</Paragraph> </Section> <Section position="7" start_page="4" end_page="4" type="metho"> <SectionTitle> 6 Acknowledgments </SectionTitle> <Paragraph position="0"> This research was supported by Intel, Purdue Research Foundation, and National Science Foundation under Grant No. IRI 97-04358, CDA 96-17388, and BCS-9980054. We would like to thank the anonymous reviewers for their comments and suggestions. We would also like to thank Dr. Charniak, Dr. Chelba, and Dr. Srinivas for their help with this research e ort. Finally, we would like to thank Yang Liu (Purdue University) for providing us with the</Paragraph> </Section> class="xml-element"></Paper>