File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1072_metho.xml
Size: 9,565 bytes
Last Modified: 2025-10-06 14:12:43
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1072"> <Title>The Test Results</Title> <Section position="4" start_page="366" end_page="366" type="metho"> <SectionTitle> SYSTEM ARCHITECTURE </SectionTitle> <Paragraph position="0"> The VOYAGER system consists of the TINA natural language understanding system and the s u M M IT speech recognition system. These components will only be described briefly here, as they are more fully documented in \[7,9,10\].</Paragraph> <Paragraph position="1"> TINA combines a general English syntax at the top level with a semantic grammar framework at lower levels, to provide an interleaved syntax/semantics analysis that minimizes perplexity. As a result, most sentences in TINA have only one parse. TINA uses a best-first heuristic search in parsing, storing alternate candidate parse paths while it pursues the most promising (most probable) path. In addition, the grammar is trainable from instances of parse trees, as described in the next section.</Paragraph> <Paragraph position="2"> The SUMMIT system transforms a speech waveform into a segment lattice. Features are extracted for each segment and used to determine a set of acoustic scores for phone candidates. A lexicon provides word pronunciations which are expanded through phonological rules into a network of alternate pronunciations for each word. The control strategy to align the segmental acoustic-phonetic network with the lexical word-pronunciation network uses an N-best interface which produces the top N candidate word sequences in decreasing order of total path score. It makes use of an A* search algorithm \[3,4\] with an initial Viterbi search serving as the mechanism for establishing a tight upper-bound estimate of the score for the unseen portion of each active hypothesis.</Paragraph> <Paragraph position="3"> Language constraints include both a local word-pair constraint and a more global linguistic constraint based on parsability. The word-pair constraint is precompiled into the word network, and limits the set of words that can follow any given word, without regard to sentence context. Allowable word pairs were determined automatically by generating a large number of random sentences from the grammar \[7\]. The linguistic constraints are incorporated either as a filter on full-sentence hypotheses as they come off the top of the stack, or with a tighter coupling in which active partial theories dynamically prune the set of allowable next-word candidates during the search.</Paragraph> </Section> <Section position="5" start_page="366" end_page="367" type="metho"> <SectionTitle> TRAINING PARSE PROBABILITIES </SectionTitle> <Paragraph position="0"> This section describes our procedure for training the probabilities in the grammar automatically from a set of parsed training sentences. From each training sentence is derived a set of context-free rules needed to parse that sentence. The entire pool of rules is used to train the grammar probabilities, with each rule occurring one or more times in the training data. By training on a set of some 3500 sentences within the VOYAGER domain, we were able to reduce the perplexity on an independent test set by a factor of three \[10\].</Paragraph> <Paragraph position="1"> Our approach to training a grammar differs from that of many current schemes, mainly in that we have intentionally tried to set up a framework that easily produces a probability estimate for the next word given the preceding word sequence.</Paragraph> <Paragraph position="2"> We feel that a next-word probability is much more appropriate than a rule-production probability for incorporating into a tightly coupled system, since it leads to a simple definition of the total score for the next word as the weighted sum of the language model probability and the acoustic probability.</Paragraph> <Paragraph position="3"> While rule-production probabilities can in fact be generated from the probabilities we provide, they will not, in general, agree with the probabilities as determined by a procedure such as the inside/outside algorithm \[6,2\].</Paragraph> <Paragraph position="4"> In our approach, the grammar is partitioned into rule sets, according to the left-hand side (LHS) category in the context free rule set. Within each partition, the categories that show up on the right-hand side (RHS) of all the rules sharing the unique LHS category for the partition are used to form a bigram language model for the categories particular to that partition. Thus the language model statistics are encoded as a set of two-dimensional tables of category-category transition probabilities, one table for each partition. A direct consequence of this bigram model within each partition is that new sibling &quot;chains&quot; may form, producing, in many cases, combinations that were never explicitly mentioned in the original rule set. The parser is driven only by the set of local node-node transitions for a given LHS, so that any new chains take on the same status as sibling sets (RHS) that appeared explicitly in the original grammar. While this property can at times lead to inadvertent rules that are inappropriate, it often yiekls productive new rules and allows for faster generalization of the grammar. Given a particular parse tree, the probability for the next word is the product of the node-node transition probabilities linking the next word to the previous word. The overall next-word probability for a given initial word sequence is then the sum over all parse trees spanning the entire word sequence.</Paragraph> <Paragraph position="5"> A specific example should help to elucidate the training process. Imagine that a training set provides a set of five rules as shown in Table 1. The training algorithm produces a transition networks as shown in Figure 1, with probabilities established by counting and normalizing pair frequencies, as would be done at the word level in a traditional bigram language model. Rule production probabilities can be regenerated from these pair transition probabilities, giving the result shown in the column &quot;Derived Probability&quot; in the table, to be compared with &quot;Original Probability.&quot; The derived probabilities are not the same as what one would get by simply counting rule frequencies. The probabilities are correct up to the point of the category NOUN; that is, there is a 2/5 probability of getting the rule group (1,4) and a 3/5 probability of getting the group (2,3,4). However, the transitions out of NOUN are conditional on the rule group. That is, rules that start with (ART NOUN) have a 50/50 chance of being followed by an ADJUNCT, whereas the remaining rules have a 1/3 chance. The method of ignoring everything except the preceding node has the effect of smoothing these two groups, giving all of them an equal chance (2/5) of finding an ADJUNCT next. This is a form of deleted-interpolation \[5\] and it helps to get around sparse data problems, although it is making an independence assumption: whether or not a noun is followed by an adjunct is assumed to be independent of the preceding context of the noun. In this example, no new rules were introduced. If we had, however, no training for Rule 4, it would still get a nonzero probability, because all of its sibling pairs are available from other rules. That is to say, not only does this method smooth probabilities among rules, but it also creates new rules that had not been explicitly seen in the training set.</Paragraph> <Paragraph position="6"> The grammar itself includes syntactic and semantic constraints that may cause a particular next-sibling to fail. There is also a trace mechanism that restores a moved category to its deep-structure position. Both of these mechanisms disrupt the probabilities in ways that are ignored by the training method. While it is possible to renormalize on the fly by checking all next-siblings against the constraints and accumulating the total probability of those that pass, we did not in fact do this, in the interest of computation. We do plan to incorporate this normalizing step in a future experiment, to assess whether it offers a significant improvement in the probability estimates. We do currently make some corrections for the gap mechanism. Rather than using the a priori statistics on the likelihood of a node whose child is a trace, we simply assume that node occurred with probability 1.0. While neither of these is absolutely correct, the latter is generally much closer to the truth than the former.</Paragraph> <Paragraph position="7"> The TINA grammar has been trained on more than 3500 sentences. The parse score is computed as the sum of the log probabilities of the node-node transitions in the parse tree.</Paragraph> <Paragraph position="8"> The probability of a given terminal is taken to be 1/K, where K is the number of lexical items having the same lexical class.</Paragraph> </Section> <Section position="6" start_page="367" end_page="367" type="metho"> <SectionTitle> TRAINING RULES Original </SectionTitle> <Paragraph position="0"> It would not be difficult to incorporate more sophisticated estimates of lexical items, for example, using unigram probabilities within a given lexical class, but we did not do that here, to avoid sparse data problems. The parse scores reported in the following section are the log probabilities, .normalized for the number of terminals, to compensate for decreasing probabilities in longer sentences.</Paragraph> </Section> class="xml-element"></Paper>