File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/j99-4003_intro.xml
Size: 12,054 bytes
Last Modified: 2025-10-06 14:06:49
<?xml version="1.0" standalone="yes"?> <Paper uid="J99-4003"> <Title>Speech Repairs, Intonational Phrases, and Discourse Markers: Modeling Speakers' Utterances in Spoken Dialogue</Title> <Section position="4" start_page="536" end_page="648" type="intro"> <SectionTitle> 3. POS-based Language Model </SectionTitle> <Paragraph position="0"> In this section, we present a speech recognition language model that incorporates POS tagging. Here, POS tags are viewed as part of the output of the speech recognizer rather than as intermediate objects. Not only is this syntactic information needed for Computational Linguistics Volume 25, Number 4 modeling the occurrence of speech repairs and intonational phrases, but it will also be useful for higher-level syntactic and semantic processes. Incorporating POS tagging can also be seen as a first step in tightening the coupling between speech recognition and natural language processing so as to be able to make use of richer knowledge of natural language than simple word-based language models provide.</Paragraph> <Section position="1" start_page="537" end_page="537" type="sub_section"> <SectionTitle> 3.1 Word-based Language Models </SectionTitle> <Paragraph position="0"> The goal of speech recognition is to find the most probable sequence of words I;V given the acoustic signal A (Jelinek 1985).</Paragraph> <Paragraph position="2"> The first term, Pr(A\[W), is the acoustic model and the second term, Pr(W), is the language model. We can rewrite W explicitly as the sequence of words WIW2W3... WN, where N is the number of words in the sequence. For expository ease, we use W/,j to refer to Wi... Wj. We now use the definition of conditional probabilities to rewrite</Paragraph> <Paragraph position="4"> The above equation gives us the probability of the word sequence as the product of the probability of each word given its previous lexical context. This probability distribution must be estimated. The simplest approach to estimating the probability of an event given a context is to use a training corpus to compute the relative frequency of the event given the context. However, no matter how large the corpus is, there will always be event-context pairs that have not been seen, or have been seen too rarely to accurately estimate the probability. To alleviate this problem, one must partition the contexts into equivalence classes and use these to compute the relative frequencies. A common technique is to partition the context based on the last n - 1 words, Wi-n+l,i-1, which is referred to as an n-gram language model. One can also mix in smaller-size language models to use when there is not enough data to support the larger context.</Paragraph> <Paragraph position="5"> Two common approaches for doing this are interpolated estimation (Jelinek and Mercer 1980) and the backoff approach (Katz 1987).</Paragraph> </Section> <Section position="2" start_page="537" end_page="538" type="sub_section"> <SectionTitle> 3.2 Incorporating POS Tags and Discourse Marker Identification </SectionTitle> <Paragraph position="0"> Previous attempts to incorporate POS tags into a language model view the POS tags as intermediate objects and sum over all POS possibilities (Jelinek 1985).</Paragraph> <Paragraph position="2"> Heeman and Allen Modeling Speakers' Utterances However, this throws away valuable information that is needed by later processing. Instead, we redefine the speech recognition problem so as to include finding the best POS and discourse marker sequence along with the best word sequence. For the word sequence W, let D be a POS sequence that can include discourse marker tags. The goal of the speech recognition process is to now solve the following:</Paragraph> <Paragraph position="4"> The first term Pr(AIWD ) is the acoustic model, which can be approximated by Pr(AIW ).</Paragraph> <Paragraph position="5"> The second term Pr(WD) is the POS-based language model and accounts for both the sequence of words and their POS assignment. We rewrite this term as follows:</Paragraph> <Paragraph position="7"> Equation 7 involves two probability distributions that need to be estimated. These are the same distributions that are needed by previous POS-based language models (Equation 5) and POS taggers (Church 1988; Charniak et al. 1993). However, these approaches simplify the context so that the lexical probability is just conditioned on the POS category of the word, and the POS probability is conditioned on just the preceding POS tags, which leads to the following two approximations.</Paragraph> <Paragraph position="9"> However, to successfully incorporate POS information, we need to account for the full richness of the probability distributions, as will be demonstrated in Section 3.4.4.</Paragraph> </Section> <Section position="3" start_page="538" end_page="648" type="sub_section"> <SectionTitle> 3.3 Estimating the Probabilities </SectionTitle> <Paragraph position="0"> To estimate the probability distributions, we follow the approach of Bahl et al. (1989) and use a decision tree learning algorithm (Breiman et al. 1984) to partition the context into equivalence classes. The algorithm starts with a single node. It then finds a question to ask about the node in order to partition the node into two leaves, each more informative as to which event occurred than the parent node. Information-theoretic metrics, such as minimizing entropy, are used to decide which question to propose.</Paragraph> <Paragraph position="1"> The proposed question is then verified using held-out data: if the split does not lead to a decrease in entropy according to the held-out data, the split is rejected and the node is not further explored. This process continues with the new leaves and results in a hierarchical partitioning of the context. After the tree is grown, relative frequencies are calculated for each node, and these probabilities are then interpolated with their parent node's probabilities using a second held-out dataset.</Paragraph> <Paragraph position="2"> Using the decision tree algorithm to estimate probabilities is attractive since the algorithm can choose which parts of the context are relevant, and in what order. Hence, this approach lends itself more readily to allowing extra contextual information to be Computational Linguistics Volume 25, Number 4 Figure 2 Binary classification tree that encodes the POS tags for the decision tree algorithm. included, such as both the word identities and POS tags, and even hierarchical clusterings of them. If the extra information is not relevant, it will not be used. The approach of using decision trees will become even more critical in the next two sections, where the probability distributions will be conditioned on even richer context.</Paragraph> <Paragraph position="3"> 3.3.1 Simple Questions. One of the most important aspects of using a decision tree algorithm is the form of the questions that it is allowed to ask. We allow two basic types of information to be used as part of the context: numeric and categorical. For a numeric variable N, the decision tree searches for questions of the form &quot;is N >= n', where n is a numeric constant. For a categorical variable C, it searches over questions of the form: &quot;is C E S&quot;, where S is a subset of the possible values of C. We also allow composite questions (Bahl et al. 1989), which are Boolean combinations of elementary questions.</Paragraph> <Paragraph position="4"> abilities includes both word identities and POS tags. To make effective use of this information, we allow the decision tree algorithm to generalize between words and POS tags that behave similarly. To learn which ones behave similarly, Black et al.</Paragraph> <Paragraph position="5"> (1992) and Magerman (1994) used the clustering algorithm of Brown et al. (1992) to build a hierarchical classification tree. Figure 2 gives the tree that we built for the POS tags. The algorithm starts with each POS tag in a separate class and iteratively finds two classes to merge that results in the smallest loss of information about POS adjacency. This continues until only a single class remains. The order in which classes were merged, however, gives a binary tree with the root corresponding to the entire Figure 3 Binary classification tree that encodes the personal pronouns (PRP).</Paragraph> <Paragraph position="6"> tagset, each leaf to a single POS tag, and intermediate nodes to groupings of the tags that are statistically similar. The path from the root to a tag gives the binary encoding for the tag. For instance, the binary encoding of VBG in Figure 2 is 01011100. The decision tree algorithm can ask which partition a tag belongs to by asking questions about its binary encoding.</Paragraph> <Paragraph position="7"> 3.3.3 Questions about Word Identities. For handling word identities, one could follow the approach used for handling the POS tags (e.g., Black et al. 1992; Magerman 1994) and view the POS tags and word identities as two separate sources of information.</Paragraph> <Paragraph position="8"> Instead, we view the word identities as a further refinement of the POS tags. We start the clustering algorithm with a separate class for each word and each tag that it takes on. Classes are only merged if the tags are the same. The result is a word classification tree for each tag. This approach means that the trees will not be polluted by words that are ambiguous as to their tag, as exemplified by the word loads, which is used in the corpus as a third-person present tense verb VBZ and as a plural noun NNS.</Paragraph> <Paragraph position="9"> Furthermore, this approach simplifies the clustering task because the hand annotations of the POS tags resolve a lot of the difficulty that the algorithm would otherwise have to learn. Hence, effective trees can be built even when only a small amount of data is available.</Paragraph> <Paragraph position="10"> Figure 3 shows the classification tree for the personal pronouns (PRP). For reference, we also list the number of occurrences of each word for the POS tag. In the figure, we see that the algorithm distinguished between the subjective pronouns L we, and they, and the objective pronouns me, us, and them. The pronouns you and it can take both cases and were probably clustered according to their most common usage in the corpus. The class low is used to group singleton words, which do not have enough training data to allow effective clustering. In using the word identities with the decision tree algorithm, we restrict the algorithm from asking word questions when the POS tag for the word is not uniquely determined by previous questions.</Paragraph> <Paragraph position="11"> grown for estimating the probability distribution of the POS tag of the current word. The question on the root node &quot;is D)_ 1 = 0 V D2_1 = 1&quot; is asking whether the POS tag of the previous word has a 0 as the first bit or a 1 as the second bit of its binary encoding. If the answer is no then the bottom branch is followed, which corresponds to the following partition.</Paragraph> <Paragraph position="12"> Di-1 c (CC, PREP, JJ, JJS, JJR, CD, DT, PRP$, WDT} Following the bottom branch of the decision tree, we see that the next question is &quot;is D31 ----1&quot;, which gives a true partition of Di-1 E {JJ, JJS, JJR, CD, DT, PRP$,WDT}. Following the top branch, we see that the next question is &quot;is D4_l = 1&quot;, whose true Computational Linguistics Volume 25, Number 4</Paragraph> <Paragraph position="14"> The top part of the decision tree used for estimating the POS probability distribution.</Paragraph> <Paragraph position="15"> partition is Di-1 E {DT, PRP$, WDT}. The next question along the top branch is &quot;is D5 -1&quot; which gives a true partition of Di-1 = WDT. As indicated in the figure, this i_1--~ is a leaf node, and so no suitable question was found to ask of this context.</Paragraph> </Section> </Section> class="xml-element"></Paper>