POS Tagging versus Classes in Language Modeling 
Peter A. Heeman 
Computer Science and Engineering 
Oregon Graduate Institute 
PO Box 91000 Portland OR 97291 
heeman@cse, ogi. edu 
Abstract 
Language models for speech recognition concen- 
Irate solely on recognizing the words that were spo- 
ken. In this paper, we advocate redefining the 
speech recognition problem so that its goal is to find 
both the best sequence of words and their POS tags, 
and thus incorporate POS tagging. The use of POS 
tags allows more sophisticated generalizations than 
are afforded by using a class-based approach. Fur- 
thermore, if we want to incorporate speech repair 
and intonational phrase modeling into the language 
model, using POS tags rather than classes gives bet- 
ter performance in this task. 
1 Introduction 
For recognizing spontaneous speech, the acoustic 
signal is to weak to narrow down the number of 
word candidates. Hence, speech recognizers em- 
ploy a language model that prunes out acoustic al- 
ternatives by taking into account the previous words 
that were recognized. In doing this, the speech 
recognition problem is viewed as finding the most 
likely word sequence 12¢ given the acoustic signal 
(Jelinek, 1985). 
if" = argmwaX Pr(WIA ) (1) 
We can rewrite the above using Bayes' rule. 
14 r = argmax Pr(AIW) Pr(W) (2) 
w Pr(A) 
Since Pr(A) is independent of the choice of W, we 
simplify the above as follows. 
l~V = argr~axPr(AIW)Pr(W) (3) 
The first term, Pr(AIW), is the acoustic model and 
the second term, Pr(W), is the lanffuage model, 
which assigns a probability to the sequence of words 
W. We can rewrite W explicitly as a sequence of 
words W1W2W3... WN, where N is the number of 
words in the sequence. For expository ease, we use 
the notation Wid to refer to the sequence of words 
Wi to Wj. We now use the definition of conditional 
probabilities to rewrite Pr(Wi,N) as follows. 
N 
Pr(W1,N) = H Pr(Wi\[Wu-1) (4) 
i=l 
To estimate the probability distribution, a train- 
ing corpus is typically used from which the proba- 
bilities can be estimated using relative frequencies. 
Due to sparseness of data, one must define equiv- 
alence classes amongst the contexts WLi-1, which 
can be done by limiting the context to an n-gram 
language model (Jelinek, 1985). One can also mix 
in smaller size language models when there is not 
enough data to support the larger context by using 
either interpolated estimation (Jelinek and Mercer, 
1980) or a backoff approach (Katz, 1987). A way of 
measuring the effectiveness of the estimated proba- 
bility distribution is to measure the perplexity that it 
assigns to a test corpus (Bahl et al., 1977). Perplex- 
ity is an estimate of how well the language model 
is able to predict the next word of a test corpus in 
terms of the number of alternatives that need to be 
considered at each point. The perplexity of a test set 
wl,N is calculated as 2 t't, where H is the entropy, 
which is defined as follows. 
N 1 
H = l r(w, lwL,-1) (5) 
i--1 
1.1 Class-based Language Models 
The choice of equivalence classes for a language 
model need not be the previous words. Words 
can be grouped into classes, and these classes can 
be used as the basis of the equivalence classes of 
the context rather than the word identities (Jelinek, 
1985). Below we give the equation usually used for 
a class-based trigram model, where the function g 
maps each word to its unambiguous class. 
Pr(W/IWu.a ) ~ Pr(W~lg(W~) ) Pr(g(W~)lg(W~-l)g(W,-z)) 
Using classes has the potential of reducing the prob- 
lem of sparseness .of data by allowing generaliza- 
179 
tions over similar words, as well as reducing the size 
of the language model. 
To determine the word classes, one can use the 
algorithm of Brown et al. (1992), which finds the 
classes that give high mutual information between 
the classes of adjacent words. In other words, for 
each bigram ~i3i_1~.13i in a training corpus, choose 
the classes such that the classes for adjacent words 
g(wi-1) and g(wi) lose as little information about 
each other as possible. Brown et at give a greedy al- 
gorithm for finding the classes. They start with each 
word in a separate class and iteratively combine 
classes that lead to the smallest decrease in mutual 
information between adjacent words. Kneser and 
Ney (1993) found that a class-based language model 
results in a perplexity improvement for the LOB 
corpus from 541 for a word-based bigrarn model to 
478 for a class-based bigram model. Interpolating 
the word-based and class-based models resulted in 
an improvement to 439. 
1.2 POS-Based Models 
One can also use POS tags, which capture the syn- 
tactic role of each word, as the basis of the equiv- 
alence classes (Jelinek, 1985). Consider the se- 
quence of words "hello can I help you". Here, 
"hello" is being used as an acknowledgment, "can" 
as a modal verb, 'T' as a pronoun, "help" as an un- 
tensed verb, and "you" as a pronoun. To use POS 
tags in language modeling, the typical approach is 
to sum over all of the POS possibilities. Below, we 
give the derivation based on using trigrams. 
Pr(Wi.N) 
--E 
PI,N 
Pr(Wt,NP1,N) 
N 
= ~ l~IPr(WdW~,,-~P~,) Pr(PdW~,-,P~,.a) 
P1.N i=1 
N 
= ~ IX Pr(W'IP') Pr(PilP, a,,a) (6) 
PI,N i=I 
The above approach for incorporating POS infor- 
mation into a language model has not been of much 
success in improving speech recognition perfor- 
mance. Srinivas (1996) reports that suclt a model re- 
sults in a 24.5% increase in perplexity over a word- 
based model on the Wall Street Journal; Niesler and 
Woodland (1996) report an II.3% increase (but a 
22-fold decrease in the number of parameters of 
such a model) for the LOB corpus; and Kneser and 
180 
Ney (1993) report a 3% increase on the LOB cor- 
pus. The POS tags remove too much of the lexical 
information that is necessary for predicting the next 
word. Only by interpolating it with a word-based 
model is an improvement seen (Jelinek, 1985). 
In the rest of the paper, we first describe the an- 
notations of the Trains corpus. We next present our 
POS-based language model and contrast its perfor- 
mance with a class-based model. We then augment 
these models to account for speech repairs and in- 
tonational phrase, and show that the POS-based one 
performs better than the class-based one for model- 
ing speech repairs and intonational phrases. 
2 The Trains Corpus 
As part of the TRAINS project (Allen et al., 1995), 
a long term research project to build a conversa- 
tionally proficient planning assistant, we collected 
a corpus of problem solving dialogs (Heeman and 
Allen, 1995). The dialogs involve two human par- 
ticipants, one who is playing the role of a user and 
has a certain task to accomplish, and another who is 
playing the role of a planning assistant. The collec- 
tion methodology was designed to make the setting 
as close to human-computer interaction as possible, 
but was not a wizard scenario, where one person 
pretends to be a computer. Table 1 gives informa- 
tion about the corpus. 
Dialogs 
Speakers 
Turns 
Words 
Fragments 
Distinct Words 
Distinct Words/POS 
Singleton Words 
Singleton Words/POS 
Intonational Phrases 
Speech Repairs 
98 
34 
6163 
58298 
756 
859 
1101 
252 
350 
10947 
2396 
Table 1: Size of the Trains Corpus 
2.1 POS Annotations 
Our POS tagset is based on the Penn Treebank 
tagset (Marcus et al., 1993), but modified to in- 
clude tags for discourse markers and end-of-turns, 
and to provide richer syntactic information (Hee- 
man, 1997). Table 2 lists our tagset with differ- 
ences from the Penn tagset marked in bold. Con- 
tractions are annotated using 'A' to conjoin the tag 
for each part; for instance, "can't" is annotated as 
'MDARB'. 
AC Acknowledgement DP Pro-form NNPS Plural proper Noun 
BE Base form of"be" DT Determiner PDT Pre-determiner 
BED Past tense EX Existential "there" POS Possessive 
BEG Present participle HAVE Base form of"have" PPREPpre-preposition 
BEN Past participle HAVED Past tense PREP Preposition 
BEP Present HAVEP Present PRP Personal pronoun VBD 
BEZ 3rd person sing. pres. HAVEZ 3rd person sing. pres. PRP$ Possessive pronoun VBG 
CC Co-ordinating conjunct JJ Adjective RB Adverb VBN 
CC.DDiscourse connective J JR Relative Adjective RBR Relative Adverb VBP 
CD Cardinal number JJS Superlative Adjective RBS Superlative Adverb VBZ 
DO Base form of"do" MD Modal RB..D Discourse adverbial WDT 
DOD Past tense NN Noun RP Reduced particle WP 
DOP Present NNS Plural noun SC Subordinating conjunct WRB 
DOZ 3rd person sing. present NNP Proper Noun TO To-infinitive WP$ 
Table 2: Part-of-Speech Tags used in the Trains Corpus 
TURN Turn marker 
UI-I_D Discourse interjection 
UH..FP Filled pause 
VB Verb base form (other 
than 'do', 'be', or 'have') 
Past tense 
Present participle 
Past participle 
Present tense 
3rd person sing. pres. 
Wh-determiner Wh-pronoun 
Wh-adverb 
Processive Wh-pronoun 
2.2 Speech Repair Annotations 
Speech repairs occur where the speaker goes back 
and changes or repeats what was just said (Heeman, 
1997), as illustrated by the following. 
Example 1 (d92a-2.1 utt29) 
the one with the bananas I mean that's taking the bananas 
reparandum et alteration 
Speech repairs have three parts (some of which are 
optional): the reparandum, which are the words the 
speaker wants to replace, an editing term, which 
helps mark the repair, and the alteration, which is 
the replacement of the reparandum. The end of the 
reparandum is referred to as the interruption point. 
For annotating speech repairs, we have extended 
the scheme proposed by Bear et al. (1992) so that 
it better deals with overlapping and ambiguous re- 
pairs. Like their scheme, ours allows the annotator 
to capture the word correspondences that exist be- 
tween the reparandum and the alteration. Below, 
we illustrate how a speech repair is annotated. In 
this example, the reparandum is "engine two from 
Elmi(ra)-", the editing term is "or", and the alter- 
ation is "engine three from Elmira". The word 
matches on "engine" and "from" are annotated with 
'm' and the word replacement of "two" by "three" 
is annotated with 'r'. 
Example 2 (d93-15.2 utt42) 
engine two from Elmi(ra)- or engine three from Elmira 
ml r2 m3 m4 l"et ml r2 nO m4 
ip:mod+ 
2.3 Intonation Annotations 
Speakers break up their speech into" intonational 
phrases. This segmentation serves a similar purpose 
as punctuation does in written speech. The ToBI 
annotation scheme (Silverman et al., 1992) involves 
labeling the accented words, intermediate phrases 
and intonational phrases with high and low accents. 
Since we are currently only interested in the intona- 
tional phrase segmentation, we only label the into- 
national phrase endings. 
3 POS-Based Language Model 
In this section, we present an alternative formulation 
for using POS tags in a statistical language model. 
Here, POS tags are viewed as part of the output of 
the speech recognizer, rather than intermediate ob- 
jects (Heeman and Allen, 1997a; Heeman, 1997). 
3.1 Redefining the Recognition Problem 
To add POS tags into the language model, we refrain 
from simply summing over all POS sequences as 
illustrated in Section 1.2. Instead, we redefine the 
speech recognition problem so that it finds the best 
word and POS sequence. Let P be a POS sequence 
for the word sequence W. The goal of the speech 
recognizer is to now solve the following. 
12¢P = argmaxPr(WP\]A) W,P 
Pr(AIWP) Pr(WP) = arg ma.2¢ 
w~ Pr(A) 
= argmaxPr(AlWP)Pr(WP) (7) wp 
The first term Pr(AIWP) is the acoustic model, 
which traditionally excludes the category assign- 
ment. In fact, the acoustic model can probably 
be reasonably approximated by Pr(AIW). The 
second term Pr(WP) is the POS-based language 
model and this accounts for both the sequence of 
words and the POS assignment for those words. We 
rewrite the sequence WP explicitly in terms of the 
N words and their corresponding POS tags, thus 
giving us the sequence W1,NP1,N. The probabil- 
ity Pr(Wi,NP1,N) forms the basis for POS taggers, 
with the exception that POS taggers work from a 
sequence of given words. 
181 
As in Equation 4, we rewrite the probabi\]lity 
Pr(W1,NP1,N) as follows using the definition of 
conditional probability, 
Pr( Wx.N Px,N ) 
N 
= I~\[pr(wiPdWu_lPz,_~) 
i=I 
N 
= HPr(WilWl, i-lPti)Pr(PilWl, i.lPl, i.1) (8) 
i=l 
Equation 8 involves two probability distributions 
that need to be estimated. Previous attempts at us- 
ing POS tags in a language model as well as POS 
taggers (i.e. (Charniak et al., 1993)) simplify these 
probability distributions, as given in Equations 9 
and 10. However, to successfully incorporate POS 
information, we need to account for the full richness 
of the probability distributions. Hence, we cannot 
use these two assumptions when learning the prob- 
ability distributions. 
Pr(WilW~i-lPl, i) ~ Pr(WilPi) • (9) 
Pr(PilWt~-tP~i-1) ~ Pr(PdPl, i-~) (10) 
3.2 Estimating the Probabilities 
To estimate the probability distributions, we follow 
the approach of Bahl et al. (1989) and use a deci- 
sion tree learning algorithm (Breiman et al., 1984) 
to partition the context into equivalence classes. The 
algorithm starts with a single node. It then finds a 
question to ask about the node in order to partition 
the node into two leaves, each being more informa- 
tive as to which event occurred than the parent node. 
Information theoretic metrics, such as minimizing 
entropy, are used to decide which question to pro- 
pose. The proposed question is then verified using 
heldout data: if the split does not lead to a decrease 
in entropy according to the heldout data, the split is 
rejected and the node is not further explored (Bahl 
et al., 1989). This process continues with the new 
leaves and results in a hierarchical partitioning of 
the context. 
After growing a tree, the next step is to use the 
partitioning of the context induced by the decision 
tree to determine the probability estimates. Using 
the relative frequencies in each node will be biased 
towards the training data that was used in choosing 
the questions. Hence, Bahl et al. smooth these prob- 
abilities with the probabilities of the parent node us- 
ing interpolated estimation with a second heldout 
dataset. 
Using the decision tree algorithm to estimate 
probabilities is attractive since the algorithm can 
choose which parts of the context are relevant, and 
in what order. Hence, this approach lends itself 
more readily to allowing extra contextual informa- 
tion to be included, such as both the word identi- 
fies and POS tags, and even hierarchical clusterings 
of them. If the extra information is not relevant, it 
will not be used. The approach of using decision 
trees will become even more critical in the next two 
sections where the probability distributions will be 
conditioned on even richer context. 
3.2.1 Simple Questions 
One important aspects of using a decision tree algo- 
rithm is the form of the questions that it is allowed to 
ask. We allow two basic types of information to be 
used as part of the context: numeric and categorical. 
For a numeric variable N, the decision tree searches 
for questions of the form 'is N >= n', where n is 
a numeric constant. For a categorical variable C, 
it searches over questions of the form 'is C E S' 
where S is a subset of the possible values of C. We 
also allow restricted boolean combinations of ele- 
mentary questions (Bahl et al., 1989). 
3.2.2 Questions about POS Tags 
The context that we use for estimating the probabil- 
ities includes both word identities and POS tags. To 
make effective use of this information, we need to 
allow the decision tree algorithm to generalize be- 
tween words and POS tags that behave similarly. 
To learn which words behave similarly, Black et 
aL(1989) and Magerrnan (1994) used the clustering 
algorithm of Brown et al. (1992) to build a hierar- 
chical classification tree. Figure 1 gives the clas- 
sification tree that we built for the POS tags. The 
algorithm starts with each token in a separate class 
and iteratively finds two classes to merge that re- 
sults in the smallest lost of information about POS 
adjacency. Rather than stopping at a certain number 
of classes, one continues until only a single class 
remains. However, the order in which classes were 
merged gives a hierarchical binary tree with the root 
corresponding to the entire tagset, each leaf to a sin- 
gle POS tag, and intermediate nodes to groupings of 
tags that are statistically similar. The path from the 
root to a tag gives the binary encoding for the tag. 
The decision tree algorithm can ask which partition 
a word belongs to by asking questions about the bi- 
nary encoding. 
182 
Figure 1: Classification Tree for POS Tags 
3.2.3 Questions about Word Identities 
For handling word identities, one could follow 
the approach used for handling the POS tags 
(e.g. (Black et ai,, 1992; Magerman, 1994))and 
view the POS tags and word identities as two sep- 
arate sources of information. Instead, we view the 
word identifies as a further refinement of the POS 
tags. We start the clustering algorithm with a sep- 
arate class for each word and each POS tag that it 
takes on and only allow it to merge classes if the 
POS tags are the same. This results in a word clas- 
sification tree for each POS tag. Building a word 
classification tree for each POS tag means that the 
tree will not be polluted by words that are ambigu- 
ous as to their POS tag, as exemplified by the word 
"loads", which is used in the Trains corpus as both 
a third-person present tense verb VBZ and as a plu- 
ral noun iNNS. Furthermore, building dtree for each 
POS tag simplifies the task because the hand an- 
notations of the POS tags resolve a lot of the dif- 
ficulty that the algorithm would otherwise have to 
handle. This allows effective trees to be built even 
when only a small amount of data is available. 
183 
~it 64<: low > 2 
them 157 
me 85 
us 176 
they 89 
we 766 $ 
Figure 2: Classification Tree for Personal Pronouns 
Figure 2 shows the classification tree for the per- 
sonal pronouns (PRP). For reference, we list the 
number of occurrences of each word. Notice that 
the algorithm distinguished between the subjective 
pronouns T, 'we', and 'they', and the objective pro- 
nouns 'me', 'us' and 'them'. The pronouns 'you' 
and 'it' take both cases and were probably clustered 
according to their most common usage in the cor- 
pus. Although we could have added extra POS tags 
to distinguish between these two types of pronouns, 
it seems that the clustering algorithm can make up 
for some of the shortcomings of the POS tagset. The 
class low is used to group singleton words. 
3.3 Results 
Before giving a comparison between our POS-based 
model and a class-based model, we first describe the 
experimental setup and define the perplexity mea- 
sures that we use to measure the performance. 
3.3.1 Experimental Setup 
To make the best use of our limited data, we used 
a six-fold cross-validation procedure: each sixth of 
the data was tested using a model built from the re- 
maining data. Changes in speaker are marked in the 
word transcription with the special token <turn>. 
We treat contractions, such as "that'll" and "gonna", 
as separate words, treating them as "that" and "'ll'" 
for the first example, and "going" and "ta" for the 
second. 1 We also changed all word fragments into 
the token <fragment>. 
Since current speech recognition rates for sponta- 
neous speech are quite low, we have run the exper- 
iments on the hand-collected transcripts. In search- 
ing for the best sequence of POS tags for the tran- 
scribed words, we follow the technique proposed 
by Chow and Schwartz (1989) and only keep a 
small number of alternative paths by pruning the 
low probability paths after processing each word. 
3.3.2 Branching Perplexity 
Our POS-based model is not only predicting the 
next word, but its POS tag as well. To estimate 
I See Heeman and Darrmati (1997) for how to treat contrac- 
tions as separate words in a speech recognizer.. 
the branching factor, and thus the size of the search 
space, we use the following formula for the entropy, 
where di is the POS tag for word wi. 
1 N 
H = - ~ ~ log 215r(wi \[wLi_ 1 dl, i)tSr(di \[wLi-x dLi-1 ) 
i=1 
3.3.3 Word Perplexity 
In order to compare a POS-based model against a 
traditional language model, we should not penalize 
the POS-based model for incorrect POS tags, and 
hence we should ignore them when defining the per- 
plexity. Just as with a traditional model, we base the 
perplexity measure on Pr(wilw~i-1). However, for 
our model, this probability is not estimated. Hence, 
we must rewrite it in terms of the probabilities that 
we do estimate. To do this, our only recourse is to 
sum over all possible POS sequences. 
H 1 N ~DtPr(wiDilWLi.lDl, i_l) Pr(Wl, i.lDl, i.1) 
=" N--i=Lll°gx EDxl--t Pr(w~i-aDLia) 
3.3.4 Using Richer Context 
Table 3 shows the effect of varying the richness of 
the information that the decision tree algorithm is 
allowed to use in estimating the POS and word prob- 
abilities. The second column uses the approxima- 
tions given in Equation 9 and 10. The third col- 
umn gives the results using the full context. The 
results show that adding the extra context has the 
biggest effect on the perplexity measures, decreas- 
ing the word perplexity from 43.22 to 24.04, a re- 
duction of 44.4%. The effect on POS tagging is less 
pronounced, but still gives an error rate reduction of 
3.8%. Hence, to use POS tags during speech recog- 
nition, one must use a richer context for estimating 
the probabilities than what is typically used. 
Context for Wi Di Di-2,iWi~,i-i 
Context for Di Di-2,i-i Di.,2,i-i Wi.2,i-1 
POS Errors 1778 1711 
POS Error Rate 3.04 2.93 
Word Perplexity 43.22 24.04 
Branching Perplexity ! 47.25 26.35 
Table 3: Using Richer Context 
3.3.5 Class-Based Decision-Tree Models 
In this section, we compare the POS-based model 
against a class-based model. To make the compari- 
son as focused as possible, we use the same method- 
ology for estimating the probability distributions as 
we used for the POS-based model. The classes were 
obtained from the word clustering algorithm, but 
stopping once a certain number of classes has been 
reached. Unfortunately, the clustering algorithm of 
Brown et al. does not have a mechanism to decide 
an optimal number of word classes (cf. (Kneser and 
Ney, 1993)). Hence, to give an optimal evaluation 
of the class-based approach, we choose the num- 
ber of classes that gives the best perplexity results, 
which was 100 classes. We then built word clas- 
sification trees, just as we did for the POS-based 
approach, where words from different classes are 
not allowed to be merged. The resulting class-based 
model achieved a perplexity of 25.24 in compari- 
son to 24.04 for the POS-based model. This im- 
provement is due to two factors. First, tracking the 
syntactic role of each word gives valuable informa- 
tion for predicting the subsequent words. Second, 
the classification trees for the POS-based approach, 
which the decision tree algorithm uses to determine 
the equivalence classes, are of higher quality. This 
is due to the POS-based classification trees using the 
hand-annotated POS information, since they take 
advantage of the hand-coded knowledge present in 
the POS tags and are not polluted by words that take 
on more than one syntactic role. 
3.3.6 Preliminary Wall Street Journal Results 
For building a system that partakes in dialogue, 
read-speech corpora, such as the Wall Street Jour- 
nal, are not appropriate. However, to make our 
results more comparable to the literature, we have 
done preliminary tests on the Wall Street Journal 
corpus in the Penn Treebank, which has POS an- 
notations. This corpus has a significantly larger vo- 
cabulary size (55800 words) than the Trains corpus. 
Our current algorithm for clustering the words takes 
space in proportion to the square of the number of 
unique word/POS combinations (minus any that get 
grouped into the low occurring class). More work 
is needed to handle larger vocabulary sizes. Us- 
ing 78,800 words of data, with a vocabulary size 
of 9711, we achieved a perplexity of 250.75 on 
the known words in comparison to a trigram word- 
based backoff model (Katz, 1987) built with the 
CMU toolkit (Rosenfeld, 1995), which achieved a 
perplexity of 296.43. More work is needed to see if 
these results scale up to larger vocabulary and train- 
ing data sizes. 
4 Adding Repairs and Phrasing 
Just as we redefined the speech recognition prob- 
lem so as to account for POS tagging, we do the 
same for modeling intonational phrases and speech 
184 
repairs. We introduce null tokens between each pair 
of words ~./)i-1 and wi (Heeman and Allen, 1997b), 
which will be tagged as to the occurrence of these 
events. The variable T/indicates if word wi-1 ends 
an intonational phrase (Ti=%), or not (Ti=null). 
For detecting speech repairs, we have the prob- 
lem that repairs are often accompanied by an edit- 
ing term, such as "urn", "uh", "okay", or "well", 
and these must be identified as such. Furthermore, 
an editing term might be composed of a number 
of words, such as "let's see" or "uh well". Hence 
we use two tags: an editing term tag Ei and a re- 
pair tag Ri. The editing term tag indicates if wi 
starts an editing term (Ei=Push), if wi continues an 
editing term (Ei=ET), if wi-t ends an editing term 
(Ei=Pop), or otherwise (Ei=null). The repair tag 
Ri indicates whether word wi is the onset of the al- 
teration of a fresh start (Ri=C), a modification re- 
pair (Ri=M), or an abridged repair (Ri=A), or there 
is no repair (Pa=null). Note that for repairs with 
an editing term, the repair is tagged after the extent 
of the editing term has been determined. Below we 
give an example showing all non-null tone, editing 
term and repair tags. 
Example 3 (d93-18.1 utt47) 
it takes one Push you ET know Pop M two hours % 
If a modification repair or fresh start occurs, we 
need to determine the extent (or the onset) of the 
reparandum, which we refer to as correcting the 
speech repair. Often, speech repairs have strong 
word correspondences between the reparandum and 
alteration, involving word matches and word re- 
placements. Hence, knowing the extent of the 
reparandum means that we can use the reparandum 
to predict the words (and their POS tags) that make 
up the alteration. In our full model, we add three 
variables to account for the correction of speech re- 
pairs (Heeman and Allen, 1997b; Heeman, 1997). 
We also add an extra variable to account for silences 
between words. After a silence has occurred, we can 
use the silence to better predict whether an intona- 
tional boundary or speech repair has just occurred. 
Below we give the redefinition of the speech 
recognition problem (without speech repair correc- 
tion and silence information). The speech recog- 
nition problem is redefined so that its goal is to find 
the maximal assignment for the words as well as the 
POS, intonational, and repair tags. 
I2¢PREf' = argmax Pr(WPRETIA ) 
WPRET 
185 
Just as we did in Equation 8, we rewrite the above in 
terms of five probability distributions, each of which 
need to be estimated. The context for each of the 
probability distributions includes all of the previous 
context. In principal, we could give all of this con- 
text to the decision tree algorithm and let it decide 
what information is relevant in constructing equiva- 
lence classes of the contexts. However, the amount 
of training data is limited (as are the learning tech- 
niques) and so we need to encode the context in 
order to simplify the task of constructing meaning- 
ful equivalence classes. Hence we restructure the 
context to take into account the speech repairs and 
boundary tones (Heeman, 1997). 
4.1 Results 
We now contrast the performance of augmenting the 
POS-based model with speech repair and intona- 
tional modeling versus augmenting the class-based 
model. Just as in Section 3, all results were obtained 
using a six-fold cross-validation procedure from the 
the hand-collected transcripts. We ran these tran- 
scripts through a word-aligner (Ent, 1994), a speech 
recognizer constrained to recognize what was tran- 
scribed, in order to automatically obtain silence 
durations. In predicting the end of turn marker 
<turn>, we do not use any silence information. 
4.1.1 Recall and Precision 
We report results on identifying speech repairs and 
intonational phrases in terms of recall, precision 
and error rate. The recall rate is the number of times 
that the algorithm correctly identifies an event over 
the total number of times that it actually occurred. 
The precision rate is the number of times the algo- 
rithm correctly identifies it over the total number of 
times it identifies it. The error rate is the number 
of errors in identifying an event over the number of 
times that the event occurred. 
4.1.2 POS Tagging and Perplexity 
The first set of experiments, whose results are given 
in Table 4, explore how POS tagging and word per- 
plexity benefit from modeling boundary tones and 
speech repairs. The second column gives the re- 
suits of the POS-based language model, introduced 
in Section 3. The third column adds in speech re- 
pair detection and correction, boundary tone identi- 
fication, and makes use of silence information in de- 
tecting speech repairs and boundary tones. We see 
that this results in a perplexity reduction of 7.0%, 
and a POS error reduction of 8. 1%. As we further 
improve the modeling of the user's utterance, we 
\[ POS I Full Model 
POS Errors 1711 1572 
POS Error Rate , 2.93 2.69 
Word Perplexity i 24.04 22.35 
Branching Perplexity 126.35 30.26 
Table 4: POS Tagging and Perplexity 
expect to see further improvements in the language 
model. Of course, there is a penalty to pay in terms 
of increased search space size, as the increase in the 
branching perplexity shows. 
4.1.3 Intonational Phrases 
In Table 5, we demonstrate that modeling intona- 
tional phrases benefits from modeling POS tags. 
Column two gives the results of augmenting the 
class-based model of Section 3.3.5 with intonational 
phrase modeling and column three gives the results 
of augmenting the POS-based model. Contrasting 
the results in column two with those in column 
three, we see that using the POS-based model ~ re- 
sults in a reduction in the error rate of 17.2% over 
the class-based model. Hence, we see that modeling 
the POS tags allows much better modeling of into- 
national phrases than can be achieved with a class- 
based model. The fourth column reports the results 
using the full model, which accounts for interac- 
tions with speech repairs and the benefit of using 
silence information (Heeman and Allen, 1997b). 
Class-Based POS-Based Full 
Tones Tones Model 
Errors 4859 i 4024 i 3632 
Error Rate 44.38 i 36.75 i 33.17 
i 84.76 Recall 74.55! 81.72 
Precision 79.741 81.55 ~ 82.53 
Table 5: Detecting Intonational Phrase Boundaries 
4.1.4 Detecting Speech Repairs 
In Table 6, we demonstrate that modeling the de- 
tection of speech repairs (and editing terms) bene- 
fits from modeling POS tags. In the results below, 
we ignore errors that are the result of improperly 
identifying the type of repair, and hence score a re- 
pair as correctly detected as long as it'was identi- 
fied as either an abridged repair, modification re- 
pair or fresh start. Column two gives the results of 
augmenting the class-based model of Section 3.3.5 
with speech repair modeling and column three gives 
the results of augmenting the POS-based model. In 
186 
Errors 
Error Rate 
Recall 
Precision 
Class-Based 
Repairs 
1246 
52.00 
64.98 
79.27 
POS-Based I Full 1 
Repairs \[ Model! 
1106 839! 
46.16 35.011 
68.61 76.79 
82.28 86.66 
Table 6: Detecting Speech Repairs 
terms of overall detection, the POS-based model re- 
duces the error rate from 52.0% to 46.2%, a reduc- 
tion of 11.2%. This shows that speech repair de- 
tection profits from being able to make use of syn- 
tactic generalizations, which are not available from 
a class-based approach. The final column gives the 
results of the full model, which accounts for interac- 
tions with speech repair correction and intonational 
phrasing, and uses silence information. 
5 Conclusion 
In this paper, we presented a POS-based language 
model. Unlike previous approaches that use POS 
tags in language modeling, we redefine the speech 
recognition problem so that it includes finding the 
best word sequence and best POS tag interpretation 
for those words. Thus this work can be seen as a 
first-step towards tightening the integration between 
speech recognition and natural language processing. 
In order to make use of the POS tags, we use 
a decision tree algorithm to learn the probability 
distributions, and a clustering algorithm to build 
hierarchical partitionings of the POS tags and the 
word identities. Furthermore, we take advantage 
of the POS tags in building the word classification 
trees and in estimating the word probabilities, which 
both results in better performance and significantly 
speeds up the training procedure. We find that us- 
ing the rich context afforded by decision tree results 
in a perplexity reduction of 44.4%. We also find 
that the POS-based model gives a 4.2% reduction in 
perplexity over a class-based model, also built with 
the decision tree and clustering algorithms. Prelim- 
inary results on the Wall Street Journal corpus are 
also encouraging. Hence, using a POS-based model 
results in an improved language model as well as 
accomplishes the first part of the task in linguistic 
understanding. 
We also see that using POS tags in the language 
model aids in the identification of boundary tones 
and speech repairs, which we have also incorpo- 
rated into the model by further redefining the speech 
recognition problem. The POS tags allow these two 
processes to generalize about the syntactic role that 
words are playing in the utterance rather than using 
crude class-based approaches which does not distin- 
guish this information. We also see that modeling 
these phenomena improves the POS tagging results 
as well as the word perplexity. 
6 Acknowledgments 
We wish to thank Geraldine Damnati. The research 
involved in this paper was done while the first author 
was visiting at CNET, France Ttl6com. 

References 
J. Allen, L. Schubert, G. Ferguson, P. Heeman, 
C. Hwang, T. Kato, M. Light, N. Martin, B. Miller, 
M. Poesio, and D. Traum. 1995. The Trains project: 
A case study in building a conversational planning 
agent. Journal of Experimental and Theoretical Al, 
7:7--48. 
L. Bahl, J. Baker, E Jelinek, and R. Mercer. 1977. 
Perplexityma measure of the difficulty of speech 
recognition tasks. In Proceedings of the Meeting of 
the Acoustical Society of America. 
L. Bahl, P. Brown, P. deSouza, and R. Mercer. 1989, A 
tree-based statistical language model for natural 
language speech recognition. IEEE Transactions on 
Acoustics, Speech, and Signal Processing, 
36(7): 1001-1008. 
J. Bear, J. Dowding, and E. Shriberg. 1992. Integrating 
multiple knowledge sources for detection and 
correction of repairs in human-computer dialog. In 
Proceedings of the 30 th Annual Meeting of the 
Association for Computational Linguistics, pages 
56--63. 
E. Black, E Jelinek, J. Lafferty, D. Magerman, 
R. Mercer, and S. Roukos. 1992. Towards 
history-based grammars: Using richer models for 
probabilistic parsing. In Proceedings of the DARPA 
Speech and Natural Language Workshop, pages 
134-139. 
L. Breiman, J. Friedman, R.ichard A. Olshen, and 
C.harles J. Stone. 1984. Classification and 
Regression Trees. Wadsworth & Brooks. 
P. Brown, V. Della Pietra, P. deSouza, J. Lai, and 
Robert L. Mercer. 1992. Class-based n-gram models 
of natural language. Computational Linguistics, 
18(4):467--479. 
E. Charniak, C. Hendrickson, N. Jacobson, and 
M. Perkowitz. 1993. Equations for part-of-speech 
tagging. In Proceedings of the National Conference 
on Artificial Intelligence. 
Y. Chow and R. Schwartz. 1989. The n-best algorithm: 
An efficient procedure for finding top rksentence 
hypotheses. In Proceedings of the DARPA Speech 
and Natural Language Workshop, pages 199-202. 
Entropic Research Laboratory, Inc., 1994. Aligner 
Reference Manual. Version 1.3. 
P. Heeman and J. Allen. 1995. The Trains spoken 
dialog corpus. CD-ROM, Linguistics Data 
Consortium. 
R Heeman and J. Allen. 1997a. Incorporating POS 
tagging into language modeling. In Proceedings of 
the European Conference on Speech Communication 
and Technology, pages 2767-2770. 
R Heeman and J. Allen. 1997b. Intonational 
boundaries, speech repairs, and discourse markers: 
Modeling spoken dialog. In Proceedings of the 
Annual Meeting of the Association for Computational 
Linguistics, pages 254.--.-261. 
R Heeman and G. Damnati. 1997. Deriving 
phrase-based language models. In IEEE Workshop 
on Speech Recognition and Understanding, pages 
41-48. 
R Heeman. 1997. Speech repairs, intonational 
boundaries and discourse markers: Modeling 
speakers' utterances in spoken dialog. TR 673, Dept. 
of Computer Science, U. of Rochester. Doctoral 
dissertation. 
F. Jelinek and R. Mercer. 1980. Interpolated estimation 
of markov source paramaters from sparse data. In 
Proceedings, Workshop on Pattern Recognition in 
Practice, pages 381-397. 
F. Jelinek. 1985. Self-organized language modeling for 
speech recognition. Technical report, IBM 
T.J. Watson Research Center, Continuous Speech 
Recognition Group, Yorktown Heights, NY. 
S. Katz. 1987. Estimation of probabilities from sparse 
data for the language model component of a speech 
recognizer. IEEE Transactions on Acoustics, Speech, 
and Signal Processing, 35(3):400-401. 
R. Kneser and H. Ney. 1993. Improved clustering 
techniques for class-based statistical language 
modelling. In Proceedings of the European 
Conference on Speech Communication and 
Technology, pages 973-976. 
D. Magerrnan. 1994. Natural language parsing as 
statistical pattern recognition. Doctoral dissertation, 
Dept. of Computer Science, Stanford. 
M. Marcus, B. Santorini, and M. Marcinkiewicz. 1993. 
Building a large annotated corpus of English: The 
Penn Treebank. Computational Linguistics, 
19(2):313-330. 
T. Niesler and P. Woodland. 1996. A variable-length 
category-based n-gram language model. In 
Proceedings of the International Conference on 
Audio, Speech and Signal Processing, pages 
164-167. 
R. Rosenfeld. 1995. The CMU statistical language 
modeling toolkit and its use in the 1994 ARPA CSR 
evaluation. In Proceedings of the ARPA Spoken 
Language Systems Technology Workshop. 
K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, 
C. Wightman, P. Price, J. Pierrehumbert, and 
J. Hirschberg. 1992. ToBI: A standard for labelling 
English prosody. In Proceedings of the 2nd 
International Conference on Spoken Language 
Processing, pages 867-870. 
B. Srinivas. 1996. "Almost parsing" techniques for 
language modeling. In Proceedings of the 
International Conference on Spoken Language 
Processing, pages 1169-1172. 
