Utterance Segmentation Using Combined Approach  
Based on Bi-directional N-gram and Maximum Entropy 
 
Ding Liu 
National Laboratory of Pattern Recognition
Institute of Automation 
Chinese Academy of Sciences 
Beijing 100080, China. 
dliu@nlpr.ia.ac.cn 
Chengqing Zong 
National Laboratory of Pattern Recognition
Institute of Automation 
Chinese Academy of Sciences 
Beijing 100080, China. 
cqzong@nlpr.ia.ac.cn  
 
 
 
Abstract 
This paper proposes a new approach to 
segmentation of utterances into sentences 
using a new linguistic model based upon 
Maximum-entropy-weighted Bi-
directional N-grams. The usual N-gram 
algorithm searches for sentence bounda-
ries in a text from left to right only. Thus 
a candidate sentence boundary in the text 
is evaluated mainly with respect to its left 
context, without fully considering its right 
context. Using this approach, utterances 
are often divided into incomplete sen-
tences or fragments. In order to make use 
of both the right and left contexts of can-
didate sentence boundaries, we propose a 
new linguistic modeling approach based 
on Maximum-entropy-weighted Bi-
directional N-grams. Experimental results 
indicate that the new approach signifi-
cantly outperforms the usual N-gram al-
gorithm for segmenting both Chinese and 
English utterances. 
1 Introduction 
Due to the improvement of speech recognition 
technology, spoken language user interfaces, spo-
ken dialogue systems, and speech translation sys-
tems are no longer only laboratory dreams. 
Roughly speaking, such systems have the structure 
shown in Figure 1. 
 
 
Figure 1. System with speech input. 
 
In these systems, the language analysis module 
takes the output of speech recognition as its input, 
representing the current utterance exactly as pro-
nounced, without any punctuation symbols mark-
ing the boundaries of sentences. Here is an 
example: 这边请您坐电梯到9楼服务生将在那
里等您并将您带到913号房间 . (this way please 
please take this elevator to the ninth floor the floor 
attendant will meet you at your elevator entrance 
there and show you to room 913.) As the example 
shows, it will be difficult for a text analysis module 
to parse the input if the utterance is not segmented. 
Further, the output utterance from the speech rec-
ognizer usually contains wrongly recognized 
words or noise words. Thus it is crucial to segment 
the utterance before further language processing. 
We believe that accurate segmentation can greatly 
improve the performance of language analysis 
modules. 
Stevenson et al. have demonstrated the difficul-
ties of text segmentation through an experiment in 
which six people, educated to at least the Bache-
lor’s degree level, were required to segment into 
sentences broadcast transcripts from which all 
punctuation symbols had been removed. The ex-
perimental results show that humans do not always 
agree on the insertion of punctuation symbols, and 
that their segmentation performance is not very 
good (Stevenson and Gaizauskas, 2000). Thus it is 
a great challenge for computers to perform the task 
Output (text 
or speech) Language 
analysis and 
generation 
Speech  
 
recognition
Input speech
automatically. To solve this problem, many meth-
ods have been proposed, which can be roughly 
classified into two categories. One approach is 
based on simple acoustic criteria, such as non-
speech intervals (e.g. pauses), pitch and energy. 
We can call this approach acoustic segmentation. 
The other approach, which can be called linguistic 
segmentation, is based on linguistic clues, includ-
ing lexical knowledge, syntactic structure, seman-
tic information etc. Acoustic segmentation can not 
always work well, because utterance boundaries do 
not always correspond to acoustic criteria. For ex-
ample: 您好<pause>请问<pause>明天的单人间
还有吗<pause>或者<pause>标准间也行. Since 
the simple acoustic criteria are inadequate, linguis-
tic clues play an indispensable role in utterance 
segmentation, and many methods relying on them 
have been proposed. 
This paper proposes a new approach to linguis-
tic segmentation using a Maximum-entropy-
weighted Bi-directional N-gram-based algorithm 
(MEBN). To evaluate the performance of MEBN, 
we conducted experiments in both Chinese and 
English. All the results show that MEBN outper-
forms the normal N-gram algorithm. The remain-
der of this paper will focus on description of our 
new approach for linguistic segmentation. In Sec-
tion 2, some related work on utterance segmenta-
tion is briefly reviewed, and our motivations are 
described. Section 3 describes MEBN in detail. 
The experimental results are presented in Section 4. 
Finally, Section 5 gives our conclusion. 
2 Related Work and Our Motivations  
2.1 Related Work 
Stolcke et al. (1998, 1996) proposed an approach 
to detection of sentence boundaries and disfluency 
locations in speech transcribed by an automatic 
recognizer, based on a combination of prosodic 
cues modeled by decision trees and N-gram lan-
guage models. Their N-gram language model is 
mainly based on part of speech, and retains some 
words which are particularly relevant to segmenta-
tion. Of course, most part-of-speech taggers re-
quire sentence boundaries to be pre-determined; so 
to require the use of part-of-speech information in 
utterance segmentation would risk circularity. Cet-
tolo et al.’s (1998) approach to sentence boundary 
detection is somewhat similar to Stolcke et al.’s. 
They applied word-based N-gram language models 
to utterance segmentation, and then combined 
them with prosodic models. Compared with N-
gram language models, their combined models 
achieved an improvement of 0.5% and 2.3% in 
precision and recall respectively. 
Beeferman et al. (1998) used the CYBERPUNC 
system to add intra-sentence punctuation (espe-
cially commas) to the output of an automatic 
speech recognition (ASR) system. They claim that, 
since commas are the most frequently used punc-
tuation symbols, their correct insertion is by far the 
most helpful addition for making texts legible. 
CYBERPUNC augmented a standard trigram 
speech recognition model with lexical information 
concerning commas, and achieved a precision of 
75.6% and a recall of 65.6% when testing on 2,317 
sentences from the Wall Street Journal. 
Gotoh et al. (1998) applied a simple non-speech 
interval model to detect sentence boundaries in 
English broadcast speech transcripts. They com-
pared their results with those of N-gram language 
models and found theirs far superior. However, 
broadcast speech transcripts are not really spoken 
language, but something more like spoken written 
language. Further, radio broadcasters speak for-
mally, so that their reading pauses match sentence 
boundaries quite well. It is thus understandable that 
the simple non-speech interval model outperforms 
the N-gram language model under these conditions; 
but segmentation of natural utterances is quite dif-
ferent. 
Zong et al. (2003) proposed an approach to ut-
terance segmentation aiming at improving the per-
formance of spoken language translation (SLT) 
systems. Their method is based on rules which are 
oriented toward key word detection, template 
matching, and syntactic analysis. Since this ap-
proach is intended to facilitate translation of Chi-
nese-to-English SLT systems, it rewrites long 
sentences as several simple units. Once again, 
these results cannot be regarded as general-purpose 
utterance segmentation. Furuse et al. (1998) simi-
larly propose an input-splitting method for translat-
ing spoken language which includes many long or 
ill-formed expressions. The method splits an input 
into well-balanced translation units, using a seman-
tic dictionary. 
Ramaswamy et al. (1998) applied a maximum 
entropy approach to the detection of command 
boundaries in a conversational natural language 
user interface. They considered as their features 
words and their distances to potential boundaries. 
They posited 400 feature functions, and trained 
their weights using 3000 commands. The system 
then achieved a precision of 98.2% in a test set of 
1900 commands. However, command sentences 
for conversational natural language user interfaces 
contain much smaller vocabularies and simpler 
structures than the sentences of natural spoken lan-
guage. In any case, this method has been very 
helpful to us in designing our own approach to ut-
terance segmentation. 
There are several additional approaches which are 
not designed for utterance segmentation but which 
can nevertheless provide useful ideas. For example, 
Reynar et al. (1997) proposed an approach to the 
disambiguation of punctuation marks. They con-
sidered only the first word to the left and right of 
any potential sentence boundary, and claimed that 
examining wider context was not beneficial. The 
features they considered included the candidate’s 
prefix and suffix; the presence of particular charac-
ters in the prefix or suffix; whether the candidate 
was honorific (e.g. Mr., Dr.); and whether the can-
didate was a corporate designator (e.g. Corp.). The 
system was tested on the Brown Corpus, and 
achieved a precision of 98.8%. Elsewhere, Nakano 
et al. (1999) proposed a method for incrementally 
understanding user utterances whose semantic 
boundaries were unknown. The method operated 
by incrementally finding plausible sequences of 
utterances that play crucial roles in the task execu-
tion of dialogues, and by utilizing beam search to 
deal with the ambiguity of boundaries and with 
syntactic and semantic ambiguities. Though the 
method does not require utterance segmentation 
before discourse processing, it employs special 
rule tables for discontinuation of significant utter-
ance boundaries. Such rule tables are not easy to 
maintain, and experimental results have demon-
strated only that the method outperformed the 
method assuming pauses to be semantic boundaries.  
2.2 Our motivations 
Though numerous methods for utterance segmen-
tation have been proposed, many problems remain 
unsolved.  
One remaining problem relates to the language 
model. The N-gram model evaluates candidate 
sentence boundaries mainly according to their left 
context, and has achieved reasonably good results, 
but it can’t take into account the distant right con-
text to the candidate. This is the reason that N-
gram methods often wrongly divide some long 
sentences into halves or multiple segments. For 
example:小王病了一个星期. The N-gram method 
is likely to insert a boundary mark between “了” 
and “一”, which corresponds to our everyday im-
pression that, if reading from the left and not 
considering several more words to the right of the 
current word, we will probably consider “小王病
了” as a whole sentence. However, we find that, if 
we search the sentence boundaries from right to 
left, such errors can be effectively avoided. In the 
present example, we won’t consider “一个星期” 
as a whole sentence, and the search will be contin-
ued until the word “小” is encountered. Accord-
ingly, in order to avoid segmentation errors made 
by the normal N-gram method, we propose a re-
verse N-gram segmentation method (RN) which 
does seek sentence boundaries from right to left. 
Further, we simply integrate the two N-gram 
methods and propose a bi-directional N-gram 
method (BN), which takes into account both the 
left and the right context of a candidate segmenta-
tion site. Since the relative usefulness or signifi-
cance of the two N-gram methods varies 
depending on the context, we propose a method of 
weighting them appropriately, using parameters 
generated by a maximum entropy method which 
takes as its features information about words in the 
context. This is our Maximum-Entropy-Weighted 
Bi-directional N-gram-based segmentation method. 
We hope MEBN can retain the correct segments 
discovered by the usual N-gram algorithm, yet ef-
fectively skip the wrong segments. 
3 Maximum-Entropy-Weighted Bi-
directional N-gram-based Segmentation 
Method 
3.1 Normal N-gram Algorithm (NN) for Ut-
terance Segmentation 
Assuming that 
m
WWW ...
21
(where m is a natural 
number) is a word sequence, we consider it as an n 
order Markov chain, in which the word 
)1( miW
i
≤≤  is predicted by the n-1 words to its 
left. Here is the corresponding formula: 
)...|()...|(
11121 −+−−
=
iniiii
WWWPWWWWP  
From this conditional probability formula for a 
word, we can derive the probability of a word se-
quence
i
WWW ...
21
:  
)...|()...()...(
12112121 −−
×=
iiii
WWWWPWWWPWWWP  
Integrating the two formulas above, we get: 
)...|()...()...(
1112121 −+−−
×=
iniiii
WWWPWWWPWWWP  
Let us use SB to indicate a sentence boundary 
and add it to the word sequence. The value of 
)...(
121 +ii
SBWWWWP  and )...(
121 +ii
WWWWP will 
determine whether a specific word 
)1( miW
i
≤≤ is the final word of a sentence. We 
say 
i
W  is the final word of a sentence if and only 
if )...(
121 +ii
SBWWWWP > )...(
121 +ii
WWWWP .  
Taking the trigram as our example and consid-
ering the two cases where W
i-1
 is and is not the 
final word of a sentence, )...(
121 +ii
SBWWWWP  
and )...(
121 +ii
WWWWP  is computed respectively 
by the following two formulas: 
)|()...(
)|()...()...(
)|()|()...(
)|()|()...()...(
11121
121121
11121
121121
iiiii
iiiii
iiiiii
iiiiii
WWWPWWWWP
SBWWPSBWWWPWWWWP
SBWWPWWSBPWWWWP
SBWWPSBWSBPSBWWWPSBWWWWP
−+−
++
+−−
++
×+
×=
××+
××=
 
In the normal N-gram method, the above iterative 
formulas are computed to search the sentence 
boundaries from 
1
W  to 
m
W . 
3.2 Reverse N-gram Algorithm (RN) for Ut-
terance Segmentation 
In the reverse N-gram segmentation method, we 
take the word sequence 
m
WWW ...
21
 as a reverse 
Markov chain in which )1( miW
i
≤≤  is predicted 
by the n-1 words to its right. That is: 
 )...|()...|(
1111 +−++−
=
iniiimmi
WWWPWWWWP  
As in the N-gram algorithm, we compute the 
occurring probability of word sequence 
m
WWW ...
21
 using the formula: 
)...|()...()...(
11111 +−+−−
×=
immiimmimm
WWWWPWWWPWWWP  
Then the iterative computation formula is: 
)...|()...()...(
11111 +−++−−
×=
iniiimmimm
WWWPWWWPWWWP  
By adding SB to the word sequence, we say 
i
W  
is the final word of a sentence if and only if 
)...(
11 iimm
SBWWWWP
+−
> )...(
11 iimm
WWWWP
+−
. 
Similar to NN, )...(
11 iimm
SBWWWWP
+−
 and 
)...(
11 iimm
WWWWP
+−
 are computed as follows in 
the trigram: 
)|()...(
)|()...()...(
)|()|()...(
)|()|()...()...(
12121
11111
112121
111111
++++−
++−+−
+++++−
+++−+−
×+
×=
××+
××=
iiiiimm
iiimmiimm
iiiiiimm
iiiimmiimm
WWWPWWWWP
SBWWPSBWWWPWWWWP
SBWWPWWSBPWWWWP
SBWWPSBWSBPSBWWWPSBWWWWP
  
In contrast to the normal N-gram segmentation 
method, we compute the above iterative formulas 
to seek sentence boundaries from 
m
W  to 
1
W . 
3.3 Bi-directional N-gram Algorithm for Ut-
terance Segmentation 
From the iterative formulas of the normal N-gram 
algorithm and the reverse N-gram algorithm, we 
can see that the normal N-gram method recognizes 
a candidate sentence boundary location mainly 
according to its left context, while the reverse N-
gram method mainly depends on its right context. 
Theoretically at least, it is reasonable to suppose 
that, if we synthetically consider both the left and 
the right context by integrating the NN and the RN, 
the overall segmentation accuracy will be im-
proved. 
Considering the word sequence 
m
WWW ...
21
, the 
candidate sites for sentence boundaries may be 
found between 
1
W  and 
2
W , between 
2
W  and 
3
W , …, or between 
1−m
W and
m
W . The number of 
candidate sites is thus m-1. We number those m-1 
candidate sites 1, 2 … m-1 in succession, and we 
use )(iP
is
)11( −≤≤ mi  and 
)(iP
no
)11( −≤≤ mi  respectively to indicate the 
probability that the current site i really is, or is not, 
a sentence boundary. Thus, to compute the word 
sequence segmentation, we must compute )(iP
is
 
and )(iP
no
 for each of the m-1 candidate sites. In 
the bi-directional BN, we compute )(iP
is
 and 
)(iP
no
 by combining the NN results and RN re-
sults. The combination is described by the follow-
ing formulas: 
)()()(
)()()(
___
___
iPiPiP
iPiPiP
RNnoNNnoBNno
RNisNNisBNis
×=
×=
 
where )(
_
iP
NNis
, )(
_
iP
NNno
 denote the probabili-
ties calculated by NN which correspond to 
)...(
121 +ii
SBWWWWP  and )...(
121 +ii
WWWWP in 
section 3.1 respectively and )(
_
iP
RNis
, )(
_
iP
RNno
 
denote the probabilities calculated by RN which 
correspond to )...(
11 iimm
SBWWWWP
+−
 and 
)...(
11 iimm
WWWWP
+−
 in section 3.2 respectively. 
We say there exits a sentence boundary at site i 
)11( −≤≤ mi if and only if )()(
__
iPiP
BNnoBNis
> . 
3.4 Maximum Entropy Approach for Utter-
ance Segmentation 
In this section, we explain our maximum-entropy-
based model for utterance segmentation. That is, 
we estimate the joint probability distribution of the 
candidate sites and their surrounding words. Since 
we consider information concerning the lexical 
context to be useful, we define the feature func-
tions for our maximum method as follows: 


 ==
=
else
bScefixincludeif
cbf
j
j
0
)0&&)),((Pr(1
),(
10
 


 ==
=
else
bScefixincludeif
cbf
j
j
0
)1&&)),((Pr(1
),(
11
 


 ==
=
else
bScSuffixincludeif
cbf
j
j
0
)0&&)),(((1
),(
20
 


 ==
=
else
bScSuffixincludeif
cbf
j
j
0
)1&&)),(((1
),(
21
 
S
j
 denotes a sequence of one or more words 
which we can call the Matching String. (Note that 
S
j
 may contain the sentence boundary mark ‘SB’.) 
The candidate c’s state is denoted by b, where b=1 
indicates that c is a sentence boundary and b=0 
indicates that it is not a boundary. Prefix(c) de-
notes all the word sequences ending with c (that is, 
c's left context plus c) and Suffix(c) denotes all the 
word sequences beginning with c (in other words, 
c plus its right context). For example: in the utter-
ance: 去<c1>机<c2>场<c3>怎<c4>么<c5>走, 
‘场’, ‘机场’,  and ‘去机场’ are c3’s Prefix, while 
‘怎’ , ‘怎么’and ‘怎么走’ are c3’s Suffix. The 
value of function )),((Pr
j
Scefixinclude  is true 
when word sequence S
j
 is one of c’s Prefixes, and 
the value of function )),((
j
ScSuffixinclude  is 
true when S
j
 is one of c’s Suffixes. 
Corresponding to the four feature functions 
),(
10
cbf
j
, ),(
11
cbf
j
, ),(
20
cbf
j
, ),(
21
cbf
j
 are the 
four parameters 
10j
α ,
11j
α ,
20j
α ,
21j
α . Thus the 
joint probability distribution of the candidate sites 
and their surrounding contexts is given by: 
)(),(
1
),(
21
),(
20
),(
11
),(
10
21201110
∏
=
×××=
k
j
cbf
j
cbf
j
cbf
j
cbf
j
jjjj
bcP ααααπ
where k is the total number of the Matching Strings 
and π is a parameter set to make P(c,1) and P(c,0) 
sum to 1. The unknown parameters 
10j
α ,
11j
α ,
20j
α ,
21j
α  are chosen to maximize the 
likelihood of the training data using the General-
ized Iterative Scaling (Darroch and Ratcliff, 1972) 
algorithm. In the maximum entropy approach, we 
say that a candidate site is a sentence boundary if 
and only if P(c, 1) > P(c, 0). (At this point, we can 
anticipate a technical problem with the maximum 
approach to utterance segmentation. When a 
Matching String contains SB, we cannot know 
whether it belongs to the Prefixes or Suffixes of 
the candidate site until the left and right contexts of 
the candidate site have been segmented. Thus if the 
segmentation proceeds from left to right, the lexi-
cal information in the right context of the current 
candidate site will always remain uncertain. Like-
wise, if it proceeds from right to left, the informa-
tion in the left context of the current candidate site 
remains uncertain. The next subsection will de-
scribe a pragmatic solution to this problem.) 
3.5 Maximum-Entropy-Weighted Bi-
directional N-gram Algorithm for Utter-
ance Segmentation 
In the bi-directional N-gram based algorithm, we 
have considered the left-to-right N-gram algorithm 
and the right-to-left algorithm as having the same 
significance. Actually, however, they should be 
assigned differing weights, depending on the lexi-
cal contexts. The combination formulas are as fol-
lows: 
)()()()()(
)()()()()(
____
____
iPCWiPCWiP
iPCWiPCWiP
RNnoinorNNnoinonno
RNisiisrNNisiisnis
×××=
×××=
 
)(
_ iisn
CW , )(
_ inon
CW , )(
_ iisr
CW , )(
_ inor
CW  
are the functions of the context surrounding candi-
date site i which denotes the weights of 
)(
_
iP
NNis
, )(
_
iP
NNno
, )(
_
iP
RNis
 and )(
_
iP
RNno
 re-
spectively. Assuming that the weights of )(
_
iP
NNis
 
and )(
_
iP
NNno
 depend upon the context to the left 
of the candidate site, and that the weights of 
)(
_
iP
RNis
 and )(
_
iP
RNno
 depend on the context to 
the right of the candidate site, the weight functions 
can be rewritten as: 
)(
_ iisn
LeftCW , )(
_ inon
LeftCW , )(
_ iisr
RightCW ,
)(
_ inor
RightCW . It is reasonable to assume that as 
the joint probability ),( SBiLeftCP
i
=  rises, 
)(
_
iP
NNis
 will increase in significance. (The joint 
probability in question is the probability of the cur-
rent candidate’s left context, taken together with 
the probability that the candidate is a sentence 
boundary.) Therefore the value of )(
_ iisn
LeftCW  
is given by ),()(
_
SBiLeftCPLeftCW
iiisn
== . 
Similarly we can give the formulas for comput-
ing )(
_ inon
LeftCW , )(
_ iisr
RightCW , and 
)(
_ inor
RightCW  as follows: 
)!,()(
_
SBiLeftCPLeftCW
iinon
==  
),()(
_
SBiRightCPRightCW
iiisr
==  
)!,()(
_
SBiRightCPRightCW
iinor
==  
We can easily get the values of 
),( SBiLeftCP
i
= , )!,( SBiLeftCP
i
= ,
),( SBiRightCP
i
= , and )!,( SBiRightCP
i
=  
using the method described in the maximum en-
tropy approach section. For example: 
∏
=
==
k
j
if
ji
j
SBiLeftCP
1
),1(
11
11
),( απ
 
∏
=
==
k
j
if
ji
j
SBiLeftCP
1
),0(
10
10
)!,( απ
 
As mentioned in last subsection, we need seg-
mented contexts for maximum entropy approach. 
Since the maximum entropy parameters for MEBN 
algorithm are used as modifying NN and RN, we 
just estimate the joint probability of the candidate 
and its surrounding contexts based upon the seg-
ments by NN and RN. Using NLeftC
i
 indicate the 
left context to the candidate i which has been seg-
mented by NN algorithm and RRightC
i
 indicate the 
right context to i which has been segmented by RN, 
the combination probability computing formulas 
for MEBN are as follows: 
)()!,(
)()!,()(
)(),(
)(),()(
_
__
_
__
iPSBiRRightCP
iPSBiNLeftCPiP
iPSBiRRightCP
iPSBiNLeftCPiP
RNnoi
NNnoiMEBNno
RNisi
NNisiMEBNis
×=×
×==
×=×
×==
 
We evaluate site i as a sentence boundary if and 
only if )()(
__
iPiP
MEBNnoMEBNis
> . 
4 Experiment 
4.1 Model Training 
Our models are trained on both Chinese and Eng-
lish corpora, which cover the domains of hotel res-
ervation, flight booking, traffic information, 
sightseeing, daily life and so on. We replaced the 
full stops with “SB” and removed all other punc-
tuation marks in the training corpora. Since in most 
actual systems part of speech information cannot 
be accessed before determining the sentence 
boundaries, we use Chinese characters and English 
words without POS tags as the units of our N-gram 
models. Trigram and reverse trigram probabilities 
are estimated based on the processed training cor-
pus by using Modified Kneser-Ney Smoothing 
(Chen and Goodman, 1998). As to the maximum 
entropy model, the Matching Strings are chosen as 
all the word sequences occurring in the training 
corpus whose length is no more than 3 words. The 
unknown parameters corresponding to the feature 
functions are generated based on the training cor-
pus using the Generalized Iterative Scaling algo-
rithm. Table 1 gives an overview of the training 
corpus. 
Corpus SIZE SB Num-
ber 
Average Length 
of Sentence 
Chinese 4.02MB 148967 8 Chinese charac-
ters 
English 4.49MB 149311 6 words 
Table 1. Overview of the Training Corpus. 
4.2 Testing Results 
We test our methods using open corpora which are 
also limited to the domains mentioned above. All 
punctuation marks are removed from the test cor-
pora. An overview of the test corpus appears in 
table 2. 
Corpus SIZE SB 
Number 
Average Length 
of Sentence 
Chinese 412KB 12032 10 Chinese char-
acters 
English 391KB 10518 7 words 
Table 2. Overview of the Testing Corpus. 
We have implemented four segmentation algo-
rithms using NN, RN, BN and MEBN respectively. 
If we use “RightNum” to denote the number of 
right segmentations, “WrongNum” denote the 
number of wrong segmentations, and “TotalNum” 
to denote the number of segmentations in the 
original testing corpus, the precision (P) can be 
computed using the formula 
P=RightNum/(RightNum+WrongNum), the recall 
(R) is computed as R=RightNum/TotalNum, and 
the F-Score is computed as F-Score = 
RP
RP
+
××2
. 
The testing results are described in Table 3 and 
Table 4. 
Methods 
Total 
Num 
Right 
Num 
Wrong 
Num 
Preci-
sion  
Recall F-Score 
NN 12032 10167 2638 79.4% 84.5% 81.9% 
RN 12032 10396 2615 79.9% 86.4% 83.0% 
BN 12032 10528 2249 82.4% 87.5% 84.9% 
MEBN 12032 10348 1587 86.7% 86.0% 86.3% 
Table 3. Experimental Results for Chinese Utter-
ance Segmentation. 
Methods 
Total 
Num 
Right 
Num 
Wrong
Num 
Preci-
sion  
Recall F-Score 
NN 10518 8730 3164 73.4% 83.0% 77.9% 
RN 10518 9014 3351 72.9% 85.7% 78.8% 
BN 10518 9056 3019 75.0% 86.1% 80.2% 
MEBN 10518 8929 2403 78.8% 84.9% 81.7% 
Table 4. Experimental Results for English Utter-
ance Segmentation. 
From the result tables it is clear that RN, BN, and 
MEBN all outperforms the normal N-gram algo-
rithm in the F-score for both Chinese and English 
utterance segmentation. MEBN achieved the best 
performance which improves the precision by 
7.3% and the recall by 1.5% in the Chinese ex-
periment, and improves the precision by 5.4% and 
the recall by 1.9% in the English experiment. 
4.3 Result analysis 
MEBN was proposed in order to maintain the cor-
rect segments of the normal N-gram algorithm 
while skipping the wrong segments. In order to see 
whether our original intention has been realized, 
we compared the segments as determined by RN 
with those determined by NN, compare the seg-
ments found by BN with those of NN and then 
compare the segments found by MEBN with those 
of NN. For RN, BN and MEBN, suppose TN de-
notes the number of total segmentations, CON de-
notes the number of correct segmentations 
overlapping with those found by NN; SWN de-
notes the number of wrong NN segmentations 
which were skipped; WNON denotes the number 
of wrong segmentations not overlapping with those 
of NN; and CNON denotes the number of segmen-
tations which were correct but did not overlap with 
those of NN. The statistical results are listed in 
Table 5 and Table 6. 
Methods TN CON SWN WNON CNON
RN 13011 9525 1098 1077 870 
BN 12777 9906 753 355 622 
MEBN 11935 9646 1274 223 678 
Table 5. Chinese Utterance Segmentation Results 
Comparison. 
Methods TN CON SWN WNON CNON 
RN 12365 8223 1077 1271 792 
BN 12075 8565 640 488 491 
MEBN 11332 8370 1247 486 559 
Table 6. English Utterance Segmentation Results 
Comparison. 
Focusing upon the Chinese results, we can see that 
RN skips 1098 incorrect segments found by NN, 
and has 9525 correct segments in common with 
those of NN. It verifies our supposition that RN 
can effectively avoid some errors made by NN. 
But because at the same time RN brings in 1077 
new errors, RN doesn’t improve much in precision. 
BN skips 753 incorrect segments and brings in 355 
new segmentation errors; has 9906 correct seg-
ments in common with those of NN and brings in 
622 new correct segments. So by equally integrat-
ing NN and RN, BN on one hand finds more cor-
rect segments, on the other hand brings in less 
wrong segments than NN. But in skipping incor-
rect segments by NN, BN still performs worse than 
RN, showing that it only exerts the error skipping 
ability of RN to some extent. As for MEBN, it 
skips 1274 incorrect segments and at the same time 
brings in only 223 new incorrect segments. Addi-
tionally it maintains 9646 correct segments in com-
mon with those of NN and brings in 678 new 
correct segments. In recall MEBN performs a little 
worse than BN, but in precision it achieves a much 
better performance than BN, showing that modi-
fied by the maximum entropy weights, MEBN 
makes use of the error skipping ability of RN more 
effectively. Further, in skipping wrong segments 
by NN, MEBN even outperforms RN, which indi-
cates the weights we set on NN and RN not only 
act as modifying parameters, but also have direct 
beneficial affection on utterance segmentation. 
5 Conclusion 
This paper proposes a reverse N-gram algorithm, a 
bi-directional N-gram algorithm and a Maximum-
entropy-weighted Bi-directional N-gram algorithm 
for utterance segmentation. The experimental re-
sults for both Chinese and English utterance seg-
mentation show that MEBN significantly 
outperforms the usual N-gram algorithm. This is 
because MEBN takes into account both the left and 
right contexts of candidate sites: it integrates the 
left-to-right N-gram algorithm and the right-to-left 
N-gram algorithm with appropriate weights, using 
clues on the sites’ lexical context, as modeled by 
maximum entropy. 
Acknowledgements 
This work is sponsored by the Natural Sciences 
Foundation of China under grant No.60175012, as 
well as supported by the National Key Fundamen-
tal Research Program (the 973 Program) of China 
under the grant G1998030504. 
The authors are very grateful to Dr. Mark Selig-
man for his very useful suggestions and his very 
careful proofreading.  

References 
Beeferman D., A. Berger, and J. Lafferty. 1998. 
CYBERPUNC: A lightweight punctuation annotation 
system for speech. In Proceedings of the IEEE Inter-
national Conference on Acoustics, Speech and Signal 
Processing, Seattle, WA. pp. 689-692. 
Beeferman D., A. Berger, and J. Lafferty. 1999. Statisti-
cal models for text segmentation. Machine Learning 
34, pp 177-210.  
Berger A., S. Della Pietra, and V. Della Pietra. 1996. A 
Maximum Entropy Approach to Natural Language 
Processing. Computational Linguistics, 22(1), pp. 39-
71.  
Cettolo M. and D. Falavigna. 1998. Automatic 
Detection of Semantic Boundaries Based on Acoustic 
and Lexical Knowledge. ICSLP 1998, pp. 1551-1554. 
Chen S. F. and J. Goodman. 1998. An empirical study 
of smoothing techniques for language modeling. Tech-
nical Report TR-10-98, Center for Research in Com-
puting Technology, Harvard University. pp.243-255. 
Darroch J. N. and D. Ratcliff. 1972. Generalized Itera-
tive Scaling for Log-Linear Models. The Annals of 
Mathematical Statistics, 43(5), pp. 1470-1480. 
Furuse O., S. Yamada, and K. Yamamoto. 1998. Split-
ting Long or Ill-formed Input for Robust Spoken-
language Translation. COLING-ACL 1998, pp. 421-
427. 
Gotoh Y. and S. Renals. 2000. Sentence Boundary De-
tection in Broadcast Speech Transcripts. In Proc. In-
ternational Workshop on Automatic Speech 
Recognition, pp. 228-235. 
Nakano M., N. Miyazaki, J. Hirasawa, K. Dohsaka, and 
T. Kawabata. 1999. Understanding Unsegmented User 
Utterances in Real-Time Spoken Dialogue Systems. 
Proceedings of the 37th Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL-99), Col-
lege Park, MD, USA, pp. 200-207. 
Ramaswamy N. G. and J. Kleindienst. 1998. Automatic 
Identification of Command Boundaries in a Conversa-
tional Natural Language User Interface. ICSLP 1998. 
pp. 401-404. 
Reynar J. and A. Ratnaparkhi. 1997. A maximum en-
tropy approach to identifying sentence boundaries. In 
Proceedings of the 5th Conference on Applications of 
Natural Language Processing (ANLP), Washington 
DC, pp. 16-19.  
Seligman M. 2000. Nine Issues in Speech Translation. 
In Machine Translation, 15, pp. 149-185. 
Stevenson M. and R. Gaizauskas. 2000. Experiments on 
sentence boundary detection. In Proceedings of the 
Sixth Conference on Applied Natural Language Proc-
essing and the First Conference of the North American 
Chapter of the Association for Computational Linguis-
tics, pp. 24-30. 
Stolcke A. and E. Shriberg. 1996. Automatic linguistic 
segmentation of conversational speech. Proc. Intl. 
Conf. on Spoken Language Processing, Philadelphia, 
PA, vol. 2, pp. 1005-1008.  
Stolcke A., E. Shriberg, R. Bates, M. Ostendorf, D. 
Hakkani, M. Plauche, G. Tur, and Y. Lu. 1998. Auto-
matic Detection of Sentence Boundaries and Disfluen-
cies based on Recognized Words. Proc. Intl. Conf. on 
Spoken Language Processing, Sydney, Australia, vol. 
5, pp. 2247-2250.  
Zong, C. and F. Ren. 2003. Chinese Utterance Segmen-
tation in Spoken Language translation. In Proceedings 
of the 4
th
 international conference on intelligent text 
processing and Computational Linguistics (CICLing), 
Mexico, Feb 16-22. pp. 516-525. 
Zhou Y. 2001. Utterance Segmentation Based on Deci-
sion Tree. Proceedings of the 6
th
 National joint Con-
ference on Computational Linguistics, Taiyuan, China, 
pp. 246-252.  
