Edit Detection and Parsing for Transcribed Speech
Eugene Charniak and Mark Johnson
Deparments of Computer Science and Cognitive and Linguistic Sciences
Brown Laboratory for Linguistic Information Processing (BLLIP)
Brown University, Providence, RI 02912
ec,mj@cs.brown.edu
 
Abstract
We present a simple architecture for parsing
transcribed speech in which an edited-word de-
tector  rst removes such words from the sen-
tence string, and then a standard statistical
parser trained on transcribed speech parses the
remaining words. The edit detector achieves a
misclassi cation rate on edited words of 2.2%.
(The NULL-model, which marks everything as
not edited, has an error rate of 5.9%.) To evalu-
ate our parsing results we introduce a new eval-
uation metric, the purpose of which is to make
evaluation of a parse tree relatively indi erent
to the exact tree position of EDITED nodes. By
this metric the parser achieves 85.3% precision
and 86.5% recall.
1 Introduction
While signi cant e ort has been expended on
the parsing of written text, parsing speech
has received relatively little attention. The
comparative neglect of speech (or transcribed
speech) is understandable, since parsing tran-
scribed speech presents several problems absent
in regular text: \um"s and \ah"s (or more
formally,  lled pauses), frequent use of par-
entheticals (e.g., \you know"), ungrammatical
constructions, and speech repairs (e.g., \Why
didn’t he, why didn’t she stay home?").
In this paper we present and evaluate a simple
two-pass architecture for handling the problems
of parsing transcribed speech. The  rst pass
tries to identify which of the words in the string
are edited (\why didn’t he," in the above exam-
ple). These words are removed from the string
given to the second pass, an already existing sta-
tistical parser trained on a transcribed speech
 
This research was supported in part by NSF grant LIS
SBR 9720368 and by NSF ITR grant 20100203.
corpus. (In particular, all of the research in this
paper was performed on the parsed \Switch-
board" corpus as provided by the Linguistic
Data Consortium.)
This architecture is based upon a fundamen-
tal assumption: that the semantic and prag-
matic content of an utterance is based solely
on the unedited words in the word sequence.
This assumption is not completely true. For
example, Core and Schubert [8] point to coun-
terexamples such as \have the engine take the
oranges to Elmira, um, I mean, take them to
Corning" where the antecedent of \them" is
found in the EDITED words. However, we be-
lieve that the assumption is so close to true that
the number of errors introduced by this assump-
tion is small compared to the total number of
errors made by the system.
In order to evaluate the parser’s output we
compare it with the gold-standard parse trees.
For this purpose a very simple third pass is
added to the architecture: the hypothesized
edited words are inserted into the parser output
(see Section 3 for details). To the degree that
our fundamental assumption holds, a \real" ap-
plication would ignore this last step.
This architecture has several things to recom-
mend it. First, it allows us to treat the editing
problem as a pre-process, keeping the parser un-
changed. Second, the major clues in detecting
edited words in transcribed speech seem to be
relatively shallow phenomena, such as repeated
word and part-of-speech sequences. The kind
of information that a parser would add, e.g.,
thenodedominatingtheEDITED node, seems
much less critical.
Note that of the major problems associated
with transcribed speech, we choose to deal with
only one of them, speech repairs, in a special
fashion. Our reasoning here is based upon what
one might and might not expect from a second-
pass statistical parser. For example, ungram-
maticality in some sense is relative, so if the
training corpus contains the same kind of un-
grammatical examples as the testing corpus,
one would not expect ungrammaticality itself
to be a show stopper. Furthermore, the best
statistical parsers [3,5] do not use grammatical
rules, but rather de ne probability distributions
over all possible rules.
Similarly, parentheticals and  lled pauses ex-
ist in the newspaper text these parsers currently
handle, albeit at a much lower rate. Thus there
is no particular reason to expect these construc-
tions to have a major impact.
1
This leaves
speech repairs as the one major phenomenon
not present in written text that might pose a
major problem for our parser. It is for that rea-
son that we have chosen to handle it separately.
The organization of this paper follows the ar-
chitecture just described. Section 2 describes
the  rst pass. We present therein a boosting
model for learning to detect edited nodes (Sec-
tions 2.1 { 2.2) and an evaluation of the model
as a stand-alone edit detector (Section 2.3).
Section 3 describes the parser. Since the parser
is that already reported in [3], this section sim-
ply describes the parsing metrics used (Section
3.1), the details of the experimental setup (Sec-
tion 3.2), and the results (Section 3.3).
2 Identifying EDITED words
The Switchboard corpus annotates disfluencies
such as restarts and repairs using the terminol-
ogy of Shriberg [15]. The disfluencies include
repetitions and substitutions, italicized in (1a)
and (1b) respectively.
(1) a. Ireally,I really like pizza.
b. Whydidn’the,why didn’t she stay
home?
Restarts and repairs are indicated by disfluency
tags ‘[’, ‘+’ and ‘]’ in the disfluency POS-tagged
Switchboard corpus, and by EDITED nodes in
the tree-tagged corpus. This section describes
a procedure for automatically identifying words
corrected by a restart or repair, i.e., words that
1
Indeed, [17] suggests that  lled pauses tend to indi-
cate clause boundaries, and thus may be a help in pars-
ing.
are dominated by an EDITED node in the tree-
tagged corpus.
This method treats the problem of identify-
ing EDITED nodes as a word-token classi cation
problem, where each word token is classi ed as
either edited or not. The classi er applies to
words only; punctuation inherits the classi ca-
tion of the preceding word. A linear classi er
trained by a greedy boosting algorithm [16] is
used to predict whether a word token is edited.
Our boosting classi er is directly based on the
greedy boosting algorithm described by Collins
[7]. This paper contains important implemen-
tation details that are not repeated here. We
chose Collins’ algorithm because it o ers good
performance and scales to hundreds of thou-
sands of possible feature combinations.
2.1 Boosting estimates of linear
classi ers
This section describes the kinds of linear clas-
si ers that the boosting algorithm infers. Ab-
stractly, we regard each word token as an event
characterized by a  nite tuple of random vari-
ables
(Y;X
1
;:::;X
m
):
Y is the the conditioned variable and ranges
over f−1;+1g,withY = +1 indicating that
the word is not edited. X
1
;:::;X
m
are the con-
ditioning variables;eachX
j
ranges over a  nite
set X
j
. For example, X
1
is the orthographic
form of the word and X
1
is the set of all words
observed in the training section of the corpus.
Our classi ers use m = 18 conditioning vari-
ables. The following subsection describes the
conditioning variables in more detail; they in-
clude variables indicating the POS tag of the
preceding word, the tag of the following word,
whether or not the word token appears in a
\rough copy" as explained below, etc.
The goal of the classi er is to predict the
value of Y given values for X
1
;:::;X
m
.The
classi er makes its predictions based on the oc-
curence of combinations of conditioning vari-
able/value pairs called features. A feature F
is a set of variable-value pairs hX
j
;x
j
i,with
x
j
2X
j
. Our classi er is de ned in terms of
a  nite number n of features F
1
;:::;F
n
,where
n  10
6
in our classi ers.
2
Each feature F
i
de-
2
It turns out that many pairs of features are exten-
sionally equivalent, i.e., take the same values on each
 nes an associated random boolean variable
F
i
=
Y
hX
j
;x
j
i2F
i
(X
j
=x
j
);
where (X=x) takes the value 1 if X = x and 0
otherwise. That is, F
i
=1i X
j
= x
j
for all
hX
j
;x
j
i2F
i
.
Our classi er estimates a feature weight  
i
for
each feature F
i
, that is used to de ne the pre-
diction variable Z:
Z =
n
X
i=1
 
i
F
i
:
The prediction made by the classi er is
sign(Z)=Z=jZj, i.e., −1 or +1 depending on
the sign of Z.
Intuitively, our goal is to adjust the vector
of feature weights ~ =( 
1
;:::; 
n
) to minimize
the expected misclassi cation rate E[(sign(Z) 6=
Y )]. This function is di cult to minimize,
so our boosting classi er minimizes the ex-
pected Boost loss E[exp(−YZ)]. As Singer and
Schapire [16] point out, the misclassi cation
rate is bounded above by the Boost loss, so a
low value for the Boost loss implies a low mis-
classi cation rate.
Our classi er estimates the Boost loss as
b
E
t
[exp(−YZ)], where
b
E
t
[ ] is the expectation
on the empirical training corpus distribution.
The feature weights are adjusted iteratively;
one weight is changed per iteration. The fea-
ture whose weight is to be changed is selected
greedily to minimize the Boost loss using the
algorithm described in [7]. Training contin-
ues for 25,000 iterations. After each iteration
the misclassi cation rate on the development
corpus
b
E
d
[(sign(Z) 6= Y )] is estimated, where
b
E
d
[ ] is the expectation on empirical develop-
ment corpus distribution. While each iteration
lowers the Boost loss on the training corpus, a
graph of the misclassi cation rate on the de-
velopment corpus versus iteration number is a
noisy U-shaped curve, rising at later iterations
due to overlearning. The value of ~ returned
word token in our training data. We developed a method
for quickly identifying such extensionally equivalent fea-
ture pairs based on hashing XORed random bitmaps,
and deleted all but one of each set of extensionally equiv-
alent features (we kept a feature with the smallest num-
ber of conditioning variables).
by the estimator is the one that minimizes the
misclass ciation rate on the development cor-
pus; typically the minimum is obtained after
about 12,000 iterations, and the feature weight
vector ~ contains around 8000 nonzero feature
weights (since some weights are adjusted more
than once).
3
2.2 Conditioning variables and features
This subsection describes the conditioning vari-
ablesusedintheEDITED classi er. Many of the
variables are de ned in terms of what we call
a rough copy. Intuitively, a rough copy iden-
ti es repeated sequences of words that might
be restarts or repairs. Punctuation is ignored
for the purposes of de ning a rough copy, al-
though conditioning variables indicate whether
the rough copy includes punctuation. A rough
copy in a tagged string of words is a substring
of the form  
1
 γ 
2
,where:
1.  
1
(the source)and 
2
(the copy)bothbe-
gin with non-punctuation,
2. the strings of non-punctuation POS tags of
 
1
and  
2
are identical,
3.  (the free  nal) consists of zero or more
sequences of a free  nal word (see below)
followed by optional punctuation, and
4. γ (the interregnum) consists of sequences of
an interregnum string (see below) followed
by optional punctuation.
The set of free- nal words includes all partial
words (i.e., ending in a hyphen) and a small set
of conjunctions, adverbs and miscellanea, such
as and, or, actually, so, etc. The set of interreg-
num strings consists of a small set of expressions
such as uh, you know, I guess, I mean, etc. We
search for rough copies in each sentence start-
ing from left to right, searching for longer copies
 rst. After we  nd a rough copy, we restart
searching for additional rough copies following
the free  nal string of the previous copy. We
say that a word token is in a rough copy i it
appears in either the source or the free  nal.
4
(2) is an example of a rough copy.
3
We used a smoothing parameter  as described in
[7], which we estimate by using a line-minimization rou-
tine to minimize the classi er’s minimum misclassi ca-
tion rate on the development corpus.
4
In fact, our de nition of rough copy is more complex.
For example, if a word token appears in an interregnum
(2) I thought I
|{z}
 
1
cou-,
| {z }
 
Imean,
| {z }
γ
I
|{z}
 
2
would  n-
ish the work
Table 1 lists the conditioning variables used
in our classi er. In that table, subscript inte-
gers refer to the relative position of word to-
kens relative to the current word; e.g. T
1
is
the POS tag of the following word. The sub-
script f refers to the tag of the  rst word of the
free  nal match. If a variable is not de ned for
a particular word it is given the special value
‘NULL’; e.g., if a word is not in a rough copy
then variables such as N
m
, N
u
, N
i
, N
l
, N
r
and
T
f
all take the value NULL. Flags are boolean-
valued variables, while numeric-valued variables
are bounded to a value between 0 and 4 (as well
as NULL, if appropriate). The three variables
C
t
, C
w
and T
i
are intended to help the classi er
capture very short restarts or repairs that may
not involve a rough copy. The flags C
t
and C
i
indicate whether the orthographic form and/or
tag of the next word (ignoring punctuation) are
thesameasthoseofthecurrentword. T
i
has
anon-NULL value only if the current word is
followed by an interregnum string; in that case
T
i
is the POS tag of the word following that
interregnum.
As described above, the classi er’s features
are sets of variable-value pairs. Given a tuple of
variables, we generate a feature for each tuple
of values that the variable tuple assumes in the
training data. In order to keep the feature set
managable, the tuples of variables we consider
are restricted in various ways. The most impor-
tant of these are constraints of the form ‘if X
j
is included among feature’s variables, then so
is X
k
’. For example, we require that if a fea-
ture contains P
i+1
then it also contains P
i
for
i  0, and we impose a similiar constraint on
POS tags.
2.3 Empirical evaluation
For the purposes of this research the Switch-
board corpus, as distributed by the Linguistic
Data Consortium, was divided into four sections
and the word immediately following the interregnum also
appears in a (di erent) rough copy, then we say that the
interregnum word token appears in a rough copy. This
permits us to approximate the Switchboard annotation
convention of annotating interregna as EDITED if they
appear in iterated edits.
(or subcorpora). The training subcorpus con-
sists of all  les in the directories 2 and 3 of the
parsed/merged Switchboard corpus. Directory
4 is split into three approximately equal-size sec-
tions. (Note that the  les are not consecutively
numbered.) The  rst of these ( les sw4004.mrg
to sw4153.mrg) is the testing corpus. All edit
detection and parsing results reported herein
are from this subcorpus. The  les sw4154.mrg
to sw4483.mrg are reserved for future use. The
 les sw4519.mrg to sw4936.mrg are the devel-
opment corpus. In the complete corpus three
parse trees were su ciently ill formed in that
our tree-reader failed to read them. These trees
received trivial modi cations to allow them to
be read, e.g., adding the missing extra set of
parentheses around the complete tree.
We trained our classi er on the parsed data
 les in the training and development sections,
and evaluated the classifer on the test section.
Section 3 evaluates the parser’s output in con-
junction with this classi er; this section focuses
on the classi er’s performance at the individual
word token level. In our complete application,
the classi er uses a bitag tagger to assign each
word a POS tag. Like all such taggers, our tag-
ger has a nonnegligible error rate, and these tag-
ging could conceivably a ect the performance of
the classi er. To determine if this is the case,
we report classi er performance when trained
both on \Gold Tags" (the tags assigned by the
human annotators of the Switchboard corpus)
and on \Machine Tags" (the tags assigned by
our bitag tagger). We compare these results to
a baseline \null" classi er, which never identi-
 es a word as EDITED. Our basic measure of
performance is the word misclassi cation rate
(see Section 2.1). However, we also report pre-
cision and recall scores for EDITED words alone.
All words are assigned one of the two possible
labels, EDITED or not. However, in our evalua-
tion we report the accuracy of only words other
than punctuation and  lled pauses. Our logic
here is much the same as that in the statistical
parsing community which ignores the location
of punctuation for purposes of evaluation [3,5,
6] on the grounds that its placement is entirely
conventional. The same can be said for  lled
pauses in the switchboard corpus.
Our results are given in Table 2. They show
that our classi er makes only approximately 1/3
W
0
Orthographic word
P
0
;P
1
;P
2
;P
f
Partial word flags
T
−1
;T
0
;T
1
;T
2
;T
f
POS tags
N
m
Number of words in common in source and copy
N
u
Number of words in source that do not appear in copy
N
i
Number of words in interregnum
N
l
Number of words to left edge of source
N
r
Number of words to right edge of source
C
t
Followed by identical tag flag
C
w
Followed by identical word flag
T
i
Post-interregnum tag flag
Table 1: Conditioning variables used in the EDITED classi er.
of the misclassi cation errors made by the null
classi er (0.022 vs. 0.059), and that using the
POS tags produced by the bitag tagger does
not have much e ect on the classi er’s perfor-
mance (e.g., EDITED recall decreases from 0.678
to 0.668).
3 Parsing transcribed speech
We now turn to the second pass of our two-pass
architecture, using an \o -the-shelf" statistical
parser to parse the transcribed speech after hav-
ing removed the words identi ed as edited by
the  rst pass. We  rst de ne the evaluation
metric we use and then describe the results of
our experiments.
3.1 Parsing metrics
In this section we describe the metric we use
to grade the parser output. As a  rst desider-
atum we want a metric that is a logical exten-
sion of that used to grade previous statistical
parsing work. We have taken as our starting
point what we call the \relaxed labeled preci-
sion/recall" metric from previous research (e.g.
[3,5]). This metric is characterized as follows.
For a particular test corpus let N be the total
number of nonterminal (and non-preterminal)
constituents in the gold standard parses. Let
M be the number of such constituents returned
by the parser, and let C be the number of these
that are correct (as de ned below). Then pre-
cision = C=M and recall = C=N.
Aconstituentc is correct if there exists a con-
stituent d in the gold standard such that:
1. label(c)=label(d)
5
5
For some reason, starting with [12] the labels ADVP
2. begin(c)  
r
begin(d)
3. end(c)  
r
end(d)
In 2 and 3 above we introduce an equivalence
relation  
r
between string positions. We de ne
 
r
to be the smallest equivalence relation sat-
isfying a  
r
b for all pairs of string positions a
and b separated solely by punctuation symbols.
The parsing literature uses  
r
rather than =
because it is felt that two constituents should
be considered equal if they disagree only in the
placement of, say, a comma (or any other se-
quence of punctuation), where one constituent
includes the punctuation and the other excludes
it.
Our new metric, \relaxed edited labeled preci-
sion/recall" is identical to relaxed labeled preci-
sion/recall except for two modi cations. First,
in the gold standard all non-terminal subcon-
stituents of an EDITED node are removed and
the terminal constituents are made immediate
children of a single EDITED node. Furthermore,
two or more EDITED nodes with no separating
non-edited material between them are merged
into a single EDITED node. We call this version
a \simpli ed gold standard parse." All precision
recall measurements are taken with respected to
the simpli ed gold standard.
Second, we replace  
r
with a new equiva-
lence relation  
e
which we de ne as the smallest
equivalence relation containing  
r
and satisfy-
ing begin(c)  
e
end(c)foreachEDITED node c
in the gold standard parse.
6
and PRT are considered to be identical as well.
6
We considered but ultimately rejected de ning  
e
using the EDITED nodes in the returned parse rather
Classifer
Null Gold Tags Machine Tags
Misclassi cation rate 0.059 0.021 0.022
EDITED precision { 0.952 0.944
EDITED recall 0 0.678 0.668
Table 2: Performance of the \null" classi er (which never marks a word as EDITED) and boosting
classi ers trained on \Gold Tags" and \Machine Tags".
123 4 567 8
EEEE
the , bagel with uh , doughnut
122 4 522 8
Figure 1: Equivalent string positions as de ned by  
e
.
We give a concrete example in Figure 1. The
 rst row indicates string position (as usual in
parsing work, position indicators are between
words). The second row gives the words of the
sentence. Words that are edited out have an
\E" above them. The third row indicates the
equivalence relation by labeling each string posi-
tion with the smallest such position with which
it is equivalent.
There are two basic ideas behind this de ni-
tion. First, we do not care where the EDITED
nodes appear in the tree structure produced by
the parser. Second, we are not interested in the
 ne structure of EDITED sections of the string,
justthefactthattheyare EDITED.Thatwe
do care which words are EDITED comes into
our  gure of merit in two ways. First, (non-
contiguous) EDITED nodes remain, even though
their substructure does not, and thus they are
counted in the precision and recall numbers.
Secondly (and probably more importantly), fail-
ure to decide on the correct positions of edited
nodes can cause collateral damage to neighbor-
ing constituents by causing them to start or stop
in the wrong place. This is particularly rele-
vant because according to our de nition, while
the positions at the beginning and ending of an
edit node are equivalent, the interior positions
are not (unless related by the punctuation rule).
than the simpli ed gold standard. We rejected this be-
cause the  
e
relation would then itself be dependent
on the parser’s output, a state of a airs that might al-
low complicated schemes to improve the parser’s perfor-
mance as measured by the metric.
See Figure 1.
3.2 Parsing experiments
The parser described in [3] was trained on the
Switchboard training corpus as speci ed in sec-
tion 2.1. The input to the training algorithm
was the gold standard parses minus all EDITED
nodes and their children.
We tested on the Switchboard testing sub-
corpus (again as speci ed in Section 2.1). All
parsing results reported herein are from all sen-
tences of length less than or equal to 100 words
and punctuation. When parsing the test corpus
we carried out the following operations:
1. create the simpli ed gold standard parse
by removing non-terminal children of an
EDITED node and merging consecutive
EDITED nodes.
2. remove from the sentence to be fed to the
parser all words marked as edited by an
edit detector (see below).
3. parse the resulting sentence.
4. add to the resulting parse EDITED nodes
containing the non-terminal symbols re-
moved in step 2. The nodes are added as
high as possible (though the de nition of
equivalence from Section 3.1 should make
the placement of this node largely irrele-
vant).
5. evaluate the parse from step 4 against the
simpli ed gold standard parse from step 1.
We ran the parser in three experimental sit-
uations, each using a di erent edit detector in
step 2. In the  rst of the experiments (labeled
\Gold Edits") the \edit detector" was simply
the simpli ed gold standard itself. This was to
see how well the parser would do it if had perfect
information about the edit locations.
In the second experiment (labeled \Gold
Tags"), the edit detector was the one described
in Section 2 trained and tested on the part-of-
speech tags as speci ed in the gold standard
trees. Note that the parser was not given the
gold standard part-of-speech tags. We were in-
terested in contrasting the results of this experi-
ment with that of the third experiment to gauge
what improvement one could expect from using
a more sophisticated tagger as input to the edit
detector.
In the third experiment (\Machine Tags") we
used the edit detector based upon the machine
generated tags.
The results of the experiments are given in
Table 3. The last line in the  gure indicates
the performance of this parser when trained and
tested on Wall Street Journal text [3]. It is
the \Machine Tags" results that we consider the
\true" capability of the detector/parser combi-
nation: 85.3% precision and 86.5% recall.
3.3 Discussion
The general trends of Table 3 are much as one
might expect. Parsing the Switchboard data is
much easier given the correct positions of the
EDITED nodes than without this information.
The di erence between the Gold-tags and the
Machine-tags parses is small, as would be ex-
pected from the relatively small di erence in
the performance of the edit detector reported in
Section 2. This suggests that putting signi cant
e ort into a tagger for use by the edit detec-
tor is unlikely to produce much improvement.
Also, as one might expect, parsing conversa-
tional speech is harder than Wall Street Jour-
nal text, even given the gold-standard EDITED
nodes.
Probably the only aspect of the above num-
bers likely to raise any comment in the pars-
ing community is the degree to which pre-
cision numbers are lower than recall. With
the exception of the single pair reported in [3]
and repeated above, no precision values in the
recent statistical-parsing literature [2,3,4,5,14]
have ever been lower than recall values. Even
this one exception is by only 0.1% and not sta-
tistically signi cant.
We attribute the dominance of recall over pre-
cision primarily to the influence of edit-detector
mistakes. First, note that when given the
gold standard edits the di erence is quite small
(0.3%). When using the edit detector edits the
di erence increases to 1.2%. Our best guess is
that because the edit detector has high preci-
sion, and lower recall, many more words are left
in the sentence to be parsed. Thus one  nds
more nonterminal constituents in the machine
parses than in the gold parses and the precision
is lower than the recall.
4 Previous research
While there is a signi cant body of work on  nd-
ing edit positions [1,9,10,13,17,18], it is di cult
to make meaningful comparisons between the
various research e orts as they di er in (a) the
corpora used for training and testing, (b) the
information available to the edit detector, and
(c) the evaluation metrics used. For example,
[13] uses a subsection of the ATIS corpus, takes
as input the actual speech signal (and thus has
access to silence duration but not to words), and
uses as its evaluation metric the percentage of
time the program identi es the start of the in-
terregnum (see Section 2.2). On the other hand,
[9,10] use an internally developed corpus of sen-
tences, work from a transcript enhanced with
information from the speech signal (and thus
use words), but do use a metric that seems to be
similar to ours. Undoubtedly the work closest
to ours is that of Stolcke et al. [18], which also
uses the transcribed Switchboard corpus. (How-
ever, they use information on pause length, etc.,
that goes beyond the transcript.) They cate-
gorize the transitions between words into more
categories than we do. At  rst glance there
might be a mapping between their six categories
and our two, with three of theirs corresponding
to EDITED words and three to not edited. If
one accepts this mapping they achieve an er-
ror rate of 2.6%, down from their NULL rate of
4.5%, as contrasted with our error rate of 2.2%
down from our NULL rate of 5.9%. The di er-
ence in NULL rates, however, raises some doubts
that the numbers are truly measuring the same
thing.
Experiment Labeled Precision Labeled Recall F-measure
Gold Edits 87.8 88.1 88.0
Gold Tags 85.4 86.6 86.0
Machine Tags 85.3 86.5 85.9
WSJ 89.5 89.6
Table 3: Results of Switchboard parsing, sentence length  100.
There is also a small body of work on parsing
disfluent sentences [8,11]. Hindle’s early work
[11] does not give a formal evaluation of the
parser’s accuracy. The recent work of Schubert
and Core [8] does give such an evaluation, but
on a di erent corpus (from Rochester Trains
project). Also, their parser is not statistical
and returns parses on only 62% of the strings,
and 32% of the strings that constitute sentences.
Our statistical parser naturally parses all of our
corpus. Thus it does not seem possible to make
a meaningful comparison between the two sys-
tems.
5 Conclusion
We have presented a simple architecture for
parsing transcribed speech in which an edited
word detector is  rst used to remove such words
from the sentence string, and then a statistical
parser trained on edited speech (with the edited
nodes removed) is used to parse the text. The
edit detector reduces the misclassi cation rate
on edited words from the null-model (marking
everything as not edited) rate of 5.9% to 2.2%.
To evaluate our parsing results we have intro-
duced a new evaluation metric, relaxed edited
labeled precision/recall. The purpose of this
metric is to make evaluation of a parse tree
relatively indi erent to the exact tree posi-
tion of EDITED nodes, in much the same way
that the previous metric, relaxed labeled pre-
cision/recall, make it indi erent to the attach-
ment of punctuation. By this metric the parser
achieved 85.3% precision and 86.5% recall.
There is, of course, great room for improve-
ment, both in stand-alone edit detectors, and
their combination with parsers. Also of interest
are models that compute the joint probabilities
of the edit detection and parsing decisions |
that is, do both in a single integrated statistical
process.
References
1. Bear, J., Dowding, J. and Shriberg, E.
Integrating multiple knowledge sources for
detection and correction of repairs in human-
computer dialog.InProceedings of the 30th
Annual Meeting of the Association for Com-
putational Linguistics. 56{63.
2. Charniak, E. Statistical parsing with a
context-free grammar and word statistics.
In Proceedings of the Fourteenth National
Conference on Arti cial Intelligence. AAAI
Press/MIT Press, Menlo Park, CA, 1997,
598{603.
3. Charniak, E. A maximum-entropy-
inspired parser.InProceedings of the 2000
Conference of the North American Chap-
ter of the Association for Computational
Linguistics. ACL, New Brunswick NJ, 2000.
4. Collins, M. J. A new statistical parser
based on bigram lexical dependencies.InPro-
ceedings of the 34th Annual Meeting of the
ACL. 1996.
5. Collins, M. J. Three generative lexical-
ized models for statistical parsing.InPro-
ceedings of the 35th Annual Meeting of the
ACL. 1997, 16{23.
6. Collins, M. J. Head-Driven Statistical
Models for Natural Language Parsing. Uni-
versity of Pennsylvania, Ph.D. Dissertation,
1999.
7. Collins,M.J.Discriminative reranking
for natural language parsing.InProceedings
of the International Conference on Machine
Learning (ICML 2000). 2000.
8. Core, M. G. and Schubert, L. K. A syn-
tactic framework for speech repairs and other
disruptions.InProceedings of the 37th An-
nual Meeting of the Association for Compu-
tational Linguistics. 1999, 413{420.
9. Heeman, P. A. and Allen, J. F. Into-
national boundaries, speech repairs and dis-
course markers: modeling spoken dialog.In
35th Annual Meeting of the Association for
Computational Linguistics and 17th Interna-
tional Conference on Computational Linguis-
tics. 1997, 254{261.
10. Heeman, P. A. and Allen, J. F. Speech
repairs, intonational phrases and discourse
markers: modeling speakers’ utterances in
spoken dialogue. Computational Linguistics
254 (1999).
11. Hindle, D. Deterministic parsing of syn-
tactic non-fluencies.InProceedings of the
21st Annual Meeting of the Association for
Computational Linguistics. 1983, 123{128.
12. Magerman, D. M. Statistical decision-tree
models for parsing.InProceedings of the 33rd
Annual Meeting of the Association for Com-
putational Linguistics. 1995, 276{283.
13. Nakatani, C. H. and Hirschberg, J. A
corpus-based study of repair cues in sponta-
neous speech. Journal of the Acoustical Soci-
ety of America 953 (1994), 1603{1616.
14. Ratnaparkhi, A. Learning to parse natu-
ral language with maximum entropy models.
Machine Learning 34 1/2/3 (1999), 151{176.
15. Shriberg, E. E. Preliminaries to a The-
ory of Speech Disfluencies.InPhD Disserta-
tion. Department of Psychology, University
of California-Berkeley, 1994.
16. Singer, Y. and Schapire, R. E. Im-
proved boosting algorithms using con dence-
based predictions.InProceedings of the
Eleventh Annual Conference on Computa-
tional Learning Theory. 1998, 80{91.
17. Stolcke, A. and Shriberg, E. Auto-
matic linguistic segmantation of conversa-
tional speech.InProceedings of the 4th In-
ternational Conference on Spoken Language
Processing (ICSLP-96). 1996.
18. Stolcke, A., Shriberg, E., Bates, R.,
Ostendorf, M., Hakkani, D., Plauche,
M., T
¨
ur, G. and Lu, Y. Automatic detec-
tion of sentence boundaries and disfluencies
based on recognized words. Proceedings of
the International Conference on Spoken Lan-
guage Processing 5 (1998), 2247{2250.
