File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1003_metho.xml

Size: 12,092 bytes

Last Modified: 2025-10-06 14:14:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1003">
  <Title>High Performance Segmentation of Spontaneous Speech Using Part of Speech and Trigger Word Information</Title>
  <Section position="4" start_page="12" end_page="12" type="metho">
    <SectionTitle>
2 Data preparation
</SectionTitle>
    <Paragraph position="0"> For our experiments we took as data the first 1000 turns (roughly 12000 words or 12 full dialogues) of transcripts from the Switchboard corpus in a version that is already annotated for parts of speech (e.g.</Paragraph>
    <Paragraph position="1"> noun, adjective, personal pronoun, etc.).</Paragraph>
    <Paragraph position="2"> The definition of a small clause which we wanted the neural network to learn the boundaries of is as follows: Any finite clause that contains an inflected verbal form and a subject (or at least either of them, if not possible otherwise). However, common phrases such as good bye, and stuff like that, etc. are also considered small clauses.</Paragraph>
    <Paragraph position="3"> Preprocessing the data involved (i) expansion of some contracted forms (e.g. l'm -+ I am), (ii) correction of frequent tagging errors, and (iii) generation of segment boundary candidates using some simple heuristics to speed up manual editing.</Paragraph>
    <Paragraph position="4"> Thus we obtained a total of 1669 segment boundaries, which means that on average approximately after every seventh token (i.e. 14% of the text) there is a segment boundary.</Paragraph>
  </Section>
  <Section position="5" start_page="12" end_page="12" type="metho">
    <SectionTitle>
3 Features and input encoding
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
3.1 Features
</SectionTitle>
      <Paragraph position="0"> The transcripts are tagged with part of speech (POS) data from a set of 39 tags 1 and were processed to extract trigger words, i.e. words that are frequently near small clause boundaries (&lt;b&gt;). Two scores were assigned to each word w in the transcript according to the following formulae:</Paragraph>
      <Paragraph position="2"> where C is the number of times w occurred as the word (before/after) a boundary, and /5 is the Bayesian estimate for the probability that a boundary occurs (after/before) w.</Paragraph>
      <Paragraph position="3"> This score is thus high for words that are likely (based on/5) and reliable (based on C) predictors of small clause boundaries.</Paragraph>
      <Paragraph position="4"> The pre- and post-boundary trigger words were then merged and the top 30 selected to be used as features for the neural network.</Paragraph>
    </Section>
    <Section position="2" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
3.2 Input encoding
</SectionTitle>
      <Paragraph position="0"> The information generated for each word consisted of a data label (a unique tracking number, the actual word, and its part of speech), a vector of real values xl, ..., xc and a label ('+' or '-') indicating whether a segment boundary had preceded the word in the original segmented corpus.</Paragraph>
      <Paragraph position="1"> The real numbers xl, ..., xc are the values given as input to the first layer of the network. We tested three different encodings:  1. Boolean encoding of POS: xi (1 &lt; i &lt; c = 39) is set to 0.9 if the word's part of speech is the i th part of speech, and to 0.1 otherwise.</Paragraph>
      <Paragraph position="2"> 2. Boolean encoding of triggers: xi (1 &lt; i &lt; c = 30) is set to 0.9 if the word is the ith trigger, and to 0.1 otherwise.</Paragraph>
      <Paragraph position="3"> 3. Concatenation of boolean POS and trigger encodings (c = 39 + 30 = 69).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="12" end_page="13" type="metho">
    <SectionTitle>
4 The neural network
</SectionTitle>
    <Paragraph position="0"> We use a fully connected feed-forward three-layer (input, hidden, and output) artificial neural network and the standard backpropagation algorithm to train it (with learning rate ~/= 0.3 and momentum ~ = 0.3).</Paragraph>
    <Paragraph position="1"> Given a window size of W and c features per encoded word, the input layer is dimensioned to c x W units, that is W blocks of c units.</Paragraph>
    <Paragraph position="2"> The number of hidden units (h) ranged in our experiments from 1 to 25.</Paragraph>
    <Paragraph position="3">  POS and trigger encoding, W = 6, h = 2, 0 = 0.7) As for the output layer, in all the experiments it was fixed to a single output unit which indicates the presence or absence of a segment boundary just before the word currently at the middle of the window. The actual threshold to decide between segment boundary and no segment boundary is the parameter 0 which we varied from 0.1 to 0.9.</Paragraph>
    <Paragraph position="4"> The data was presented to the network by simulating a sliding window over the sequence of encoded words, that is by feeding the input layer with the c x W encodings of, say, words wi...wi+w-1 and then, as the next input to the network, shifting the values one block (c units) to the left, thereby admitting from the right the c values corresponding to the encoding of wi+w. Note that at the beginning of each speaker turn or utterance the first c x (w _ 1) input units need be padded with a &amp;quot;dummy&amp;quot; value, so that the first word can be placed just before the middle of the window. Symmetrically, at the end of each turn, the last c x (w _ 1) input units are also padded.</Paragraph>
  </Section>
  <Section position="7" start_page="13" end_page="14" type="metho">
    <SectionTitle>
5 Results and discussion
</SectionTitle>
    <Paragraph position="0"> We created two data sets for our experiments, all from randomly chosen turns from the original data: (i) the &amp;quot;small&amp;quot; data set (a 20:20:60(%) split between training, validation, and test sets), and (ii) the &amp;quot;large&amp;quot; data set (a 60:20:20(%) split).</Paragraph>
    <Paragraph position="1"> First, we ran 180 experiments on the &amp;quot;small&amp;quot; data set, exhaustively exploring the space defined by varying the following parameters: * encoding scheme: POS only, triggers only, POS and triggers.</Paragraph>
    <Paragraph position="2">  by the neural network divided by total number of boundaries found by the neural network), recall (number of correct boundaries found by the neural network divided by true number of boundaries 2. precision, recall in the data) and F-score (defined as precision+recall / were computed for each training, validation and test sets.</Paragraph>
    <Paragraph position="3"> To be fair, we chose to take the epoch with the maximum F-score on the validation set as the best configuration of the net, and we report results from the test set only. Figure 1 shows a typical training/learning curve of a neural network.</Paragraph>
    <Paragraph position="4"> The best performance was obtained using a net with 2 hidden units, a window size of 6 and the output unit threshold set to 0.7. The following results were achieved.</Paragraph>
    <Paragraph position="5">  Icl ssi catidegnr tdeglprdegcisidegnlrec lllFscdegrel0 8 O.845 0.860 0.852 Some general trends are observed: * As the window size gets larger, the performance increases, but it seems to peak at around size 6.  the 0.5 &lt;/9 &lt; 0.7 interval.</Paragraph>
    <Paragraph position="6"> * Varying the threshold leads to a tradeoff of precision vs. recall.</Paragraph>
    <Paragraph position="7"> To illustrate the last point, we present a graph that shows a comparison between the three encoding methods used, for a window size of 6 (Figure 2). The combined method is only slightly better than the POS method, but they both are clearly superior to the trigger-word method. Still it is interesting to note that quite a reasonable performance can be obtained just by looking at the 30 most indicative pre- and post-boundary trigger-words. Noteworthy is also the behavior of the precision-recall curves: with our method a high level of recall can be maintained even as the output threshold is increased to augment precision.</Paragraph>
    <Paragraph position="8"> In Figure 3, we plot the F-score against the threshold. Whereas for the encodings POS only and POS and triggers, the peaks are in the region between 0.5 and 0.7, for the triggers only encoding, the best F-scores are achieved between 0.3 and 0.5.</Paragraph>
    <Paragraph position="9"> We also ran another 30 experiments with the &amp;quot;large&amp;quot; data set focusing on the region defined by the parameters that achieved the best results in the preceding experiments (i.e. window size 6 or 8, threshold between 0.5 and 0.7, number of hidden units between 1 and 10). Under these constraints, F-scores vary slightly, always remaining between .85 and .88 for both validation and test sets.</Paragraph>
    <Paragraph position="10"> Within this region, therefore, several neural nets yield extremely good performance.</Paragraph>
    <Paragraph position="11"> While Lavie et al. (1996) just report an improvement in the end-to-end performance of the JANUS speech-to-speech translation system when using their segmentation method but do not give details the performance of the segmentation method itself, Stolcke and Shriberg (1996) are more explicit and provide precision and recall results. Moreover Lavie et al. (1996) deal with Spanish input whereas Stolcke and Shriberg (1996), like us, drew their data from the Switchboard corpus.</Paragraph>
    <Paragraph position="12"> Type Harmful? Reason Context false positive no trigger word false positive yes non-clausal and false negative yes speech repair false positive ? trigger word false positive yes non-clausal and false negative yes speech repair false positive no CORRECT false negative no CORRECT false negative yes embedded relative clause false positive no trigger word to work &lt;b&gt; and* when I had work off * and on &lt;b&gt; but * and they are he you know * gets to a certain if you like trip * and fall or something &lt;b&gt; we * that's been &lt;b&gt; but i think * its relevance &lt;b&gt; and she * she was into nursing homes * die very quickly  0 = 0.7). False positive indicates an instance where the net hypothesizes a boundary where there is none. False negative indicates an instance where the net fails to hypothesize a boundary where there is one. A '&lt;b&gt;' indicates a small clause boundary. A '*' indicates the location of the error. Thus here we compare our approach with that of Stolcke and Shriberg (1996). They trained on 1.4 million words and in their best system, achieved precision .69 and recall .85 (which corresponds to an F-score of .76). We trained on 2400 words (i.e. over 500 times less training data), and we achieved an F-score of .85 (i.e. a 12% improvement).</Paragraph>
  </Section>
  <Section position="8" start_page="14" end_page="14" type="metho">
    <SectionTitle>
6 Error analysis
</SectionTitle>
    <Paragraph position="0"> Table 1 shows 10 representative errors that one of the best performing neural network made on the test set. 25 randomly selected errors were used to do the error analysis, which consisted of 14 false positives and 11 false negatives. 8 of the errors were errors we considered to be harmful to the parser, 3 were errors of unknown harmfulness, and the remaining 14 were considered harmless.</Paragraph>
    <Paragraph position="1"> Of the harmful errors, three were due to the word and being used as a conjunction in a non-clausal context, two were due to a failure to detect a speech repair, and one was due to an embedded relative clause (most people that move into nursing homes * die very quickly).</Paragraph>
    <Paragraph position="2"> The network was also 'able to correctly identify some mistagged data (marked as CORRECT in Table 1).</Paragraph>
    <Paragraph position="3"> These results suggest that adding features relevant to speech repairs (such as whether words were repeated) or features relevant to detecting the use of and as a non-clausal conjunct might be useful in achieving better accuracy.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML