XML Viewer - n06-2023

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2023_metho.xml
Size: 7,033 bytes
Last Modified: 2025-10-06 14:10:13
<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2023">
  <Title>Summarizing Speech Without Text Using Hidden Markov Models</Title>
  <Section position="4" start_page="89" end_page="90" type="metho">
    <SectionTitle>
3 Using Continuous HMM for Speech
</SectionTitle>
    <Paragraph position="0"> Summarization We define our HMM by the following parameters: Ohm = 1..N : The state space, representing a set of states where N is the total number of states in the model; O = o1k,o2k,o3k,...oMk : The set of observation vectors, where each vector is of size k; A = {aij} : The transition probability matrix, where aij is the probability of transition from state i to state j; bj(ojk) : The observation probability density function, estimated by SMk=1cjkN(ojk,ujk,Sjk), where ojk denotes the feature vector; N(ojk,ujk,Sjk) denotes a single Gaussian density function with mean of ujk and covariance matrix Sjk for the state j, with M the number of mixture components and cjk the weight of the kth mixture component; P = pii : The initial state probability distribution. For convenience, we define the parameters for our HMM by a set l that represents A, B and P. We can use the parameter set l to evaluate P(O|l), i.e. to measure the maximum likelihood performance of the output observables O. In order to evaluate P(O|l), however, we first need to compute the probabilities in the matrices in the parameter set l The Markov assumption that state durations have a geometric distribution defined by the probability of self transitions makes it difficult to model durations in an HMM. If we introduce an explicit duration probability to replace self transition probabilities, the Markov assumption no longer holds.</Paragraph>
    <Paragraph position="1"> Yet, HMMs have been extended by defining state duration distributions called Hidden Semi-Markov Model (HSMM) that has been succesfully used (Tweed et. al., 2005). Similar to (Tweed et. al.,  2005)'s use of HSMMs, we want to model the position of a sentence in the source document explicitly. But instead of building an HSMM, we model this positional information by building our position-sensitive HMM in the following way: We first discretize the position feature into L number of bins, where the number of sentences in each bin is proportional to the length of the document.</Paragraph>
    <Paragraph position="2"> We build 2 states for each bin where the second state models the probability of the sentence being included in the document's summary and the other models the exclusion probability. Hence, for L bins we have 2L states. For any bin lth where 2l and 2l [?] 1 are the corresponding states, we remove all transitions from these states to other states except 2(l+1) and 2(l+1)[?]1. This converts our ergodic L state HMM to an almost Left-to-Right HMM though l states can go back to l [?] 1. This models sentence position in that decisions at the lth state can be arrived at only after decisions at the (l [?] 1)th state have been made. For example, if we discretize sentence position in document into 10 bins, such that 10% of sentences in the document fall into each bin, then states 13 and 14, corresponding to the seventh bin (.i.e. all positions between 0.6 to 0.7 of the text) can be reached only from states 11, 12, 13 and 14.</Paragraph>
    <Paragraph position="3"> The topology of our HMM is shown in Figure 1.</Paragraph>
    <Section position="1" start_page="89" end_page="90" type="sub_section">
      <SectionTitle>
3.1 Features and Training
</SectionTitle>
      <Paragraph position="0"> We trained and tested our model on a portion of the TDT-2 corpus previously used in (Maskey and Hirschberg, 2005). This subset includes 216 stories from 20 CNN shows, comprising 10 hours of audio data and corresponding manual transcript. An annotator generated a summary for each story by extracting sentences. While we thus rely upon human- null identified sentence boundaries, automatic sentence detection procedures have been found to perform with reasonable accuracy compared to human performance (Shriberg et. al., 2000).</Paragraph>
      <Paragraph position="1"> For these experiments, we extracted only acoustic/prosodic features from the corpus. The intuition behind using acoustic/prosodic features for speech summarization is based on research in speech prosody (Hirschberg, 2002) that humans use acoustic/prosodic variation -- expanded pitch range, greater intensity, and timing variation -- to indicate the importance of particular segments of their speech. In BN, we note that a change in pitch, amplitude or speaking rate may signal differences in the relative importance of the speech segments produced by anchors and reporters -- the professional speakers in our corpus. There is also considerable evidence that topic shift is marked by changes in pitch, intensity, speaking rate and duration of pause (Shriberg et. al., 2000), and new topics or stories in BN are often introduced with content-laden sentences which, in turn, often are included in story summaries.</Paragraph>
      <Paragraph position="2"> Our acoustic feature-set consists of 12 features, similar to those used in (Inoue et. al., 2004; Christensen et. al., 2004; Maskey and Hirschberg, 2005). It includes speaking rate (the ratio of voiced/total frames); F0 minimum, maximum, and mean; F0 range and slope; minimum, maximum, and mean RMS energy (minDB, maxDB, meanDB); RMS slope (slopeDB); sentence duration (timeLen = endtime - starttime). We extract these features by automatically aligning the annotated manual transcripts with the audio source. We then employ Praat (Boersma, 2001) to extract the features from the audio and produce normalized and raw versions of each. Normalized features were produced by dividing each feature by the average of the feature values for each speaker, where speaker identify was determined from the Dragon speaker segmentation of the TDT-2 corpus. In general, the normalized acoustic features performed better than the raw values.</Paragraph>
      <Paragraph position="3"> We used 197 stories from this labeled corpus to train our HMM. We computed the transition probabilities for the matrix ANXN by computing the relative frequency of the transitions made from each state to the other valid states. We had to compute four transition probabilities for each state, i.e. aij where j = i,i + 1,i + 2,i + 3 if i is odd and j = i [?] 1,i,i + 1,i + 2 if i is even. Odd states signify that the sentence should not be included in the summary, while even states signify sentence inclusion. Observation probabilities were estimated using a mixture of Gaussians where the number of mixtures was 12. We computed a 12X1 matrix for the mean u and 12X12 matrices for the covariance matrix S for each state. We then computed the maximum likelihood estimates and found the optimal sequence of states to predict the selection of document summaries using the Viterbi algorithm. This approach maximizes the probability of inclusion of sentences at each stage incrementally.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML