File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2021_metho.xml

Size: 7,061 bytes

Last Modified: 2025-10-06 14:10:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2021">
  <Title>Initial Study on Automatic Identification of Speaker Role in Broadcast News Speech</Title>
  <Section position="5" start_page="81" end_page="82" type="metho">
    <SectionTitle>
3 Speaker Role Identification Approaches
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="81" end_page="81" type="sub_section">
      <SectionTitle>
3.1 Hidden Markov Model (HMM)
</SectionTitle>
      <Paragraph position="0"> proach for speaker role labeling. This is a simple first order HMM.</Paragraph>
      <Paragraph position="1"> The HMM has been widely used in many tagging problems. Stolcke et al. (Stolcke et al., 2000) used it for dialog act classification, where each utterance (or dialog act) is used as the observation. In speaker role detection, the observation is composed of a much longer word sequence, i.e., the entire speech from one speaker. Figure 1 shows the graphical representation of the HMM for speaker role identification, in which the states are the speaker roles, and the observation associated with a state consists of the utterances from a speaker. The most likely role sequence</Paragraph>
      <Paragraph position="3"> where O is the observation sequence, in which Oi corresponds to one speaker turn. If we assume what a speaker says is only dependent on his or her role, then:</Paragraph>
      <Paragraph position="5"> From the labeled training set, we train a language model (LM), which provides the transition probabilities in the HMM, i.e., the P(R) term in Equation (1). The vocabulary in this role LM (or role grammar) consists of different role tags. All the sentences belonging to the same role are put together to train a role specific word-based N-gram LM. During testing, to obtain the observation probabilities in the HMM, P(Oi|Ri), each role specific LM is used to calculate the perplexity of those sentences corresponding to a test speaker turn.</Paragraph>
      <Paragraph position="6"> The graph in Figure 1 is a first-order HMM, in which the role state is only dependent on the previous state.</Paragraph>
      <Paragraph position="7"> In order to capture longer dependency relationship, we used a 6-gram LM for the role LM. For each role specific word-based LM, 4-gram is used with Kneser-Ney smoothing. There is a weighting factor when combining the state transitions and the observation probabilities with the best weights tuned on the development set (6 for the transition probabilities in our experiments). In addition, in stead of using Viterbi decoding, we used forward-backward decoding in order to find the most likely role tag for each segment. Finally we may use only a subset of the sentences in a speaker's turn, which are possibly more discriminative to determine the speaker's role. The LM training and testing and HMM decoding are implemented using the SRILM toolkit (Stolcke, 2002).</Paragraph>
    </Section>
    <Section position="2" start_page="81" end_page="82" type="sub_section">
      <SectionTitle>
3.2 Maximum Entropy (Maxent) Classifier
</SectionTitle>
      <Paragraph position="0"> A Maxent model estimates the conditional probability:</Paragraph>
      <Paragraph position="2"> where Zl(O) is the normalization term, functions gk(Ri,O) are indicator functions weighted by l, and k is used to indicate different 'features'. The weights (l) are obtained to maximize the conditional likelihood of the training data, or in other words, maximize the entropy while satisfying all the constraints. Gaussian smoothing (variance=1) is used to avoid overfitting. In our experiments we used an existing Maxent toolkit (available from http://homepages.inf.ed.ac.uk/s0450736/maxent toolkit.</Paragraph>
      <Paragraph position="3"> html).</Paragraph>
      <Paragraph position="4"> The following features are used in the Maxent model: * bigram and trigram of the words in the first and the last sentence of the current speaker turn * bigram and trigram of the words in the last sentence of the previous turn  * bigram and trigram of the words in the first sentence of the following turn Our hypothesis is that the first and the last sentence from a speaker's turn are more indicative of the speaker's role (e.g., self introduction and closing). Similarly the last sentence from the previous speaker segment and the first sentence of the following speaker turn also capture the speaker transition information. Even though sentences from the other speakers are included as features, the Max-ent model makes a decision for each test speaker turn individually without considering the other segments. The impact of the contextual role tags will be evaluated in our experiments.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="82" end_page="82" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="82" end_page="82" type="sub_section">
      <SectionTitle>
4.1 Experimental Setup
</SectionTitle>
      <Paragraph position="0"> We used the TDT4 Mandarin broadcast news data in this study. The data set consists of about 170 hours (336 shows) of news speech from different sources. In the original transcripts provided by LDC, stories are segmented; however, speaker information (segmentation or identity) is not provided. Using the reference transcripts and the audio files, we manually labeled the data with speaker turns and the role tag for each turn.3 Speaker segmentation is generally very reliable; however, the role annotation is ambiguous in some cases. The interannotator agreement will be evaluated in our future work. In this initial study, we just treat the data as noisy data.</Paragraph>
      <Paragraph position="1"> We preprocessed the transcriptions by removing some bad codes and also did text normalization. We used punctuation (period, question mark, and exclamation) available from the transcriptions (though not very accurate) to generate sentences, and a left-to-right longest word match approach to segment sentences into words. These words/sentences are then used for feature extraction in the Maxent model, and LM training and perplexity calculation in the HMM as described in Section 3. Note that the word segmentation approach we used may not be the-state-of-art, which might have some effect on our experiments.</Paragraph>
      <Paragraph position="2"> 10-fold cross validation is used in our experiments.</Paragraph>
      <Paragraph position="3"> The entire data set is split into ten subsets. Each time one subset is used as the test set, another one is used as the development set, and the rest are used for training.</Paragraph>
      <Paragraph position="4"> The average number of segments (i.e., speaker turns) in the ten subsets is 1591, among which 50.8% are anchors.</Paragraph>
      <Paragraph position="5"> Parameters (e.g., weighting factor) are tuned based on the average performance over the ten development sets, and the same weights are applied to all the splits during testing. null  based on the annotation manual used for English at Columbia</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML