File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1029_metho.xml

Size: 16,555 bytes

Last Modified: 2025-10-06 14:07:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1029">
  <Title>Fine-Grained Hidden Markov Modeling for Broadcast- News Story Segmentation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. GENERATIVE MODEL
</SectionTitle>
    <Paragraph position="0"> We model the generation of news stories as a 251 state Hidden Markov Model, with the topology shown in Figure 1. States labeled, 1 to 250, correspond to each of the first 250 words of a story. One extra state, labeled 251, is included to model the production of all words at the end of stories exceeding 250 words in length.</Paragraph>
    <Paragraph position="1"> Several other models were considered, but this model is particularly suited to the features used, as it allows one to model features that vary with depth into the story (Section 3.1), while simultaneously, by delaying certain features. It also allows one to model features that occur in specific regions the boundaries (Section 3.3). This is possible because all states can feed into the initial state, i.e. all stories end by going into the first word of a new story.</Paragraph>
    <Paragraph position="2">  For example, the original model involved a series of beginning and then end states, with a single middle state that could be cycled through (Figure 2). This proved to be a problem because the ends of long stories were being mixed with the ends of short stories which led to problems with our spaced coherence feature (Section 3.1). Another possibility involved splitting the model into two main paths, one to model the shorter stories, and one to model the longer as there is something of a bimodal distribution in story lengths (Figure 4). However, the fine-grained nature of our model would suffer from splitting the data in this manner, and a choice about at which length to fork the model would be somewhat artificial.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="501" type="metho">
    <SectionTitle>
3. FEATURES
</SectionTitle>
    <Paragraph position="0"> Associated with the model is a set of features. For each state, the model assigns a probability distribution over all possible combinations of values the features may take on. The probability assigned to value combinations is assumed to be independent of the state/observation history, conditioned on the state. We further assume that the value of any one feature is independent of all others, once the current state is known. Features have been explicitly designed with this assumption in mind. Three categories of features have been used, which we refer to as coherence features, x-duration feature, and the trigger features.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1. Coh
</SectionTitle>
      <Paragraph position="0"/>
      <Paragraph position="2"> COHER-4 (Figures 3b, c &amp; d) correspond to similar features; for these, however, the buffer is separated by 50, 100, and 150 words, respectively, from the current word. Interestingly, the COHER-4 feature actually caused a reduction in performance, and was not used in the final evaluation.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2. X-duration
</SectionTitle>
      <Paragraph position="0"> This feature is based on indications given by the speech recognizer that it was unable to transcribe a portion of the audio signal. The existence of an untranscribable section prior to the word gives a non-zero X-DURATION value based on the extent of the section.</Paragraph>
      <Paragraph position="1"> Empirically this is an excellent predictor of boundaries in that an untranscribable event has uniform likelihood of occurring anywhere in a news story, except prior to the first word of a story, where it is extremely likely to occur.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="501" type="sub_section">
      <SectionTitle>
3.3. Triggers
</SectionTitle>
      <Paragraph position="0"> Trigger features correspond to small regions at the beginning and end of stories, and exploit the fact that some words are far more likely to occur in these positions than in other parts of a news segment. One region, for example, is restricted to the first word of the story. In ABC's World News Tonight, for example, the word &amp;quot;finally&amp;quot; is far more likely to occur in the first word of a story than would be expected by its general rate of occurrence in the training data. For a word, w, appearing in the input stream, the value of the feature is an estimate of how likely it is for w to appear in the region of interest. The estimate used is given by:</Paragraph>
      <Paragraph position="2"> n is the total number of occurrences of w; and R f is the fraction of all tokens of w that occurred in the region. This estimate can be viewed as Bayesian estimate with a beta prior. The beta prior is equivalent to a uniform prior and the observation of one occurrence of the word in the region out of</Paragraph>
      <Paragraph position="4"> occurrences. This estimate was chosen so that: 1) the prior probability would not be greatly affected for words observed only a few times in the training data; 2) it would be pushed strongly towards the empirical probability of the word appearing in the region for words that were encountered in R; 3) it has a prior probability, R f , equal to the expectation for a randomly selected word. The regions used for the submission were restricted to the one-word regions for: first word, second word, last word, and  e have used four coherence features. The COHER-1 feature, n schematically in Figure 2a, is based on a buffer of 50 ords immediately prior to the current word. If the current word es not appear in the buffer, the value of COHER-1 is 0. If it does pear in the buffer, the value is -log(s</Paragraph>
      <Paragraph position="6"> stories in which the word appears, and s is the total number of ories, in the training data. Words that did not appear in the aining data, are treated as having appeared once. In this way, re words get high feature values, and common words get low ature values. Three other features: COHER-2, COHER-3, and next-to-last word. Limited experimentation with multi-state regions, was not fruitful. For example, including the regions, {3,4,...,10} and {-10,-9,...,-3}, where -i is interpreted as i words prior to the end of the story, did not improve segmentation performance.</Paragraph>
      <Paragraph position="7"> Since, as described, the current HMM topology does not model end-of-story words (earlier versions of the topology did model these states directly), trigger features for end-of-story regions are delayed. That means that a trigger related to the last word in a story would be delayed by a one word buffer. In this way, it is linked to the first word in the next story. For example, the word &amp;quot;Jennings&amp;quot; (the name of the main anchorperson) is strongly d  correlated with the last word in news stories in the ABC World News Tonight corpus. The estimated probability of it being the last word of the story in which it appears is .235 (obtained by the aforementioned method). The trained model associates a high likelihood of seeing the value .235 at state = 1; the intuitive interpretation being, &amp;quot;a word highly likely to appear at the last word of a story, occurred 1-word ago&amp;quot;.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="501" end_page="501" type="metho">
    <SectionTitle>
4. PARAMETER ESTIMATION
</SectionTitle>
    <Paragraph position="0"> The Hidden Markov Model requires the estimation of transition and conditional observation probabilities. There are 251 transition probabilities to be estimated. Much more of a problem are the observation probabilities, there being 9 features in the model, for each of which a probability distribution over as many as 100 values must be estimated, for each of 251 states. With the goal of developing methods for robust estimation in the context of story segmentation, we have applied non-parametric kernel estimation techniques, using the LOCFIT library [Loader, '99] of the R open-source statistical analysis package, which is based on the S-plus system [Venables &amp; Ripley, `99; Chambers &amp; Hastie, `92, Becker, Chambers &amp; Wilks, `88].</Paragraph>
    <Paragraph position="1"> For the transition probabilities, it is assumed that the underlying probability distribution over story length is smooth, allowing the empirical histogram, shown at the top of Figure 4, to be transformed to the probability density estimate shown at the bottom. From this probability distribution over story lengths, the conditional transition probabilities can be estimated directly.</Paragraph>
    <Paragraph position="2"> Conditional observation probabilities are also deduced from an estimate of the joint probability distribution. First, observation values were binned. Binning limits were set in an attempt to 1) be large enough to obtain sufficient counts for the production of robust probability estimates, and yet, 2) be constrained enough so that important distinctions in the probabilities for different feature values will be reflected in the model. For each bin, the observation counts are smoothed by performing a non-parametric regression of the observation counts as a function of state. The smoothed observations counts corresponding to the regression are then normalized so as to sum to the total observation count for the bin. The result is a conditional probability distribution over states for a given binned feature value, p(State=s|Feature=fv). Once this is done for all bin values, each conditional probability is multiplied by the marginal probability, p(State=s), of being in a given state, resulting in a joint distribution, p(fv,s), over the entire space of (Feature,State) values. From this joint distribution, the necessary conditional probabilities, p(Feature=fv|State=s), can be deduced directly.</Paragraph>
    <Paragraph position="3"> Figure 5 shows the conditional probability estimates, p(fv  |s), for the feature value COHER-3=20, across all states, confirming the intuition that, while the probability of seeing a value of 20 is small for all states, the likelihood of seeing it is much higher in latter parts of a story than it is in early-story states.</Paragraph>
  </Section>
  <Section position="6" start_page="501" end_page="501" type="metho">
    <SectionTitle>
5. SEGMENTATION
</SectionTitle>
    <Paragraph position="0"> Once parameters for the HMM have been determined, segmentation is straightforward. The Viterbi algorithm [Rabiner, `89], is employed to determine the sequence of states most likely to have produced the observation sequence associated with the broadcast. A boundary is then associated with each word produced from State 1 for the maximum likelihood state sequence.</Paragraph>
    <Paragraph position="1"> The version of the Viterbi algorithm we have implemented provides for the specification of &amp;quot;state-penalty&amp;quot; parameters, which we have used for the &amp;quot;boundary state&amp;quot;, state 1. In effect, the probability for each path in consideration is multiplied by the value of this parameter (which can be less than, equal to, or greater than, 1) for each time the path passes through the boundary state. Variation of the parameter effectively controls the &amp;quot;aggressiveness&amp;quot; of segmentation, allowing for tuning system behavior in the context of the evaluation metric.</Paragraph>
  </Section>
  <Section position="7" start_page="501" end_page="501" type="metho">
    <SectionTitle>
6. RESULTS
</SectionTitle>
    <Paragraph position="0"> Preliminary test results of this approach are encouraging. After training on all but 15 of the ABC World News Tonight programs from the TDT-2 corpus [Nist, '00], a test on the remaining 15 produced a false-alarm (boundary predicted incorrectly) probability of .11, with a corresponding miss (true boundary not predicted) probability of .14, equal to the best performance reported to date, for this news source.</Paragraph>
    <Paragraph position="1"> A more intuitive appreciation for the quality of performance can be garnered from the graphs in Figure 6, which contrast the segmentation produced by the system (middle) with ground truth (the top graph), for a typical member of the ABC test set. The x-axis corresponds to time (in units of word tokens); i.e., the index of the word produced by the speech recognizer, and the y-axis  corresponds to the state of the HMM model. A path passing through the point (301, 65), for example, corresponds to a path through the network that produced the 65th word from state 301. Returns to state=1 correspond to boundaries between stories. The bottom graph shows the superposition of the two to help illustrate the agreement between the path chosen by the system and the path corresponding to perfect segmentation..</Paragraph>
  </Section>
  <Section position="8" start_page="501" end_page="501" type="metho">
    <SectionTitle>
7. VISUALIZATION
</SectionTitle>
    <Paragraph position="0"> The evolution of the segmentation algorithm was driven by analysis of the behavior of the system, which was supported by visualization routines developed using the graphing capability of the R package. Figure 7 gives an example of the kind of graphical displays that were used for analysis of the segmentation of a specific broadcast news program; in this case, analysis of the role of the X-DURATION feature. This graphical display allows for the comparison of the maximum likelihood path produced by the HMM to the path through the HMM that would be produced by a perfect system - one privy to ground-truth.</Paragraph>
    <Paragraph position="1">  he true state than from the predicted state. Strongly ne nts are a major component of the probability calculation tha sulted in the system preferring the path it chose over the ath. These points suggest potential deficiencies in the mo heir identification directs the focus of analysis so that sy rformance can be improved by correcting weaknesses of xisting model.</Paragraph>
    <Paragraph position="2">  he top graph corresponds to the bottom graph of Figure 6, owing the states traversed by the two systems. The second h shows the value of the X-DURATION feature rresponding to each word of the broadcast. So, the plotting of a nt at (301, 3) corresponds to an X-DURATION value of 3 ving been observed at time, 301. One thing that can be seen om this graph is that being at a story boundary (low-points on thicker-darker line of the top graph) is more frequent when gher values of the X-DURATION cue are observed, than when wer values are observed, as could be expected.</Paragraph>
    <Paragraph position="3"> he third graph shows, on a log scale, how many times more ely it is that the observed X-DURATION value would be erated from the true state than from the state predicted by the stem. Most points are close to 0, indicating that the X-ATION value observed was as likely to have come from the ue state as it is to have come from the state predicted by the iterbi algorithm. Of course, this is the case wherever the true ate has been correctly predicted. Negative points indicate that the -DURATION value observed is less likely to be produced from The final graph shows the cumulative sum of the values from the graph above it. (Note that the sum of the logs of the probabilitie is equivalent to the cumulative product of probabilities on a l scale.) The graphing of the cumulative sum can be very use when the system is performing poorly due to a small but consistent preference for the observations having been produc by the state sequence chosen by the system. This phenomenon made evident by a steady downward trend in the graph of t cumulative sum. This is in contrast to an overall level trend w occasional downward dips. Note, that a similar graph for the to probability (equal to the product of all the individual feature value probabilities) will always have an overall downward trend, sinc the maximum likelihood path will always have a likelihood  greater than the likelihood of any other path.</Paragraph>
    <Paragraph position="4"> Aside from supporting the detailed analysis of specific features, the productions of these graphs for each of the features, together with the corresponding graph for the total observation probability, allowed us to quickly asses which of the features was most problematic at any given stage of model development.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML