XML Viewer - w97-0304

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0304_metho.xml
Size: 38,483 bytes
Last Modified: 2025-10-06 14:14:36
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0304">
  <Title>Text Segmentation Using Exponential Models*</Title>
  <Section position="4" start_page="35" end_page="38" type="metho">
    <SectionTitle>
3 A Feature-Based Approach
</SectionTitle>
    <Paragraph position="0"> Our attack on the segmentation problem is based on a statistical framework that we call feature induction for random fields and exponential models (Berger, Della Pietra, and Della Pietra, 1996; Della Pietra, Della Pietra, and Lafferty, 1997). The idea is to construct a model which assigns to each position in the data stream a probability that a boundary belongs at that position. This probability distribution arises by incrementally building a log-linear model that weighs different &amp;quot;features&amp;quot; of the data. For simplicity, we assume that the features are binary questions. null To illustrate (and to show that our approach is in no way restricted to text), consider the task of partitioning a stream of multimedia data containing audio, text and video. In this setting, the features might include questions such as: * Does the phrase COMING UP appear in the last utterance of the decoded speech? * Is there a sharp change in the video stream in the last 20 frames? * Does the language model degrade in performance in the next two utterances?  * Is there a &amp;quot;match&amp;quot; between the spectrum of the current image and an image near the last segment boundary? * Are there blank video frames nearby? * Is there a sharp change in the audio stream in the next utterance?  The idea of using features is a natural one, and indeed other recent work on segmentation, such as (Litman and Passonneau, 1995), adopts this approach. We take a unique approach to incorporating the information inherent in various features, using the statistical framework of exponential models to choose the best features and combine them in a principled manner.</Paragraph>
    <Section position="1" start_page="36" end_page="36" type="sub_section">
      <SectionTitle>
3.1 A short-range model of language
</SectionTitle>
      <Paragraph position="0"> Central to our approach to segmenting is a pair of tools: a short- and long-range model of language.</Paragraph>
      <Paragraph position="1"> Monitoring the relative behavior of these two models goes a long way towards helping our segmenter sniff out natural breaks in the text. In this section and the next, we describe these language models and explain their utility in identifying segments.</Paragraph>
      <Paragraph position="2"> The trigram models Ptri(W \]w-2, W-l) we employ use the Katz backoff scheme (Katz, 19877) for smoothing. We trained trigram models on two different corpora. The Wall Street Journal corpus (WSJ) is a 38-million word corpus of articles from the newspaper. The model was constructed using a set },V of the approximately 20,000 most frequently occurring words in the corpus. Another model was constructed on the Broadcast News corpus (BN), made up of approximately 150 million words (four and a half years) of transcripts of various news broadcasts, including CNN news, political roundtables, NPR broadcasts, and interviews.</Paragraph>
      <Paragraph position="3"> By restricting the conditioning information to the previous two words, the trigram model is making the simplifying assumption--clearly false--that the use of language one finds in television, radio, and newspaper can be modeled by a second-order Markov process. Although words prior to w-2 certainly bear on the identity of w, higher-order models are impractical: the number of parameters in an n-gram model is O(\[ W \]~), and finding the resources to compute and store all these parameters becomes a hopeless task for n &gt; 3. Usually the lexical myopia of the trigram model is a hindrance; however, we will see how a segmenter can in fact make positive use of this shortsightedness.</Paragraph>
    </Section>
    <Section position="2" start_page="36" end_page="37" type="sub_section">
      <SectionTitle>
3.2 A long-range model of language
</SectionTitle>
      <Paragraph position="0"> One of the fundamental characteristics of language, viewed as a stochastic process, is that it is highly nonstationary. Throughout a written document and during the course of spoken'conversation, the topic evolves, affecting local statistics on word occurrences. A model which could adapt to its recent context would seem to offer much over a stationary model such as the trigram model. For example, an adaptive model might, for some period of time after seeing a word like HOMERUN, boost the probabilities of the words {HOMERUN, PITCHER, FIELDER, ER-ROR, BATTER, TRIPLE, OUT}. For an empirically-driven example, we provide an excerpt from the BN corpus. Emphasized words mark where a long-range language model might reasonably be expected to outperform (assign higher probabilities than) a short-range model: Some doctors are more skilled at doing the procedure than others so it's recommended that patients ask doctors about their track record. People at high risk of stroke include those over age 55 with a family history or high blood pressure, diabetes and smokers. We urge them to be evaluated by their family physicians and this can be done by a very simple procedure simply by having them test with a stethoscope for symptoms of blockage.</Paragraph>
      <Paragraph position="1"> One means of injecting long-range awareness into a language model is by retaining a cache of the most recently seen n-grams which is smoothed together (typically by linear interpolation) with the static model; see for example (Jelinek et al., 1991; Kuhn and de Mori, 1990). Another approach, using maximum entropy methods, introduces a parameter for trigger pairs of mutually informative words, so that the occurrence of certain words in recent context boosts the probability of the words that they trigger (Lau, Rosenfeld, and Roukos, 1993).</Paragraph>
      <Paragraph position="2"> The method we use here, described in (Beeferman, Berger, and Lafferty, 1997), employs a static trigram model as a &amp;quot;prior,&amp;quot; or default distribution, and adds certain features to a family of conditional exponential models to capture some of the nonstationary features of text. The features are simple trigger pairs of words chosen on the basis of mutual information. Figure 1 provides a small sample of the (s,t) trigger pairs used in most of the experiments we will describe.</Paragraph>
      <Paragraph position="3"> To incorporate triggers into a long-range language model, we begin by constructing a standard, static backoff trigram model Ptri (w \] w_ 2, w_ 1 ) as described in 3.1. We then build a family of conditional exponential models of the general form</Paragraph>
      <Paragraph position="5"> where H~ W-N,W-N+l,...,w-x is the word history (the N words preceding w in the text), and Z(H) is the normalization constant</Paragraph>
      <Paragraph position="7"> the BN domain. Roughly speaking, after seeing an &amp;quot;s&amp;quot; word, the empirical probability of witnessing the corresponding &amp;quot;t&amp;quot; in the next N words is boosted by the factor in the third column. In the experiments described herein, N = 500. A separate set of (s, t) pairs were extracted from the WSJ corpus.</Paragraph>
      <Paragraph position="8"> The functions fi, which depend both on the word history H and the word being predicted, are the features; each fl is assigned a weight APS. In the models that we built, feature fi is an indicator function, testing for the occurrence of a trigger pair (si,tl):</Paragraph>
      <Paragraph position="10"> otherwise.</Paragraph>
      <Paragraph position="11"> The above equations reveal that the probability of a word t involves a sum over all words s such that s E H (s appeared in the past 500 words) and (s, t) is a trigger pair. One propitious manner of viewing this model is to imagine that, when assigning probability to a word w following a history of words H, the model &amp;quot;consults&amp;quot; a cache of words which appeared in H and which are the left half of some (s, t) trigger pair. In general, the cache consists of content words s which promote the probability of their mate t, and correspondingly demote the probability of other words. As described in (Beeferman, Berger, and Lafferty, 1997), for each (s,t) trigger pair there corresponds a real-valued parameter A; the probability of t is boosted by a factor of e x for W words following the occurrence of si.</Paragraph>
      <Paragraph position="12"> The training algorithm we use for estimating the A values is the Improved Iterative Scaling algorithm of (Della Pietra, Della Pietra, and Lafferty, 1997), which is a scheme for solving the maximum likelihood problem that is &amp;quot;dual&amp;quot; to a corresponding maximum entropy problem. Assuming robust estimates for the A parameters, the resulting model is essentially guaranteed to be superior to the trigram model.</Paragraph>
      <Paragraph position="13"> For a concrete example, if si-~-VLADIMIR and ti =GENNADY, then fi = 1 if and only if VLADIMIR appeared in the past N words and the current word w is GENNADY. Consulting Table 1, we see that in the BN corpus, the presence of VLADIMIR will boost the probability of GENNADY by a factor of 19.6 for the next N = 500 words.</Paragraph>
    </Section>
    <Section position="3" start_page="37" end_page="38" type="sub_section">
      <SectionTitle>
3.3 Language model &amp;quot;relevance&amp;quot; features
</SectionTitle>
      <Paragraph position="0"> A long-range language model such as that described in Section 3.2 uses selected words from the past ten, twenty or more sentences to inform its decision on the possible identity of the next word. This is likely to help if all of these sentences are in the same document as the current word, for in that case the model has presumably begun to adapt to the idiosyncracies of the current document. In the case of the trigger model described above, the cache will be filled with &amp;quot;relevant&amp;quot; words. In this setting, one would expect a long-range model to outperform a trigram (or other short-range) model, which doesn't avail itself of long-range information.</Paragraph>
      <Paragraph position="1"> On the other hand, if the present document has just recently begun, the long-range model is wrongly conditioning its decision on information from a different--and presumably unrelated--document. A soap commercial, for instance, doesn't benefit a long-range model in assigning probabilities to the words in the news segment following the commercial.</Paragraph>
      <Paragraph position="2"> Often a long-range model will actually be misled by such irrelevant context; in this case, the myopia of the trigram model is actually helpful.</Paragraph>
      <Paragraph position="3"> By monitoring the long- and short-range models, one might be more inclined towards a partition when the long-range model suddenly shows a dip in performance--a lower assigned probability to the observed words--compared to the short-range model. Conversely, when the long-range model is consistently assigning higher probabilities to the observed words, a partition is less likely.</Paragraph>
      <Paragraph position="4"> This motivates a quantitative measure of &amp;quot;relevance,&amp;quot; which we define as the logarithm of the ratio of the probability the exponential model assigns to the next word (or sentence) to that assigned by the short-range trigram model:</Paragraph>
      <Paragraph position="6"> When the exponential model outperforms the tri-gram model, R &gt; 0.</Paragraph>
      <Paragraph position="7">  If we observe the behavior of R as a function of the position of the word within a segment, we find that on average R slowly increases from below zero to well above zero. Figure 1 gives a striking graphical illustration of this phenomenon. The figure plots the average value of R as a function of relative position in the segment, with position zero indicating the beginning of a segment. This plot shows that when a segment boundary is crossed the predictions of the adaptive model undergo a dramatic and sudden degradation, and then steadily become more accurate as relevant content words for the new segment are encountered and added to the cache. (The few very high points to the left of a segment boundary are primarily a consequence of the word CNN--which is a trigger word and often appears at the beginning and end of a broadcast news segment.) This observed behavior is consistent with our earlier intuition: the cache of the long-range model is destructive early in a document, when the new content words bear little in common with the content words from the previous article. Gradually, as the cache fills with words drawn from the current article, the long-range model gains steam and R improves.</Paragraph>
      <Paragraph position="8"> While Figure 1 shows that this behavior is very pronounced as a &amp;quot;law of large numbers,&amp;quot; our feature induction results indicate that relevance is also a very good predictor of boundaries for individual events.</Paragraph>
      <Paragraph position="9"> In the experiments we report in this paper, we assume that sentence boundaries are provided in the annotation, and so the questions we ask are actually about the relevance score assigned to entire sentences normalized by sentence length, a geometric mean of language model ratios.</Paragraph>
    </Section>
    <Section position="4" start_page="38" end_page="38" type="sub_section">
      <SectionTitle>
3.4 Vocabulary features
</SectionTitle>
      <Paragraph position="0"> In addition to the estimate of &amp;quot;topicality&amp;quot; that relevance features provide, we included features pertaining to the identity of words before and after potential segment boundaries as candidates in our exponential model. The set of candidate word-based features we use are simple questions of the form  * Does the word appear up to 1 sentence in thefuture? 2 sentences? 3? 5? * Does the word appear up to 1 sentence in the past? sentences ? 3? 5? * Does the word appear up to 5 sentences in the past but not 5 sentences in the future? * Does the word appear up to 5 sentences in the future but not 5 sentences in the past? * Does the word appear up to 1 word in the future? 5 words ? * Does the word appear up to 1 word in the past? 5 words ? * Does the word begin the preceding sentence?</Paragraph>
      <Paragraph position="2"> tive, long-range language model is on average less accurate than a static trigram model. The figure plots the average value of the logarithm of the ratio of the adaptive language model to the static trigram model as a function of relative position in the segment, with position zero indicating the beginning of a segment.</Paragraph>
      <Paragraph position="3"> The statistics were collected over the roughly seven million words of mixed broadcast news and Reuters data comprising the TDT corpus (see Section 5).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="38" end_page="39" type="metho">
    <SectionTitle>
4 Feature Induction
</SectionTitle>
    <Paragraph position="0"> To cast the problem of determining segment boundaries in statistical terms, we set as our goal the construction of a probability distribution q(b i w), where b E {YES, NO} is a random variable describing the presence of a segment boundary in context w. We consider distributions in the linear exponential fam-</Paragraph>
    <Paragraph position="2"> where q0(blw ) is a prior or default distribution on the presence of a boundary, and A- f(w) is a linear combination of binary features fi(w) E {0, 1} with real-valued feature parameters )ti:</Paragraph>
    <Paragraph position="4"> insure that this is indeed a family of conditional probability distributions. (This family of models is closely related to the class of sigmoidal belief networks (Neal, 1992).) Our judgment of the merit of a model q E Q(f, qo) relative to a reference distribution p ~ Q(f, qo) dur- null ing training is made in terms of the Kullback-Leibler divergence ,~ea be{YES,NO} qLdeg I w) &amp;quot;  Thus, when p is chosen to be the empirical distribution of a sample of training events { (w, b)}, we are using the maximum likelihood criterion for model selection. Under certain mild regularity conditions, the maximum likelihood solution</Paragraph>
    <Paragraph position="6"> exists and is unique. To find this solution, we use the iterative scaling algorithm presented in (Della Pietra, Della Pietra, and Lafferty, 1997).</Paragraph>
    <Paragraph position="7"> This explains how a model is chosen once we know the features fl,-.., fn, but how are these features to be found? The procedure that we follow is a greedy algorithm akin to growing a decision tree. Given an initial distribution q and a set of candidate features C, we consider the one-parameter family of distributions {q~,g}aeR = Q(g' q) for each g E C. The gain of the candidate feature g is defined to be</Paragraph>
    <Paragraph position="9"> This is the improvement to the model that would result from adding the feature g and adjusting its weight to the best value. After calculating the gain of each candidate feature, the one with the largest gain is chosen to be added to the model, and all of the model's parameters are then adjusted using iterative scaling. In this manner, an exponential model is incrementally built up using the most informative features.</Paragraph>
    <Paragraph position="10"> Having concluded our discussion of our overall approach, we present in Figure 2 a schematic view of the steps involved in building a segmenter using this approach.</Paragraph>
  </Section>
  <Section position="6" start_page="39" end_page="40" type="metho">
    <SectionTitle>
5 Feature Induction in Action
</SectionTitle>
    <Paragraph position="0"> This section provides a peek at the construction of segmenters for two different domains. Inspecting the  sequence of features selected by the induction algorithm reveals much about feature induction in general, and how it applies to the segmenting task in particular. We emphasize that the process of feature selection is completely automatic once the set of candidate features has been selected.</Paragraph>
    <Paragraph position="1"> The first segmenter was built on the WSJ corpus. The second was built on the Topic Detection and Tracking Corpus (Allan, to appear). The TDT corpus is a mixed collection of newswire articles and broadcast news transcripts adapted from text corpora previously released by the Linguistic Data Consortium; in particular, portions of data were extracted from the 1995 and 1996 Language Model text collections published by the LDC in support of the DARPA Continuous Speech Recognition project.</Paragraph>
    <Paragraph position="2"> The extracts used for TDT include material from the Reuters newswire service, and from the Primary Source Media CD-ROM publications of transcripts for news programs that appeared on the ABC, CNN, NPR and PBS broadcast networks; the size of the corpus is roughly 7.5 million words. The TDT corpus was constructed as part of a DARPA-sponsored project intended to study methods for detecting new topics or events and tracking their reappearance and evolution over time.</Paragraph>
    <Section position="1" start_page="40" end_page="40" type="sub_section">
      <SectionTitle>
5.1 WSJ features
</SectionTitle>
      <Paragraph position="0"> For the WSJ experiments, which we describe first, a total of 300,000 candidate features were available to the induction program. Though the trigram prior was trained on 38 million words, the trigger parameters were only trained on a one million word subset of this data.</Paragraph>
      <Paragraph position="1"> Figure 3 shows the first several features that were selected by the feature induction algorithm. This shows the word or relevance score for each feature together with the value of e x for the feature after iterative scaling is complete for the final model. The ~-- -~ figures indicate features that are active over a range of sentences. Thus, the symbol MR. +1 I, 0.07 ,t represents the feature &amp;quot;Does the word MR. appear in the next sentence?&amp;quot; which, if true, contributes a factor of e x = 0.07 to the exponential model. Similarly, the ~ ~ figures represent features that are active over a range of words. For HE +5 example, the figure * 0.08 * represents the question &amp;quot;Does the word HE appear in the next five words?&amp;quot; which is assigned a weight of 0.08. The symbol \]5 -~ SAm +5 SAID :'.= 2 7 ,I stands for a feature which asks &amp;quot;Does the&amp;quot; word SAID appear in the previous five sentences but not in the next five sentences?&amp;quot; and contributes a factor of 2.7 if the answer is &amp;quot;yes.&amp;quot; Most of the features in Figure 3 make a good deal  active range of the feature, in words or sentences, relative to the current word.</Paragraph>
      <Paragraph position="2"> of sense. The first selected feature, for instance, is a strong hint that an article may have just begun; articles in the WSJ corpus often concern companies, and typically the full name of the company (ACME INCORPORATED, for instance) only appears once at the beginning of the article, and subsequently in abbreviated form (ACME). Thus the appearance of INCORPORATED is a strong indication that a new article may have recently begun.</Paragraph>
      <Paragraph position="3"> The second feature uses the relevance statistic t.</Paragraph>
      <Paragraph position="4"> 1 For the WSJ experiments, we modified the language model relevance statistic by adding a weight to each word position depending only on its trigram history w-2, w-1. Although our results require further analysis, we do not believe that this makes a significant difference in the fea-If the trigger model performs poorly relative to the trigram model in the following sentence, this feature (roughly speaking) boosts the probability of a segment at this location by a factor of 5.3.</Paragraph>
      <Paragraph position="5"> The fifth feature concerns the presence of the word MR. In hindsight, we can explain this feature by noting that in WSJ data the style is to introduce a person in the beginning of an article by writing, for example, WILE E. COYOTE, PRESIDENT OF ACME INCORPORATED... and then later in the article using a shortened form of the name: MR. COYOTE</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="40" end_page="41" type="metho">
    <SectionTitle>
CITED A LACK OF EXPLOSIVES... Thus, the pres-
</SectionTitle>
    <Paragraph position="0"> ence of MR. in the following sentence discounts the probability of an article boundary by 0.07, a factor of roughly 14.</Paragraph>
    <Paragraph position="1"> The sixth feature which boosts the probability of a segment if the previous sentence contained the word CLOSED--is another artifact of the WSJ domain, where articles often end with a statement of a company's performance on the stock market during the day of the story of interest. Similarly, the end of an article is often made with an invitation to visit a related story; hence a sentence beginning with SEE boosts the probability of a segment boundary by a large factor of 94.8. Since a personal pronoun typically requires an antecedent, the presence of HE among the first words is a sign that the current position is not near an article boundary, and this feature therefore has a discounting factor of 0.082.</Paragraph>
    <Section position="1" start_page="40" end_page="41" type="sub_section">
      <SectionTitle>
5.2 TDT features
</SectionTitle>
      <Paragraph position="0"> For the TDT experiments, a larger vocabulary and roughly 800,000 candidate features were available to the induction program. Though the trigram prior was trained on approximately 150 million words, the trigger parameters were trained on a 10 million word subset of the BN corpus.</Paragraph>
      <Paragraph position="1"> Figure 4 reveals the first several features chosen by the induction algorithm. The letter c. appears among several of the first features. This is because of the fact that the data is tokenized for speech processing (whence c. N. N. rather than CNN), and the network identification information is often given at the end and beginning of news segments (c. N.</Paragraph>
      <Paragraph position="2"> N.'S RICHARD BLYSTONE IS HERE TO TELL US...).</Paragraph>
      <Paragraph position="3"> The first feature asks if the letter c. appears in the previous five words; if so, the probability of a segment boundary is boosted by a factor of 9.0. The personal pronoun I appears as the second feature; if this word appears in the following three sentences then the probability of a segment boundary is discounted. null The language model relevance statistic appears for the first time in the sixth feature. The word turps chosen by the algorithm, or the quantitative performance of the resulting segmenter.</Paragraph>
      <Paragraph position="4">  J. that the seventh and fifteenth features ask about can be attributed to the large number of news stories in the data having to do with the O.J. Simpson trial. The nineteenth feature asks if the term FROM appears among the previous five words, and if the answer is &amp;quot;yes&amp;quot; raises the probability of a segment boundary by more than a factor of two.</Paragraph>
      <Paragraph position="5"> This feature makes sense in light of the &amp;quot;sign-off&amp;quot; conventions that news reporters and anchors follow</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="41" end_page="42" type="metho">
    <SectionTitle>
(THIS IS WOLF BLITZER REPORTING LIVE FROM
THE WHITE HOUSE). Similar explanations of many
</SectionTitle>
    <Paragraph position="0"> of the remaining features are easy to guess from a perusal of Figure 4.</Paragraph>
  </Section>
  <Section position="9" start_page="42" end_page="42" type="metho">
    <SectionTitle>
6 A Probabilistic Error Metric
</SectionTitle>
    <Paragraph position="0"> Precision and recall statistics are commonly used in natural language processing and information retrieval to assess the quality of algorithms. For the segmentation task they might be used to gauge how frequently boundaries actually occur when they are hypothesized and vice versa. Although they have snuck into the literature in this disguise, we believe they are unwelcome guests.</Paragraph>
    <Paragraph position="1"> A useful error metric should somehow correlate with the utility of the instrumented procedure in a reM application. In almost any conceivable application, a segmenting tool that consistently comes close--off by a sentence, say--is preferable to one that places boundaries willy-nilly. Yet an algorithm that places a boundary a sentence away from the actual boundary every time actually receives worse precision and recall scores than an algorithm that hypothesizes a boundary at every position. It is natural to expect that in a segmenter, close should count for something.</Paragraph>
    <Paragraph position="2"> A useful metric should Mso be robust with respect to the scale (words, sentences, paragraphs, for instance) at which boundaries are determined. However, precision and recall are scale-dependent quantities. (Reynar, 1994) uses an error window that redefines &amp;quot;correct&amp;quot; to mean hypothesized within some constant window of units away from a reference boundary, but this approach still suffers from overdiscretizing error, drawing all-or-nothing lines insensitive to gradations of correctness.</Paragraph>
    <Paragraph position="3"> Finally, for many purposes it is useful to have a metric that is a single number. A commonly cited flaw of the precision/recall figures is their complementary nature: hypothesizing more boundaries raises precision at the expense of recall, allowing an algorithm designer to tweak parameters to trade precision for recall. One proposed work-around is to employ dynamic time warping to come up with an explicit alignment between the segments proposed by the algorithm and the reference segments, and then to combine insertion, deletion, and substitution errors into an overall penalty. This error metric, in common use in speech recognition, can be achieved by a similar Viterbi search. A string edit distance such as this is useful and reasonable for applications like speech or spelling correction partly because it measures how much work a user would have to do to correct the output of the machine. For many of the applications we envision for segmentation, however, the user will not correct the output but will rather browse the returned text to extract information.</Paragraph>
    <Paragraph position="4"> Our proposed metric satisfies the listed desiderata.</Paragraph>
    <Paragraph position="5"> It formalizes in a probabilistic manner the effect of document co-occurrence on goodness, in which it is deemed desirable for related units of information to appear in the same document and unrelated units to appear in separate documents.</Paragraph>
    <Section position="1" start_page="42" end_page="42" type="sub_section">
      <SectionTitle>
6.1 The new metric
</SectionTitle>
      <Paragraph position="0"> Segmentation, whether at the word or sentence level, is about identifying boundaries between successive units of information in a text corpus. Two such units are either related or unrelated by the intent of the document author. A natural way to reason about developing a segmentation algorithm is therefore to optimize the likelihood that two such units are correctly labeled as being related or being unrelated. Our error metric P~, is simply the probability that two sentences drawn randomly from the corpus are correctly identified as belonging to the same document or not belonging to the same document. More formally, given two segmentations ref and hyp for a corpus n sentences long,</Paragraph>
      <Paragraph position="2"> Here ~ref is an indicator function which is 1 if the two corpus indices specified by its parameters belong in the same document, and 0 otherwise; similarly, ~hyp is 1 if the two indices are hypothesized to belong in the same document, and 0 otherwise. The operator is the XNOR function (&amp;quot;both or neither&amp;quot;) on its two operands. The function D, is a distance probability distribution over the set of possible distances between sentences chosen randomly from the corpus, and will in general depend on certain parameters # such as the average spacing between sentences. If D~ is uniform over the length of the text, then the metric represents the probability that any two sentences drawn from the corpus are correctly identified as being in the same document or not.</Paragraph>
      <Paragraph position="3"> Consider the implications of this for information retrieval. Suppose there is precisely one sentence in a target corpus that satisfies our information demands. For some applications it may be sufficient for the system to return only that sentence, but in general we desire that it return as many sentences directly related to the target sentence as possible, without returning too many unrelated sentences. If we assume &amp;quot;related&amp;quot; to mean &amp;quot;contained in the same document&amp;quot;, then our error metric judges algorithms based on how often this happens.</Paragraph>
      <Paragraph position="4"> In practice letting D~, be the uniform distribution is unreasonable, since for large corpora most randomly drawn pairs of sentences are in different documents and are correctly identified as such by even the most naive algorithms. We instead adopt a distribution that focuses on small distances. In particular, we choose D~ to be an exponential distribution with mean l/p, a parameter that we fix at the approximate mean document length for the domain: Dt~(i, J) = 7t~ e-~li-jl .</Paragraph>
      <Paragraph position="5"> In the above, 7t, is a normalization chosen so that D~, is a probability distribution over the range of distances it can accept.</Paragraph>
      <Paragraph position="6"> There are several sanity checks that validate the use of our metric. The measure is a probability and therefore a real number between 0 and 1. We expect 1 to represent perfection; indeed, an algorithm scores 1 with respect to some data if and only if it predicts its segmentation exactly. It captures the notion of nearness in a principled way, gently penalizing algorithms that hypothesize boundaries that aren't quite right, and scaling down with the algorithm's degradation. Furthermore, it is not possible to &amp;quot;cheat&amp;quot; and obtain a high score with this metric: spurious behavior such as never hypothesizing boundaries and hypothesizing nothing but boundaries are penalized. We refer to Section 7 for sample results on how these trivial algorithms score.</Paragraph>
      <Paragraph position="7"> One weakness of the metric as we have presented it here is that there is no principled way of specifying the distance distribution Du. We plan to give a more detailed analysis of this problem and present a method for choosing the parameters ~ in a future paper.</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="42" end_page="45" type="metho">
    <SectionTitle>
7 Experimental Results
7.1 Quantitative results
</SectionTitle>
    <Paragraph position="0"> After feature induction was carried out (as described in Section 5), a simple decision procedure was used for actually placing boundaries: a segment boundary was placed at each position for which the model probability was above a fixed threshold or, with boundaries required to be separated by a minimum number of sentences e. The threshold and minimum separation were determined on heldout data in order to maximize the probability P~, and turned out to be a = 0.20 and e = 2 for the WSJ model, and ot = 0.14 and e = 5 for the TDT models.</Paragraph>
    <Paragraph position="1"> The quantiative results for the WSJ and TDT models are collected in Tables 5 and 6 respectively.</Paragraph>
    <Paragraph position="2"> For the WSJ model, the probabilistic metric P~, was 0.83 when evaluated on 325K words of test data, and the precision and recall for exact matches of boundaries were 56% and 54%, for an F-measure of 55. As a simple baseline we compared this performance to that obtained by four simple default methods for assigning boundaries: choosing boundaries randomly, assigning every possible boundary,  and tested on a similarly sized portion of unseen text. The top 70 features were selected. The mean segment length in the training and test data was 1/p = 18 sentences. As a basis of comparison, the figures for several baseline models are given. The figures in the random row were calculated by randomly generating a number of segments equal to the number appearing in the test data. The all and none rows include the figures for models which hypothesize all possible segment boundaries and no boundaries, respectively. The even row shows the results of simply hypothesizing a segment boundary every 18 sentences. reference hypoth.</Paragraph>
    <Paragraph position="3"> model segments segments P~ precision  data from 1992-1993, not included in TDT corpus, and the top 100 features were selected. Model B was trained on the first 2M words of TDT corpus which is made up of a mix of CNN transcripts and Reuters newswire, and again the top 100 features were selected. The mean document length was 1/p = 25 sentences. assigning no boundaries, and deterministically placing a segment boundary every 1/p sentences. It is instructive to compare the values of P, with precision and recall for these default algorithms in order to obtain some intuition for the new error metric.</Paragraph>
    <Paragraph position="4"> Two separate models were built to segment the TDT corpus. The first, which we shall refer to simply as Model A, was trained using two million words from the BN corpus from the 1992-1993 time period. This data contains CNN transcripts, but no Reuters newswire data. Model B was trained on the first two million words of the TDT corpus. Both models were tested on the last 4.3 million words of the TDT corpus. We expect Model A to be inferior to Model B for two reasons: the lack of Reuters data in it's training set and the difference of between one and two years in the dates of the stories in the  training and test sets. The difference is quantifiied in Table 6, which shows that P~, = 0.82 for Model A while P, = 0.88 for Model B.</Paragraph>
    <Section position="1" start_page="44" end_page="45" type="sub_section">
      <SectionTitle>
7.2 Qualitative results
</SectionTitle>
      <Paragraph position="0"> We now present graphical examples of the segmentation algorithm at work on previously unseen test data. Figure 7 shows the performance of the WSJ segmenter on a typical collection of test data, in blocks of 300 contiguous sentences. In these figures the reference segmentation is shown below the horizontal line as a vertical line at the position between sentences where the article boundary occurred. The decision made by the automatic segmenter is shown as a verticle line above the horzontal line at the appropriate position. The fluctuating curve is the probability assigned by the exponential model con- null The lower verticle lines indicate reference segmentations (&amp;quot;truth&amp;quot;). The upper verticle lines are boundaries placed by the algorithm. The fluctuating curve is the probability of a segment boundary according to the exponential model after 70 features were induced. null structed using feature induction. Notice that in this domain many of the segments are quite short, adding special difficulties for the segmentation problem. Figure 8 shows the performance of the TDT segmenter (Model B) on five randomly chosen blocks of 200 sentences from the TDT test data.</Paragraph>
      <Paragraph position="1"> We hasten to add that these results were obtained Figure 8: Randomly chosen segmentations of TDT test data, in 200 sentence blocks, using Model B.</Paragraph>
      <Paragraph position="2"> with no smoothing or pruning of any kind, and with no more than 100 features induced from the candidate set of several hundred thousand. Unlike many other machine learning methods, feature induction for exponential models is quite robust to overfitting since the features act in concert to assign probability to events rather than splitting the event space and assigning probability using relative counts. We expect that significantly better results can be obtained by simply training on much more data, and by allowing a more sophisticated set of features.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML