File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2914_metho.xml

Size: 12,925 bytes

Last Modified: 2025-10-06 14:10:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2914">
  <Title>Word Distributions for Thematic Segmentation in a Support Vector Machine Approach</Title>
  <Section position="7" start_page="101" end_page="103" type="metho">
    <SectionTitle>
3 Support Vector Learning Task and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="101" end_page="103" type="sub_section">
      <SectionTitle>
Thematic Segmentation
</SectionTitle>
      <Paragraph position="0"> The theory of Vapnik and Chervonenkis (Vapnik, 1995) motivated the introduction of support vector learning. SVMs have originally been used for classification purposes and their principles have been extended to the task of regression, clustering and feature selection. (Kauchak and Chen, 2005) employed SVMs using features (derived for instance from information given by the presence of paragraphs, pronouns, numbers) that can be reliably used for topic  segmentation of narrative documents. Aside from the fact that we consider the TS task on different datasets (not only on narrative documents), our approach is different from the approach proposed by (Kauchak and Chen, 2005) mainly by the data representation we propose and by the fact that we put the emphasis on deriving the thematic structure merely from word distribution, while (Kauchak and Chen, 2005) observed that the 'block similarities provide little information about the actual segment boundaries' on their data and therefore they concentrated on exploiting other features.</Paragraph>
      <Paragraph position="1"> An excellent general introduction to SVMs and other kernel methods is given for instance in (Cristianini and Shawe-Taylor, 2000). In the section below, we give some highlights representing the main elements in using SVMs for thematic segmentation.</Paragraph>
      <Paragraph position="2"> The support vector learner L is given a training set of n examples, usually denoted by Strain= ((vectoru1, y1),...,(vectorun, yn))[?] (U x Y )n drawn independently and identically distributed according to a fixed distribution Pr(u,y) = Pr(y|u)Pr(u). Each training example consists of a high-dimensional vector vectoru describing an utterance and the class label y. The utterance representations we chose are further described in Section 4. The class label y has only two possible values: 'thematic boundary' or 'nonthematic boundary'. For notational convenience, we replace these values by +1 and -1 respectively, and thus we have y [?] {-1, 1}. Given a hypothesis space H, of functions h : U - {[?]1,+1} having the form h(vectoru) = sign(&lt; vectorw,vectoru &gt; +b), the inductive support vector learner Lind seeks a decision function hind from H, using Strain so that the expected number of erroneous predictions is minimized. Using the structural risk minimization principle (Vapnik, 1995), the support vector learner gets the optimal decision function h by minimizing the following cost function:</Paragraph>
      <Paragraph position="4"> xi [?] 0 for i = 1,2,...,n.</Paragraph>
      <Paragraph position="5"> The parameters vectorw and b follow from the optimisation problem, which is solved by applying Lagrangian theory. The so-called slack variables xi, are introduced in order to be able to handle non-separable data. The positive parameters C+ and C[?] are called regularization parameters and determine the amount up to which errors are tolerated. More exactly, training data may contain noisy or outlier data that are not representative of the underlying distribution. On the one hand, fitting exactly to the training data may lead to overfitting. On the other hand, dismissing true properties of the data as sampling bias in the training data will result in low accuracy. Therefore, the regularization parameter is used to balance the trade-off between these two competing considerations. Setting the regularization parameter too low can result in poor accuracy, while setting it too high can lead to overfitting. In the TS task, we used an automated procedure to select the regularization parameters, as further described in section 5.3.</Paragraph>
      <Paragraph position="6"> In cases where non-linear hypothesis functions should be optimised, each vectorui can be mapped into ph(vectorui) [?] F, where F is a higher dimensional space usually called feature space, in order to make linear the relation between vectorui and yi. Thus the original linear learning machine can be adopted in finding the classification solution in the feature space.</Paragraph>
      <Paragraph position="7"> When using a mapping function ph : U - F, if we have a way of computing the inner product &lt;ph(vectorui),ph(vectoruj)&gt; directly as a function of the original input point, then the so-called kernel function K(vectorui,vectoruj) = &lt;ph(vectorui),ph(vectoruj)&gt; is proved to simplify the computational complexity implied by the direct use of the mapping function ph. The choice of appropriate kernels and its specific parameters is an empirical issue. In our experiments, we used the Gaussian radial basis function (RBF) kernel:</Paragraph>
      <Paragraph position="9"> For the SVM calculations, we used the LIBSVM library (Chang and Lin, 2001).</Paragraph>
      <Paragraph position="10"> 4 Representation of the information used to determine thematic boundaries As presented in section 3, in the thematic segmentation task, an input vectorui to the support vector classifier is a vectorial representation of the utterance to  be classified and its context. Each dimension of the input vector indicates the value of a certain feature characterizing the utterance. All input features here are indicator functions for a word occurring within a fixed-size window centered on the utterance being labeled. More exactly, the input features are computed in the following steps:  1. The text has been pre-processed by tokenization, elimination of stop-words and lemmatization, using TreeTagger (Schmid, 1996). 2. We make use of the so-called bag of words ap- null proach, by mapping each utterance to a bag, i.e. a set that contains word frequencies. Therefore, word frequencies have been computed to count the number of times that each term (i.e. word lemma) is used in each utterance. Then a transformation of the raw word frequency counts is applied in order to take into account both the local (i.e. for each utterance) word frequencies as well as the overall frequencies of their occurrences in the entire text collection. More exactly, we made experiments in parallel with three such transformations, which are very commonly used in information retrieval domain (Dumais, 1991): tf.idf, tf.normal and log.entropy.</Paragraph>
      <Paragraph position="11"> 3. Each i-th utterance is represented by a vector vectorui, where a j-th element of vectorui is computed as:</Paragraph>
      <Paragraph position="13"> where winSize [?] 1 and fi,j is the weighted frequency (determined in the previous step) of the j-th word from the vocabulary in the i-th utterance. In this manner, we will have ui,j &gt; 0 if and only if at least two occurrences of the j-th term occur within (2 * winSize) utterances on opposite sides of a boundary candidate. That is, each ui,j is capturing how many word co-occurrences appear across the candidate utterance in an interval (of (2*winSize) utterances) centered in the boundary candidate utterance.</Paragraph>
      <Paragraph position="14"> 4. Each attribute value from the input data is scaled to the interval [0,1].</Paragraph>
      <Paragraph position="15"> Note that the vector space representation adopted in the previous steps will result in a sparse high dimensional input data for our system. More exactly, table 1 shows the average number of non-zero features per example corresponding to each data set (further described in section 5.1).</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="103" end_page="104" type="metho">
    <SectionTitle>
5 Experimental Setup
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="103" end_page="104" type="sub_section">
      <SectionTitle>
5.1 Data sets used
</SectionTitle>
      <Paragraph position="0"> In order to evaluate how robust our SVM approach is, we performed experiments on three English data sets of approximately the same dimension (i.e. containing about 260,000 words).</Paragraph>
      <Paragraph position="1"> The first dataset is a subset of the ICSI-MR corpus (Janin et al., 2004), where the gold standard for thematic segmentations has been provided by taking into account the agreement of at least three human annotators (Galley et al., 2003). The corpus consists of high-quality close talking microphone recordings of multi-party dialogues. Transcriptions at word level with utterance-level segmentations are also available. A test sample from this dataset consists of the transcription of an approximately one-hour long meeting and contains an average of about seven thematic episodes.</Paragraph>
      <Paragraph position="2"> The second data set contains documents randomly selected from the Topic Detection and Tracking (TDT) 2 collection, made available by (LDC, 2006).</Paragraph>
      <Paragraph position="3"> The TDT collection includes broadcast news and newswire text, which are segmented into topically cohesive stories. We use the story segmentation provided with the corpus as our gold standard labeling. A test sample from our subset contains an average of about 24 segments.</Paragraph>
      <Paragraph position="4"> The third dataset we use in this study was originally proposed in (Choi, 2000) and contains artificial thematic episodes. More precisely, the dataset is built by concatenating short pieces of texts that  have been randomly extracted from the Brown corpus. Any test sample from this dataset consists of ten segments. Each segment contains at least three sentences and no more than eleven sentences.</Paragraph>
      <Paragraph position="5"> While the focus of our paper is not on the method of evaluation, it is worth pointing out that the performance on the synthetic data set is a very poor guide to the performance on naturally occurring data (Georgescul et al., 2006). We include the synthetic data for comparison purposes.</Paragraph>
    </Section>
    <Section position="2" start_page="104" end_page="104" type="sub_section">
      <SectionTitle>
5.2 Handling unbalanced data
</SectionTitle>
      <Paragraph position="0"> We have a small percentage of positive examples relative to the total number of training examples.</Paragraph>
      <Paragraph position="1"> Therefore, in order to ensure that positive points are not considered as being noisy labels, we change the penalty of the minority (positive) class by setting the parameter C+ of this class to:</Paragraph>
      <Paragraph position="3"> where n+ is the number of positive training examples, n is the total number of training examples and l is the scaling factor. In the experiments reported here, we set the value for the scale factor l to l = 1 and we have: C+ = 7 * C[?] for the synthetic data derived from Brown corpus; C+ = 18 * C[?]for the TDT data and C+ = 62 * C[?] for the ICSI meeting data.</Paragraph>
    </Section>
    <Section position="3" start_page="104" end_page="104" type="sub_section">
      <SectionTitle>
5.3 Model selection
</SectionTitle>
      <Paragraph position="0"> We used 80% of each dataset to determine the best model settings, while the remaining 20% is used for testing purposes. Each training set (for each dataset employed) was divided into disjoint subsets and five-fold cross-validation was applied for model selection.</Paragraph>
      <Paragraph position="1"> In order to avoid too many combinations of parameter settings, model selection is done in two phases, by distinguishing two kinds of parameters. First, the parameters involved in data representation (see section 4) are addressed. We start with choosing an appropriate term weighting scheme and a good value for the winSize parameter. This choice is based on a systematic grid search over 20 different values for winSize and the three variants tf.idf, tf.normal and log.entropy for term weighting. We ran five-fold cross validation, by using the RBF kernel with its parameter g fixed to g = 1. We also set the regularization parameter C equal to C = 1.</Paragraph>
      <Paragraph position="2"> In the second phase of model selection, we take the optimal parameter values selected in the previous phase as a constant factor and search the most appropriate values for C and g parameters. The range of values we select from is: C [?] braceleftbig10[?]3,10[?]2,10[?]1,1,10,102,103bracerightbig and g [?]braceleftbig 2[?]6,2[?]5,2[?]4,...,24,26bracerightbig and for each possible value we perform five-fold cross validation. Therefore, we ran the algorithm five times for the 91 = 7 x 13 parameter settings. The most suitable model settings found are shown in Table 2. For these settings, we show the algorithm's results in section 6.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML