File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1662_metho.xml
Size: 8,834 bytes
Last Modified: 2025-10-06 14:10:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1662"> <Title>Sentence Ordering with Manifold-based Classification in Multi-Document Summarization</Title> <Section position="5" start_page="526" end_page="527" type="metho"> <SectionTitle> 3. Sentence Network Construction </SectionTitle> <Paragraph position="0"> Suppose S is the set of all sentences in the documents and a summary (a summary sentence may be not a document sentence), let S={s1, s2, ..., sN} with a distance metric d(si,sj), the distance between two sentences si and sj, which is based on the Jensen-Shannon divergence (Lin, 1991). We construct a graph with sentences as points by sorting the distances among the points in an ascending order and repeatedly connecting two points according to the order until a connected graph is obtained. Then, we assign a weight wi,j, as in (1), to each edge based on the distance.</Paragraph> <Paragraph position="1"> )/),(exp()1 , djiji ssdw [?]= The weights are symmetric, wi,i=1 and wi,j=0 for all non-neighbors (d is set as 0.6 in this work). 2) is the one-step transition probability p(si, sj) from si to sj based on weights of neighbors.</Paragraph> <Paragraph position="2"> Let M be the Ng17433N matrix and Mi,j= p(si, sj), then Mt is the tth Markov random walk matrix, whose i, j-th entry is the probability pt(si, sj) of the transition from si to sj after t steps. In this way, each sentence sj is associated with a vector of conditional probabilities pt(si, sj), i=1, ..., N, which form a new manifold-based representation for sj. With such representations, sentences are close whenever they have a similar distribution over the starting points. Notice that the representations depend on the step parameter t (Tishby et al., 2000). With smaller values of t, unlabeled points may be not connected with labeled ones; with bigger values of t, the points may be indistinguishable. So, an appropriate t should be estimated.</Paragraph> </Section> <Section position="6" start_page="527" end_page="527" type="metho"> <SectionTitle> 4. Sentence Classification </SectionTitle> <Paragraph position="0"> Suppose s1, s2, ..., sL are summary sentences and their labels are c1, c2, ..., cL respectively. In our case, each summary sentence is assigned with a unique class label ci, 1g18411ig18411L. This also means that for each class ci, there is only one labeled example, i.e., the summary sentence, si.</Paragraph> <Paragraph position="1"> Let S={(s1, c1), (s2, c2), ..., (sL, cL), sL+1,..., sN}, then the task of sentence classification is to infer the labels for unlabeled sentences, sL+1,..., sN.</Paragraph> <Paragraph position="2"> Through the classification, we can get similar sentences for each summary sentence. To do that, we assume that each sentence has a distribution p(ck|si), 1g18411kg18411L, 1g18411ig18411N, and these probabilities are to be estimated from the data.</Paragraph> <Paragraph position="3"> Seeing a sentence as a sample from the t step Markov random walk in the sentence graph, we have the following interpretation of p(ck|si).</Paragraph> <Paragraph position="5"> This means that the probability of si belonging to ck is dependent on the probabilities of those sentences belonging to ck which will transit to si after t steps and their transition probabilities.</Paragraph> <Paragraph position="6"> With the conditional log-likelihood of labeled sentences 4) as the estimation criterion, we can use the EM algorithm to estimate p(ck|si), in which the E-step and M-step are 5) and 6) respectively.</Paragraph> <Paragraph position="8"> p(ci|si) is called the membership probability of si.</Paragraph> <Paragraph position="9"> After classification, each sentence is assigned a label according to 7).</Paragraph> <Paragraph position="10"> One key problem in this setting is to estimate the parameter t. A possible strategy for that is by cross validation, but it needs a large amount of labeled data. Here, following Szummer et al., 2001, we use marginal difference of probabilities of sentences falling in different classes as the estimation criterion, which is given in 8).</Paragraph> <Paragraph position="11"> ))|()|(max()()8 To maximize 8), we can get an appropriate value for the parameter t, which means that a better t should make sentences belong to some classes more prominently. Notice that the classes represented by summary sentences may be incomplete for all the sentences occurring in the documents, so some sentences will belong to the classes without obviously different probabilities. To avoid such sentences in the estimation of t, we only choose the top (40%) sentences in a class based on their membership probabilities.</Paragraph> </Section> <Section position="7" start_page="527" end_page="529" type="metho"> <SectionTitle> 5. Sentence Ordering </SectionTitle> <Paragraph position="0"> After sentence classification, we get a class of similar sentences for each summary sentence, which is also a member of the class. With these sentence classes, we create a directed class graph based on the order of their member sentences in documents. In the graph, each sentence class is a node, and there exists a directed edge ei,j from one node ci to another cj if and only if there is si in ci immediately appearing before sj in cj in the documents (the sentences not in classes are neglected). The weight of ei,j, Fi,j, captures the frequency of such occurrence. We add one additional node denoting an initial class c0, and it links to each class with a directed edge e0,j, the weight F0,j of which is the frequency of the member sentences of the class appearing at the beginning of the documents.</Paragraph> <Paragraph position="1"> Suppose the input is the class graph G=<C, E>, where C = {c1, c2, ..., cL} is the set of the classes, E={ei,j|1[?]i, j[?]L} is the set of the directed edges, and o is the ordering of the classes. Fig. 2 gives the ordering algorithm.</Paragraph> <Paragraph position="3"> iii) For all ci in C, ikii FFF ,,0,0 +iv) Remove ck from C and ek,j and ei,k from E; v) Repeat i)-iv) while Cg302{ c0} vi) Return the order o.</Paragraph> <Paragraph position="4"> ------------------------------------------------------- null In the algorithm, there are two main steps. Step i) selects the class whose member sentences occur most frequently immediately after those in c0.</Paragraph> <Paragraph position="5"> Step iii) updates the weights of the edges e0,i. In fact, it can be seen as merge of the original c0 and ck, and in this sense the updated c0 represents the history of selections.</Paragraph> <Paragraph position="6"> In contrast to the MO algorithm, the ordering algorithm here (HO) uses immediate back-front co-occurrence, while the MO algorithm uses relative back-front locations. On the other hand, the selection of a class is dependent on previous selections in HO, while in MO, the selection of a class is mainly dependent on its in-out edge difference.</Paragraph> <Paragraph position="7"> In contrast to the PO algorithm, the selection of a class in HO is dependent on all previous selections, while in PO, the selection is only related to the most recent one.</Paragraph> <Paragraph position="8"> As an example, Fig. 3 gives an initial class graph. The output orderings by PO and HO are [c1, c3, c4, c2] and [c1, c3, c2, c4] respectively. The difference lies in whether to select c4 or c2 after selection of c3. PO selects c4 since it only considers the most recent selection, while HO selects c2 because it considers all previous</Paragraph> <Paragraph position="10"> From 1)-6), we can see some regularity among the order of the classes: c2 and c3 are interchangeable, while c1 always appears behind c2 or c3. From 7), we can see that c2 and c3 still co-occur, while c1 happens to occur at the beginning of the document. Thus, the appropriate ordering should be [c2, c3, c1] or [c3, c2, c1]. Fig. 5 is the graph built by MO.</Paragraph> <Paragraph position="11"> According to MO, the first node to be selected will be c1, since the difference of its in-out edges (+3) is bigger than that (-2, -1) of other two nodes. Then the in-out edge differences for c2 or c3 are both 0 after removing edges associated with c1, and either c2 or c3 will be selected. Thus, the output ordering should be [c1, c2, c3] or [c1, c3, c2].</Paragraph> <Paragraph position="13"> Fig. 6 Graph by HO Fig. 6 is the class graph built by HO. According to HO, the first node to be selected will be c2 or c3, since e0,1=e0,2=3>e0,1=1. Suppose c2 is firstly selected, then e0,3 a0 e0,3+e2,3=3+6=9, while e0,1 a0 e0,1+e2,1=1+2=3, so c3 will be selected then.</Paragraph> <Paragraph position="14"> Finally the output ordering will be [c2, c3, c1]. Similarly, if c3 is firstly selected, the output ordering will be [c3, c2, c1].</Paragraph> </Section> class="xml-element"></Paper>