File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1018_metho.xml

Size: 11,532 bytes

Last Modified: 2025-10-06 14:09:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1018">
  <Title>Chinese Text Summarization Based on Thematic Area Detection</Title>
  <Section position="4" start_page="0" end_page="4" type="metho">
    <SectionTitle>
3 The Algorithm
</SectionTitle>
    <Paragraph position="0"> In this section, the proposed method will be introduced in detail. The method consists of the following three main stages: Stage 1: Find the different thematic areas in the document through paragraph clustering and clustering analysis.</Paragraph>
    <Paragraph position="1"> Stage 2: Select the most suitable sentence from each thematic area as the representative one.</Paragraph>
    <Paragraph position="2"> Stage 3: Make the representative sentences form the final summary according to certain requirements.</Paragraph>
    <Section position="1" start_page="0" end_page="4" type="sub_section">
      <SectionTitle>
3.1 Stage 1: Thematic Area Detection
</SectionTitle>
      <Paragraph position="0"> The process of thematic area detection is displayed in  Different from the general word segmentation operation adopted in the traditional Chinese automatic summarization research, we do not take the general operation when pre-processing the original  document, but make use of the method introduced by (Liu et al., 2003) to extract terms from the document and then express its content by such metadata elements as terms.</Paragraph>
      <Paragraph position="1"> The greatest advantage of term extraction technology is that it needs no support of fixed thesaurus, only through the continuous updating and making statistics of a real corpus. We can dynamically establish and update a term bank and improve the extraction quality through continuous correcting of the parameters for extraction. Thus it is of wide practical prospects for natural language processing. In addition, the terms can represent a relative specific meaning, because most of them are phrases, which consist of multi-characters.  The advantage of the vector space model (VSM) is that it successfully makes the unstructured documents structured which makes it possible to handle the massive real documents by adopting the existing mathematical instruments. All the terms extracted from the document are considered as the features of a vector, while the values of the features are statistics of the terms. According to this, we can set up the VSM of paragraphs, that is each paragraph Pi (i:1~M,M is the number of all paragraphs in a document) is represented as the vector of weights of terms, VPi, VPi =</Paragraph>
      <Paragraph position="3"> Where N is the total number of terms, WPij denotes the weight of the j-th term in the i-th paragraph. There are many methods of calculating WPij, such as tf, tf*idf, mutual information (Patrick Pantel and Lin, 2002), etc. The method adopted here (Gong and Liu, 2001) is shown as follows:</Paragraph>
      <Paragraph position="5"> Where TF(Tij) denotes the number of occurrence of the j-th term in the i-th paragraph, M/Mj denotes the inverse paragraph frequency of term j, and Mj denotes the number of paragraphs in which term j occurs. In accordance, on the basis of defining WPij, we can further define the weight of paragraph Pi, W(P i), by the follwing formula: (2) In formula (2), n represents the total number of different terms occurring in the i-th paragraph.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.1.3 Step 3: Paragraph Clustering and
Clustering Analysis
1) Paragraph clustering
</SectionTitle>
      <Paragraph position="0"> The existing clustering algorithms can be categorized as hierarchical (e.g. agglomerative etc) and partitional (e.g. K-means, K-medoids, etc) (Pantel and Lin, 2002).</Paragraph>
      <Paragraph position="1"> The complexity of the hierarchical clustering algorithm is O(n2Log(n)) , where n is the number of elements to be clustered, which is usually greater than that of the partitional method. For example, the complexity of K-means is linear in n. So in order to achieve high efficiency of algorithm, we choose the latter to cluster paragraphs.</Paragraph>
      <Paragraph position="2"> K-means clustering algorithm is a fine choice in many circumstances, because it is simple and effective. But in the process of clustering by means of K-means, the quality of clustering is greatly affected by the elements that marginally belong to the cluster, and the centroid can't represent the real element in the cluster, So while choosing the paragraphs clustering algorithm, we adopt K-medoids (Kaufmann and Rousseeuw, 1987; Moens et al. 1999) which is less sensitive to the effect of marginal elements than K-means.</Paragraph>
      <Paragraph position="3"> Suppose that every sample point in the N-dimensional sample space respectively represent a paragraph vector, and the clustering of paragraphs can be visualized as that of the M sample points in the sample space. Here N is the number of terms in the document and M is the number of paragraphs.</Paragraph>
      <Paragraph position="4"> Table 1 shows the formal description of the paragraph clustering process based on K-medoids method.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="4" end_page="4" type="metho">
    <SectionTitle>
2) Clustering analysis
</SectionTitle>
    <Paragraph position="0"> A classical problem when adopting K-medoid clustering method and many other clustering methods is the determination of K, the number of clusters. In traditional K-medoid method, K must be offered by the user in advance. In many cases, it's impractical. As to clustering of paragraphs, customers can't predict the latent thematic number in the document, so it's impossible to offer K correctly.</Paragraph>
    <Paragraph position="1"> In view of the problem, the authors put forward a new clustering analysis method to automatically determine the value of K according to the distribution of values of the self-defined objective function. The basic idea is that if K, the number of clusters, is determined with each value of K, and Input: &lt;a, b&gt;, they respectively denote the paragraph matrix composed by all the paragraph vectors in the document and the number of clusters, k (the range of k is set to 2~M).</Paragraph>
    <Paragraph position="2"> Step 1: randomly select k paragraph vectors as the initial medoids of the clusters (here, the medoids denote the representative paragraphs of k clusters).</Paragraph>
    <Paragraph position="3"> Step 2: assign each paragraph vector to a cluster according to the medoid X closest to it.</Paragraph>
    <Paragraph position="4"> Step 3: calculate the Euclidean distance between all the paragraph vectors and their closest medoids.</Paragraph>
    <Paragraph position="5"> Step 4: randomly select a paragraph vector Y.</Paragraph>
    <Paragraph position="6"> Step 5: to all the X, if it can reduce the Euclidean distance between all the paragraph vectors and their closest medoids by interchanging X and Y, then change their positions, otherwise keep as the original.</Paragraph>
    <Paragraph position="7"> Step 6: repeat from step 2 to 5 until no changes take place.</Paragraph>
    <Paragraph position="8"> Output: &lt;A, B, C&gt;, they respectively denote the cluster id, the representative paragraph vector and all the paragraph vectors of each cluster under the k clusters.</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
K-medoid method
</SectionTitle>
      <Paragraph position="0"> suitably, then the corresponding clustering results can well distinguish the different themes in the document, and correspondingly the average of the sum of the weight of the representative paragraph under each theme will tend to maximize. We call this the maximum property of the objective function.</Paragraph>
      <Paragraph position="1"> Correspondingly, we define the following objective function Objf(K) to reflect clustering quality and determine the number of clusters, K.</Paragraph>
      <Paragraph position="3"> Where W(Pj) denotes the weight of the selected representative paragraph in the j-th cluster, here the selected representative paragraph Pj can be regarded as the medoid in the j-th cluster which is determined by the final output of the presented K-medoid paragraph clustering process, and the weight of Pj is calculated by formula (2). Put the objective function in K clustering results corresponding then make good use of the maximum property of the objective function to adaptively determine the final number of clusters, K.</Paragraph>
      <Paragraph position="4"> Figure 2 shows the concrete distribution of the values of objective function obtained in the example document &amp;quot;On the Situation and Measures That Face Fishing in the Sea in Da Lian City&amp;quot; when adopting the proposed clustering analysis method. According to the maximum property of objective function, that is take the value of K when the values of the objective function take maximum as the final number of clusters. From the results in Figure 2, we can know that K equals to six, that is we find six latent thematic areas from nine paragraphs in the document with this method.</Paragraph>
      <Paragraph position="5">  Output the complete information table of each thematic area in the form of the representative paragraph and all the paragraphs and sentences covered by the thematic area.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.2 Stage 2: Selection of the Thematic
Representative Sentences
</SectionTitle>
      <Paragraph position="0"> To select a most suitable representative sentence from each thematic area, the author proposes the following method. This is in contrast with a method proposed by Radev (Radev et al., 2000 ), where the centroid of a cluster is selected as the representative one.</Paragraph>
      <Paragraph position="1"> Method: select the sentence which is most similar to the thematic area semantically as representative one.</Paragraph>
      <Paragraph position="2"> Before carrying out the method in detail, there are two problems to be solved: 1) The vector representation of sentence and thematic area The vector representation of sentence and thematic area is similar to that of paragraph introduced before. We only need to change the weight calculation field of the terms from the interior of paragraph to the interior of sentence or thematic area. Accordingly, we can describe the sentence ve ctor and thematic area vector as  sentence and thematic area The calculation of semantic similarity of sentence and thematic area can be achieved by calculating the vector distance between sentence vector and thematic area vector. Here we adopt the traditional cosine method for vector distance calculation. Correspondingly, the distance between the sentence vector VSj and the thematic area vector VAk is calculated by the following  At the premise of the same number of summarization sentences selected out by different summarization methods: The higher the value of RE calculated by the covariance matrix of the summarization sentence vectors.</Paragraph>
      <Paragraph position="3"> The lower the summarization redundancy.</Paragraph>
    </Section>
    <Section position="3" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.3 Stage 3: The Creation of the Summary
</SectionTitle>
      <Paragraph position="0"> Ouput the selected representative sentences from each thematic area according to their postions in the original document to form the final summary.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML