File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/c00-1072_evalu.xml

Size: 7,508 bytes

Last Modified: 2025-10-06 13:58:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1072">
  <Title>The Automated Acquisition of Topic Signatures for Text Summarization</Title>
  <Section position="7" start_page="497" end_page="499" type="evalu">
    <SectionTitle>
6 Experimental Results
</SectionTitle>
    <Paragraph position="0"> In order to assess the utility of topic signatures in text sununarization, we follow the procedure described at the end of Section 4.1 to create topic signature for each selected TREC topic. Documents are separated into relevant and nom'elevant sets according to their TREC relevancy judgments for each topic. We then run each document through a part-of-speech tagger and convert each word into its root form based on the \\h)rdNct lexical database.</Paragraph>
    <Paragraph position="1"> We also collect individual root word (unigram) fi'equency, two consecutive non-stopword 9 (bigram) fi'equency, and three consecutive non-stopwords (trigram) fi'equeney to facilitate the computation of the -21ogA value for each term. We expect high ranking bigram and trigram signature terms to be very informative. We set the cutoff associated weight at 10.83 with confidence level ~t = 0.001 by looking up a X 2 statistical table.</Paragraph>
    <Paragraph position="2"> Table 2 shows the top 10 unigrmn, bigram, and tri-gram topic signature terms for each topic m. Several conclusions can be drawn directly. Terms with high -21ogA are indeed good indicators for their corresponding topics. The -2logA values decrease as the number of words in a term increases. This is reasonable, since longer terms usually occur less often than their constituents. However, bigram terms are more informative than nnigrant terms as we can observe: jail//prison overervwding of topic 151, tobacco industry of topic 257, computer security of topic 258, and solar e'n, ergy/imwer of topic 271. These mLtomatically generated signature terms closely resemble or equal the given short TREC topic descriptions.</Paragraph>
    <Paragraph position="3"> Although trigram terms shown in the table, such as federal court order, philip morris 7~r, jet propul..</Paragraph>
    <Paragraph position="4"> sion laboratory, and mobile telephone s:qstem are also meaningflfl, they do not demonstrate the closer term relationship among other terms in their respective topics that is seen in tlm bigram cases. We expect that more training data can improve tile situation.</Paragraph>
    <Paragraph position="5"> We notice that the -2logA values for topic 258 are higher than those of the other three topics. As indicated by (Mani et al., 1998) the majority of relevant documents for topic 258 have the query topic as their main theme; while the others mostly have the query topics as their subsidiary themes. This implies that it is too liberal to assume all the terms in relevant documents of the other three topics are relevant. We plan to apply text segmentation algorithms such as TextTiling (Hearst, t997) to segment documents into subtopic units. We will then perform the topic signature creation procedure only on tile relevant units to prevent inchlsion of noise terms.</Paragraph>
    <Paragraph position="6"> 9\,Ve use the stopword list supplied with the SMAIIT retrieval system.</Paragraph>
    <Paragraph position="7"> ldegq'he -2logA values are not comparable across ngram categories, since each ngraln category has its own sample space.  module+, basulin(.' module, and tfidf lnothll(~s with lmnta'n annot, at(~(l lllo(lo,\] Sllllllll}ll'ios. \VC III(+~}/SIlI'(+ + l;h(; l)c'rfl)rmanc(~' using a c()ml)ined umasure of l'n'call (I~) and pr(~cisi(m (P), F. F-score is defined by: I&amp;quot;-- (1 +H'2)I'l? where  \Ve as.~um(~ (,(lual importance of re(:all iIIld precision aim set H to 1 in our (+,Xl)('+rimtml;s. The Imselitm (I)ositi(m) module scores ('at:h S(!llt(':llC{} hy its I)ositi(&gt;n in the text. The first sent(race gets the highesc s(:ortL the last S(HIt(H1Co the lowest. The l)as(~liIl(~ method is eXlmCted to lm ('.f\[ectiv(~ for news gem'e.</Paragraph>
    <Paragraph position="8"> The tfidf module assigns a score t.o a tt++rllI ti at:cording to the product; of its flequc, ncy within a dot:lllll(Hlt .j (tfij) and its illV(~I'S(} doctmmnt t?equoncy (idfi lo.q ,~). N is the total mmfl)or of document.s in the (:()rlms and dfj is the, numl)er of (Io(:HnloAll;.q (:OlH:nining term ti.</Paragraph>
    <Paragraph position="9"> The topic sigjlla(.lll(++ module sciliis each ,q(~llt;(H1C(~: assigning to ('ach word that occurs in a topic signa(ure thu weigh(, of that, keyword in t.hc' tol)ic signatltl'tL Eit{'h s(++llt(,+ItC(~ Ill(ill l'(:c(:ive.q a topic signature score equal to tlm total of all signature word scores it (:Olllailis, normalizc'd 1) 3' the. highest sentence score. This s(:ol( 3 indical.es l;h(~ l'(!l(wall(:(~ of l.h(; S(!llt.t~n(:(! to t, lw sigmmlre topic.</Paragraph>
    <Paragraph position="10"> SU.~\[.MAt/IST In'oduced (!xttat:ts of tlm samu l(~xI.q sui)aralely for each ,,lodul0, for a s(~l'i(,s of extracts ranging from ()cX; to 100% of the. original l;(}xI. Althottgh many rel&lt;want docttments are avaita})l+, for each t01&gt;ic, Ollly SOlll0 o\[ \[h0111 htlv(~ allSWOl kc!y  markut)s. The mnnber of documents with answer keys are listed in the row labeled: &amp;quot;# of Relevant Does Used in Training&amp;quot;. To ensure we utilize all the available data and conduct a sound evaluation, we perform a three-fold (:ross validation. We reserve one-third of documents as test set, use the rest as training set, and ret)eat three times with nonoverlapl)ing test set. Furthernmre, we use only uni-gram topic signatures fin&amp;quot; evaluation.</Paragraph>
    <Paragraph position="11"> The result is shown in Figure 2 and TaMe 3. We find that the topic signature method outperforms the other two methods and the tfidfmethod performs poorly. Among 40 possibh,, test points fl)r four topics with 10% SUmlnary length increment (0% means select at least one sentence) as shown in Table 3, the topic signature method beats the baseline method 34 times. This result is really encouraging and indicates that the topic signature method is a worthy addition to a variety of text summarization methods.</Paragraph>
    <Section position="1" start_page="499" end_page="499" type="sub_section">
      <SectionTitle>
6.2 Enriching Topic Signatures Using
Existing Ontologies
</SectionTitle>
      <Paragraph position="0"> We have shown in the previous sections that topic signatures can be used to al)I)roximate topic identification at the lexieal level. Although the automatically acquired signature terms for a specific topic seem to 1)e bound by unknown relationships as shown in Table 2, it is hard to image how we can enrich the inherent fiat structure of tol)ie signatures as defined in Equation 1 to a construct as complex as a MUC template or script.</Paragraph>
      <Paragraph position="1"> As discussed in (Agirre et al., 2000), we propose using an existing ontology such as SENSUS (Knight and Luk, 1994) to identify signature term relations.</Paragraph>
      <Paragraph position="2"> The external hierarchical framework can be used to generalize topic signatures and suggest richer representations for topic signatures. Automated entity recognizers can be used to (:lassify unknown entities into their appropriate SENSUS concept nodes.</Paragraph>
      <Paragraph position="3"> We are also investigating other approaches to atttomatieally learn signature term relations. The idea mentioned in this paper is just a starting point.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML