File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0506_metho.xml
Size: 4,954 bytes
Last Modified: 2025-10-06 14:08:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0506"> <Title>A Study for Documents Summarization based on Personal Annotation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Annotation based summarization </SectionTitle> <Paragraph position="0"> Annotation is defined as a body of words marked among the text. It may be any word, phrase or sentences which readers may feel interesting or important. When we say &quot;annotation&quot;, we mean its position and content, that is, where it is located and what texts it contains.</Paragraph> <Paragraph position="1"> Since users may annotate part of the important information, or the annotations may be incomplete, therefore, in spite of annotations themselves, we also need to consider &quot;context&quot; as a supplement to what users are interested in. For a particular annotation, context is defined the surrounding text of the annotation.</Paragraph> <Paragraph position="2"> Since annotations contain a set of keywords, its significance can be identified when compared to other words of the text. For a given document, we first extract user's annotations and their contexts, and construct a new keywords set together with original keywords in the text, where annotations and contexts are given higher scores than others; Then we weight sentences according to keywords they contained and do summarization by selecting high-weighted ones.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Annotations &Context Extraction </SectionTitle> <Paragraph position="0"> A set of sentences is extracted from a given document.</Paragraph> <Paragraph position="1"> And for each annotation, we identify which sentence it is located at, as the context of the annotation. Then keywords are extracted from annotations and their contexts. An annotation may span through several sentences, a sentence may include several annotations, and an annotation may contain several keywords. For each sentence, we simplify the keywords extraction problem as identifying the annotations it contains. Annotated sentences are defined as those who contain annotations.</Paragraph> <Paragraph position="2"> The keywords occurring in annotations are called annotated keywords (F1). Keywords occurring in annotated sentences are called context keywords (F2). Frequencies f of repetitive keywords in F1 or F2 are accumulated. It's obvious that F1 is a subset of F2. Thus we get two keywords vector set:</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Keywords Extraction </SectionTitle> <Paragraph position="0"> From the document, content words are stemmed from Porter's algorithm (Porter, 1980). Content keywords are referred to words whose frequencies are beyond a certain threshold and not occurring in stopping wordlist.</Paragraph> <Paragraph position="1"> Word frequencies are calculated by tf*idf method (Salton and Buckley, 1988). After applying word occurrences statistics to full text; we get the vector</Paragraph> <Paragraph position="3"> Text keywords are those occurred either in F2 or their frequency satisfies a given threshold a ( a></Paragraph> <Paragraph position="5"> Annotated words are considered superiorly whatever frequency they occur originally, since users may be interested in some rare or &quot;unknown&quot; keywords in the documents, this kind of words should not be excluded beyond text keywords.</Paragraph> <Paragraph position="6"> Text keywords (F0):</Paragraph> <Paragraph position="8"> It is obvious that both F1 and F2 are subsets of F0.</Paragraph> <Paragraph position="9"> Next in order to apply the emphasis of annotations and contexts on summarization, combination is performed to integrate F0, F1 and F2. Since keywords in different sets may have different influences on summarization, some parameters are used to balance their effects respectively.</Paragraph> <Paragraph position="10"> Final keywords (F): F = F0 + b F1 + g F2;</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="1" type="metho"> <SectionTitle> )(,1,, FsizennifwF </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> b is annotation weight (b >=0), g is context weight (g >=0). b =0 means considering no annotations. g =0 means considering no context.</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 3.3 Sentence Extraction </SectionTitle> <Paragraph position="0"> Sentences are weighted according to the keywords it contains: |S |is the length of a sentence, which means the key-word count it contains. Sentences are ranked by their weights, and then top scored sentences are selected as important ones and used to compose into a summary according to their original position.</Paragraph> </Section> </Section> class="xml-element"></Paper>