File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1615_metho.xml
Size: 4,182 bytes
Last Modified: 2025-10-06 14:09:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1615"> <Title>FarsiSum - A Persian text summarizer</Title> <Section position="4" start_page="2" end_page="4" type="metho"> <SectionTitle> 3 FarsiSum </SectionTitle> <Paragraph position="0"> FarsiSum is a web-based text summarizer for Persian based upon SweSum. It summarizes Persian newspaper text/HTML in Unicode format.</Paragraph> <Paragraph position="1"> FarsiSum uses the same structure used by SweSum (see Figure 2), with exception of the lexicons, but some modifications have been made in SweSum in order to support Persian texts in Unicode format.</Paragraph> <Section position="1" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 3.1 User Interface </SectionTitle> <Paragraph position="0"> The user interface includes: * The first page of FarsiSum on WWW presented in Persian</Paragraph> <Paragraph position="2"> The final summary including statistical information to the user, presented in Persian.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.2 Stop List </SectionTitle> <Paragraph position="0"> The current implementation uses a simple stop list rather than a full-fledged Persian lexicon. The stop-list is a HTML file (UTF-8 encoding) containing about 200 high-frequency Persian words including the most common verbs, pronouns, adverbs, conjunctions, prepositions and articles.</Paragraph> <Paragraph position="1"> http://www.nada.kth.se/iplab/hlt/farsisum/indexfarsi.html null The stop-list has been successively built during the implementation phase by iteratively running FarsiSum in order to find the most common words in Persian.</Paragraph> <Paragraph position="2"> The assumption is that words not included in the stop-list are nouns or adjectives (content words) and should be counted as such in the word frequency list.</Paragraph> </Section> <Section position="3" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 3.3 Tokenizer </SectionTitle> <Paragraph position="0"> The tokenizer is modified in order to recognize Persian comma, semi colon and question mark.</Paragraph> <Paragraph position="1"> * Sentence boundaries are found by searching for periods, exclamation and question marks as well as <BR> (the HTML new line) and the Persian question mark (? ).</Paragraph> <Paragraph position="2"> * The tokenizer finds the word boundaries by searching for characters such as &quot;.&quot;, &quot;,&quot;, &quot;!&quot;, &quot;?&quot;, &quot;<&quot;, &quot;>&quot;, &quot;:&quot;, spaces, tabs and new lines. Persian semi colon, comma and question mark can also be recognized.</Paragraph> <Paragraph position="3"> * All words in the document are converted from ASCII to UTF-8. These words are then compared with the words in the stoplist. Words not included in the stop list are regarded as content words and will be counted as keywords.</Paragraph> <Paragraph position="4"> The word order in Persian is SOV , i.e. the last word in a sentence is a verb. This knowledge is used to prevent verbs from being stored in the Word frequency table.</Paragraph> </Section> <Section position="4" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 3.4 Architecture </SectionTitle> <Paragraph position="0"> FarsiSum is implemented as a HTTP client/server application as shown in Figure 2. The summarization program is located on the server side and the client is a browser such as Internet of the document) to be summarized is attached to the request. (The original text is in Unicode format).</Paragraph> <Paragraph position="1"> * The document is summarized in three phases including tokenizing, scoring and keyword extraction. Words in the document are converted from ASCII to UTF-8. These words are then compared with the words in the stop-list (2-5).</Paragraph> <Paragraph position="2"> * The summary is returned back to the HTTP server that returns the summarized document to the client (6).</Paragraph> <Paragraph position="3"> The browser then renders the summarized text to the screen.</Paragraph> </Section> </Section> class="xml-element"></Paper>