File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1820_metho.xml
Size: 10,677 bytes
Last Modified: 2025-10-06 14:08:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1820"> <Title>A Knowledge-based Approach to Text Classification</Title> <Section position="2" start_page="0" end_page="1" type="metho"> <SectionTitle> 1 Knowledge-based text classification </SectionTitle> <Paragraph position="0"> The principal process for the Knowledge -based text classification is illustrated as following: Text set NULL? End Select a text for analysis Topic identification Output topic tagging of the text Figure 1 The principal process for the knowledge-based text classification From the figure 1 we can know that the crucial technique of the text classification is the topic identification parser. The topic tagging of the text is identified as its catalogue.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Feature Dictionary </SectionTitle> <Paragraph position="0"> The feature dictionary is mainly used to store some terms that can illustrate the topic feature concept, and we call these terms as &quot;feature terms&quot;. The data structure of the feature dictionary is consist of word, POS, semantic, location and field attribute. Some examples of feature dictionary are described as following: Since 1996 we employed a semi-automatic method to acquire feature terms from precategorized corpus, and developed a feature dictionary including about 300,000 feature terms. There are about 1,500 kinds of semantic features, and about 1,000 kinds of field attributes to tag feature terms in this dictionary.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Topic Feature Distribution Computing Formula </SectionTitle> <Paragraph position="0"> According to the field attributes, frequencies and positions of feature terms, we could compute topic feature distribution. The computing steps are described as following: 1) According to the frequency and position of a feature term ft</Paragraph> <Paragraph position="2"> . The computing formula is described as following: Where p(ft</Paragraph> <Paragraph position="4"> is times of the feature term ft i occurring in the title. N begin is times of the feature term ft i occurring in the first sentence of a paragraph.</Paragraph> <Paragraph position="6"> is times of the feature term ft i occurring in the end sentence of a paragraph. G1Bfreq(ft</Paragraph> <Paragraph position="8"> total frequency of the all feature terms in the text.</Paragraph> <Paragraph position="9"> In the experiment we discovered that the feature terms in different position of a text have the different influence abilities on the topic features. So we take into account of this factor and use different experience coefficient in the weight computing formula of feature terms. In the formula (1), the coefficient of N title is 1.0, the coefficient of N begin is 0.5, and the coefficient of</Paragraph> <Paragraph position="11"> is 0.5.</Paragraph> <Paragraph position="12"> 2) From the attribute of a feature term in the dictionary we could acquire its field attribute, in fact this field attribute is the topic feature illustrated by the feature term. The weight of a topic feature could be gotten by adding all weights of feature terms that illustrate the same topic feature. The more the feature terms illustrates the same topic feature, the higher the weight of the topic feature. The weight of a topic feature expresses its abilities illustrating the topic of the text. The weight p(f i ) computing formula of a topic feature f</Paragraph> <Paragraph position="14"> shows feature term set illustrating the same topic feature f</Paragraph> <Paragraph position="16"/> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Topic Feature Aggregation Formula </SectionTitle> <Paragraph position="0"> The topic feature aggregation formula is described as following:</Paragraph> <Paragraph position="2"> ) is the weight of the topic feature f j . GAD (f j ) is the coefficient of the topic feature f</Paragraph> <Paragraph position="4"> In the application system, we used automatic construction technique to construct a library, which called topic feature aggregation formula library that includes 105 topic feature aggregation formulas.</Paragraph> </Section> <Section position="4" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 2.4 FIFA algorithm </SectionTitle> <Paragraph position="0"> Most of automatic text processing techniques uses topic identification as part of a specific task.</Paragraph> <Paragraph position="1"> The approaches to topic identification taken in these techniques can be summarized in three group: statistical, knowledge-based, and hybrid.</Paragraph> <Paragraph position="2"> The statistical approach (H.P.Luhn 1957 GC8 H.P.Edmundson 1969GC8Gerard Salton, James Allan, Chris Buckley, and Amit Singhal 1994) infers topics of texts from term frequency, term location, term co-occurrence, etc, without using external knowledge bases such as machine readable dictionaries. The knowledge-based approach (Wendy Lehnert and C. Loiselle 1989, Lin Hongfei 2000) relies on a syntactic or semantic parser, knowledge bases such as scripts or machine readable dictionaries, etc., without using any corpus statistics. The hybrid approach (Elizabeth D. Liddy and Sung H. Myaeng 1992, Marti A. Hearst 1994) combines the statistical and knowledge-based approaches to take advantage of the strengths of both approaches and thereby to improve the overall system performance.</Paragraph> <Paragraph position="3"> This paper presents a simple and effective approach named FIFA (feature identification and feature aggregation) to text automatic topic identification. The core of algorithm FIFA is based on the equation: the term 'topic feature' to name the sub-topic in a text. In this phase algorithm FIFA identifies feature terms in a text by dictionary-based and rule-based methods. The distribution of a topic feature is computed by attributes, frequencies and positions of topic feature terms.</Paragraph> </Section> </Section> <Section position="3" start_page="1" end_page="2" type="metho"> <SectionTitle> 2) Topic Feature Aggregation (FA): </SectionTitle> <Paragraph position="0"> According to distribution of topic features, in this phase we use topic feature aggregation formulas to compute the weights of topics of a text, then the topic of a text could be determined by the weights computed. Topic feature aggregation formula will be introduced detailedly in the following chapters. Using machine-learning method, the topic feature aggregation formulas could be acquired automatically from pre-classified training corpus.</Paragraph> <Paragraph position="1"> The topic identification algorithm FIFA could be described as following: Step1: Text segmentation and POS tagging Input: a raw text 1. Preprocessing phase: One major function is to recognize sentence boundaries, paragraph breaks, abbreviations, numbers, and other special tokens.</Paragraph> <Paragraph position="2"> 2. Segmentation phase: Employing maximal matching algorithm to segment a sentences into some words, and setting a word's POS set in machine readable dictionary as its POS tagging.</Paragraph> <Paragraph position="3"> 3. Disambiguation phase: Employing a technology based on ambiguous segmentation dictionary to resolve the problem of word ambiguous segmentation, and base on rules to recognize the unknown words, such as name, location, company, organization noun etc.</Paragraph> <Paragraph position="4"> 4. POS tagging phase: Employing tri-gram based technology to POS tagging.</Paragraph> <Paragraph position="5"> Output: a text with formats, segmentation and POS tagging Step2: Topic feature identification Input: a text with formats, segmentation and POS tagging 1. Feature-dictionary-based feature terms identification and tagging The core of the method is to use feature dictionary to realize the feature terms identification and tagging. If a term in the text is found in the dictionary, then we call this term as a feature term of the text and its Because of the limitation of the feature dictionary, we could not identify all feature terms by feature-dictionary-based technique. To resolve the problem of the unknown feature terms, we use the technique of rule-based feature terms identification and tagging. There are two steps for the identification and tagging: 1) We employ statistics-based approach to acquire some high-frequency terms from the text as analysis objects which length is composed of two or more words, and the frequency in the text should exceed two times.</Paragraph> <Paragraph position="6"> 2) We employ rule-based technique to analyze the grammatical structure of the high frequency terms, and according to the grammatical structure of the terms and the attribute of the central word to estimate the field attribute of the term, which is tagged as the topic feature attribute of the term. 3. According to attributes, frequencies and positions of the feature terms to calculate the distribution of the topic feature of the text. Output: topic feature setGB8GC4={(f 1. Reading a formula GAF i from the topic feature aggregation formula library, where the formula GAF i is the aggregation formula of the topic t i .</Paragraph> <Paragraph position="7"> 2. According to parameters in the topic feature set GB8, the weight of the topic t i could be computed by the formulaGAF i .</Paragraph> <Paragraph position="8"> 3. If there are some other aggregation formulas in the library, then go to Step1, otherwise go to the next step.</Paragraph> <Paragraph position="9"> 4. Supposing the topic feature f</Paragraph> <Paragraph position="11"> is the weight of the topic f</Paragraph> <Paragraph position="13"> weight.</Paragraph> <Paragraph position="14"> Output: Selecting the topic with maximal weight as topic tagging of the text. Algorithm1: Topic identification algorithm FIFA</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3. Experiment </SectionTitle> <Paragraph position="0"> To test the efficiency of the FIFA-based text automatic classification, and according to the pre-determined 10 topics we constructed a test corpus, which includes 1000 articles downloaded from the Internet. The composing of the test corpus is described as following: could value the effect of the FIFA-based text automatic classification. The following figure 2 shows the results of text classification. Line GB9 represents precision percent while line GB5 represents recall percent.</Paragraph> </Section> class="xml-element"></Paper>