File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1610_intro.xml
Size: 2,431 bytes
Last Modified: 2025-10-06 14:02:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1610"> <Title>Automatic Arabic Document Categorization Based on the Naive Bayes Algorithm</Title> <Section position="2" start_page="0" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> With the explosive growth of text documents on the web, relevant information retrieval has become a crucial task to satisfy the needs of different end users. To this end, automatic text categorization has emerged as a way to cope with such a problem.</Paragraph> <Paragraph position="1"> Automatic text (or document) categorization attempts to replace and save human effort required in performing manual categorization. It consists of assigning and labeling documents using a set of pre-defined categories based on document contents. As such, one of the primary objectives of automatic text categorization has been the enhancement and the support of information retrieval tasks to tackle problems, such as information filtering and routing, clustering of related documents, and the classification of documents into pre-specified subject themes. Automatic text categorization has been used in search engines, digital library systems, and document management systems (Yang, 1999).</Paragraph> <Paragraph position="2"> Such applications have included electronic email filtering, newsgroups classification, and survey data grouping. Barq for instance uses automatic categorization to provide similar documents feature (Rachidi et al., 2003). In this paper, NB which is a statistical machine learning algorithm is used to learn to classify non-vocalized Arabic web text documents.</Paragraph> <Paragraph position="3"> This paper is organized as follows. Section 2, briefly describe related works in the area of automatic text categorization. Section 3 describes the preprocessing undergone by documents for the purpose of categorization; it describes in particular the preprocessing specific to the Arabic language. In section 4 Naive Bayes (NB), the learning algorithm used in this paper for document categorization is presented. Section 5 outlines the experimental setting, as well as the experiments carried out to evaluate the performance of the NB classifier. It also gives the numerical results with their analysis and interpretation. Section 6 summarizes the work and suggests some ideas for future works.</Paragraph> </Section> class="xml-element"></Paper>