File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/03/w03-1106_relat.xml
Size: 1,322 bytes
Last Modified: 2025-10-06 14:15:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1106"> <Title>Text Classi cation in Asian Languages without Word Segmentation</Title> <Section position="8" start_page="0" end_page="0" type="relat"> <SectionTitle> 5.5 Related Work </SectionTitle> <Paragraph position="0"> The use ofa7 -gram models has also been extensively investigated in information retrieval. However, unlike previous research (Cavnar and Trenkle, 1994; Damashek, 1995), where researchers have used a7 -grams as features for a traditional feature selection process and then deployed classi ers based on calculating feature-vector similarities, we consider all a7 -grams as features and determine their importance implicitly by assessing their contribution to perplexity. In this way, we avoid an error prone feature selection step.</Paragraph> <Paragraph position="1"> Language modeling for text classi cation is a relatively new area. In principle, any language model can be used to perform text categorization. However, a7 -gram models are extremely simple and have been found to be effective in many applications. Teahan and Harper (Teahan and Harper, 2001) used a PPM (prediction by partial matching) model for text categorization where they seek a model that obtains the best compression on a new document.</Paragraph> </Section> class="xml-element"></Paper>