File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/99/e99-1019_concl.xml
Size: 1,820 bytes
Last Modified: 2025-10-06 13:58:22
<?xml version="1.0" standalone="yes"?> <Paper uid="E99-1019"> <Title>Exploring the Use of Linguistic Features in Domain and Genre Classification</Title> <Section position="9" start_page="146" end_page="146" type="concl"> <SectionTitle> 6 Conclusion </SectionTitle> <Paragraph position="0"> In this paper, we examined different linguistically motivated inputs for training text classification algorithms, focussing on domain- and genre-based tasks.</Paragraph> <Paragraph position="1"> The most clear-cut result is the influence of the training corpus on classifier performance. If we want general-purpose classifiers for large genres or collections of genres, &quot;small&quot; representative corpora such as LIMAS will in the end provide too little training material, because the emphasis is on capturing the extent of potential variation in a language, and less on providing sufficient numbers of prototypical instances for text categorisation algorithms. In addition, genre boundaries are notoriously fuzzy, and if this inherent variability is compounded by sparse data, we indeed have a problem, as Sec. 5.4 showed. Therefore, further work into genre classification should focus on well-defined genres and corpora large enough to contain a sufficient number of prototypical documents. In our opinion, further investigations into the utility of linguistic features for textcategorization tasks should best be conducted on such corpora. null Our results neither support nor refute the hypotheses advanced in Sec. 2. However, note that in some cases, the additional non-content word information did indeed improve performance (cf.</Paragraph> <Paragraph position="2"> Tab. 3), so that such representations should at least be experimented with before settling on content words.</Paragraph> </Section> class="xml-element"></Paper>