File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/w04-1909_concl.xml
Size: 2,451 bytes
Last Modified: 2025-10-06 13:54:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1909"> <Title>Mining Linguistically Interpreted Texts</Title> <Section position="7" start_page="0" end_page="0" type="concl"> <SectionTitle> 6 Conclusions </SectionTitle> <Paragraph position="0"> This paper presented a series of experiments aiming at comparing our proposal of pre-processing techniques based on linguistic information with usual methods adopted for pre-processing in text mining.</Paragraph> <Paragraph position="1"> We find in the literature other alternative proposals for the pre-processing phase of text mining. (Goncalves and Quaresma, 2003) use the canonical form of the word instead stemming, for European Portuguese. (Feldman et al, 1998) proposes the use of compound terms as opposed to single terms for text mining. Similarly, (Aizawa, 2001) uses morphological analysis to aid the extraction of compound terms. Our approach differs from those since we propose single terms selection based on different part of speech information.</Paragraph> <Paragraph position="2"> The results show that a selection made solely on the basis of category information produces results at least as good as those produced by usual methods (when the selection considers nouns and adjectives or nouns and proper nouns) both in categorization and clustering tasks. In the categorization experiments we obtained the lowest error rate for PD2 when the pre-processing phase was based on the selection of nouns and adjectives, 18,01%. However, the second best score in the case of categorization was achieved by the traditional methods, 19,77%. Due to the small corpus, further experiments are needed to verify the statistical significance of the reported gains. The results of the clustering experiments show a difference in precision from 50,52% to 63,15%.</Paragraph> <Paragraph position="3"> As we are planning to test our techniques with a larger number of documents and consequently a larger number of terms, we are considering applying other machine-learning techniques such as Support Vector Machines that are robust enough to deal with a large number of terms. We are also planning to apply more sophisticated linguistic knowledge than just grammatical categories, as, for instance, the use of noun phrases for terms selection, since this information is provided by the parser PALAVRAS. Other front for future work is further tests for other languages.</Paragraph> </Section> class="xml-element"></Paper>