File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1909_intro.xml

Size: 2,359 bytes

Last Modified: 2025-10-06 14:02:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1909">
  <Title>Mining Linguistically Interpreted Texts</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Natural language texts can be viewed as resources containing uniform data in such a way that methods similar to those used in Data Base Knowledge Extraction can be applied to them. The adaptation of these methods to texts is known as Text Mining (Tan, 1999). Machine learning techniques are applied to document collections aiming at extracting patterns that may be useful to organize or recover information from the collections. Tasks related to this area are text categorization, clustering, summarization, and information extraction. One of the first steps in text mining tasks is the pre-processing of the documents, as they need to be represented in a more structured way.</Paragraph>
    <Paragraph position="1"> Our work proposes a new technique to the pre-processing phase of documents and we compare it with usual pre-processing methods. We focus on two text mining tasks, namely text categorization and clustering. In the categorization task we associate each document to a class from a pre-defined set, in the clustering task the challenge is to identify groups of similar documents without being aware of pre-defined classes. Usually, the pre-processing phase in these tasks are based on the approach called bag-of-words, in which just simple techniques are used to eliminate uninteresting words and to reduce various semantically related terms to the same root (stopwords and stemming, respectively). As an alternative, we propose the use of linguistic information in the pre-processing phase, by selecting words according to their category (nouns, adjectives, proper names, verbs) and using its canonical form. We ran a series of experiments to evaluate this proposal over Brazilian Portuguese texts.</Paragraph>
    <Paragraph position="2"> This paper is organized as follows. Section 2 presents an overview of text mining. Section 3 presents the methods used for collecting the linguistic knowledge used in the experiments. The experiments themselves are described in Section 4. Section 5 presents an analysis of the results and the paper is concluded in Section 6.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML