File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0906_intro.xml

Size: 3,024 bytes

Last Modified: 2025-10-06 14:00:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0906">
  <Title>Discriminating the registers and styles in the Modem Greek language</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The identification of the language style characterising the constituent parts of a corpus is very important to several appfieations. For example, in information retrieval applications, where large corpora of texts need to be searched efficiently, it is useful to have information about the language style used in each text, to improve the accuracy of the search (Karlgren, 1999). In fact, the criteria regarding language style may differ for each search and therefore - due to the large number of texts - there is a requirement to perform style categorisation in an automated manner. Such systems normally use statistical methods to evaluate the properties of given texts.</Paragraph>
    <Paragraph position="1"> The complexity of the studied properties varies.</Paragraph>
    <Paragraph position="2"> Kilgarriff (1996) employs mainly the frequencyof-occurrence of words while Karlgren (1999) applies statistical methods primarily on structural and part-of-speech information.</Paragraph>
    <Paragraph position="3"> Baayen et al. (1996), who study the topic of author identification, apply statistical measures and methods on syntactic rewrite rules resulting by processing a given set of texts. They report that the accuracy thus obtained is higher than when applying the same statistical measures to the original text. On the other hand, Biber (1995) uses Multidimensional Analysis coupled with a large number of linguistic features to distinguish amongregisters. The underlying idea is that, rather than being distinguished on the basis of a set of finguistic features, registers are distinguished on the basis of combinations of weighted linguistic features, the so-called &amp;quot;dimensions&amp;quot;.</Paragraph>
    <Paragraph position="4"> This article reports on the discrimination of texts in written Modem Greek. The ongoing research described here has followed two distinct directions. First, we have tried to distinguish among registers of written Modern Greek. In a second phase, our research has focused on distinguishing among individual styles within one register and, more specifically, among speakers of the Greek Parliament. To achieve that, structural, morphological and part-of-speech information is employed. Initially (in section 2) emphasis is placed on distinguishing among the different registers used. In section 3, the task of author identification is tested with selected statistical methods. In both sections, we describe the set of linguistic features measured, we argue for the statistical method employed and we comment on the results. Section 4 contains a description of future plans for extending this fine of research while in section 5 the conclusions of this article are provided.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML