File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0906_metho.xml

Size: 20,459 bytes

Last Modified: 2025-10-06 14:07:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0906">
  <Title>Discriminating the registers and styles in the Modem Greek language</Title>
  <Section position="3" start_page="0" end_page="36" type="metho">
    <SectionTitle>
2 Distinguishing Registers
</SectionTitle>
    <Paragraph position="0"> To distinguish among registers, we successfully exploited a particular feature of Modem Greek, namely the contrast between Katharevousa and Demotiki. These are variation.,; of Modern Greek which correspond (if only roughly) to formal and informal speaking. Katharevoma was the official language of the Greek State until 1979 when it was replaced by Demotiki. By that time, Demotiki was the establis\]hed language of literature while, in times, it had been the language of elementary education. Compared to Demotiki, Katharevousa bears an important resemblance to Ancient Greek manifested explicitly on the morphological level and the use of the lexicon. At a second step, we dropped the Katharevousa-Demotiki approach and relied on part-of-speech information, which is often exploited in text categofisafion experiments (for instance, see Biber et al. 1998). Again, we obtained satisfactory results.</Paragraph>
    <Section position="1" start_page="0" end_page="36" type="sub_section">
      <SectionTitle>
2.1 Method of work
</SectionTitle>
      <Paragraph position="0"> The variables used to distinguish among registers may be grouped into the following categories: 1. Morphological variables: These were verbal endings quantifying the contrast Katharevousa / Demofiki. Although the morphological differences between these two variations of Greek are not limited to the verb paradigm, we focused on the latter since it better highlights the contrast under consideration (Tainbouratzis et al., 2000). A total of 230 verbal endings were selected, split into 145 Demotiki and 85 Katharevousa endings (see also the Appendix). These 230 frequencies-of-occurrence were grouped into  12 variables for use in the, statistical analysis. 2. Lexical variables: Certain negation particles  (ovd~eiC/, otu~rere, oo~o4zo6, dryer)clearly signify a preference for Katharevousa while others (~51Xo~C/, #are, XcopiC/) are clear indicators of Demotiki. However, the most frequently used negation particles (tzt,/a/v, ~cv) are not characteristic of either of the two variations.</Paragraph>
      <Paragraph position="1"> 3. Structural macro-features: average sentence length, number of commas, dashes and brackets (total of 4 variables).</Paragraph>
      <Paragraph position="2"> 4. After the completion of the experiments with variables of type 1-3 (Tambouratzis et al., 2000), Part-of-Speech (PoS) counts were introduced. The PoS categories were adjectives, adjunedons, adverbs, articles, conjunctions, nouns, pronouns, numerals, particles, verbs and a hold-all category (for non-classifiable entries), resulting in 11 variables expressed as percentages.</Paragraph>
      <Paragraph position="3"> These variables are more similar to the characteristics used by Karlgren (1999), and differ considerably from those used by Kilgarriff (1996) and Baayen et al. (1996). For the metrics of the first and thkd categories, a custom-built program was used running under Linux. This program calculated all structural and morphological metrics for each text in a single pass and the results were processed with the help of a spreadsheet package. The metrics of the second category were calculated using a custom-built program in the C programming language. PoS counts were obtained using the ILSP tagger (Papageorgiou et al., 2000) coupled with a number of custom-built programs to determine the actual frequencies-of-occurrence from the tagged texts. Finally, the STATGRAPHICS package was used for the statistical analysis. The dataset selected consisted of examples from three registers: (i) fiction (364 Kwords - 24 texts), (ii) texts of academic prose referring to historical issues, also referred to as the history register (361 Kwords - 32 texts) and (iii) political speeches obtained from the proceedings of the Greek parliament sessions, also referred to as the parliament register (509 Kwords - 12 texts).</Paragraph>
      <Paragraph position="4"> The texts of registers (I) and (II) were retrieved from the ILSP corpus (Gavrilidou et al., 1998), all of them dating from the period 1991-1999. The texts of register (III) were transcripts of the Greek Parliament sessions held during the first half of 1999.</Paragraph>
      <Paragraph position="5"> This dataset was processed using both seeded and unseeded clustering techniques with between 3 and 6 clusters. The unseeded approach confirmed the existence of distinct natural classes, which correspond to the three registers. The seeded approach confirmed the ability to accurately separate these three registers and to cluster their elements together. Initially, a &amp;quot;short&amp;quot; data vector containing only the 12 morphological variables quantifying the Demofiki/Katharevousa contrast was used (Tambouratzis et al. 2000), as well as a 16-element vector combining structural and morphological characteristics. The seeds for the Parliaraent and History registers were chosen randomly. The seeds for the Fiction register were chosen so that at least one of them would not be  an &amp;quot;outlier&amp;quot; of the Fiction register. Representative results are shown in Table 1 for the different vectors and numbers of clusters. In each case, the classification rate quoted corresponds to the number of text elements correctly classified (according to the register of the respective seed).</Paragraph>
      <Paragraph position="6"> 12-elem. 16-elem.</Paragraph>
      <Paragraph position="7">  function of the cluster number and vector size. The vector size was augmented with PoS information, resulting in a 27-element data vector. A new set of clustering experiments were performed using Ward's method with the squared Euclidean distance measure to cluster the data in an unseeded manner. Finally, a 15element data vector was used with PoS and structural information but without any morphological information. The results obtained (Table 2) show that PoS information improves the clustering performance.</Paragraph>
    </Section>
    <Section position="2" start_page="36" end_page="36" type="sub_section">
      <SectionTitle>
2.2 Comments on the Results
</SectionTitle>
      <Paragraph position="0"> Our results strongly suggest that registers of written Modem Greek can be discriminated accurately on the basis of the contrast Katharevousa / Demotiki manifested with morphological variation. Languages with a different history may not be suited to such a categorisafion method. This is evident in Biber's work (1995) for the English language, where a variety of grammatical and macro-structural linguistic features but no morphological variation features were employed. It seems then that corpora of languages which are characterised by the phenomenon of diglossia, may be successfully categorisable on the basis of morphological information (or other reflexes of diglossia). Such a discrimination method may give results as satisfactory as approaches which are closer to the Biber (1995) spirit and rely on PoS and structural measures (see Tables 1 and 2).</Paragraph>
      <Paragraph position="1"> Tables 1 and 2 show that the accuracy of clustering reaches approximately 99% while the seeded clustering approach had a high degree of accuracy, reaching 100% when using 5 clusters.</Paragraph>
      <Paragraph position="2"> For the 27-element vector with both morphological and PoS information, perfect clustering has been achieved even with 4 clusters. On the other hand, a successful clustering (albeit with a lower level of accuracy) is achieved using only structural and PoS information.</Paragraph>
      <Paragraph position="3"> It should be noted that the lexical variables used, that is the negation particles, did not contribute at all (Markantonatou et al., 2000). Furthermore, the system performed almost as well with and without macro-structure features, the difference in accuracy being less than 5%.</Paragraph>
      <Paragraph position="4"> The parliament texts can be claimed to form a register whose patterns are closely positioned in the pattern space. Of the three registers, the literature one presented the highest degree of variance, with more than one sub-clusters existing as well as outlier elements. This may be explained by the fact that the parliament proceedings, contrary to literature, undergo intensive editing by a small group of specialised public servants.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="36" end_page="127" type="metho">
    <SectionTitle>
3 Distinguishing Styles within One
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="36" end_page="36" type="sub_section">
      <SectionTitle>
Register
</SectionTitle>
      <Paragraph position="0"> In this section, we report on our efforts to distinguish among individual styles within one register. In particular, we intend to distinguish among speakers of the Parliament by studying the transcripts of the speeches of five parliament members over the period 1997-2000. Each of these speakers belongs to one of the five political parties that were represented in the Greek parliament over that period. Up to date, the experiments have been limited to the period 1999-2000.</Paragraph>
    </Section>
    <Section position="2" start_page="36" end_page="38" type="sub_section">
      <SectionTitle>
3.1 Method of work
</SectionTitle>
      <Paragraph position="0"> The number of variables (46 in total) calculated for each of the five speakers can be grouped as follows:  * the use of infixes (2 variables) in the past tense forms.</Paragraph>
      <Paragraph position="1"> . the person and ntmaber of the verb form (6 variables).</Paragraph>
      <Paragraph position="2"> The last two types of variable are expressed as percentages normalised over the number of verb forms.</Paragraph>
      <Paragraph position="3"> 2. Lexical variables (6 variables): * Negation particles (623, &amp;v,/aft). * Negative words of Katharevousa (ovJeiC/, ~iveo).</Paragraph>
      <Paragraph position="4"> * Other words which also express the contrast Katharevousa / Demotiki (the anaphoric pronouns 'o~oioC/' (Kath) and 'taro' (Dem)), currently resulting in a single variable.</Paragraph>
      <Paragraph position="5"> 3. Structural macro-features: average sentence and word length, number of commas, question marks, dashes and brackets, resulting in a total of 6 variables. 4. Structural micro-features (other than lexical): * Part-of-Speech counts (10 variables). * Use of grammatical categories such as the genitive case with nouns and adjectives (2 variables).</Paragraph>
      <Paragraph position="6"> 5. The year when the speech was presented in the Parliament and the order of the speech in  the daily schedule, that is whether it was the first speech of the speaker that clay (hereafter denoted as &amp;quot;protoloyia&amp;quot;) or the second, third etc. (resulting in a total of 2 variables). 6. The identity of the speaker, denoted as the speaker Signature (1 variable), which was used to determine the desired classification. Similarly to the clustering experiments, a set of C programs was used to extract automatically the values of the aforementioned variables from the transcripts. Most of these programs rely on measuring the occurrence of di-gram% and more generally n-grams, for letters, words and tagsets, thus being straight-forward. In the case of speaker identification, Discriminant Analysis was used, as the clustering approach did not give very good results, indicating that the distinction among personal styles is weaker than that among registers. Even when only 2 speakers were used, the clusters formed involved patterns from both speaker classes.</Paragraph>
      <Paragraph position="7"> We experimented with two corpora, Corpus I and Corpus 11, as described in Table 3. Corpus H is a subset of Corpus I. Each of the speeches included in Corpus II was delivered as an opening speech Cprotoloyia&amp;quot;) at a parliament session when at least two of the studied speakers delivered speeches.</Paragraph>
      <Paragraph position="8"> An important issue is whether the selected variables are strongly correlated. If indeed strong correlations do exist, these might be used to reduce the dimensionality of the pattern space. For the purposes of this analysis, the 46 independent variables were used (45 in the case of Corpus II where only &amp;quot;protoioyiai&amp;quot; exist, since then the order variable is constantly equal to 1). The number of correlations of all variable pairs exceeding given thresholds is depicted in Figure 1, for both Corpus I and Corpus 11. According to this study, in Corpus IL the percentage of variable pairs with an absolute value of correlation exceeding 0.5 is approximately 3%, indicating a low correlation between the parameters. Additionally, out of 990 pairs of Corpus 11, only a single one has a correlation exceeding 0.8. The correlations for the same parameter pairs over the two corpora are similar, though as a rule the correlation for Corpus I is less that that for Corpus 11, reflecting the larger variability of texts in Corpus I. The correlation study indicated that most of the parameters are not strongly correlated. Thus, a factor analysis step is not necessary and the application of the diseriminant analysis directly on the original variables is justified.</Paragraph>
      <Paragraph position="9"> Initially, Corpus I (see Table 3) was processed. The 46 aforementioned variables were used to generate discriminant functions accurately recognising the identity of the speaker. To that end, three different approaches were used: (i) the full model: all variables were used to determine the discriminant functions;  (ii) the forward model: starting from an empty model, variables were introduced in order to create a reduced model, with a small number of variables; (iii) the backward model: starting from the full model, variables were eliminated to create a reduced model.</Paragraph>
      <Paragraph position="10"> In the cases of the forward and backward models, the values of the F parameter to both enter and delete a variable were set to 4 while the maximum number of steps to generate the model was set to 50.</Paragraph>
      <Paragraph position="11">  The performance of this model is improved if: I. the order in which each particular speech was delivered is taken into account: the subset of &amp;quot;protoloyiai&amp;quot; is well-defined and presents a low variance while the speeches of second or lower order have a higher variance.</Paragraph>
      <Paragraph position="12"> 2. the corpus comprises only sessions where more than one speaker has delivered speeches. Thus, the more balanced Corpus II (Table 3) presents an improved discrimination performance.</Paragraph>
      <Paragraph position="13"> For these two corpora, the results of the diseriminant analysis are shown in Table 4. The discrimination rate obtained with Corpus II is much higher than that for Corpus I. In addition, smaller models, with 8 variables, may be created that correctly classify at least 75% of Corpus II. An example of the factors generated and the manner in which they separate the pattern space is shown in the diagrams of Figure 2.</Paragraph>
    </Section>
    <Section position="3" start_page="38" end_page="127" type="sub_section">
      <SectionTitle>
3.2 Comments on the Results
</SectionTitle>
      <Paragraph position="0"> Though this research is continuing, certain facts can be reported with confidence.</Paragraph>
      <Paragraph position="1"> Within the Greek Parliament Proceedings register, individual styles can not be classified on the basis of morphological features expressing the contrast Katharevousa/Demotiki. This may be explained by the fact that these texts undergo intensive editing towards a well-established sublanguage. This editing homogenises the morphological profile of the texts but, of course, does not go as far as homogenising the lexical preferences of the various speakers. That is why, contrary to the register-clustering experiments, lexical variables expressing the particular contrast seem to play a role in discriminating between speakers and why the use of Katharevousa-odented negative particles, which was not important in register discrimination, seems to be of some importance in style discrimination. The observation that negative words play a role in style identification is in agreement with the observations of Labb6 (1983) on the French political speech.</Paragraph>
      <Paragraph position="2"> Structural features have turned out to be important: the average word length, the use of punctuation and question marks and the use of certain parts-of-speech such as articles, conjunctions, adjuncfions and - especiaUy verbs. Furthermore, the distribution of verbs into persons and numbers seems to be important, though the exact variables selected differ depending on the exact set of speeches used (these variables are of course complementary).</Paragraph>
      <Paragraph position="3"> One of the most interesting findings of this research is that it is important whether the speaker delivers a &amp;quot;protoloyia&amp;quot; or not. &amp;quot;Protoloyiai&amp;quot; can be classified at a rate of 95% while mixed deliveries result in a lower rate, as low as 75%. This may be caused by two factors:  1. &amp;quot;Protoloyiai&amp;quot; represent longer stretches of text, which are more characteristic of a given speaker. 2. Speakers prepare meticulously for their &amp;quot;protoloyiai&amp;quot; while their other deliveries represent a more spontaneous type of speech,  which tends to contain patterns shared by all the parliament members.</Paragraph>
      <Paragraph position="4"> Finally, certain additional patterns are emerging for each of the speakers. Certain speakers (e.g. speaker A) are more consistently recognised than others (e.g. speaker B) while speaker B is similar to speaker C and speaker D is similar to speaker E. This indicates that additional variables may be required to improve the classification accuracy for all speakers.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="127" end_page="127" type="metho">
    <SectionTitle>
4 Future Plans 5 Conclusions
</SectionTitle>
    <Paragraph position="0"> As a next step, frequency of use of certain lemmata shall be imroduced since visual inspection indicates that they may provide good discriminatory features. We also plan to substitute average lengths (of both words and sentences) with the distribution of lengths.</Paragraph>
    <Paragraph position="1"> Furthermore, we intend to introduce certain structural measurements such as repetition of structures, chains of nominals and the occurrence of negation within NP phrasal eousdments.</Paragraph>
    <Paragraph position="2"> Another possible extension involves the inclusion of the speech topic. As certain speakers' characteristics seem to change through time, we plan to process the entire corpus of speeches for the target period 1997/-2000.</Paragraph>
    <Paragraph position="3"> Finally, an important issue is the comparison of the results obtained in our experiments to these generatedby alternative techniques proposed by other researchers. This will allow the deduction of more accurate conclusions regarding the strengths and the weaknesses of the research strategies.</Paragraph>
    <Paragraph position="4"> In this article, ongoing research on register and individual style eategorisation of written Modem Greek has been reported. A system has been proposed for the automatic register categorisafion of corpora in Modem Greek exploiting the highly inflectional nature of the language. The results have been obtained with a relatively constrained set of registers; however their recognition accuracy is remarkably high, exceeding 98% with an unseeded clustering approach using between 3 and 6 clusters.</Paragraph>
    <Paragraph position="5"> On the front of individual style categorisation, a discrimination rate of over 80% was achieved for five speakers within the Greek Parliament register. Morphological variables were shown to be of less importance to this task, while lexieal and straetural variables seemed to take over. We are planning to introduce several new lexical and structural variables in order to achieve better discrimination rates and to determine discriminating features of the different styles.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML