File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/j00-4001_metho.xml

Size: 16,620 bytes

Last Modified: 2025-10-06 14:07:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="J00-4001">
  <Title>Automatic Text Categorization in Terms of Genre and Author</Title>
  <Section position="3" start_page="472" end_page="474" type="metho">
    <SectionTitle>
2. Current Trends in Stylometry
</SectionTitle>
    <Paragraph position="0"> The main feature that characterizes both text genre detection and authorship attribution studies is the selection of the most appropriate measures, namely, those that reflect the style of the writing. Various sets have been proposed in the literature. In this section, we classify the most popular of the proposed style markers, taking into account the information required for their calculation rather than the task they have been applied to.</Paragraph>
    <Section position="1" start_page="472" end_page="472" type="sub_section">
      <SectionTitle>
2.1 Token-Level Measures
</SectionTitle>
      <Paragraph position="0"> The simplest approach considers the sample text as a set of tokens grouped in sentences. Typical measures of this category are word count, sentence count, character per word count, and punctuation marks count. Such features have been widely used in both text genre detection and authorship attribution research since they can be easily detected and computed. It is worth noting that the first pioneering works in authorship attribution, when no powerful computational systems were available, were based exclusively on these measures. For example, Morton (1965) used sentence length for testing the authorship of Greek prose, Brinegar (1963) adopted word length measures, and Brainerd's (1974) approach was based on distribution of syllables per word. Although such measures seemed to work in specific cases, they became subject to heavy criticism for their lack of generality (Smith 1983, 1985).</Paragraph>
    </Section>
    <Section position="2" start_page="472" end_page="473" type="sub_section">
      <SectionTitle>
2.2 Syntactic Annotation
</SectionTitle>
      <Paragraph position="0"> The use of measures related to syntactic annotation of the text is very common in text genre detection. Such measures provide very useful information for the exploration  Computational Linguistics Volume 26, Number 4 of the characteristics of style (Biber 1995). Typical paradigms are passive count, nominalization count, and counts of the frequency of various syntactic categories (e.g., part-of-speech tags). Recently, syntactic information has also been applied to authorship attribution. Specifically, Baayen, Van Halteren, and Tweedie (1996) used frequencies of occurrence of rewrite rules as they appear in a syntactically annotated corpus and proved that they perform better than word frequencies. Their calculation requires tagged or parsed text, however. Current NLP tools are not able to provide accurate calculation results for many of the previously proposed style markers. In the study of register variation conducted by Biber (1995), a subset of the measures (i.e., the simplest ones) was calculated by computational tools and the remaining were counted manually. Additionally, the automatically acquired measures were counterchecked manually. Many researchers, therefore, try to avoid the use of features related to syntactic annotation in order to avoid such problems (Kessler, Nunberg, and Sch~itze 1997). As a result, the recent advances in computational linguistics have not notably affected research in computational stylistics.</Paragraph>
    </Section>
    <Section position="3" start_page="473" end_page="473" type="sub_section">
      <SectionTitle>
2.3 Vocabulary Richness
</SectionTitle>
      <Paragraph position="0"> Various measures have been proposed for capturing the richness or the diversity of the vocabulary of a text and they have been applied mainly to authorship attribution studies. The most typical measure of this category is the type-token ratio V/N, where V is the size of the vocabulary of the sample text, and N is the number of tokens of the sample text. Similar features are the hapax legomena (i.e., words occurring once in the sample text) and the dislegomena (i.e., words occurring twice in the sample text). Since text length dramatically affects these features, many researchers have proposed functions of these features that they claim are text length independent (Honor6 1979; Yule 1944; Sichel 1975). Additionally, instead of using a single measure, some researchers have used a set of such vocabulary richness functions in combination with multivariate statistical techniques to achieve better results in authorship attribution (Holmes 1992). In general, these measures are not computationally expensive. However, according to results of recent studies, the majority of the vocabulary richness functions are highly text length dependent and quite unstable for texts shorter than 1,000 words (Tweedie and Baayen 1998).</Paragraph>
    </Section>
    <Section position="4" start_page="473" end_page="474" type="sub_section">
      <SectionTitle>
2.4 Common Word Frequencies
</SectionTitle>
      <Paragraph position="0"> Instead of using vocabulary distribution measures, some researchers have counted the frequency of occurrence of individual words in the sample text. Such counts are a reliable discriminating factor (Karlgreen and Cutting 1994; Kessler, Nunberg, and Schi~tze 1997) and have been applied to many works in text genre detection. Their calculation is simple, but nontrivial effort is required for the selection of the most appropriate words for a given problem. Morever, the words that best distinguish a given group of authors cannot be applied to a different group of authors with the same success (Holmes and Forsyth 1995). Oakman (1980) notes: &amp;quot;The lesson seems clear not only for function words but for authorship word studies in general: particular words may work for specific cases such as 'The Federalist Papers' but cannot be counted on for other analyses&amp;quot; (p. 28). Furthermore, the results of such studies are highly language dependent.</Paragraph>
      <Paragraph position="1"> Michos et al. (1996) introduce the idea of grouping certain words in categories, such as idiomatic expressions, scientific terminology, formal words, and so on. Although this solution is language independent, it requires the construction of a complicated computational mechanism for the automated detection of the categories in the sample text.</Paragraph>
      <Paragraph position="2"> Alternatively, the use of sets of common high-frequency words (typically 30 or 50 words) has been applied mainly to authorship attribution studies (Burrows 1987).</Paragraph>
      <Paragraph position="3">  Stamatatos, Fakotakis, and Kokkinakis Text Categorization NCPtoo, A ,ys,s I i measures J L. measures j Figure 1 The proposed method.</Paragraph>
      <Paragraph position="4"> The application of a principal components analysis on the frequencies of occurrence of the most frequent words achieved remarkable results in plotting the texts in the space of the first two principal components, for a wide variety of authors (Burrows 1992). This approach is language independent and computationally inexpensive. Various additional restrictions to this basic method have been proposed (e.g., separation of common homographic forms, removal of proper names from the most frequent word list, etc.), aimed at improving its performance. For a fully automated system, such restrictions require robust and accurate NLP tools.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="474" end_page="478" type="metho">
    <SectionTitle>
3. The Proposed Method
</SectionTitle>
    <Paragraph position="0"> Our method attempts to exploit already existing NLP tools for the extraction of stylistic information. To this end, we use two types of measures, as can be seen in Figure 1: * measures relevant to the actual output of the NLP tool (i.e., usually tagged or parsed text), and * measures relevant to the particular methodology by which the NLP tool analyzes the text (analysis-level measures).</Paragraph>
    <Paragraph position="1"> Thus, the set of style markers is adapted to a specific, already existing NLP tool, taking into account its particular properties. Analysis-level measures capture useful stylistic information without additional cost. The NLP tool is not considered a black box. Therefore, full access to its source code is required in order to define and measure analysis-level style markers. Moreover, tool-specific knowledge, rather than language-specific knowledge, is required for the definition of such measures. In other words, researchers using this approach can define analysis-level measures based on their deep understanding of a particular NLP tool even if they are not familiar with the natural language to which the methodology is to be applied.</Paragraph>
    <Paragraph position="2"> To illustrate the proposed method, we apply it to Modern Greek using the SCBD, an existing NLP tool able to detect sentence and chunk boundaries in unrestricted text, as described in the next section. In addition to a set of easily computable features (i.e., token-level and syntax-level measures) provided by the actual output of the SCBD,  Computational Linguistics Volume 26, Number 4 we use a set of analysis-level features, i.e., measures that represent the way in which the input text has been analyzed by the SCBD.</Paragraph>
    <Paragraph position="3"> The particular analysis-level style markers can be calculated only when this specific computational tool is utilized. However, the SCBD is a general-purpose tool and was not designed for providing stylistic information exclusively. Thus, any NLP tool (e.g., part-of-speech taggers, parsers, etc.) can provide similar measures. The appropriate analysis-level style markers have to be defined according to the methodology used by the tool in order to analyze the text. For example, some similar measures have been used in stylistic experiments in information retrieval on the basis of a robust parser built for information retrieval purposes (Strzalkowski 1994). This parser produces trees to represent the structure of the sentences that compose the text. However, it is set to &amp;quot;skip&amp;quot; or surrender attempts to parse clauses after reaching a time-out threshold. When the parser skips, it notes that in the parse tree. The measures proposed by Karlgren (1999) as indicators of clausal complexity are the average parse tree depth and the number of parser skips per sentence, which in essence are analysis-level style markers.</Paragraph>
    <Paragraph position="4"> 4. Style Markers for Modem Greek As mentioned above, the subset of style markers used for Modern Greek depends on the text analysis by the specific NLP tool, the SCBD. Thus, before describing the set of style markers we used, we briefly present the main features of the SCBD.</Paragraph>
    <Section position="1" start_page="475" end_page="476" type="sub_section">
      <SectionTitle>
4.1 Description of the SCBD
</SectionTitle>
      <Paragraph position="0"> The SCBD is a text-processing tool able to deal with unrestricted Modern Greek text.</Paragraph>
      <Paragraph position="1"> No manual preprocessing is required. It performs the following procedures: Sentence boundary detection: The following punctuation marks are considered potential sentence boundaries: period, exclamation point, question mark, and ellipsis. A set of automatically acquired disambiguation rules (Stamatatos, Fakotakis, and Kokkinakis 1999) is applied to every potential sentence boundary in order to locate the actual sentence boundaries. These rules utilize neither lexicons with specialized information nor abbreviation lists.</Paragraph>
      <Paragraph position="2"> Chunk boundary detection: Intrasentential phrase detection is achieved through multiple-pass parsing making use of an approximately 450-keyword lexicon (i.e., closed-class words such as articles and prepositions) and a 300-suffix lexicon containing the most common suffixes of Modern Greek words. Initially, using the suffix lexicon, a set of morphological descriptions is assigned to any word of the sentence not included in the keyword lexicon. If the suffix of a word does not match any of the entries of the suffix lexicon, then no morphological description is assigned to this word. It is marked as a special word and is not ignored in subsequent analysis. Then, each parsing pass (five passes are performed) analyzes a part of the sentence, based on the results of the previous passes, and the remaining part is kept for the subsequent passes. In general, the first passes try to detect simple cases that are easily recognizable, while the last passes deal with more complicated ones. Cases that are not covered by the disambiguation rules remain unanalyzed. The detected chunks are noun phrases (NPs),  prepositional phrases (PPs), verb phrases (VPs), and adverbial phrases (APs). In addition, two chunks are usually connected by a sequence of conjunctions (CONs).</Paragraph>
      <Paragraph position="3"> The SCBD is able to cope rapidly with any piece of text, even ill-formed text, and its performance is comparable to more sophisticated systems that require more complicated resources. Figure 2 gives an overview of the SCBD. An example of its output for a sample text, together with a rough English translation (included in parentheses), is given below (note that special words, those that do not match with any of the stored suffixes, are marked with an asterisk): VP\[&amp;eu 0gAco uoz pg{oo (I don't want to pour)\] NP\[A&amp;&amp; (oil)\] PP\[crrr/9~wr~&amp; (in the fire)\] CON\[of&amp;kale (but)\] VP\[rr~C/re4a; (I believe)\] CON\[drL (that)\] NP\[r/ err~fldpvu~rr/(the encumbrance)\] PP\[o-rou rrpo~rcoko7Lcr#6 (of the budget)\] PP\[ozrr6 rovC/ flov&amp;evrgC/ (by the deputies)\] VP\[&amp;u #rcopeg uce rrpoe#erpeirc~L (can not be measured)\] #6uo (merely) PP\[#e rc~ 5*&amp;C/.*6px. rcou c~uc,Spo#~n&amp;u (with the 5 bil. Dr. of the retroactive salaries)\] troy (that) NP\[rc~po~u re&amp;evrcdc~ (they took lately)\] VP\[rrponc~&amp;cburc~C/ (causing)\] NP\[rr/(Sva~op&amp;~ r~\]g ~oLu~C/ 7v,&amp;#r/~ (the discontent of the public opinion)\].</Paragraph>
      <Paragraph position="4"> It is worth noting that we did not modify the structure of the SCBD in order to calculate style markers, aside from adding simple functions for their measurement.</Paragraph>
    </Section>
    <Section position="2" start_page="476" end_page="478" type="sub_section">
      <SectionTitle>
4.2 Stylometric Levels
</SectionTitle>
      <Paragraph position="0"> Our aim during the definition of the set of style markers was to take full advantage of the analysis of the text by the SCBD. To this end, we included measures relevant to the actual output of this tool as well as measures relevant to the methodology used by the SCBD to analyze the text. Specifically, the proposed set of style markers comprises three levels: * Token Level: The sample text is considered as a set of tokens grouped in sentences. This level is based on the output of the sentence boundary  text has been analyzed by the particular methodology of the SCBD are included here:</Paragraph>
      <Paragraph position="2"> detected keywords/words special words/words assigned morphological descriptions/words chunks' morphological descriptions/total detected chunks words remaining unanalyzed after pass 1/words words remaining unanalyzed after pass 2/words words remaining unanalyzed after pass 3/words words remaining unanalyzed after pass 4/words words remaining unanalyzed after pass 5/words It is clear that the analysis level contains extremely useful stylistic information. For example, M14 and M15 are valuable markers that indicate of the percentage of high-frequency words and the percentage of unusual words included in the sample text, respectively. M16 is a useful indicator of the morphological ambiguity of the words and M17 indicates the degree to which this ambiguity has been resolved. Moreover, markers M18 to M22 indicate the syntactic complexity of the text. Since the first parsing passes analyze the most common cases, it is easy to understand that a large part of a syntactically complicated text would not be analyzed by them (e.g., high values for M18, M19, and M20 in conjunction with low values for M21 and M22). Similarly, a syntactically simple text would be characterized by low values for M18, M19, and M20.  Stamatatos, Fakotakis, and Kokkinakis Text Categorization Note that all the proposed style markers are produced as ratios of two relative measures in order for them to be stable over the text length. However, they are not standardized.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML