File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2803_metho.xml
Size: 6,351 bytes
Last Modified: 2025-10-06 14:10:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2803"> <Title>Linguistic features of Italian blogs: literary language</Title> <Section position="3" start_page="11" end_page="13" type="metho"> <SectionTitle> 3 Quantitative analysis of large corpora </SectionTitle> <Paragraph position="0"> using search engines Some linguistic features of blogs can now be probed and measured using search engines. For instance, preliminary searches show that from the orthographical point of view Italian blogs are much more correct than the average of the Italian web and that they are edited at least as well as online newspapers (Tavosanis in print b). Other indicators related to the use of &quot;neo-standard&quot; Italian forms (Berruto 1987) in the field of per sonal pronouns and demonstratives suggest a kinship between blogs and newspapers (Tavosa nis in print a).</Paragraph> <Paragraph position="1"> According to those searches, the main differ ences between blog posts and newspaper articles are not linked to writing accuracy or to different morphological choices. We can therefore assume as a working hypothesis that the main differences between blogs and newspapers in fact relate to lexicon and syntax.</Paragraph> <Paragraph position="2"> The syntactic status of many blogs is probably well represented by the textual samples chosen above (widespread use of suspension points be ing the most conspicuous feature). However, close survey of this level can probably be ob tained only through the encoding of a wide cor pus with syntactic tagging.</Paragraph> <Paragraph position="3"> The lexical features of blogs can instead be studied through simple search engine analysis (see again Tavosanis in print a and b for details of this method). Newspaper editing in Italy, en forced by a strong tradition and dedicated staff, excludes words considered too expressive (apart from those acknowledged by the same tradition: Bonomi 2002). Blogs, on the other hand, can in clude forms taken from every level of linguistic use. We can therefore expect that both literary and low forms are more used in blogs than in newspapers.</Paragraph> <Paragraph position="4"> Two Web corpora were then selected: the web site of the newspaper La Repubblica, indexed and queried through the Google interface (= R), and the whole of the blogs indexed in the beta version of Blogsearch.google.com (= B). Of course, no exact data are available on the consis tency of the two collections and the number of tokens indexed. The two corpora seem however roughly equal in size: the search of a common word like questo gives 427,000 occurrences in R and 467,000 in B; the search of quello 209,000 (R) and 257,608 occurrences (B); the search of lui 118,000 (R) and 159,970 occurrences (B); and so on. Of course, since word frequency is strongly correlated with the style and topic of the texts (for the Italian situation see Bortolini 1971: XIV-XV; Voghera 1993), this assessment cannot be considered an exact estimate. It does however give a preliminary quantitative estimate.</Paragraph> <Paragraph position="5"> The highest frequency of vulgar words in the B corpus is of course undisputed, since newspa per editing is a strong barrier against this kind of language, and it needs no particular demonstra tion, e.g., we can find 30,310 occurrences of the word cazzo in B against 278 in R, and so on.</Paragraph> <Paragraph position="6"> It is more difficult to demonstrate the highest frequency of literary language, which in the Ital ian tradition has a wide and varied lexicon. The abundance of synonyms and dispersion of forms lead one to focus searches on large groups of &quot;weak&quot; words instead of a limited set of &quot;strong&quot; words.</Paragraph> <Paragraph position="7"> Next the list of &quot;literary&quot; verbs beginning with the letters b, e and v in the De Mauro (2000) dic tionary was selected for analysis. The chosen verbs were 31 (b-), 47 (e-) and 49 (v-). Many of them also had non-literary uses and/or coincided with other Italian words: therefore only the words without homographs were used for the search, where every meaning recorded in the dic tionary was marked at least as &quot;obsolete&quot; (code OB), &quot;literary&quot; (LE) or &quot;bureaucratic&quot; (BU). This left 23 (b-), 28 (e-) and 21 (v-) verbs. The two corpora were then searched for the infinitive forms of the verbs. Many of them did not appear at all: baiare, balbuzzire, ballonzare (1 occurren ce in a text written in the dialect of Naples), ba sciare, benedicere (2 occurrences in two texts written in the dialect of Naples), biancicare, bia stemiare, blasmare, bombire, botare, botarsi, bravare, buccinare, bulicare, ebere, ecclissare, educere, enfiare, enfiarsi, escomunicare, escuo tere, escusare, esinanire, espedire, esseguire, esterminare, estollere, estollersi, estorre, estrue re, estruare, esturbare, esurire, evellere, evenire, vagheggiarsi, vanare, vengiare, vengiarsi, ver berare, verdicare, verdire, vernare, verzicare, vilificare and vincire.</Paragraph> <Paragraph position="8"> The search also revealed that a verb marked in the dictionary as &quot;literary&quot; was instead widely used in both corpora: vigilare. While other forms occurred at most 94 times, in the corpora there are 644 occurrences of vigilare, evenly balanced (332 in B, 312 in R). It therefore appears more correct to consider this verb as a &quot;common&quot; word, without literary connotations, and to ex clude it from further analysis.</Paragraph> <Paragraph position="9"> In a second phase, many forms were excluded from counts since they resulted simple typos or broken forms of different words (e.g., many oc currences of ventare are in fact occurrences of widely used verbs like diventare or inventare, with incorrect spacing). Only words where the possibilities of misspellings seemed low were therefore included In the counts.</Paragraph> <Paragraph position="10"> After this sifting, the forms represented in the corpus occurred as described in Table 2: It seems therefore that some Italian blogs have in fact a higher proportion of a random selection of literary words than Italian newspapers. Further searches should be able to confirm or refute this finding.</Paragraph> </Section> class="xml-element"></Paper>