File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-2001_metho.xml

Size: 14,340 bytes

Last Modified: 2025-10-06 14:10:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-2001">
  <Title>Large linguistically-processed Web corpora for multiple languages</Title>
  <Section position="3" start_page="0" end_page="87" type="metho">
    <SectionTitle>
2 Crawl seeding and crawling
</SectionTitle>
    <Paragraph position="0"> We would like a &amp;quot;balanced&amp;quot; resource, containing a range of types of text corresponding, to some degree, to the mix of texts we find in designed linguistic corpora (Atkins et al., 1992), though also including text types found on the Web which were not anticipated in linguists' corpus design discussions. We do not want a &amp;quot;blind&amp;quot; sample dominated by product listings, catalogues and computer scientists' bulletin boards. Our pragmatic solution is to query Google through its API service for random pairs of randomly selected content words in the target language. In preliminary experimentation, we found that single word queries yielded many inappropriate pages (dictionary definitions of the word, top pages of companies with the word in their name), whereas combining more than two words retrieved pages with lists of words, rather than collected text.</Paragraph>
    <Paragraph position="1"> Ueyama (2006) showed how queries for words sampled from traditional written sources such as newspaper text and published essays tend to yield &amp;quot;public sphere&amp;quot; pages (online newspaper, government and academic sites), whereas basic vocabulary/everyday life words tend to yield &amp;quot;personal&amp;quot; pages (blogs, bulletin boards). Since we wanted both types, we obtained seed URLs with queries  for words from both kinds of sources. For German, we sampled 2000 mid-frequency words from a corpus of the S&amp;quot;uddeutsche Zeitung newspaper and paired them randomly. Then, we found a basic vocabulary list for German learners,3 removed function words and particles and built 653 random pairs. We queried Google via its API retrieving maximally 10 pages for each pair. We then collapsed the URL list, insuring maximal sparseness by keeping only one (randomly selected) URL for each domain, leaving a list of 8626 seed URLs.</Paragraph>
    <Paragraph position="2"> They were fed to the crawler.</Paragraph>
    <Paragraph position="3"> The crawls are performed using the Heritrix crawler,4 with a multi-threaded breadth-first crawling strategy. The crawl is limited to pages whose URL does not end in one of several suffixes that cue non-html data (.pdf, .jpeg, etc.)5 For German, the crawl is limited to sites from the .de and .at domains. Heritrix default crawling options are not modified in any other respect. We let the German crawl run for ten days, retrieving gzipped archives (the Heritrix output format) of about 85GB.</Paragraph>
  </Section>
  <Section position="4" start_page="87" end_page="88" type="metho">
    <SectionTitle>
3 Filtering
</SectionTitle>
    <Paragraph position="0"> We undertake some post-processing on the basis of the Heritrix logs. We identify documents of mime type text/html and size between 5 and 200KB. As observed by Fletcher (2004) very small documents tend to contain little genuine text (5KB counts as &amp;quot;very small&amp;quot; because of the html code overhead) and very large documents tend to be lists of various sorts, such as library indices, store catalogues, etc. The logs also contain sha1 fingerprints, allowing us to identify perfect duplicates. After inspecting some of the duplicated documents (about 50 pairs), we decided for a drastic policy: if a document has at least one duplicate, we discard not only the duplicate(s) but also the document itself. We observed that, typically, such documents came from the same site and were warning messages, copyright statements and similar, of limited or no linguistic interest. While the strategy may lose some content, one of our general principles is that, given how vast the Web is, we can afford to privilege precision over recall.</Paragraph>
    <Paragraph position="1"> All the documents that passed the pre-filtering  documents in other formats, e.g., Adobe pdf.</Paragraph>
    <Paragraph position="2"> stage are run through a perl program that performs 1) boilerplate stripping 2) function word filtering 3) porn filtering.</Paragraph>
    <Paragraph position="3"> Boilerplate stripping By &amp;quot;boilerplate&amp;quot; we mean all those components of Web pages which are the same across many pages. We include stripping out HTML markup, javascript and other non-linguistic material in this phase. We aimed to identify and remove sections of a document that contain link lists, navigational information, fixed notices, and other sections poor in human-produced connected text. For purposes of corpus construction, boilerplate removal is critical as it will distort statistics collected from the corpus.6 Weadopted theheuristic used inthe Hyppia project BTE tool,7: content-rich sections of a page will have a low html tag density, whereas boilerplate is accompanied by a wealth of html (because of special formatting, newlines, links, etc.) The method is based on general properties of Web documents, so is relatively independent of language and crawling strategy.</Paragraph>
    <Paragraph position="4"> Function word and pornography filtering Connected text in sentences reliably contains a high proportion of function words (Baroni, to appear), so, if a page does not meet this criterion we reject it. The German function word list contains 124 terms. We require that a minimum of 10 types and 30 tokens appear in a page, with a ratio of function words to total words of at least one quarter. The filter also works as a simple language identifier.8 Finally, we use a stop list of words likely to occur in pornographic Web pages, not out of prudery, but because they tend to contain randomly generated text, long keyword lists and other linguistically problematic elements. We filter out documents that have at least three types or ten tokens from a list of words highly used in pornography.</Paragraph>
    <Paragraph position="5"> The list was derived from the analysis of pornographic pages harvested in a previous crawl. This isnot entirely satisfactory, since some ofthe words 6We note that this phase currently removes the links from the text, so we can no longer explore the graph structure of the dataset. In future we may retain link structure, to support research into the relation between it and linguistic characteristics. null  machine-generated text (typically produced as part of search engine ranking scams or for other shady purposes); sometimes this appears to have been generated with a bigram language model, and thus identifying it with automated techniques is far from trivial.</Paragraph>
    <Paragraph position="6">  in the list, taken in isolation, are wholly innocent (fat, girls, tongue, etc.) We shall revisit the strategy in due course.</Paragraph>
    <Paragraph position="7"> This filtering took 5 days and resulted in a version of the corpus containing 4.86M documents for a total of 20GB of uncompressed data.</Paragraph>
  </Section>
  <Section position="5" start_page="88" end_page="88" type="metho">
    <SectionTitle>
4 Near-duplicate detection
</SectionTitle>
    <Paragraph position="0"> We use a simplified version of the &amp;quot;shingling&amp;quot; algorithm (Broder et al., 1997). For each document, after removing all function words, we take fingerprints of a fixed number s of randomly selected ngrams; then, for each pair of documents, we count the number of shared n-grams, which can be seen as an unbiased estimate of the overlap between the two documents (Broder et al., 1997; Chakrabarti, 2002). We look for pairs of documents sharing more than t n-grams, and we discard one of the two.</Paragraph>
    <Paragraph position="1"> After preliminary experimentation, we chose to extract 25 5-grams from each document, and to treat as near-duplicates documents that shared at least two of these 5-grams. Near-duplicate spotting on the German corpus took about 4 days.</Paragraph>
    <Paragraph position="2"> 2,466,271 near-duplicates were removed. The corpus size decreased to 13GB. Most of the processing time was spent in extracting the n-grams and adding the corresponding fingerprints to the data-base (which could be parallelized).</Paragraph>
  </Section>
  <Section position="6" start_page="88" end_page="88" type="metho">
    <SectionTitle>
5 Part-of-speech tagging/lemmatization
</SectionTitle>
    <Paragraph position="0"> and post-annotation cleaning We performed German part-of-speech tagging and lemmatization with TreeTagger.9 Annotation took 5 days. The resulting corpus contains 2.13B words, or 34GB of data including annotation.</Paragraph>
    <Paragraph position="1"> After inspecting various documents from the annotated corpus, we decided to perform a further round of cleaning. There are two reasons for this: first, we can exploit the annotation to find other anomalous documents, through observing where the distribution of parts-of-speech tags is very unusual and thus not likely to contain connected text. Second, the TreeTagger was not trained on Web data, and thus its performance on texts that are heavy on Web-like usage (e.g., texts all in lowercase, colloquial forms of inflected verbs, etc.) is dismal. While a better solution to this second problem would be to re-train the tagger on Web</Paragraph>
    <Paragraph position="3"> data (ultimately, the documents displaying the second problem might be among the most interesting ones to have in the corpus!), for now we try to identify the most problematic documents through automated criteria and discard them. The cues we used included the number of words not recognised by the lemmatizer; the proportion of words with upper-case initial letters; proportion of nouns, and proportion of sentence markers.</Paragraph>
    <Paragraph position="4"> After this further processing step, the corpus contains 1,870,259 documents from 10818 different domains, and its final size is 1.71 billion tokens (26GB of data, with annotation). The final size of the Italian corpus is 1,875,337 documents and about 1.9 billion tokens.</Paragraph>
  </Section>
  <Section position="7" start_page="88" end_page="88" type="metho">
    <SectionTitle>
6 Indexing and Web user interface
</SectionTitle>
    <Paragraph position="0"> We believe that matters of efficient indexing and user friendly interfacing will be crucial to the success of our initiative, both because many linguists will lack the relevant technical skills to write their own corpus-access routines, and because we shall not publicly distribute the corpora for copyright reasons; an advanced interface that allows linguists to do actual research on the corpus (including the possibility of saving settings and results across sessions) will allow us to make the corpus widely available while keeping it on our servers.10 We are using the Sketch Engine,11 a corpus query tool which has been widely used in lexicography and which supports queries combining regular expressions and boolean operators over words, lemmas and part-of-speech tags.</Paragraph>
  </Section>
  <Section position="8" start_page="88" end_page="89" type="metho">
    <SectionTitle>
7 Comparison with other corpora
</SectionTitle>
    <Paragraph position="0"> We would like to compare the German Web corpus to an existing &amp;quot;balanced&amp;quot; corpus of German attempting to represent a broad range of genres and topics. Unfortunately, as far as we know no resource of this sort is publicly available (which is one of the reasons why we are interested in developing the German Web corpus in the first instance.) Instead, we use a corpus of newswire articles from the Austria Presse Agentur (APA, kindly provided to us by &amp;quot;OFAI) as our reference 10The legal situation is of course complex. We consider that our case is equivalent to that of other search engines, and that offering linguistically-encoded snippets of pages to researchers does not go beyond the &amp;quot;fair use&amp;quot; terms routinely invoked by search engine companies in relation to Web page caching.</Paragraph>
    <Paragraph position="1">  point. This corpus contains 28M tokens, and, despite its uniformity in terms of genre and restricted thematic range, it has been successfully employed as a general-purpose German corpus in many projects. After basic regular-expression-based normalization and filtering, the APA contains about 500K word types, the Web corpus about 7.4M. There is a large overlap among the 30 most frequent words in both corpora: 24 out of 30 words are shared. The non-overlapping words occurring in the Web top 30 only are function words: sie 'she', ich 'I', werden 'become/be', oder 'or', sind 'are', er 'he'. The words only in the APA list show a bias towards newswire-specific vocabulary (APA, Prozent 'percent', Schluss 'closure') and temporal expressions that are also typical of newswires (am 'at', um 'on the', nach 'after').</Paragraph>
    <Paragraph position="2"> Of the 232,322 hapaxes (words occurring only once) in the APA corpus, 170,328 (73%) occur in the Web corpus as well.12 89% of these APA hapaxes occur more than once in the Web corpus, suggesting how the Web data will help address data sparseness issues.</Paragraph>
    <Paragraph position="3"> Adopting the methodology of Sharoff (2006), we then extracted the 20 words most characteristics of the Web corpus vs. APA and vice versa, based on the log-likelihood ratio association measure. Results are presented in Table 1. The APA corpus has a strong bias towards newswire parlance (acronyms and named entities, temporal expressions, financial terms, toponyms), whereas the terms that come out as most typical of the Web corpus are function words that are not strongly connected with any particular topic or genre. Several of these top-ranked function words mark first and second person forms (ich, du, wir, mir).</Paragraph>
    <Paragraph position="4"> This preliminary comparison both functioned as a &amp;quot;sanity check&amp;quot;, showing that there is consider12Less than 1% of the Web corpus hapaxes are attested in the APA corpus.</Paragraph>
    <Paragraph position="5"> able overlap between our corpus and a smaller corpus used in previous research, and suggested that the Web corpus has more a higher proportion of interpersonal material.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML