XML Viewer - w06-1704

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1704_metho.xml
Size: 22,761 bytes
Last Modified: 2025-10-06 14:10:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1704">
  <Title>CUCWeb: a Catalan corpus built from the Web</Title>
  <Section position="3" start_page="19" end_page="21" type="metho">
    <SectionTitle>
2 Corpus Constitution
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="19" end_page="20" type="sub_section">
      <SectionTitle>
2.1 Data collection
</SectionTitle>
      <Paragraph position="0"> Our goal was to crawl the portion of the Web related to Spain. Initially, we crawled the set of pages with the suffix .es. However, this domain is not very popular, because it is more expensive than other domains (e.g. the cost of a .com domain is about 15% of that of an .es domain), and because its use is restricted to company names or registered trade marks.3 In a second phase a different heuristic was used, and we considered that a Web site was in Spain if either its IP address was assigned to a network located in Spanish land, or if the Web site's suffix was .es. We found that only 16% of the domains with pages in Spain were under .es.</Paragraph>
      <Paragraph position="1"> The final collection of the data was carried out in September and October 2004, using a commercial piece of software by Akwan (da Silva et al., 1999). 4 The actual collection was started by the crawler using as a seed the list of URLs in a Spanish search engine -which was a commercial search engine back in 2000- under the name of Buscopio. That list covered the major part of the existing Web in Spain at that time. 5. New URLs were extracted from the downloaded pages, and the process continued recursively while the pages were in Spain -see above. The crawler downloaded all pages, except those that had an identical URL (http://www.web.es/main/ and http://www.web.es/main/index.html were considered different URLs). We retrieved over 16 million Web pages (corresponding to over 300,000 web sites and 118,000 domains), and processed them to extract links and text. The uncompressed text of the pages amounts to 46 GB, and the metadata generated during the crawl to 3 GB.</Paragraph>
      <Paragraph position="2"> In an initial collection process, a number of difficulties in the characterisation of the Web of Spain were identified, which lead to redundancy in the contents of the collection: Parameters to a program inside URL addresses.</Paragraph>
      <Paragraph position="3"> This makes it impossible to adequately sep- null GHz and with 1.6 GB of RAM under Red-Hat Linux. For the information storage we used a RAID of disks with 1.8 TB of total capacity, although the space used by the collection is about 50 GB.</Paragraph>
      <Paragraph position="4">  arate static and dynamic pages, and may lead to repeatedly crawl pages with the same content.</Paragraph>
      <Paragraph position="5"> Mirrors (geographically distributed copies of the same contents to ensure network efficiency).</Paragraph>
      <Paragraph position="6"> Normally, these replicas are entire collections with a large volume, so that there are many sites with the same contents, and these are usually large sites. The replicated information is estimated between 20% and 40% of the total Web contents ((Baeza-Yates et al., 2005)).</Paragraph>
      <Paragraph position="7"> Spam on the Web (actions oriented to deceive search engines and to give to some pages a higher ranking than they deserve in search results). Recognizing spam pages is an active research area, and it is estimated that over 8% of what is indexed by search engines is spam (Fetterly et al., 2004). One of the strategies that induces redundancy is to automatically generate pages to improve the score they obtain in link-based rankings algorithms.</Paragraph>
      <Paragraph position="8"> DNS wildcarding (domain name spamming).</Paragraph>
      <Paragraph position="9"> Some link analysis ranking functions assign less importance to links between pages in the same Web site. Unfortunately, this has motivated spammers to use several different Web sites for the same contents, usually through configuring DNS servers to assign hundreds or thousands of site names to the same IP address. Spain's Web seems to be quite populated with domain name spammers: 24 out of the 30 domains with the highest number of Web sites are configured with DNS wildcarding (Baeza-Yates et al., 2005).</Paragraph>
      <Paragraph position="10"> Most of the spam pages were under the .com top-level domain. We manually checked the domains with the largest number of sites and pages to ban a list of them, mostly sites containing pornography or collections of links without information content. This is not a perfect solution against spam, but generates significant savings in terms of bandwidth and storage, and allows us to spend more resources in content-rich Web sites. We also restricted the crawler to download a maximum of 400 pages per site, except for the Web sites within .es, that had no pre-established limit.</Paragraph>
    </Section>
    <Section position="2" start_page="20" end_page="21" type="sub_section">
      <SectionTitle>
2.2 Data processing
</SectionTitle>
      <Paragraph position="0"> The processing of the data to obtain the Catalan corpus consisted of the following steps: language classification, linguistic filtering and processing, duplicate filtering and corpus indexing. This section details each of these aspects.</Paragraph>
      <Paragraph position="1"> We built a language classifier with the Naive Bayes classifier of the Bow system (Mccallum, 1996). The system was trained with corpora corresponding to the 4 official languages in Spain (Spanish, Catalan, Galician and Basque), as well as to the other 6 most frequent languages in the Web (Anonymous, 2000): English, German, French, Italian, Portuguese, and Dutch.</Paragraph>
      <Paragraph position="2"> 38% of the collection could not be reliably classified, mostly because of the presence of pages without enough text, for instance, pages containing only images or only lists of proper nouns. Within the classified pages, Catalan was the third most used language (8% of the collection). As expected, most of the collection was in Spanish (52%), but English had a large part (31%). The contents in Galician and Basque only comprise about 2% of the pages.</Paragraph>
      <Paragraph position="3"> We wanted to use the Catalan portion as a corpus for NLP and linguistic studies. We were not interested in full coverage of Web data, but in quality. Therefore, we filtered it using a computational dictionary and some heuristics in order to exclude documents with little linguistic relevance (e.g. address lists) or with a lot of noise (programming code, multilingual documents). In addition, we performed a simple duplicate filter: web pages with a very similar content (determined by a hash of the processed text) were considered duplicates.</Paragraph>
      <Paragraph position="4"> The sizes of the corpus (in documents and words6) after each of the processes are depicted in  card almost 60% of the original documents. The final corpus consists of 166 million words from 204 thousand documents.</Paragraph>
      <Paragraph position="5"> Its distribution in terms of top-level domains is shown in Table 2, and the 10 biggest sites in Ta6Word counts do not include punctuation marks.</Paragraph>
      <Paragraph position="6"> ble 3. Note that the .es domain covers almost half of the pages and com a quarter, but .org and .net also have a quite large share of the pages.</Paragraph>
      <Paragraph position="7"> As for the biggest sites, they give an idea of the content of CUCWeb: they mainly correspond to university and institutional sites. A similar distribution can be observed for the 50 biggest sites, which will determine the kind of language found in CUCWeb.</Paragraph>
      <Paragraph position="8">  The corpus was further processed with CatCG ( `Alex Alsina et al., 2002), a POS-tagger and shallow parser for Catalan built with the Connexor Constraint Grammar formalism and tools.7 CatCG provides part of speech, morphological features (gender, number, tense, etc.) and syntactic information. The syntactic information is a functional tag (e.g. subject, object, main verb) annotated at word level.</Paragraph>
      <Paragraph position="9"> Since we wanted the corpus not only to be an in-house resource for NLP purposes, but also to be accessible to a large number of users. To that end, we indexed it using the IMS Corpus Workbench tools8 and we built a web interface to it (see Section 3.1). The CWB includes facilities for indexing and searching corpora, as well as a special module for web interfaces. However, the size of the corpus is above the advisable limit for these tools. 9 Therefore, we divided it into 4 subcorpora  a corpus has to be split into several parts, a good rule of thumb is to split it in 100M word parts. In his words &amp;quot;depending on various factors such as language, complexity of annotations  and indexed each of them separately. The search engine for the corpus is the CQP (Corpus Query Processor, one of the modules of the CWB).</Paragraph>
      <Paragraph position="10"> Since CQP provides sequential access to documents we ordered the corpus documents by PageRank so that they are retrieved according to their popularity on the Internet.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="21" end_page="22" type="metho">
    <SectionTitle>
3 Corpus Exploitation
</SectionTitle>
    <Paragraph position="0"> CUCWeb is being exploited in two ways: on the one hand, data can be accessed through a web interface (Section 3.1). On the other hand, the annotated data can be exploited by theoretical or computational linguists, lexicographers, translators, etc. (Section 3.2).</Paragraph>
    <Section position="1" start_page="21" end_page="22" type="sub_section">
      <SectionTitle>
3.1 Corpus interface
</SectionTitle>
      <Paragraph position="0"> Despite the wide use of corpora in NLP, few interfaces have been built, and still fewer are flexible enough to be of interest to linguistic researchers. As for Web data, some initiatives exist (WebCorp 10, the Linguist's Search Engine 11, KWiCFinder 12), but they are meta-interfaces to search engines. For Catalan, there is a web interface for the CTILC corpus13, but it only allows for one word searches, of which a maximum of 50 hits are viewed. It is not possible either to download search results.</Paragraph>
      <Paragraph position="1"> From the beginning of the project our aim was to create a corpus which could be useful for both the NLP community and for a more general audience with an interest in the Catalan language.</Paragraph>
      <Paragraph position="2"> and how much RAM you have, a larger or smaller size may give better overall performance.&amp;quot;.</Paragraph>
      <Paragraph position="3">  This includes linguists, lexicographers and language teachers.</Paragraph>
      <Paragraph position="4"> We expected the latter kind of user not to be familiar with corpus searching strategies and corpus interfaces, at least not to a large extent. Therefore, we aimed at creating a user-friendly web interface which should be useful for both non-trained and experienced users.14 Further on, we wanted the interface to support not only example searches but also statistical information, such as co-occurrence frequency, of use in lexicographical work and potentially also in language teaching or learning. There are two web interfaces to the corpus: an example search interface and a statistics interface. Furthermore, since the flexibility and expressiveness of the searches potentially conflicts with user-friendliness, we decided to divide the example search interface into two modalities: a simple search mode and an expert search mode.</Paragraph>
      <Paragraph position="5"> The simple mode allows for searches of words, lemmata or word strings. The search can be restricted to specific parts of speech or syntactic functions. For instance, a user can search for an ambiguous word like Catalan &amp;quot;la&amp;quot; (masculine noun, or feminine determiner or personal pronoun) and restrict the search to pronouns. Or look for word &amp;quot;traduccions&amp;quot; ('translations') functioning as subject. The advantage of the simple mode is that an untrained person can use the corpus almost without the need to read instructions. If new users find it useful to use CUCWeb, we expect that the motivation to learn how to create advanced corpus queries will arise.</Paragraph>
      <Paragraph position="6"> The expert mode is somewhat more complex but very flexible. A string of up to 5 word units can be searched, where each unit may be a word  form, lemma, part of speech, syntactic function or combination of any of those. If a part of speech is specified, further morphological information is displayed, which can also be queried.</Paragraph>
      <Paragraph position="7"> Each word unit can be marked as optional or repeated, which corresponds to the Boolean operators of repetition and optionality. Within each word unit each information field may be negated, allowing for exclusions in searches, e.g. requiring a unit not to be a noun or not corresponding to a certain lemma. This use of operators gives the expert mode an expressiveness close to regular grammars, and exploits almost all querying functionalities of CQP -the search engine.</Paragraph>
      <Paragraph position="8"> In both modes, the user can retrieve up to 1000 examples, which can be viewed online or downloaded as a text file, and with different context sizes. In addition, a link to a cache copy of the document and to its original location is provided.</Paragraph>
      <Paragraph position="9"> As for the statistics interface, it searches for frequency information regarding the query of the user. The frequency can be related to any of the 4 annotation levels (word, lemma, POS, function).</Paragraph>
      <Paragraph position="10"> For example, it is possible to search for a given verb lemma and get the frequencies of each verb form, or to look for adjectives modifying the word dona ('woman') and obtain the list of lemmata with their associated frequency. The results are offered as a table with absolute and relative frequency, and they can be viewed online or retrieved as a CSV file. In addition, each of the results has an associated link to the actual examples in the corpus.</Paragraph>
      <Paragraph position="11"> The interface is technically quite complex, and the corpus quite large. There are still aspects to be solved both in the implementation and the documentation of the interface. Even restricting the searches to 1000 hits, efficiency remains often a problem in the example search mode, and more so in the statistics interface. Two partial solutions have been adopted so far: first, to divide the corpus into 4 subcorpora, as explained in Section 2.2, so that parallel searches can be performed and thus the search engine is not as often overloaded. Second, to limit the amount of memory and time for a given query. In the statistics interface, a status bar shows the progress of the query in percentage and the time left.</Paragraph>
      <Paragraph position="12"> The interface does not offer the full range of CWB/CQP functionalities, mainly because it was not demanded by our &amp;quot;known&amp;quot; users (most of them linguists and translators from the Department of Translation and Philology at Universitat Pompeu Fabra). However it is planned to increasingly add new features and functionalities. Up to now we did not detect any incompatibility between splitting the corpora and the implementation of CWB/CQP deployment or querying functionalities.</Paragraph>
    </Section>
    <Section position="2" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
3.2 Whole dataset
</SectionTitle>
      <Paragraph position="0"> The annotated corpus can be used as a source of data for NLP purposes. A previous version of the CUCWeb corpus -obtained with the methodology described in this paper, but crawling only the .es domain, consisting of 180 million words- has already been exploited in a lexical acquisition task, aimed at classifying Catalan verbs into syntactic classes (Mayol et al., 2006).</Paragraph>
      <Paragraph position="1"> Cluster analysis was applied to a 200 verb set, modeled in terms of 10 linguistically defined features. The data for the clustering were first extracted from a fragment of CTILC (14 million word). Using the manual tagging of the corpus, an average 0.84 f-score was obtained. Using CatCG, the performance decreased only 2 points (0.82 fscore). null In a subsequent experiment, the data were extracted from the CUCWeb corpus. Given that it is 12 times larger than the traditional corpus, the question was whether &amp;quot;more data is better data&amp;quot; (Church and Mercer, 1993, 18-19). Banko and Brill (2001) present a case study on confusion set disambiguation that supports this slogan. Surprisingly enough, results using CUCWeb were significantly worse than those using the traditional corpus, even with automatic linguistic processing: CUCWeb lead to an average 0.71 f-score, so an 11 point difference resulted. These results somewhat question the quality of the CUCWeb corpus, particularly so as the authors attribute the difference to noise in the CUCWeb and difficulties in linguistic processing (see Section 4). However, 0.71 is still well beyond the 0.33 f-score baseline, so that our analysis is that CUCWeb can be successfully used in lexical acquisition tasks. Improvement in both filtering and linguistic processing is still a must, though.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="22" end_page="24" type="metho">
    <SectionTitle>
4 Discussion of the architecture
</SectionTitle>
    <Paragraph position="0"> The initial motivation for the CUCWeb project was to obtain a large annotated corpus for Catalan.</Paragraph>
    <Paragraph position="1"> However, we set up an architecture that enables  the construction of web corpora in general, provided the language-dependent modules are available. Figure 1 shows the current architecture for CUCWeb.</Paragraph>
    <Paragraph position="2"> The language-dependent modules are the language classifier (our classifier now covers 10 languages, as explained in Section 2.2) and the linguistic processing tools. In addition, the web interface has to be adapted for each new tagset, piece of information and linguistic level. For instance, the interface currently does not support searches for chunks or phrases.</Paragraph>
    <Paragraph position="3"> Most of the problems we have encountered in processing Web documents are not new (Baroni and Ueyama, To appear), but they are much more frequent in that kind of documents than in standard running text.15 We now review the main problems we came across: Textual layout In general, they are problems that arise due to the layout of Web documents, which is very different to that of standard text. Pre-processing tools have to be adapted to deal with these elements. These include headers or footers (Last modified...), copyright statements or frame elements, the so-called boilerplates. Currently, due to the fact that we process the text extracted by the crawler, no boilerplate detection is performed, which increases the amount of noise in the corpus. Moreover, the pre-processing module does not even handle e-mail addresses or phone numbers (they are not frequently found in the kind of 15By &amp;quot;standard text&amp;quot;, we mean edited pieces of text, such as newspapers, novels, encyclopedia, or technical manuals.  text it was designed to process); as a result, for example, one of the most frequent determiners in the corpus is 93, the phone prefix for Barcelona. Another problem for the pre-processing module, again due to the fact that we process the text extracted from the HTML markup, is that most of the structural information is lost and many segmentation errors occur, errors that carry over to subsequent modules.</Paragraph>
    <Paragraph position="4"> Spelling mistakes Most of the texts published on the Web are only edited once, by their author, and are neither reviewed nor corrected, as is usually the case in traditional textual collections (Baeza-Yates et al., 2005). It could be argued that this makes the language on the Web closer to the &amp;quot;actual language&amp;quot;, or at least representative of other varieties in contrast to traditional corpora. However, this feature makes Web documents difficult to process for NLP purposes, due to the large quantity of spelling mistakes of all kinds. The HTML support itself causes some of the difficulties that are not exactly spelling mistakes: A particularly frequent kind of problem we have found is that the first letter of a word gets segmented from the rest of the word, mainly due to formatting effects. Automatic spelling correction is a more necessary module in the case of Web data.</Paragraph>
    <Paragraph position="5"> Multilinguality Multilinguality is also not a new issue (there are indeed multilingual books or journals), but is one that becomes much more evident when handling Web documents. Our current approach, given that we are not interested in full coverage, but in quality, is to discard multi-lingual documents (through the language classifier and the linguistic filter). This causes two problems. On the one hand, potentially useful texts are lost, if they are inserted in multilingual documents (note that the linguistic filter reduces the initial collection to almost a half; see Table 1). On the other hand, many multilingual documents remain in the corpus, because the amount of text in another language does not reach the specified threshold. Due to the sociological context of Catalan, Spanish-Catalan documents are particularly frequent, and this can cause trouble in e.g. lexical acquisition tasks, because both are Romance languages and some word forms coincide. Currently, both the language classifier and the dictionary filter are document-based, not sentence-based. A better approach would be to do sentence-based language classification. However, this would increase the complexity of corpus construction and management: If we want to maintain the notion of document, pieces in other languages have to be marked but not removed. Ideally, they should also be tagged and subsequently made searchable.</Paragraph>
    <Paragraph position="6"> Duplicates Finally, a problem which is indeed particular to the Web is redundancy. Despite all efforts in avoiding duplicates during the crawling and in detecting them in the collection (see Section 2), there is still quite a lot of duplicates or near-duplicates in the corpus. This is a problem both for NLP purposes and for corpus querying. More sophisticated algorithms, as in Broder (2000), are needed to improve duplicate detection.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML