File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-2001_intro.xml
Size: 1,577 bytes
Last Modified: 2025-10-06 14:03:26
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-2001"> <Title>Large linguistically-processed Web corpora for multiple languages</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The Web contains vast amounts of linguistic data for many languages (Kilgarriff and Grefenstette, 2003). One key issue for linguists and language technologists is how to access it. The drawbacks of using commercial search engines are presented in Kilgarriff (2003). An alternative is to crawl the Web ourselves.1 We have done this for two languages, German and Italian, and here we report on the pipeline of processes which give us reasonably well-behaved, 'clean' corpora for each language.</Paragraph> <Paragraph position="1"> alexa.com/company/index.html), who allow the user (for a modest fee) to access their cached Web directly. Using Alexa would mean one did not need to crawl; however in our experience, crawling, given free software like Heritrix, is not the bottleneck. The point at which input is required is the filtering out of non-linguistic material.</Paragraph> <Paragraph position="2"> We use the German corpus (which was developed first) as our example throughout. The procedure was carried on a server running RH Fedora Core 3 with 4 GB RAM, Dual Xeon 4.3 GHz CPUs and about 2.5 TB hard disk space. We are making the tools we develop as part of the project freely available,2 in the hope of stimulating public sharing of resources and know-how.</Paragraph> </Section> class="xml-element"></Paper>