File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/e06-2001_concl.xml
Size: 941 bytes
Last Modified: 2025-10-06 13:55:10
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-2001"> <Title>Large linguistically-processed Web corpora for multiple languages</Title> <Section position="9" start_page="89" end_page="89" type="concl"> <SectionTitle> 8 Conclusion </SectionTitle> <Paragraph position="0"> We have developed very large corpora from the Web for German and Italian (with other languages to follow). We have filtered and cleaned the text so that the obvious problems with using the Web as a corpus for linguistic research do not hold. Preliminary evidence suggests the 'balance' of our German corpus compares favourably with that of a newswire corpus (though of course any such claim begs a number of open research questions about corpus comparability). We have lemmatised and part-of-speech-tagged the data and loaded it into a corpus query tool supporting sophisticated linguistic queries, and made it available to all.</Paragraph> </Section> class="xml-element"></Paper>