File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-1704_concl.xml

Size: 3,195 bytes

Last Modified: 2025-10-06 13:55:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1704">
  <Title>CUCWeb: a Catalan corpus built from the Web</Title>
  <Section position="6" start_page="24" end_page="25" type="concl">
    <SectionTitle>
5 Conclusions and future work
</SectionTitle>
    <Paragraph position="0"> We have presented CUCWeb, a project aimed at obtaining a large Catalan corpus from the Web and making it available for all language users. As an existing resource, it is possible to enhance it and modify it, with e.g. better filters, better duplicate detectors, or better NLP tools. Having an actual corpus stored and annotated also makes it possible to explore it, be it through the web interface or as a dataset.</Paragraph>
    <Paragraph position="1"> The first CUCWeb version (from data gathering to linguistic processing and web interface implementation) was developed in only 6 months, with partial dedication of a a team of 6 people. Since then, many improvements have taken place, and many more remain as a challenge, but it confirms that creating a 166 million word annotated corpus, given the current technological state of the art, is a relatively easy and cheap issue.</Paragraph>
    <Paragraph position="2"> Resources such as CUCWeb facilitate the technological development of non-major languages and quantitative linguistic research, particularly so if flexible web interfaces are implemented. In addition, they make it possible for NLP and Web studies to converge, opening new fields of research (e.g. sociolinguistic studies of the Web).</Paragraph>
    <Paragraph position="3"> We have argued that the developed architecture allows for the creation of Web corpora in general.</Paragraph>
    <Paragraph position="4"> In fact, in the near future we plan to build a Spanish Web corpus and integrate it into the same web interface, using the data already gathered. The Spanish corpus, however, will be much larger than the Catalan one (a conservative estimate is 600  million words), so that new challenges in processing and searching it will arise.</Paragraph>
    <Paragraph position="5"> We have also reviewed some of the challenges that Web data pose to existing NLP tools, and argued that most are not new (textual layout, misspellings, multilinguality), but more frequent on the Web. To address some of them, we plan to develop a more sophisticated pre-processing module and a sentence-based language classifier and filter. A more general challenge of Web corpora is the control over its contents. Unlike traditional corpora, where the origin of each text is clear and deliberate, in CUCWeb the strategy is to gather as much text as possible, provided it meets some quality heuristics. The notion of balance is not present anymore, although this needs not be a drawback (Web corpora are at least representative of the language on the Web). However, what is arguably a drawback is the black box effect of the corpus, because the impact of text genre, topic, and so on cannot be taken into account.</Paragraph>
    <Paragraph position="6"> It would require a text classification procedure to know what the collected corpus contains, and this is again a meeting point for Web studies and NLP.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML