File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1705_metho.xml
Size: 5,851 bytes
Last Modified: 2025-10-06 14:10:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1705"> <Title>Annotated web as corpus</Title> <Section position="4" start_page="29" end_page="29" type="metho"> <SectionTitle> 3 Research hypothesis and aims </SectionTitle> <Paragraph position="0"> Our research hypothesis is that distributed computational techniques can alleviate the annotation bottleneck for processing corpus data from the web. This leads us to a number of research questions: null * How can corpus data from the web be divided into units for processing via distributed techniques? * Which corpus annotation techniques are suitable for distributed processing? * Can distributed techniques assist in corpus clean-up and conversion to allow inclusion of a wider variety of genres and to support more representative corpora? In the early stages of our proposed research, we are focussing on grammatical word-class analysis (part-of-speech tagging) of web-derived corpora of English and aspects of corpus clean-up and conversion. Clarifying copyright issues and exploring models for legal dissemination of corpora compiled from web data are key objectives of this stage of the investigation as well.</Paragraph> </Section> <Section position="5" start_page="29" end_page="30" type="metho"> <SectionTitle> 4 Methodology </SectionTitle> <Paragraph position="0"> The initial focus of the work will be to develop the framework for distributed corpus annotation.</Paragraph> <Paragraph position="1"> Since existing solutions have been centralised in nature, we first must examine the consequences that a distributed approach has for corpus annotation and identify issues to address.</Paragraph> <Paragraph position="2"> A key concern will be handling web pages within the framework, as it is essential to minimise the amount of data communicated between peers. Unlike the other distributed analytical systems mentioned above, the size of text document and analysis time is largely proportional for corpora annotation. This places limitations on work unit size and distribution strategies. In particular, three areas will be investigated: * Mechanisms for crawling/discovery of a web corpus domain - how to identify pages to include in a web corpus. Also BOINC, Berkeley Open Infrastructure for Network Computing. http://boinc.berkeley.edu.</Paragraph> <Paragraph position="3"> investigate appropriate criteria for handling pages which are created or modified dynamically.</Paragraph> <Paragraph position="4"> * Mechanisms to generate work units for distributed computation - how to split the corpus into work units and reduce the communication / computation time ratio that is crucial for such systems to be effective. null * Mechanisms to support the distribution of work units and collection of results how to handle load balancing. What data should be sent to peers and how is the processed information handled and manipulated? What mechanisms should be in place to ensure correctness of results? How can abuse be prevented and security concerns of collaborating institutions be addressed? BOINC already provides a good platform for this, and these aspects will be investigated within the project. null Analysis of existing distributed computation systems will help to inform the design of the framework and tackle some of these issues. Finally, the framework will also cater for three common strategies for corpus annotation: which crawlers are used to locate web pages From a computational linguistic view, the framework will also need to take into account the granularity of the unit (for example, POS tagging requires sentence-units, but anaphoric annotation needs paragraphs or larger). Secondly, we need to investigate techniques for identifying identical documents, virtually identical documents and highly repetitive documents, such as those pioneered by Fletcher (2004b) and shingling techniques described by Chakrabarti (2002).</Paragraph> <Paragraph position="5"> The second stage of our work will involve implementing the framework within a P2P environment. We have already developed a prototype of an object-oriented application environment to support P2P system development using JXTA (Sun's P2P API). We have designed this environment so that specific application functionality can be captured within plug-ins that can then integrate with the environment and utilise its functionality. This system has been successfully tested with the development of plug-ins supporting instant messaging, distributed video encoding (Hughes and Walkerdine, 2005), distributed virtual worlds (Hughes et al., 2005) and digital library management (Walkerdine and Rayson, 2004). It is our intention to implement our distributed corpus annotation framework as a plugin. This will involve implementing new functionality and integrating this with our existing annotation tools (such as CLAWS ). The development environment is also flexible enough to utilise the BOINC platform, and such support will be built into it.</Paragraph> <Paragraph position="6"> Using the P2P Application Framework as a basis for the development secures several advantages. First, it reduces development time by allowing the developer to reuse existing functionality; secondly, it already supports essential aspects such as system security; and thirdly, it has already been used successfully to deploy comparable P2P applications. A lightweight version of the application framework will be bundled with the corpus annotation plug-in, and this will then be made publicly available for download in open-source and executable formats. We envisage our end-users will come from a variety of disciplines such as language engineering and linguistics. For the less-technical users, the prototype will be packaged as a screensaver or instant messaging client to facilitate deployment.</Paragraph> </Section> class="xml-element"></Paper>