File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/02/p02-1030_concl.xml
Size: 2,719 bytes
Last Modified: 2025-10-06 13:53:18
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1030"> <Title>Scaling Context Space</Title> <Section position="9" start_page="0" end_page="0" type="concl"> <SectionTitle> 7 Conclusion </SectionTitle> <Paragraph position="0"> It is a phenomenon common to many NLP tasks that the quality or accuracy of a system increases loglinearly with the size of the corpus. Banko and Brill, (2001) also found this trend for the task of confusion set disambiguation on corpora of up to one billion words. They demonstrated behaviour of different learning algorithms with very simple contexts on extremely large corpora. We have demonstrated the behaviour of a simple learning algorithm on much more complicated contextual information on very large corpora.</Paragraph> <Paragraph position="1"> Our experiments suggest that the existing methodology of evaluating systems on small corpora without reference to the execution time and representation size ignores important aspects of the evaluation of NLP tools.</Paragraph> <Paragraph position="2"> These experiments show that ef ciently implementing and optimising the NLP tools used for context extraction is of crucial importance since the increased corpus sizes make execution speed an important evaluation factor when deciding between different learning algorithms for different tasks and corpora. These results also motivate further research into improving the asymptotic complexity of the learning algorithms used in NLP systems. In the new paradigm, it could well be that far simpler but scalable learning algorithms signi cantly out-perform existing systems.</Paragraph> <Paragraph position="3"> Finally, the mass availability of online text resources should be taken on board. It is important that language engineers and computational linguists continue to try and nd new unsupervised or (as Banko and Brill suggest) semi-supervised methods for tasks which currently rely on annotated data. It is also important to consider how information extracted by systems such as thesaurus extraction systems can be incorporated into tasks which use predominantly supervised techniques, e.g. in the form of class information for smoothing.</Paragraph> <Paragraph position="4"> We would like to extend this analysis to at least one billion words for at least the most successful methods and try other tools and parsers for extracting the contextual information. However, to do this we must look at methods of compressing the vector-space model and approximating the full pair-wise comparison of thesaurus terms. We would also like to investigate how this thesaurus information can be used to improve the accuracy or generality of other NLP tasks.</Paragraph> </Section> class="xml-element"></Paper>