File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/98/w98-1111_concl.xml
Size: 2,578 bytes
Last Modified: 2025-10-06 13:58:15
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1111"> <Title>Language Identification With Confidence Limits</Title> <Section position="6" start_page="99" end_page="99" type="concl"> <SectionTitle> 4 Conclusions </SectionTitle> <Paragraph position="0"> We have examined a simple technique for classifying a stream of input tokens in which confidence measures are used to determine when a correct decision can be made. The results in table 1 show that there is a tradeoff between accuracy and the degree to which the algorithm selects a single language. Not surprisingly, the amount of training data also affects the performance, with 2000 tokens being adequate for accuracy close to 100%, and convergence typically being reached in the first 10 tokens. On a more unconstrained problem, such as genre identification from words alone, the algorithm performs less well in both accuracy and decisiveness even with significantly more training data, and is probably not adequate except as a preprocessor to some more knowledge intensive technique.</Paragraph> <Paragraph position="1"> In a sense, language identification is not a very interesting problem. As we have noted, there are plenty of techniques which work well, each with its own characteristics and suitability for different application areas. What is perhaps more important is the way the statistical information has been used here. When we take a statistical or data-led approach to NLP, there are two things which can help us trust that the technique is accurate. The first is a belief that the statistical technique is an adequate model of the underlying process which &quot;generates&quot; the data, using theoretical considerations or some external source of knowledge to inform this belief. The second is quantitative evaluation on test data which has been characterised by an outside source (for example, in the case of part of speech tagging, a corpus which has been manually annotated, or at least automatically tagged and manually corrected). The problem with quantitative evaluation is that we do not know whether it will generalise, so that if we train on one data set, we have only the theoretical model to reassure that the same model will work on a different data set. The idea I have been presenting here is to get the statistical process itself to provide feedback about itself, through the use of confidence limits which are themselves based in the statistical model. In doing so, we hope to avoid presenting a result for which we lack adequate evidence.</Paragraph> </Section> class="xml-element"></Paper>