File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/w98-1111_evalu.xml
Size: 8,777 bytes
Last Modified: 2025-10-06 14:00:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1111"> <Title>Language Identification With Confidence Limits</Title> <Section position="5" start_page="95" end_page="99" type="evalu"> <SectionTitle> 3 Evaluation </SectionTitle> <Paragraph position="0"> To evaluate the technique, a test was run using similar data to Sibun and Reynar. Corpora for eighteen languages from the European Corpus Initiative CDROM 1 were extracted and split into non-overlapping files, one containing 2000 tokens 3, one containing 200 tokens, and 25 files each of 1, 5, 10 and 20 tokens. The 2000 and 200 token files were used as training data, and the remainder for test data. Wherever possible the texts were taken from newspaper corpora, and failing that from novels or literature. The identification algorithm was run on each test file and the results placed in one of four categories: The sum of the first two figures divided by the total number of tests gives a measure of accuracy; the sum of the first and last divided by the total gives a measure of decisiveness, expressed as the proportion of the time a definitive decision was made. The tests were executed using word shape tokens on the same coding scheme as Sibun and Reynar, and using the words as they appeared in the corpus. No adjustments were made for punctuation, case, etc. Various activation thresholds were tried: raising the threshold increases accuracy by requiring more information before a decision is made, but reduces decisiveness. With shapes and 2000 tokens of training data, at a threshold of 14 or more, all the 20 token files gave 100% accuracy. For words themselves, the threshold was set to 22. The results of these tests appear in table 1. The figures for the activation threshold were determined by experimenting on the data.</Paragraph> <Paragraph position="1"> An interesting area for further work would be to put this aspect of the procedure on a sounder theoretical basis, perhaps by using the a priori probabilities of the individual languages.</Paragraph> <Paragraph position="2"> 3Sibun and Spitz, and Sibun and Reynar, present their results in terms of lines of input, with 1-5 lines corresponding roughly to a sentence, and 10-20 lines to a paragraph. Estimating a line a.s 10 words, we are therefore working with significantly smaller data sets. The accuracy figures are generally similar to or better than those of Sibun and Reynar. The corresponding figures for 200 tokens of training data appear in table 2, for the token identification task only.</Paragraph> <Paragraph position="3"> One of the strengths of the algorithm is that it makes a decision as soon as one can be made reliably. Table 3 shows the average number of tokens which have to be read before a decision can be made, for the cases where the decision was correct and incorrect, and for both cases together. Again, the results are for word shape tokens, and for words alone. The figures show that convergence usually happens within about 10 words, with a long tailing off to the results.</Paragraph> <Paragraph position="4"> The longest time to convergence was 153 shape tokens.</Paragraph> <Paragraph position="5"> A manual inspection of one run (2000 lines of training data, tokens, threshold=14) shows that errors are somtimes clustered, although quite weakly. For example, Serbian, Croatian and Slovenian show several confusions between them, as in Sibun and Reynar's results. There are two observations to be made here. Firstly, there are about as many other errors between these language and languages which are unrelated to them, such as Italian, German and Norwegian, and so the errors may be due to poor quality data rather than a lack of discrimination in the algorithm. For example, Croatian is incorrectly recognised as Serbian 3 times and as Slovenian once, while the languages which are misrecognised as Croatian are German and Norwegian (once each). Secondly, even where there are errors, the range of possibilities has been substantially reduced, so that a more powerful process (such as full-scale OCR followed by identification on words rather than shape tokens, or a raising of the threshold and adding more data) could be brought in to finish the job off. That is, the confidence limits have provided a benefit in reducing the search space.</Paragraph> <Paragraph position="6"> The confusion matrix for this case appears in an appendix.</Paragraph> <Section position="1" start_page="95" end_page="97" type="sub_section"> <SectionTitle> 3.1 Broader applicability </SectionTitle> <Paragraph position="0"> Although the algorithm was developed with language identification in mind, it is interesting to explore other classification problems with it. A simple and rather crude experiment in &quot;genre&quot; identification was carried out, using the Brown corpus. Each section of the corpus (labelled A, B, C ... R in the original) was taken as a genre, and files of similar distribution to the previous experiment were extracted. Because this is a more unconstrained problem, the training set and tests sets were about 10 times the size of the language identification task. A 20000 word file was used as training data, and the remaining files as test data. Accuracy and decisiveness results appear in table 4. Beyond the activation threshold of 12, there is no significant improvement in accuracy. The technique seems to give good accuracy when there is sufficient input (100 words or more), but at the cost of very low decisiveness. Excluding a fixed list of common words such as function words might increase the decisiveness. These results should be taken with a pinch of salt, as the notion of genre is not very well-defined, and it is not clear that sections of the Brown corpus really represent coherent categories, but they may provide a starting point for further investigation.</Paragraph> </Section> <Section position="2" start_page="97" end_page="97" type="sub_section"> <SectionTitle> 3.2 On decisiveness </SectionTitle> <Paragraph position="0"> Decisiveness represents the degree to which a unique decision has been made with a high degree of confidence. In cases where no unique decision has been made, the range of possibilities will often have been reduced: a category is only still possible at any stage if its high accumulator value is greater than the low accumulator value of the best rated category. To illustrate this, the number of categories which are still possible when all the input was exhausted was examined. The results appear in tables 5 and 6, for the tests of language identification from word shape tokens with an activation threshold of 14 and a training set of 2000 tokens, and for genre identification with a threshold of 12 and a training set of 20000 tokens. Results are shown for the cases of a correct decision, an incorrect one, and all cases. The average number of possibilities remaining is 1.3 out of 18 for the language identification test, and 9.7 out of 15 for the genre test, showing that we are generally near to convergence in the former case, but have only achieved a small reduction in the possibilities in the latter, in keeping with the generally low decisiveness.</Paragraph> </Section> <Section position="3" start_page="97" end_page="99" type="sub_section"> <SectionTitle> 3.3 A further comparison </SectionTitle> <Paragraph position="0"> The classification algorithm described above was originally developed in response to Sibun and Spitz's work. There is another approach to language identification, which has a certain (genre identification) amount in common with ours, described in a patent by Martino and Paulsen (1996). Their approach is to build tables of the most frequent words in each language, and assign them a normalised score, based on the frequency of occurrence of the word in one language compared to the total across all the languages. Only the most frequent words for each language are used. The algorithm works by accumulating scores, until a preset number of words has been read or a minimum score has been reached. They also apply the technique to genre identification. Since there is a clear similarity, it is perhaps worth highlighting the differences. In terms of the algorithm, the most important difference is that no confidence measures are included. The complexities of splitting the data into different frequency bands for calculating probabilities are thus avoided, but no test analogous to overlapping confidence intervals can be applied. Martino and Paulsen say they obtain a high degree of confidence in the decision after about 100 words, without saying what the actual success rate is; we can compare this with around 10 words (or tokens) for convergence here.</Paragraph> </Section> </Section> class="xml-element"></Paper>