File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-1109_concl.xml
Size: 1,218 bytes
Last Modified: 2025-10-06 13:55:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1109"> <Title>Study of Some Distance Measures for Language and Encoding Identi cation</Title> <Section position="13" start_page="70" end_page="70" type="concl"> <SectionTitle> 11 Conclusion </SectionTitle> <Paragraph position="0"> We have presented the results about some distance measures which can be applied to NLP problems. We also described a method for automatically identifying the language and encoding of a text using several measures including one called 'mutual cross entropy'. All these measures are applied on character based pruned n-grams models created from the training and the test data. There is one such model for each of the known language-encoding pairs. The character based models may be augmented with word based models, which increases the performance for not so good measures, but doesn't seem to have much effect for better measures. Our method gives good performance on a few words of test data and a few pages of training data for each language-encoding pair. Out of the measures considered, mutual cross entropy gave the best results, but RE, MRE and JC measures also performed almost equally well.</Paragraph> </Section> class="xml-element"></Paper>