File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1111_intro.xml

Size: 3,914 bytes

Last Modified: 2025-10-06 14:06:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1111">
  <Title>Language Identification With Confidence Limits</Title>
  <Section position="3" start_page="0" end_page="94" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Language identification is an example of a general class of problems in which we want to assign an input data stream to one of several categories as quickly and accurately as possible. It can be solved using many techniques, including knowledge-poor statistical approaches. Typically, the distribution of n-grams of characters or other objects is used to form a model. A comparison of the input against the model determines the language which matches best. Versions of this simple technique can be found in Dunning (1994) and Cavnar and Trenkle (1994), while an interesting practical implementation is described by Adams and Resnik (1997).</Paragraph>
    <Paragraph position="1"> A variant of the problem is considered by Sibun and Spitz (1994), and Sibun and Reynar (1996), who look at it from the point of view of Optical Character Recognition (OCR).</Paragraph>
    <Paragraph position="2"> Here, the language model for the OCR system cannot be selected until the language has been identified. They therefore work with so-called shape tokens, which give a very approximate encoding of the characters' shapes on the printed page without needing full-scale OCR. For example, all upper case letters are treated as being one character shape, all characters with a descender are another, and so on. Sequences of character shape codes separated by white space are assembled into word shape tokens. Sibun and Spitz then determine the language on the basis of linear discriminant analysis (LDA) over word shape tokens, while Sibun and Reynar explore the use of entropy relative to training data for character shape unigrams, bigrams and trigrams. Both techniques are capable of over 90% accuracy for most languages. However, the LDA-based technique tends to perform significantly worse for languages which are similar to one another, such as the Norse languages. Relative entropy performs better, but still has some noticeable error clusters, such as confusion between Croatian, Serbian and Slovenian.</Paragraph>
    <Paragraph position="3"> What these techniques lack is a measure of when enough information has been accumulated to distinguish one language from another reliably: they examine all of the input data and then make the decision. Here we will look at a different approach which attempts to overcome this by maintaining a measure of the total evidence accumulated for each language and how much confidence there is in the measure. To outline the approach: 1. The input is processed one (word shape) token at a time. For each language, we determine the probability that the token is in that language, expressed as a 95% confidence range.</Paragraph>
    <Paragraph position="4"> 2. The values for each word are accumulated into an overall score with a confidence range for the input to date, and compared both to an absolute threshold, and with  .</Paragraph>
    <Paragraph position="5"> each other. Thus, to select a language, we require not only that it has a high score (probability, roughly), but also that it is significantly better scoring than any other.</Paragraph>
    <Paragraph position="6"> If the process fails to make a decision on the data that is available, the subset of the languages which have exceeded the absolute threshold can be output, so that even if a final decision has not been made, the likely possibilities have been narrowed down.</Paragraph>
    <Paragraph position="7"> We look at this procedure in more detail below, with particular emphasis on how the underlying statistical model provides confidence intervals. An evaluation of the technique on data similar to that used by Sibun and Reynar follows I.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML