File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1111_metho.xml
Size: 7,921 bytes
Last Modified: 2025-10-06 14:15:15
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1111"> <Title>Language Identification With Confidence Limits</Title> <Section position="4" start_page="94" end_page="95" type="metho"> <SectionTitle> 2 The Identification Algorithm </SectionTitle> <Paragraph position="0"> The essential idea behind the identification algorithm is to accumulate the probability of the language given the input tokens for each language, treating each token as an independent event. To obtain the probability of a language l given a token t, p(llt), we use Bayes' rule:</Paragraph> <Paragraph position="2"> where p(t\[l) is the probability of the token if the language is known, p(t) is the a pr/or/probability of the token, and p(l) is the a priori probability of the language. We will assume that p(l) is constant (all languages are equi-probable) and drop it from the computation; in the tests, we will use the same amount of training data for each language. The other two terms are estimated from training data, using the procedure described in section 2.2.</Paragraph> <Paragraph position="3"> 2.1 The language model and the algorithm The input to the algorithm consists of a stream of tokens, such as word shape tokens (as in Sibun and Spitz, or Sibun and Reynar) or words themselves. The model for each language contains the probability of each known token given the language, expressed as three values: the basic probability, and the lower and ulSper limits ISome ideas related to the use of confidence limits can also be found in Dagan et al. (1991), applied in a different area.</Paragraph> <Paragraph position="4"> of a range containing this probability for a specific level of confidence. We will denote these by pB(tll), pL(tll), pH(tll), for base, low and high values. The probability that a token which has never been seen before is in a language is also present in the model of the language. In addition, there is a language independent model, containing the p(t) values. No confidence range is used for them, although this would be a simple extension of the technique.</Paragraph> <Paragraph position="5"> The algorithm proceeds by processing tokens, building up evidence about each language in three accumulators. The accumulators represent the overall probability of the language given the entire stream of tokens to date, again as base, low and high values, denoted as(1), aL(1), all(l). They are set to zero at the start of processing, and the logarithms of the probabilities are added to them as each token is processed. By taking logarithms of probabilities, we are in effect measuring the amount of evidence for each language, expressed as information content. From a practical point of view, using logarithms also helps keep all the values in a reasonable range and so avoids numerical underflow.</Paragraph> <Paragraph position="6"> After processing each token, two tests are applied. Firstly, we examine the base accumulator for the language which has the highest accumulated total, and test whether it is greater than a fixed threshold, called the activation threshold. If it is, then we conclude that enough information has been accumulated to try to make a decision. The low value for this language a (l) is then compared against the high value aff(l') for the next best language l', and if aL(l) exceeds aH(l') language l is output and the algorithm halts. Otherwise, the process continues with the next token, until the best choice language is a clear &quot;winner&quot; over any other.</Paragraph> <Paragraph position="7"> Finally, if we reach the end of the input data without a decision being made, several options are possible, depending on the needs of the application. We can simply output the language with the highest base score, even if the second test is not satisfied. Alternatively, we can output the highest scoring language, and all other languages whose high probability is greater than the low probability of this language.</Paragraph> <Section position="1" start_page="95" end_page="95" type="sub_section"> <SectionTitle> 2.2 Training the model </SectionTitle> <Paragraph position="0"> The model is trained using a collection of corpora for which the correct language is known.</Paragraph> <Paragraph position="1"> For a given language I and token t, let f(t, l) be the count of the token in that language and .f(l) be the total count of all tokens in that language.</Paragraph> <Paragraph position="2"> f(t) is the count of the token t across all the \]languages, and F the count of all tokens across all languages. The probability of the token occurring in the language p(tll ) is then calculated by assuming that the probabilities follow a binomial distribution. The idea here is that token occurrences are binary &quot;events&quot; which are either the given token t or are not. For large f(t, l), the underlying probability can be calculated by using the normal approximation to the binomial, giving the base probability ps(tll ) = ff(t,t) f(t) The standard deviation of this quantity is a(t, 1) = ~/f(l)ps(tll)(1 - ps(tll)) The low and high probabilities are found by taking a given number of standard deviations d from the base probability.</Paragraph> <Paragraph position="4"> In the evaluation below, d was set to 2, giving 95% confidence limits.</Paragraph> <Paragraph position="5"> For lower values of f(t, l), the calculation of the low and high probabilities can be made more exact, by substituting them for the base probability in the calculation of the standard deviation, giving</Paragraph> <Paragraph position="7"> Approximating 1- pL(tll) and 1 -pu(tll) to l on the grounds that the probabilities are small, and solving the equations gives</Paragraph> <Paragraph position="9"> The calculation requires marginally more computational effort than the first case, and in practice we use it for all but very large values of f(t,l), where the approximation of 1 -pL(tll ) and 1 -pu(tll ) to 1 would break down.</Paragraph> <Paragraph position="10"> For very small values of f(t,l), say less than 10, the normal approximation is not good enough, and we calculate the probabilities by reference to the binomial equatibn for the probability of rn (=y(t;l)) successes in n (= f(l)) trials:</Paragraph> <Paragraph position="12"> p is the underlying probability of the distribution, and this is what we are after. By choosing values for p(rn) and solving to find p we can obtain a given confidence range. To obtain a 95% interval, p(m) is set to 0.025, 0.5 and 0.975, yielding pt,(tll), ps(tll), and pu(tll), respectively. In fact, this is not exactly how the probability ranges for low frequency items should be calculated: instead the cumulative probability density function should be calculated and the range estimated from it s . For the present purposes, the low frequency items do not make much of a contribution to the overall success rate, and so the approximation is unimportant. However, if similar techniques were applied to problems with sparser data, then the procedure here would have to be revised.</Paragraph> <Paragraph position="13"> Finally, we need a probability for tokens which were not seen in the ti'aining data, called the zero probability, for which we set m = 0 in the above equation giving</Paragraph> <Paragraph position="15"> It is not clear what it means to have a confidence measure here, and so we use a single value for base, low and high probabilities, obtained by setting p(m) to 0.95.</Paragraph> <Paragraph position="16"> Similar calculations using f(t) in place of f(t,l) and F in place of f(l) give the a priori token probabilities p(t). As already noted, base, low and high value could have been calculated in this case, but as a minor simplification, we use only the base probability.</Paragraph> <Paragraph position="17"> ~Thanks to one of the referees for pointing this out.</Paragraph> </Section> </Section> class="xml-element"></Paper>