File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2151_intro.xml
Size: 2,543 bytes
Last Modified: 2025-10-06 14:06:04
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2151"> <Title>Handling Sparse Data by Successive Abstraction</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Sparse data is a perennial problem when applying statistical techniques to natural language processing. The fundamental problem is that there is often not enough data to estimate the required statistical parameters, i.e., the probabilities, directly from the relative frequencies. This problem is accentuated by the fact that in the search for more accurate probabilistic language models, more and more contextual information is added, resulting in more and more complex conditionings of the corresponding conditional probabilities. This in turn means that the number of observations tends to be quite small for such contexts. Over the years, a number of techniques have been proposed to handle this problem.</Paragraph> <Paragraph position="1"> One of two different main ideas behind these techniques is that complex contexts can be generalized, and data from more general contexts can be used to improve the probability estimates for more specific contexts. This idea is usually referred to as back-off smoothing, see (Katz 1987).</Paragraph> <Paragraph position="2"> These techniques typically require that a separate portion of the training data be held out from the parameter-estimation phase and saved for determining appropriate back-off weights. Further~ more, determining the back-off weights usually requires resorting to a time-consuming iterative reestimation procedure. A typical example of such a technique is &quot;deleted interpolation&quot;, which is described in' Section 5.1 below.</Paragraph> <Paragraph position="3"> The other main idea is concerned with improving the estimates of low-frequency, or nofrequency, outcomes apparently without trying to generalize the conditionings. Instead, these techniques are based on considerations of how population frequencies in general tend to behave. Examples of this are expected likelihood estimation (ELE), see Section 5.2 below, and Good-Turing estimation, see (Good 1953).</Paragraph> <Paragraph position="4"> We will here derive from first principles a practical method for handling sparse data that does not need separate training data for determining the back-off weights and which lends itself to direct calculation, thus avoiding time-consuming reestimation procedures.</Paragraph> </Section> class="xml-element"></Paper>