File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0709_intro.xml
Size: 3,193 bytes
Last Modified: 2025-10-06 14:00:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0709"> <Title>Overfitting Avoidance for Stochastic Modeling of Attribute-Value Grammars</Title> <Section position="3" start_page="0" end_page="49" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The maximum entropy technique of statistical modeling using random fields has proved to be an effective way of dealing with a variety of linguistic phenomena, in particular where modeling of attribute-valued grammars (AVG's) is concerned (Abney, 1997). This is largely because its capacity for considering overlapping information sources allows the most to be made of situations where data is sparse. Nevertheless, it is important that the statistical features employed be appropriate to the job. If the information contributed by the features is too specific to the training data, overfitting becomes a problem (Chen and Rosenfeld, 1999; Osborne, 2000). In this event, a peak in model performance will be reached early on, and continued training yields progressive deterioration in performance. From a theoretical standpoint, over-fitting indicates that the model distribution is unrepresentative of the actual probabilities. In practice, it makes the performance of the model dependent upon early stopping of training. The point at which this must be done is not always reliably predictable.</Paragraph> <Paragraph position="1"> This paper describes an approach to feature selection for maximum entropy models which reduces the effects of overfitting. Candidate features are built up from basic grammatical elements found in the corpus. This &quot;compositional&quot; quality of the features is exploited for the purpose of overfitting reduction by means of \]eature merging. In this process, features which are similar to each other, save for certain elements, are merged; i.e, their disjunction is considered as a feature in itself, thus reducing the number of features in the model.</Paragraph> <Paragraph position="2"> The motivation behind this methodology is similar to that behind that of Kohavi and John (1997), but rather than seeking a proper subset of the candidate feature set, the merging procedure attempts to compress the feature set, diminishing both noise and redundancy. The method differs from a simple feature cutoff, such as that described in Ratnaparkhi (1998), in that the feature cutoff eliminates statistical features directly, whereas the merging procedure attempts to generalize them. The method employed here also derives inspiration from the notion of Bayesian model merging introduced by Stolcke and Omohundro (1994).</Paragraph> <Paragraph position="3"> Section 2 describes parse selection and discusses the &quot;compositional&quot; statistical features employed in a maximum entropy approach to the task. Section 3 introduces the notion of feature merging and discusses its relationship with overfitting reduction. Sections 4 and 5 describe the experimental models built and the results of merging on their performance. Finally, section 6 sums up briefly and indicates some further directions for inquiry on the subject.</Paragraph> </Section> class="xml-element"></Paper>