XML Viewer - w00-0709

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0709_metho.xml
Size: 11,414 bytes
Last Modified: 2025-10-06 14:07:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0709">
  <Title>Overfitting Avoidance for Stochastic Modeling of Attribute-Value Grammars</Title>
  <Section position="4" start_page="49" end_page="49" type="metho">
    <SectionTitle>
2 Maximum entropy-based parse
</SectionTitle>
    <Paragraph position="0"> selection The task of parse selection involves selecting the best possible parse for a sentence from a set of possible parses produced by an AVG. In the present approach, parses are ranked according to their goodness by a statistical model built using the maximum entropy technique, which involves building a distribution over events which is the most uniform possible, given constraints derived from training data. Events are composed of features, the fundamental statistical units whose distribution is modeled. The model is characterized by constraints upon the distributions of features, derived from the features' empirical frequencies. An untrained (thus unconstrained) max ent model is by definition characterized by the uniform distribution. The constraints which characterize the model are expressed as weights on individual features.</Paragraph>
    <Paragraph position="1"> Training the model involves deriving the best weights from the training data by means of an algorithm such as Improved Iterative Scaling (IIS) (Della Pietra et al., 1995).</Paragraph>
    <Paragraph position="2"> IIS assigns weights to features which reflect their distribution and significance. With each iteration, these weights reflect the empirical distribution of the features in the training data with increasing accuracy. In ideal circumstances, where the distribution of features in the training data accurately represents the true probability of the features, the performance of the model should increase asymptotically with each iteration of training until it eventually converges. If the training data is corrupt, or noisy, or if it contains features which are too sparsely distributed to accurately represent their probability, then overfitting arises.</Paragraph>
    <Section position="1" start_page="49" end_page="49" type="sub_section">
      <SectionTitle>
2.1 The structure of the features
</SectionTitle>
      <Paragraph position="0"> The statistical features used for parse selection should contain information pertinent to sentence structure, as it is the information encoded in these features which will be brought to bear in prefering one parse over another. Information regarding constituent heads, POS tags, and lexical information is pertinent, as is information on constituent ordering and other grammatical information present in the data. Most or all of these factors are considered in some form or another by current state-of-the-art statistical parsers such as those of Charniak (1997), Magerman (1995) and Collins (1996).</Paragraph>
      <Paragraph position="1"> In the present approach, each feature in the feature set corresponds to a depth-one tree structure in the data, i.e. a mother node and all of its daughters. Within this general structure various schemata may be used to derive actual features, where the information about each node employed in the feature is determined by which schema is used. For example, one schema might call for POS information from all nodes and lexical information only from head nodes.</Paragraph>
      <Paragraph position="2"> Another might call for lexical information only from nodes which also contain the POS tag for prepositions. The term compositional is used in this context to describe features built up according to some such schema from basic linguistic elements such as these. Thus each compositional feature is an ordered sequence of elements, where the order reflects the position in the tree of the elements. Instantiations of these schemata in the data are used as the statistical features. The first step is to run a given schema over the data, collecting a set of features. The next step is to characterize all events in the data in terms of those features.</Paragraph>
      <Paragraph position="3"> This general structure for features allows considerable versatility; models of widely varying quality may be constructed. This structure for statistical features might be compared with the Data-Oriented Parsing (DOP) of Bod (1998) in that it considers subtrees of parses as the structural units from which statistical information is taken. The present approach differs sharply from DOP in that its trees are limited to a depth of one node below the mother and, more importantly, in the fact that the maximum entropy framework allows modeling without the independence assumptions made in DOP.</Paragraph>
      <Paragraph position="4"> Since maximum entropy allows for overlapping information sources, features derived using different schemata (that is, collecting different pieces of node-specific information) may be collected from the same subtrees, and used simultaneously in a single model.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="49" end_page="50" type="metho">
    <SectionTitle>
3 Feature merging and overfitting
</SectionTitle>
    <Paragraph position="0"> reduction The idea behind feature merging is to reduce overfitting through changes made directly to the model. This is done by combining highly  top two features are merged in the form of the bottom feature, where the lexical elements have been replaced by their disjunction. The merged feature represents the union of the sets of tokens described by the unmerged feature types. All instances of the original two features would now be replaced in the data by the merged feature. specific features which occur rarely to produce more general features which occur more often, resulting in fewer total features used. Even if the events are not noisy or inaccurate in actual fact, they may still contribute to overfitting if their features occur too infrequently in the data to give accurate frequencies. The merging procedure seeks to address overfitting at the level of the features themselves and remain true to the spirit of the maximum entropy approach, which seeks to represent what is unknown about the data with uniformity of the distribution, rather than by making adjustments on the model distribution itself, such as the Gaussian prior of Osborne (2000).</Paragraph>
    <Paragraph position="1"> Each feature, as described above, is made up of discrete elements, which may include such objects as lexical items, POS tags, and grammatical attribute information, depending on the schema being used. The rarity of the feature in the data is largely--although not entirely-determined by the rarity of elements within it. In the present merging scheme, a set of elements is collected whose empirical frequencies are below some predetermined cutoff point. Note that the use of the term &amp;quot;cutoff&amp;quot; here refers to the empirical frequency of elements of features rather than of features themselves, as in Ratnaparkhi (1998). All features containing elements in this set will be altered such that the cutoff element is replaced by a uniform disjunctive element, effectively merging all similarly structured features into one, with the disparate elements replaced by the disjunctive element. An example may be seen in figure 1, where the union of the two features at top of the figure is represented as the feature below them.</Paragraph>
    <Paragraph position="2"> The merged elements in this case are the lexical items offered and allow. Such a merge would take place on the condition that the empirical frequencies of both elements are below a certain cutoff point. If so, the elements are replaced by a new element representing the disjunction of the original elements, creating a single feature.</Paragraph>
    <Paragraph position="3"> This feature then replaces all instances of both of the original features. If both of the original features appear once each together in an event, then two instances of the merged feature will appear in that event in the new model.</Paragraph>
  </Section>
  <Section position="6" start_page="50" end_page="51" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> The experiments described here were conducted using the Wall Street Journal Penn Treebank corpus (Marcus et al., 1993). The grammar used was a manually written broad coverage DCG style grammar (Briscoe and Carroll, 1997). Parses of WSJ sentences produced by the grammar were ranked empirically using the treebank parse as a gold standard according to a weighted linear combination of crossing brackets, precision, and recall. If more than fifty parses were produced for a sentence, the  best fifty were used and the rest discarded. For the training data, the empirical rankings of all parses for each sentence were normalized so the total parse scores for each sentence added to a constant. The events of the training data consisted of parses and their corresponding normalized score. These scores were furthermore treated as frequencies. Thus, high ranked parses would be treated as events occurring more frequently in the training data, and low ranked parses would be treated as occurring rarely.</Paragraph>
    <Paragraph position="1"> The features of the unmerged model consisted of depth-one trees carrying node information according to the following schema: the POS tag of the mother, POS tags of all daughters ordered left to right, HEAD+ information for the head daughter, and lexical information for all daughters carrying a verbal or prepositional POS tag. The features themselves were culled using this schema on 2290 sentences from the training data. The feature set consisted of 38,056 features in total, of which 6561 were active in the model (assigned non-zero weights) after the final iteration of IIS. Two models using this feature set were trained, one on only 498 training sentences, a subset of the 2290 sentences used to collect the features, and the other on nearly ten times that number, 4600 training sentences, a superset of the same set of sentences.</Paragraph>
    <Paragraph position="2"> Several merged models were made based on each of these unmerged models, using various cutoff numbers. Cutoffs were set at empirical frequencies of 100, 500, 1000, 1250, and 1500 elements. For each model merge, all elements which occurred in the training data fewer times than the cutoff number were replaced in each feature they appeared in by the uniform disjunctive element, and the merged features then took the place of the unmerged features.</Paragraph>
    <Paragraph position="3"> Iterative scaling was performed for 150 iterations on each model. This number was chosen arbitrarily as a generous but not gratuitous number of iterations, allowing general trends to be observed.</Paragraph>
    <Paragraph position="4"> The models were tested on approximately 5,000 unseen sentences from other parts of the corpus. The performance of each model was measured at each iteration by binary best match. The model chose a single top parse and if this parse's empirical rank was the highest (or equal to the highest) of all the parses for the sentence, the model was awarded a point for the match, otherwise the model was awarded zero.</Paragraph>
    <Paragraph position="5"> The performance rating reflects the percentage of times that the model chose the best parse of all possible parses, averaged over all test sentences. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML