File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0905_metho.xml
Size: 10,432 bytes
Last Modified: 2025-10-06 14:15:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-0905"> <Title>Front Back Consistency Overgeneration</Title> <Section position="5" start_page="37" end_page="39" type="metho"> <SectionTitle> 3 Learning Method </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="37" end_page="38" type="sub_section"> <SectionTitle> 3.1 Background The Grammatical Inference Problem </SectionTitle> <Paragraph position="0"> Generally, the problem considered here is that of identifying a language L from a fixed finite sample D = (D+,D-), where D + C L and D- NL = 0 (D- may be empty). If D- is empty, and D + is structurally complete with regard to L, the problem is not complex, and there exist a number of reliable inference algorithms. If D + is an arbitrary strict subset of L, the problem is less clearly defined. Since any finite sample is consistent with an infinite number of languages, L cannot be identified uniquely from D +. &quot;...the best we can hope to do is to infer a grammar that will describe the strings in D + and predict other strings that in some sense are of the same nature as those contained in D +'', (Fu and Booth, 1986, p.345).</Paragraph> <Paragraph position="1"> To constrain the set of possible languages L, the inferred grammar is typically required to be as small as possible, in accordance with a more general principle of machine learning which holds that a solution should be the shortest or the most economical description consistent with all examples, as e.g. suggested in Michalski (1983). However, the problem of finding a minimal grammar consistent with a given sample D was shown to be NP-hard by Gold (1978). Li & Vazirani (1988), Kearns ~ Valiant (1989) and Pitt & Warmuth (1993) have added nonapproximability results of varying strength. In the special case where D contains ail strings of symbols over a finite alphabet I of length shorter than k, a polynomial-time algorithm can be found (Trakhtenbrot and Barzdin, 1973), but if even a small fraction of examples is missing, the problem is again NP-hard (Angluin, 1978).</Paragraph> <Paragraph position="2"> Genetic Search Given the nature of the inference problem, a search algorithm is the obvious choice. Genetic Algorithms (GAs) are particularly suitable because search spaces tend to be large, discontinuous, multimodal and highdimensional. The power of GAs as general-purpose search techniques derives partly from their ability to efficiently and quickly search large solution spaces, and from their robustness and ability to approximate good solutions even in the presence of discontinuity, multimodaiity, noise and highdimensionality in the search space. The most crucial difference to other general-purpose search and optimisation techniques is that GAs sample different areas of the search space simultaneously and are therefore able to escape local optima, and to avoid poor solution areas in the search space altogether.</Paragraph> <Paragraph position="3"> Related Research A number of results have been reported for inference of regular and context-free grammars with evolutionary techniques, e.g. by Zhou & Grefenstette (1986), Kammeyer & Belew (1996), Lucas (1994), Dupont (1994), Wyard (1989) and (1991). Results concerning the inference of stochastic grammars with genetic algorithms have been described by Schwehm & Ost (1995) and Keller Lutz (1997a) and (1997b) describe. Much of this research bases inference on both negative and positive examples, and no real linguistic data sets have been used. Genotype representation is always based on sets of production rules, and knowledge of the target grammar is often utilised. Of these, Zhou & Grefenstette is the one approach directly comparable to the present method, and some comparative results are given in Section 4.</Paragraph> </Section> <Section position="2" start_page="38" end_page="39" type="sub_section"> <SectionTitle> 3.2 The Genetic Algorithm </SectionTitle> <Paragraph position="0"> The present algorithm 7 maintains a population of individuals represented by genotypes of varyrThe algorithm described here was developed in collaboration with Berkan Eskikaya, School of Cognitive and Computing Sciences, University of Sussex.</Paragraph> <Paragraph position="1"> ing length which are initialised to random gene values and length. In the iteration typical of GAs, individuals are (1) evaluated according to the fitness function, (2) selected for reproduction by a process that gives fitter individuals a higher chance at reproduction, (3) offspring are created from two selected individuals by crossover and mutation, and (4) weaker parents are replaced by fitter offspring. These steps are repeated until either the population converges (when the genotypes in the population have reached a degree of similarity beyond which further improvement is impossible), or the nth generation is reached.</Paragraph> <Paragraph position="2"> The remainder of this section outlines the fitness function (corresponding to the evaluation function common to all search techniques), and describes how generalisation over the training set is achieved. Full details of the GA can be found in Belz ~ Eskikaya (1998).</Paragraph> <Paragraph position="3"> Fitness Evaluation The fitness of automata is evaluated according to 3 fitness criteria that assess (C1) consistency of the language covered by the automaton with the data sample, (C2) smallness and (C3)generalisation to a superset of the data sample. For the evaluation of C1, the number of strings in the data sample that a given automaton parses are counted.</Paragraph> <Paragraph position="4"> Partial parsing of prefixes of strings is also rewarded, because the acquisition of whole strings by the automata would otherwise be a matter of chance. Size (C2) is assessed in terms of number of states, the reward being higher the fewer states an FSA has. This criterion serves as an additional pressure on automata to have few states, although the number of states and, more explicitly, the number of transitions is already kept low by crossover and mutation. Generalisation (C3) is directly assessed only in terms of the size of the language covered by an automaton, where the reward is higher the closer language size is to a specified target size (expressing a given degree of generalisation).</Paragraph> <Paragraph position="5"> When the goodness of a candidate solution to a problem, or the fitness of an individual, is most naturally expressed in terms of several criteria, the question arises how to combine these criteria into a single fitness value, or, alternatively, how to compare several individuals according to several criteria, in a way that accurately reflects the goodness of a candidate solu- null tion. In the present context, trial runs showed that the structural and functional properties of solution automata are very directly affected by each of the three fitness criteria described above.</Paragraph> <Paragraph position="6"> Therefore, it was most natural to normalise the three criteria to make up one third of the fitness value each, but to attach weights to them which can be manipulated (increased and decreased) to affect the structural and functional characteristics of resulting automata.</Paragraph> <Paragraph position="7"> Raising the weight on a fitness criterion (increasing its importance relative to the other criteria) has very predictable effects, in that the criterion with the highest weight is most reliably satisfied. Lowering the weight on C3 towards 0 has the result that language size becomes unpredictable, while lowering the weight on C2 simply increases the average size of the resulting automata. The weight on C1 tends to have to be increased with increasing sample size.</Paragraph> <Paragraph position="8"> Generalisation There are two main parameters that influence the degree of generalisation a given population achieves: the fitness criteria of size (C2) and degree of overgeneration (C3).</Paragraph> <Paragraph position="9"> C2 encourages automata to be as small as possible, which -- in the limit -- leads to universal automata that parse all strings x E I*. This is counterbalanced by C3 which limits the number of strings not in the training set which automata are permitted to overgenerate. To control the quality of generalisation, transitions that are not used by any member of the training set are eliminated, because automata would otherwise accept arbitrary strings in addition to training set members to make up the required target language size.</Paragraph> <Paragraph position="10"> The overall effect is that a range of generalisation can be achieved over the training set, from precise training set coverage towards universal automata, while meaningless overgeneration of strings is avoided. When L(A) = training set, only symbols a E I with identical distributions in the data set can be grouped together on the same transition between 2 states. As the required degree of generalisation increases, symbols with the most similar distributions are grouped together first, followed by less similar ones.</Paragraph> <Paragraph position="11"> Figure 2 shows an example of what effects can be achieved in the limit. The bottom diagram is part of the best automaton discovered for the second half of the German reduced syllable set, shown in Figure 4. Here, the degree of overgeneration was set to 1 (i.e. L(A) = training set), and the size criterion C2 had a small weight. This resulted in generalisation being completely absent, i.e. the automaton generates only nasal/consonant combinations that actually occur in the data set.</Paragraph> <Paragraph position="12"> The top diagram in Figure 2 shows the effect of having a large weight on the size criterion, and increasing target language size. The nasals were consistently grouped together under these circumstances, because there is a higher degree of distributional similarity (in terms of the sets of phonemes that can follow) between m, n, N than between these and other phonemes.</Paragraph> <Paragraph position="13"> This achieves the effect that strings not in the data set can be generated in a linguistically useful way, but also may have the side-effect that rarer phoneme combinations (m\[p:f\], n\['ts\], etc.) are not be acquired, an effect that is described in (Belz, 1998).</Paragraph> </Section> </Section> class="xml-element"></Paper>