XML Viewer - w06-3207

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3207_metho.xml
Size: 27,381 bytes
Last Modified: 2025-10-06 14:10:57
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3207">
  <Title>Richness of the Base and Probabilistic Unsupervised Learning in Optimality Theory</Title>
  <Section position="3" start_page="50" end_page="51" type="metho">
    <SectionTitle>
2 Learning Probabilistic OT
</SectionTitle>
    <Paragraph position="0"> While the primary task of the grammar is to map underlying forms to overt forms, the grammar's secondary role is that of a filter - ruling out ungrammatical forms no matter what underlying form is fed to the grammar. The role of the grammar as filter follows from the OT principle of Richness of the Base, according to which the set of possible underlying forms is universal (Prince and Smolensky 1993). In other words, the grammar must be restrictive and not over-generate. The requirement that grammars be restrictive complicates the learning problem - it is not sufficient to find a combination of underlying forms and constraint ranking that yields the set of observed surface forms: the constraint ranking must yield only grammatical forms irrespective of the particular lexical items selected for the language.</Paragraph>
    <Paragraph position="1"> In classic OT, constraint ranking is categorical and non-probabilistic. In recent years various stochastic versions of OT have been proposed to account for free variation (Boersma and Hayes, 2001), lexically conditioned variation (Anttila, 1997), child language acquisition (Legendre et al., 2002) and the modeling of frequencies associated with these phenomena. In addition to these advantages, probabilistic versions of OT are advantageous from the point of view of learnability. In particular, the Gradual Learning Algorithm for Stochastic OT (Boersma, 1997, 1998; Boersma and Hayes, 2001) is capable of learning in spite of noisy training data and is capable of learning variable grammars in a supervised fashion. In addition, probabilistic versions of OT and variants of OT (Goldwater and Johnson, 2003; Rosenbach and Jaeger, 2003) enable learning of OT via likelihood maximization, for which there exist many established algorithms. Furthermore, as this paper proposes, unsupervised learning of OT using likelihood maximization combined with Richness of the Base provides a natural solution to the grammar-as-filter problem due to the power of probabilistic modeling to use negative evidence implicitly.</Paragraph>
    <Paragraph position="2"> The algorithm proposed here relies on a probabilistic extension of OT in which each possible constraint ranking is assigned a probability P(r).</Paragraph>
    <Paragraph position="3"> Thus, the OT grammar is a probability distribution over constraint rankings rather than a single constraint ranking. This notion of probabilistic OT is similar to - but less restricted than - Stochastic OT, in which the distribution over possible rankings is given by the joint probability over independently normally distributed constraints with fixed, equal variance. The advantage of the present model is computational simplicity, but the proposed learning algorithm does not depend on any particular instantiation of probabilistic OT.</Paragraph>
    <Paragraph position="4"> Tables 1 and 2 illustrate the proposed probabilistic version of OT with an abstract example. Table 1 shows the violation marks assigned by three constraints, A, B and C, to five candidate outputs O1-O5 for the underlying form, or input /I/. To compute the winner of an optimization, constraints are applied to the candidate set in order according to their rank. Candidates continue to the next constraint if they have the fewest (or tie for fewest) constraint violation marks (indicated by asterisks).</Paragraph>
    <Paragraph position="5"> In this way the winning or optimal candidate, the  candidate that violates the higher-ranked constraints the least, is selected.</Paragraph>
    <Paragraph position="6"> constraints input: /I/ A B C</Paragraph>
    <Paragraph position="8"> The third column of Table 2 identifies the winner under each possible ranking of the three constraints. For example, if the ranking is A &gt;&gt; B &gt;&gt; C, constraint A eliminates all but O3 and O4, then constraint B eliminates O3, designating O4 as the winner. The remainder of Table 2 illustrates the proposed probabilistic instantiation of OT. The first column shows the probability P(r) that the grammar assigns to each ranking in this example.</Paragraph>
    <Paragraph position="9"> The probability of each ranking determines the probability with which the winner under that ranking will be selected for the given input. In other words, it defines the conditional probability Pr(Ok | I), shown in the fourth column, of the kth output candidate given the input /I/ under the ranking r.</Paragraph>
    <Paragraph position="10"> The last column shows the total conditional probability for each candidate after summing across rankings. For instance, O3 is the winner under two of the rankings, and thus its total conditional probability P(O3  |I) is found by summing over the conditional probabilities under each ranking. The total conditional probability P(O3  |I) refers to the probability that underlying form /I/ will surface as O3, and this probability depends on the grammar.</Paragraph>
    <Paragraph position="11">  In addition to the conditional probability assigned by the grammar, this model relies on a probability distribution P(I  |M) over possible underlying forms for a given morpheme M. This property of the model implements the standard linguistic proposition that each morpheme has a consistent underlying form across contexts, while the grammar drives allomorphic variation that may result in the morpheme having different surface realizations in different contexts. Rather than identifying a single underlying form for each morpheme, this model represents the underlying form as a distribution over possible underlying forms, and this distribution is constant across contexts. To determine the probability of an underlying form for a morphologically complex word, the product of the morpheme's individual distributions is taken the probability of an underlying form is taken to be independent of morphological context. For example, suppose that some morpheme Mk has two possible underlying forms, I1 and I2, and the two underlying forms are equally likely. This means that the conditional probabilities of both underlying forms are 50%: P(I1  |Mk) = P(I2  |Mk) = 50%. In sum, the probabilistic model described here consists of a grammar and lexicon, both of which are probabilistic. The task of learning involves selecting the appropriate parameter settings of both the grammar and lexicon simultaneously.</Paragraph>
  </Section>
  <Section position="4" start_page="51" end_page="54" type="metho">
    <SectionTitle>
3 Expectation Maximization and Richness
</SectionTitle>
    <Paragraph position="0"> of the Base in OT This section presents the details of the learning algorithm for probabilistic OT. First, in Section</Paragraph>
    <Section position="1" start_page="51" end_page="51" type="sub_section">
      <SectionTitle>
3.1 the objective function and its properties are
</SectionTitle>
      <Paragraph position="0"> discussed. Next, Section 3.2 proposes the solution to the grammar-as-filter problem, which involves restricting the search space available to the learning algorithm. Finally, Section 3.3 describes the likelihood maximization algorithm - the input to the algorithm, the initial state, and the form of the solution.</Paragraph>
    </Section>
    <Section position="2" start_page="51" end_page="52" type="sub_section">
      <SectionTitle>
3.1 The Objective Function
</SectionTitle>
      <Paragraph position="0"> The learning algorithm relies on the following objective function:</Paragraph>
      <Paragraph position="2"> The likelihood of the data, or set of overt surface forms, PH(O  |M) depends on the parameter settings, the probability distributions over rankings and underlying forms, under the hypothesis H. It is also conditional on M, the set of observed morphemes, which are annotated in the data provided to the algorithm. M is constant, however, and does not differ between hypotheses for the same data set. Under this model each unique surface form Ok is treated independently, and the likelihood of the data is simply the product of the probability of each surface form, raised to the power corresponding to its observed frequency Fk. Each surface form Ok is composed of a set of morphemes Mk, and each of these morphemes has a set of underlying forms Ik,j. The probability of each surface form PH(Ok  |Mk) is found by summing the joint distribution PH(Ok &amp; Ik,j  |Mk) over all possible underlying forms Ik,J for morphemes Mk that compose Ok.</Paragraph>
      <Paragraph position="3"> Finally, the joint probability is simply the product of the conditional probability PH(Ok  |Ik,j) and lexical probability PH(IK,j  |Mk), both of which were defined in the previous section.</Paragraph>
      <Paragraph position="4"> The primary property of this objective function is that it is maximal only when the hypothesis generates the observed data with high probability. In other words, the grammar must map the selected lexicon onto observed surface forms without wasting probability mass on unobserved forms. Because there are two parameters in the model, this can be accomplished by adjusting the ranking distributions or by adjusting lexicon distributions.</Paragraph>
      <Paragraph position="5"> The probability model itself does not specify whether the grammar or the lexicon should be adjusted in order to maximize the objective function.</Paragraph>
      <Paragraph position="6"> In other words, the objective function is indifferent to whether the restrictions observed in the language are accounted for by having a restrictive grammar or by selecting a restrictive lexicon. As discussed in Section 2, according to Richness of the Base, only the first option is available in OT: the grammar must be restrictive and must neutralize noncontrastive distinctions in the language. The next subsection addresses the proposed solution - a restriction of the search procedure that favors maximizing probability by restricting the grammar rather than the lexicon.</Paragraph>
    </Section>
    <Section position="3" start_page="52" end_page="53" type="sub_section">
      <SectionTitle>
3.2 Richness of the Base
</SectionTitle>
      <Paragraph position="0"> Although the notion of a restrictive grammar is intuitively clear, it is difficult to implement formally. Previous work on OT learnability (Tesar, 1995; Tesar and Smolensky, 1995; Smolensky 1996; Tesar, 1998, Tesar, 1999; Tesar et al., 2003; Tesar and Prince, to appear; Hayes, 2004) has proposed the heuristic of Markedness over Faithfulness during learning to favor restrictive grammars. In OT there are two basic types of constraints, markedness constraints, which penalize dispreferred surface structures, and faithfulness constraints, which penalize nonidentical mappings from underlying to surface forms. In general, a restrictive grammar will have markedness constraints ranked high, because these constraints will restrict the type of surface forms that are allowed in a language. On the other hand, if faithfulness constraints are ranked high, all the distinctions introduced into the lexicon will surface. Thus, a heuristic preferring markedness constraints to rank high whenever possible does in general prefer restrictive grammars. However, the markedness over faithfulness heuristic does not exhaust the notion of restrictiveness. In particular, markedness over faithfulness does not favor grammar restrictiveness that follows from particular rankings between markedness constraints or between faithfulness constraints.</Paragraph>
      <Paragraph position="1"> This work aims to provide a general solution that does not require distinguishing various types of constraints - the proposed solution implements Richness of the Base explicitly in the initial state of the lexicon. Specifically, the solution involves requiring that initial distributions over the lexicon be uniform, or rich. Although the objective function alone does not prefer restrictive grammars over restrictive lexicons, a lexicon constrained to be uniform, or nonrestrictive, will in turn force the grammar to be restrictive. Another way to think about it is that a restrictive grammar is one that compresses the input distributions maximally by mapping as much of the lexicon onto observed surface forms as possible. By requiring the lexicon to be rich the proposed solution relies on the objective function's natural preference for grammars that maximally compress the lexicon. The objective function prefers restrictive grammars in this situation because restrictive grammars will allow the highest probability to be assigned to observed  forms. In contrast, if the lexicon is not rich, there is nothing for the grammar to compress, and the objective function's natural preference for compression will not be employed. The next subsection discusses the algorithm and the initialization of the parameters in more detail.</Paragraph>
    </Section>
    <Section position="4" start_page="53" end_page="54" type="sub_section">
      <SectionTitle>
3.3 Likelihood Maximization Algorithm
</SectionTitle>
      <Paragraph position="0"> As discussed above, the goal of the learning algorithm is to find the probability distributions over rankings and lexicons that maximize the probability assigned to the observed set of data according to the objective function. In addition, any regularities present in the data should be accommodated by the grammar rather than by restricting the lexicon.</Paragraph>
      <Paragraph position="1"> As in previous work on unsupervised learning of OT, the algorithm assumes knowledge of OT constraints, the possible underlying forms of overt forms, and sets of candidate outputs and their constraint violation profiles for all possible underlying forms. While the present version of the algorithm receives this information as input, recent work in computational OT (Riggle, 2004; Eisner, 2000) suggests that this information is formally derivable from the constraints and overt surface forms and can be generated automatically.</Paragraph>
      <Paragraph position="2"> In addition, the algorithm receives information about the morphological relations between observed surface forms. Specifically, output forms are segmented into morphemes, and the morphemes are indexed by a unique identifier. This information, which has also been assumed in previous work, cannot be derived directly from the constraints and observed forms but is a necessary component of a model that refers to underlying forms of morphemes. The present work assumes this information is available to the learner although Section 5 will discuss the possibility of learning these morphological relations in conjunction with the learning of phonology.</Paragraph>
      <Paragraph position="3"> The set of potential underlying forms is derived from observed surface forms, morphological relations, and the constraint set. On the one hand the set of potential underlying forms, which is initially uniformly distributed, should be rich enough to constitute a rich base for the reasons discussed earlier. On the other hand, the set should be restricted enough so that the search space is not too large and so that the grammar is not pressured to favor mapping underlying forms to completely unrelated surface forms. For this reason, potential underlying forms are derived from surface forms by considering all featural variants of surface forms for features that are evaluated by the grammar. Of these potential underlying forms, only those that can yield each of the observed surface allomorphs of the morpheme under some ranking of the constraints are included. This formulation differs substantially from previous work, which aimed to construct the lexicon via discrete steps, the first of which involved permanently setting the values for features that do not alternate. In contrast, the approach taken here aims to create a rich initial lexicon, to compel the selection of a restrictive grammar.</Paragraph>
      <Paragraph position="4"> In addition to featural variants, variants of surface forms that differ in length are included if they are supported by allomorphic alternation. In particular, featural variants of all the observed surface allomorphs of the morpheme are considered as potential underlying forms for the morpheme if each of the observed surface forms can be generated under some ranking. Including these types of underlying forms extends previous work, which did not allow segmental insertion or deletion or constraints that evaluate these unfaithful mappings, such as MAX and DEP.</Paragraph>
      <Paragraph position="5"> The algorithm initializes both the lexicon and grammar to uniform probability distributions. This means that all rankings are initially equally likely.</Paragraph>
      <Paragraph position="6"> Likewise, all potential underlying forms for a morpheme are initially equally likely. Thus, the probability distributions begin unbiased, but choosing an unbiased lexicon initially begins the search through parameter space at a position that favors restrictive grammars. The experiments in the following section suggest that this choice of initialization correctly selects a restrictive final grammar. The learning algorithm itself is based on the Expectation Maximization algorithm (Dempster et al., 1977) and alternates between an expectation stage and a maximization stage. During the expectation stage the algorithm computes the likelihood of the observed surface forms under the current hypothesis. During the maximization stage the algorithm adjusts the grammar and lexicon distributions in order to increase the likelihood of the data. The probability distribution over rankings is adjusted according to the following re-estimation formula:</Paragraph>
      <Paragraph position="8"> Intuitively, this formula re-estimates the probability of a ranking for state H+1 in proportion to the ranking's contribution to the overall probability at state H. The algorithm re-estimates the probability distribution for an underlying form according to an analogous formula:</Paragraph>
      <Paragraph position="10"> Intuitively, the re-estimate of the probability of an underlying form Ik,j for state H+1 is proportional to the contribution that underlying form makes to the total probability due to morpheme Mi at state H. The algorithm continues to alternate between the two stages until the distributions converge, or until the change between one stage and the next reaches some predetermined minimum. At this point the resulting distributions are taken to correspond to the learned grammar and lexicon.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="54" end_page="56" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> This section describes the results of experiments with three artificial language systems with different types of hidden structure. In all experiments presented here, each unique surface form is assumed to occur with frequency 1.</Paragraph>
    <Section position="1" start_page="54" end_page="55" type="sub_section">
      <SectionTitle>
4.1 Voicing Neutralization
</SectionTitle>
      <Paragraph position="0"> The first test set is an artificial language system (Tesar and Prince, to appear) exhibiting voicing neutralization. The constraint set includes five constraints: null  These five constraints can describe a number of languages, but of particular interest are languages in which voicing contrasts are neutralized in one or more positions. Such languages, three of which are shown below, test the algorithm's ability to identify correct and restrictive grammars. The partial rankings shown below correspond to the necessary rankings that must hold for these languages; each partial ranking actually corresponds to several total rankings of the constraints. Also shown below are the morphologically analyzed surface forms for each language that are provided as input to the algorithm. The subscripts in these forms indicate morpheme identities, while the hyphens segment the words into separate morphemes. For example, tat1,2 means that the surface form &amp;quot;tat&amp;quot; could be derived from either morpheme 1 or 2 in this language. null</Paragraph>
      <Paragraph position="2"> In language C, it would be possible to maximize the objective function by selecting a restrictive lexicon rather than a restrictive grammar. In particular, /tat/ could be selected as the underlying form for morphemes 1-4 in order to account for the lack of voiced obstruents in the observed surface forms. In this case, the objective function could just as well be satisfied by an identity grammar mapping underlying /tat/ to surface &amp;quot;tat&amp;quot;. However, as discussed in Section 2, such a grammar would violate the principle of Richness of the Base by putting the restriction against voiced obstruents into the lexicon rather than the grammar. Thus, this language tests not only whether the algorithm finds a maximum, but also whether the maximum corresponds to a restrictive grammar.</Paragraph>
      <Paragraph position="3"> In fact, for all three languages above, the algorithm converges on the correct, restrictive grammars and correct lexicons. Specifically, the final grammars for each of the languages above converge on probability distributions that distribute the probability mass equally among the total rankings consistent with the partial orders above. For example, for language C the algorithm converges on  a distribution that assigns equal probability to the 20 total rankings consistent with the partial order given by MAX, NOVOI &gt;&gt; IDVOI, IVV.</Paragraph>
      <Paragraph position="4"> The initial uniform lexicon for language C is shown in Table 3. Here the numbers 1-5 refer to morpheme indices, and the possible underlying forms for each morpheme are uniformly distributed. This initial lexicon favors a grammar that can map as much of the rich lexicon as possible onto surface forms with no voiced obstruents. With these constraints, this translates into ranking NOVOI above IDVOI and IVV. As the algorithm begins learning the lexicon and continues to refine its hypothesis for this language, nothing drives the algorithm to abandon the initial rich lexicon. Thus, in the final state, the lexicon for this language is identical to the initial lexicon. In general, the final lexicon will be uniformly distributed over underlying forms that differ in noncontrastive features.</Paragraph>
    </Section>
    <Section position="2" start_page="55" end_page="55" type="sub_section">
      <SectionTitle>
4.2 Grammatical and Lexical Stress
</SectionTitle>
      <Paragraph position="0"> The next set of languages from the PAKA system (Tesar et al., 2003) test the ability of the algorithm to identify grammatical stress (most restrictive), lexical stress (least restrictive), and combinations of the two. The constraint set includes:  Possible languages and their corresponding partial orders ranging from least restrictive to most restrictive are shown below. In the first two languages, the least restrictive languages, lexical distinctions in stress are realized faithfully, while grammatical stress surfaces only in forms with no underlying stress. In the final two languages stress is entirely grammatical; underlying distinctions are neutralized in favor of a regular surface stress pattern. Finally, the middle language is a combination of lexical and grammatical stress, requiring that the algorithm learn that a contrast in roots is preserved, while a contrast in suffixes is neutralized.</Paragraph>
      <Paragraph position="1"> * Full contrast: roots and suffixes contrast in stress, default left:  In all cases the algorithm learns the correct, restrictive grammars corresponding to the partial orders shown above. As before, the final lexicon assigns uniform probability to all underlying forms that differ in noncontrastive features. For example, in the case of the language with root contrast only, the final lexicon selects a unique lexical item for root morphemes and maintains a uniform probability distribution over stressed and unstressed underlying forms for suffixes.</Paragraph>
    </Section>
    <Section position="3" start_page="55" end_page="56" type="sub_section">
      <SectionTitle>
4.3 Abstract Underlying Vowels
</SectionTitle>
      <Paragraph position="0"> The final experiment tests the algorithm on an artificial language, based on Polish, with abstract underlying vowels that never surface faithfully.</Paragraph>
      <Paragraph position="1"> Although the particular phenomenon exhibited by Slavic alternating vowels is rare, the general phenomenon wherein underlying forms do not correspond to any surface allomorph is not uncommon and should be accommodated by the learning algorithm. This language presents a challenge for previous work on unsupervised learning of OT because alternations in the number of segments are observed in morpheme 3. The morphologically  annotated input to the algorithm for this language  In this language morphemes 1, 2 and 4 exhibit no alternation while morpheme 3 alternates between sater and satr depending on the context. The constraints for this language, based on Jarosz (2005), are shown below:  In the proposed analysis of this language, the abstract underlying [E], which is a [+high] version of [e], is neutralized on the surface and exhibits two repairs systematically depending on the context. It deletes in general, but if a complex coda is at stake, the vowel surfaces as [e] by violating IDENT[HIGH]. The required partial ranking for this language is shown below while the desired lexicon is shown in Table 5.</Paragraph>
      <Paragraph position="2"> {*E, {DEP-V &gt;&gt; *COMPLEXCODA }} &gt;&gt; IDENT[HIGH] &gt;&gt; MAX-V The algorithm successfully learns the correct ranking above and the lexicon in Table 5. Specifically, the final grammar assigns equal probability to all the rankings consistent with the above partial order. The final lexicon selects a single underlying form for each morpheme as shown in Table 5 because all underlying distinctions in this language are contrastive.</Paragraph>
    </Section>
    <Section position="4" start_page="56" end_page="56" type="sub_section">
      <SectionTitle>
4.4 Discussion
</SectionTitle>
      <Paragraph position="0"> In summary, the algorithm is able to find a correct grammar and lexicon combination for all of the language systems discussed. As discussed in Section 3, the objective function itself does not favor restrictive grammars, but the ability of the algorithm to learn restrictive grammars in these experiments suggests that initializing the lexicons to uniform distributions does compel the learning algorithm to select restrictive grammars rather than restrictive lexicons.</Paragraph>
      <Paragraph position="1"> While the experiments presented in this section focus on the task of learning a grammar and lexicon simultaneously, the proposed algorithm is also capable of learning grammars from structurally ambiguous forms. The same likelihood maximization procedure proposed here could be used for unsupervised learning of grammars that assign full structural description to overt forms. Future directions include testing the algorithm on language data of this sort.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML