XML Viewer - w04-0105

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0105_metho.xml
Size: 24,391 bytes
Last Modified: 2025-10-06 14:09:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0105">
  <Title>Priors in Bayesian Learning of Phonological Rules</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 A Morpho-Phonological Grammar
</SectionTitle>
    <Paragraph position="0"> Since the morphology we use as input to our program is obtained directly from Linguistica, our grammar is necessarily similar to the one in that program. As discussed above, Linguistica contains three primitives types in its grammar: signatures, stems, and suffixes. We add one more primitive type to our grammar, the notion of a rule.</Paragraph>
    <Paragraph position="1"> Each rule consists of a transformation, for example ! e or y ! i, and a conditioning context. A context consists of a string of four characters XtytyfXf , where Xi 2fC;V; #g(consonant, vowel, end-of-word) and yi is in the set of characters in our text.3 The first half of the context is from the end of the stem, and the second half is from the beginning of the suffix. For example, the stem-suffix pair jump + ed has the context CpeC. All transformations are assumed to occur stem-finally, i.e. at the second context position (or after the second position, for insertions). Of course, these contexts are more detailed than necessary for certain phonological rules, and don't capture all the information required for others. In future work, we plan to allow for different types of contexts and generalization over contexts, but for the present, all contexts have the same form.</Paragraph>
    <Paragraph position="2"> Using these four primitives, we can construct a grammar in the following way: As in Goldsmith's work, we list a set of signatures, each of which contains a set of stems and suffixes. In addition, we list a set of phonological rules. In many cases, only one rule will apply in a particular context, in which case it applies to all stem-suffix pairs that meet its context. If more than one rule applies, we list the rule with the most common transformation first and assume that it applies unless a particular stem specifies otherwise. Stems can thus be listed as exceptions to rules by using a non-default *nochange* rule with the appropriate context. Note that the more exceptions a rule has, the more expensive it is to add to the grammar: each new type of transformation in a particular context must be listed, and each stem requiring a non-default transformation must specify the transformation required. Any prior preferring short grammars will therefore tend 3The knowledge of which characters are consonants and which are vowels is the only information we provide to our program, other than the text corpus and the Linguistica-produced morphology. Aside from the C/V distinction, our program is entirely knowledge-free.</Paragraph>
    <Paragraph position="4"> mation Rules to reject rules requiring many exceptions (i.e. those without a consistent application context). Grammar G2, in Figure 3, shows a sample of the kind of grammar we use. This grammar generates exactly the same wordforms as G1, but using fewer signatures due to the effects of the phonological rules. All the stem-suffix parings in this grammar undergo the default rules for their contexts except for the stem booth, which is listed as an exception to the e-insertion rule. For booth + s, the grammar therefore generates booths, not boothes.</Paragraph>
    <Paragraph position="5"> Our model generates data in much the same way as Goldsmith's: a word is generated by selecting a signature and then independently generating a stem and suffix from that signature. This means that the likelihood of the data takes the same form in our model as in Goldsmith's, namely Pr(w) = Pr( )Pr(tj )Pr(fj ), where the word w consists of a stem t and a suffix f both drawn from the same signature . Our model differs from Goldsmith's in the way that stems and suffixes are produced; because we use phonological rules a great many more stems and suffixes can belong to a single signature. We defer discussion of how we define the prior probability over grammars to Section 5, and assume for the moment that we are given prior and likelihood functions that can evaluate the utility of a grammar and training data.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Search Algorithm
</SectionTitle>
    <Paragraph position="0"> Since it is clearly infeasible to evaluate the utility of every possible grammar, we need a search algorithm to guide us toward a good solution. Our algorithm uses certain heuristics to make small changes to the initial grammar (the one provided by Linguistica), evaluating each change using our objective function, and accepting or rejecting it based on the result of evaluation. Our algorithm contains three major components: a procedure to find signatures that are similar in ways suggesting phonological change, a procedure to identify possible contexts for phonological change, and a procedure to collapse related signatures and add phonological rules to the grammar. We discuss each of these components in turn.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Identifying Similar Signatures
</SectionTitle>
      <Paragraph position="0"> An important first step in simplifying the morphological analysis of our data using phonological rules is to identify signatures that might be related via such rules. Since our algorithm considers three different types of possible phonological processes (deletion, substitution, and insertion), there are three different ways in which signatures may be related. We need to look for pairs of signatures that are similar in any of these three ways.</Paragraph>
      <Paragraph position="1"> Insertion We look for potential insertion rules by finding pairs of signatures in which all suffixes but one are common to both signatures. The distinct pair of suffixes must be such that one can be formed from the other by inserting a single character at the beginning. Example pairs found by our algorithm include h .si/h .esi and h .yi/h .lyi. In searching for these pairs (as well as deletion and substitution pairs), we consider only pairs where each signature contains at least two stems. This is partly in the interests of efficiency and partly due to the fact that signatures with only one stem are often less reliable.</Paragraph>
      <Paragraph position="2"> Deletion Signature pairs exhibiting possible deletion behavior are similar to those exhibiting insertion behavior, except that one of the suffixes not common to both signatures must be the empty suffix. Examples of possible deletion pairs include h .ed.ingi/he.ed.ingiandh .ed.ingi/hed.ing.si.</Paragraph>
      <Paragraph position="3"> Substitution In a possible substitution pair, one signature (the one potentially exhibiting stem-final substitution) contains suffixes that all begin with one of two characters: the basic stem-final character, and the substituted character. The signature hied.ier.yi from G1 is such a signature. The other signature in a possible substitution pair must contain the empty suffix, and the two signatures must be identical when the first character of each suffix in the first signature is removed. Possible substitution pairs includehied.ier.yi/h .ed.eriandhous.yi/h .usi.</Paragraph>
      <Paragraph position="4"> Using the set of similar signatures we have detected, we can propose a set of possible phonological processes in our data. Some transformations, such as e! , will be suggested by more than one pair of signatures, while others, such as y ! o, will occur with only one pair. We create a list of all the possible transformations, ranked according to the number of signature pairs attesting to them.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Identifying possible contexts
</SectionTitle>
      <Paragraph position="0"> Once we have found a set of possible transformations, we need to identify the contexts in which those transformations might apply. To see how this works, suppose we are looking at the proposed e-deletion rule and our input grammar is G1. Using one of the signature pairs attesting to this rule, such as h .ed.er.ingi/he.ed.er.ingi, we can find possible conditioning contexts by examining the set of stems and suffixes in the second signature. If we want to reanalyze the stems din and bik as dine and bike, we hypothesize that each wordform generated using the suffixes present in both signatures (ed, er, and ing) must have deleted an e. We can find the context for this deletion by looking at these suffixes together with the reanalyzed stems. The contexts for deletion that we would get fromfbike, dineg fed, inggarefCeeC, CeiCg.4 Our methods for finding possible contexts for substitution and insertion rules are similar: reanalyze the stems and suffixes in the signature hypothesized to require a phonological rule, combine them, and note the context generated. In this way, we can get contexts such as CyeC for the y!i rule (from carry + ed) and V xs# for the ;! e rule (from index + s).</Paragraph>
      <Paragraph position="1"> Just as we ranked the set of possible phonological rules according to the number of signature pairs attesting to them, we can rank the set of contexts proposed for each rule. We do this by calculating</Paragraph>
      <Paragraph position="3"> the probability of seeing a particular stem context given a particular suffix context to the prior probability of the stem context. If a stem context (such as Ce) is quite common overall but hardly ever appears before a particular suffix context (iC), this is good evidence that some phonological process has modified the stem in the context of that suffix. Low values of r are therefore better evidence of conditioning for a rule than are high values of r.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Collapsing signatures
</SectionTitle>
      <Paragraph position="0"> Given a set of similar signature pairs, the rules relating them, and the possible contexts for those rules, we need to determine which rules are actually phonologically legitimate and which are simply accidents of the data. We do this by simply considering each rule and context in turn, proceeding from the most attested to least attested rules and from most likely to least likely contexts. For each rule-context pair, we add the rule to the grammar 4The reasoning we use to finding conditioning contexts for deletion rules was also described by Goldsmith (2004a), and is similar to the much earlier work of Johnson (1984).</Paragraph>
      <Paragraph position="1">  with that context and collapse any pairs of signatures related by the rule, as long as all stem-suffix pairs contain a context at least as likely as the one under consideration. Collapsing a pair of signatures means reanalyzing all the stems and suffixes in one of the signatures, and possibly adding exceptions for any stems that don't fit the rule. We have found that exceptions are often required to handle stems that were originally misanalyzed by Linguistica. For that reason, we prune the rules added to the grammar, and for each rule, if fewer than 2% of the stems require exceptions, we assume that these are errors and de-analyze the stems, returning the word-forms they generated to theh isignature. We then evaluate the new analysis using our objective function, and accept it if it scores better than our previous analysis. Otherwise, we revert to the previous analysis and continue trying new rule-context pairs.</Paragraph>
      <Paragraph position="2"> Pseudocode for our algorithm is presented in Figure 4. We use the notation i!r j to indicate that i and j are similar with respect to rule r, with j being the more &amp;quot;basic&amp;quot; signature (i.e. adding r to the grammar would allow us to move the stems in i into j).</Paragraph>
      <Paragraph position="3"> Note that collapsing a pair of signatures does not always result in an overall reduction in the number of signatures in the grammar. To see why this is  so, consider the effect of collapsing 1 and 2 and adding r1 and r2 (the e-deletion rules) to G1. When the stem bik gets reanalyzed as bike, the algorithm recognizes that bike is already a stem in the grammar, so rather than placing the reanalyzed stem in 1, it combines the reanalyzed suffixes f , ed, er, ingg with the suffixes f , sg from 6 and creates a new signature for the stem bike -- h .ed.er.ing.si.</Paragraph>
      <Paragraph position="4"> The two stems carr and carry are also combined in this way, but in that case, the combined suffixes form a signature already present in the grammar, so no new signature is required.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> For our experiments with learning phonological rules, we used two different corpora obtained from the Penn Treebank. The larger corpus contains the words from sections 2-21 of the treebank, filtered to remove most numbers, acronyms, and words containing puctuation. This corpus consists of approximately 900,000 tokens. The smaller corpus is simply the first 100,000 words from the larger corpus.</Paragraph>
    <Paragraph position="1"> We ran each corpus through the Linguistica program to obtain an initial morphological segmentation. Statistics on the results of this segmentation are shown in the left half of Table 1. &amp;quot;Singleton signatures&amp;quot; are those containing a single stem, and &amp;quot;Non- stems&amp;quot; refers to stems in a signature other than theh isignature, i.e. those stems that combine with at least one non- suffix.</Paragraph>
    <Paragraph position="2"> The original function we used to evaluate the utility of our grammars was an MDL prior very similar to the one described by Goldsmith (2001). This prior is simply the number of bits required to describe the grammar using a fairly straightforward encoding. The encoding essentially lists all the suffixes in the grammar along with pointers to each one; then lists the phonological rules with their pointers; then lists all the signatures. Each signature is a list of stems and their pointers, and a list of pointers to suffixes. Each exceptional stem also has  prior (large corpus).</Paragraph>
    <Paragraph position="3"> a pointer to a phonological rule.5 Our algorithm considered a total of 11 possible transformations in the small corpus and 40 in the large corpus, but using this prior, only a single type of transformation appeared in any rule in the final grammar: e ! , with seven contexts in the small corpus and eight contexts in the large corpus. In analyzing why our algorithm failed to accept any other types of rules, we realized that there were several problems with the MDL prior. Consider what happens to the overall evaluation when two signatures are collapsed. In general, the likelihood of the corpus will go down, because the stem and suffix probabilities in the combined signature will not fit the true probabilities of the words as well as two separate signatures could. For large corpora like the ones we are using, this likelihood drop can be quite large. In order to counterbalance it, there must be a large gain in the prior.</Paragraph>
    <Paragraph position="4"> But now look at Table 2, which shows the effects of adding all the y ! i rules to the grammar for the large corpus under the MDL prior. The first two lines give the number of signatures and stems in each grammar. The next line shows the total length (in bits) of each grammar, and this value is then broken down into three different components: the overhead caused by listing the signatures and their suffixes, the length of the stem list (not including the length required to specify exceptions to rules), and the length of the phonological component (including both rules and exception specifications). Finally, we have the negative log likelihood under each grammar and the total MDL cost (grammar plus likelihood).</Paragraph>
    <Paragraph position="5"> As expected, the likelihood term for the grammar 5There are some additional complexities in the grammar encoding that we have not mentioned, due to the fact that stems can be recursively analyzed using shorter stems. These complexities are irrelevant to the points we wish to make here, but are described in detail in Goldsmith (2001).</Paragraph>
    <Paragraph position="6">  fied prior (large corpus).</Paragraph>
    <Paragraph position="7"> with y ! i rules has increased, indicating a drop in the probability of the corpus under this grammar. But notice that the total grammar size has also increased, leading to an overall evaluation that is worse than for the original grammar. There are two main reasons for this increase in grammar size.</Paragraph>
    <Paragraph position="8"> Initially, the more puzzling of the two is the fact that the number of bits required to list all the stems has increased, despite the fact that the number of stems has decreased due to reanalyzing some pairs of stems into single stems. It turns out that this effect is due to the encoding used for stems, which is simply a bitwise encoding of each character in the stem. This encoding means that longer stems require longer descriptions. When reanalysis requires shifting a character from a suffix onto the entire set of stems in a signature (as infcertif, empt, hurrg fied, yg!fcertify, empty, hurryg f , edg), there can be a large gain in description length simply due to the extra characters in the stems. If the number of stems eliminated through reanalysis is high enough (as it is for the e! rules), this stem length effect will be outweighed. But when only a few stems are eliminated relative to the number that get longer, the overall length of the stem list increases.</Paragraph>
    <Paragraph position="9"> However, even without the stem list, the grammar with y !i rules would still be slightly longer than the grammar without them. In this case, the reason in that under our MDL prior, it is quite efficient to encode a signature and its suffixes. Therefore the grammar reduction caused by removing a few signatures is not enough to outweigh the increase caused by adding a few phonological rules.</Paragraph>
    <Paragraph position="10"> Using these observations as a guideline, we redesigned our prior by assigning a fixed cost to each stem and increasing the overhead cost for signatures. The new overhead function is equal to the sum of the lengths of all the suffixes in the signature, times a constant factor. This function means there is more incentive to collapse two signatures that share several suffixes, such as he.ed.er.ingi/h .ed.er.ingi, than to collapse signatures sharing only a single suffix, such ashing.si/h .ingi. This behavior is exactly what we want, since these shorter pairs are more likely to be accidental. Table 3 shows the effects of adding the y ! i rules under this new prior.</Paragraph>
    <Paragraph position="11"> The starting grammar is somewhat different from the one in Table 2, because more rules have already been added by the time the y !i rules are considered. The important point, however, is that the cost of each component of the grammar changes in the direction we expect it to, and the total grammar cost is reduced enough to more than make up for the loss in likelihood.</Paragraph>
    <Paragraph position="12"> With this new prior, our algorithm was more successful, learning from the large corpus the three major transformations for English (e ! , !e, and y ! i) with a total of 22 contexts. Eight of these rules, such as !e = V xs# and y!i = CyeC, had no exceptions. Of the remaining rules, the exceptions to six of the rules were correctly analyzed stems (for example, unhappy + ly!unhappily and necessary + ly!necessarily but sly + ly!slyly), while the remaining eight rules contained misanalyzed exceptions (such as overse + er ! overseer, which was listed as an exception to the rule e! = CeeC, rather than being reanalyzed as oversee + er). In the small corpus, no y!i rules were learned due to the fact that no similar signatures attesting to these rules were found.</Paragraph>
    <Paragraph position="13"> Using these phonological rules, a total of 31 signatures in the small corpus and 57 signatures in the large corpus were collapsed, with subsequent reanalysis of 225 and 528 stems, respectively. This represents 7-10% of all the non- stems. The final grammars are summarized in the right half of Table</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1.
6 Conclusion
</SectionTitle>
    <Paragraph position="0"> The work described here is clearly preliminary with respect to learning phonological rules and using those rules to simplify an existing morphology. Our notion of context, for example, is somewhat impoverished; our system might benefit from using contexts with variable lengths and levels of generality, such as those in Albright and Hayes (2003). We also cannot handle transformations that require rule ordering or more than one-character changes. One reason we have not yet implemented these additions is the difficulty of designing a heuristic search that can handle the additional complexity required. We are therefore working toward implementing a more general search procedure that will allow us to explore a larger grammar space, allowing greater flexibility with rules and contexts. Once some of these improvements have been implemented, we hope to explore the possibilities for learning in languages with richer morphology and phonology than English. null Our point in this paper, however, is not to present a fully general learner, but to emphasize that in a Bayesian system, the choice of prior can be crucial to the success of the learning task. Learning is a trade-off between finding an explanation that fits the current data (maximizing the likelihood) and maintaining the ability to generalize to new data (maximizing the prior). The MDL framework is a way to formalize this trade-off that is intuitively appealing and seems straightforward to implement, but we have shown that a simple MDL approach is not the best way to achieve our particular task. There are at least two reasons for this. First, the obvious encoding of stems actually penalizes the addition of certain types of phonological rules, even when adding these rules reduces the number of stems in the grammar. More importantly, the type of grammar we want to learn allows two different kinds of generalizations: the grouping of stems into signatures, and the addition of phonological rules. Simply specifying a method of encoding each type of generalization may not result in a linguistically appropriate trade-off during learning. In particular, we discovered that our MDL encoding for signatures was too efficient relative to the encoding for rules, leading the system to prefer not to add rules. Our large corpus size already puts a great deal of pressure on the system to keep signatures separate, since this leads to a better fit of the data. In order to learn most of the rules, we therefore had to significantly increase the cost of signatures.</Paragraph>
    <Paragraph position="1"> We are not the first to note that with an MDLstyle prior the choice of encoding makes a difference to the linguistic appropriateness of the resulting grammar. Chomsky himself (Chomsky, 1965) points out that the reason for using certain types of notation in grammar rules is to make clear the types of generalizations that lead to shorter grammars. However, our experience emphasizes the fact that very little is still known about how to choose appropriate encodings (or, more generally, priors).</Paragraph>
    <Paragraph position="2"> As researchers continue to attempt more sophisticated Bayesian learning tasks, they will encounter more interactions between different kinds of generalizations. As a result, the question of how to design a good prior will become increasingly important. Our primary goal for the future is therefore to investigate exactly what assumptions go into deciding whether a grammar is linguistically sound, and to determine how to specify those assumptions explicitly in a Bayesian prior.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML