File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0105_intro.xml

Size: 6,087 bytes

Last Modified: 2025-10-06 14:02:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0105">
  <Title>Priors in Bayesian Learning of Phonological Rules</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Linguistica
</SectionTitle>
    <Paragraph position="0"> Since our algorithm is designed to take as input a morphological analysis produced by Linguistica, we first briefly review what that analysis consists of and how it is arrived at. Linguistica is based on the MDL principle, which states that the optimal hypothesis to explain a set of data is the one that minimizes the total number of bits required to describe both the hypothesis and the data under that hypothesis. Information theory tells us that the description length of the data under a given hypothesis is simply the negative log likelihood of the data, so the MDL criterion is equivalent to a Bayesian prior favoring hypotheses that can be described succintly.</Paragraph>
    <Paragraph position="1"> Linguistic hypotheses (grammars) all contain some primitive types. Linguistica uses three primitive types in its grammar: stems, suffixes, and signatures. 2 Each signature is associated with a set of stems, and each stem is associated with exactly one signature representing those suffixes with which it combines freely. For example, walk and jump might be associated with the signature h .ed.ing.si (see Figure 1), while bad might be associated with h .lyi. Unanalyzed words can be thought of as belonging to the h i signature. A possible grammar under this scenario consists of a set of signatures, where each signature contains a set of stems and a set of suffixes. Rather than modeling the probability of each word in the corpus directly, the grammar assumes that each word consists of a stem and a (possibly empty) suffix, and assigns a probability to each word w according to</Paragraph>
    <Paragraph position="3"> where is the signature containing stem t. (We have adopted Goldsmith's notation here, using f to denote suffixes, t for stems, and for signatures.) Clearly, grouping words into signatures will cause their probabilities to be modeled less well than modeling each word individually. The negative log likelihood of the corpus will therefore increase, and this portion of the description length will grow. However, listing each word individually in the grammar requires as many stems as there are words. Assigning words to signatures significantly reduces the number of stems, and thus the length of the grammar. If the stems are chosen well, then the length of the grammar will decrease more than the length of the encoded corpus will increase, leading to an over-all win. Goldsmith (2001) provides a detailed description of the exact grammar encoding and search heuristics used to find the optimal set of stems, suffixes, and signatures under this type of model.</Paragraph>
    <Paragraph position="4"> Goldsmith's algorithm is not without its problems, however. We concern ourselves here with its tendency to postulate spurious signatures in cases where phonological constraints operate. For example, many English verb stems ending in e are placed in the signature he.ed.es.ingi, while stems not ending in e have the signature h .ed.ing.si. This is due to the fact that the stem-final e deletes before suffixes beginning in e or i. Similarly, words like match and index are likely to be given the signatureh .esi, whereas most nouns would beh .si. The toy grammar G1 in Figure 2 illustrates the sort of analysis produced by Linguistica.</Paragraph>
    <Paragraph position="5"> Goldsmith himself has noted the problem of spurious signatures (Goldsmith, 2004a), and recent ver2Linguistica actually can perform prefix analysis as well as suffix analysis, but in our work we used only the suffixing functions. null  voted to detecting allomorphs. Superficially, our work may seem similar to Goldsmith's, but in fact it is quite different. First of all, the allomorphic variation detected by Linguistica is suffix-based. That is, suffixes are proposed that operate to delete certain stem-final material. For example, a suffix (e)ing could be proposed in order to include both hope and walk in the signature h .(e)ing.si. This suffix is actually separate in the grammar from the ordinary ing suffix, so there is no recognition of the fact that any occurrence of ing in any signature should delete a preceding stem-final e. Moreover, this approach is not really phonological, in the sense that other suffixes beginning with i might or might not be analyzed as deleting stem-final e. While many languages do contain some affixspecific morpho-phonological processes, our goal here is to find phonological rules that apply at all stem-suffix boundaries, given certain context criteria. null A second major difference between the allomorphy detection in Linguistica and the work presented here is that a Linguistica suffix such as (e)ing is assumed to delete any stem-final e, without exception.</Paragraph>
    <Paragraph position="6"> While this assumption may be valid in this case, there are other suffixes and phonological processes that are not categorical. For example, the English plural s requires insertion of an e after certain stems, including those ending in x or s. However, there is no simple way to describe the context for this rule based solely on orthography, because of stems such as beach (+es) and stomach (+s). For this reason, and to add robustness against errors in the input morphological segmentation, we allow stems to be listed in the grammar as exceptions to phonological rules.</Paragraph>
    <Paragraph position="7"> In addition to these theoretical differences, the work presented here covers a wider range of phonological processes than does Linguistica. Linguistica is capable of detecting only stem-final deletion, whereas our algorithm can also detect insertion (as in match + s ! matches) and stem-final substitution (as in carry + ed!carried). In the following section we discuss the structure of the grammar we use to describe the words in our corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML