File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3209_metho.xml
Size: 28,313 bytes
Last Modified: 2025-10-06 14:10:57
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3209"> <Title>Learning Probabilistic Paradigms for Morphology in a Latent Class Model</Title> <Section position="3" start_page="69" end_page="71" type="metho"> <SectionTitle> 2 The Probabilistic Paradigm </SectionTitle> <Paragraph position="0"> We introduce the probabilistic paradigm, a probabilistic, declarative model of regular morphology.</Paragraph> <Paragraph position="1"> The probabilistic paradigm model consists of three matrices: the data matrix D, the morphological probabilities matrix M, and the lexical probabilities matrix L. Let m be the number of stems, n the number of stems, and p the number of paradigms.</Paragraph> <Paragraph position="2"> The D matrix encodes the joint distribution of lexical and morphological information in a corpus. It is of size m x n, and each cell contains the frequency of the word formed by concatenating the appropriate stem and suffix. The M matrix is of size m x p, and each column contains the conditional probabilities of each suffix given a paradigm. The L matrix is of size p x n, and contains the conditional probabilities of each paradigm given a stem. Each suffix should belong to exactly one paradigm, and the suffixes of a particular paradigm should be conditionally independent.</Paragraph> <Paragraph position="3"> Each column of the M matrix defines a canonical paradigm, a set of suffixes that attach to stems associated with that paradigm. A lexical paradigm is the full set of word forms for a particular stem, and is an instantiation of the canonical paradigm for a particular stem.</Paragraph> <Paragraph position="4"> The probabilistic paradigm is not well-developed as the usual notion of &quot;paradigm&quot; in linguistics. First, the system employs no labels such as &quot;noun&quot;, &quot;plural&quot;, &quot;past&quot;, etc. Second, probabilistic paradigms have only a top-level categorization; induced &quot;verb&quot; paradigms, for example, are not substructured into different tenses or conjuga- null tions. Third, we do not distinguish between inflectional and derivational morphology; traditional grammars place derived forms in separate lexical paradigms. Fourth, we do not handle syncretism, where one suffix belongs in multiple slots of the paradigm. Fifth, we do not yet not handle irregular and sub-regular forms. Despite these drawbacks, our paradigms have an important advantage over traditional paradigms, in being probabilistic and therefore able to model language usage.</Paragraph> <Paragraph position="5"> 3 Learning the probabilistic paradigm in a latent class model We learn the parameters of the probabilistic paradigm model by applying a dimensionality reduction algorithm to the D matrix, in order to produce the M and L matrices. This reduces the size of the representation from m*n to m*p + p*n. The main idea is to discover the latent classes (paradigms) which represent the underlying structure of the input matrix. This handles two important problems: 1) that words occur in a subset of their possible morphological forms in a corpus, and 2) that the words formed from a particular stem can belong to multiple POS categories. The second problem can be quantified as follows: in our English data, 14.3% of types occur with multiple open-class base POS categories, accounting for 56.5% of tokens; for Spanish, 13.7% of types, 37.8% of tokens.</Paragraph> <Section position="1" start_page="70" end_page="70" type="sub_section"> <SectionTitle> 3.1 LDA model for morphology </SectionTitle> <Paragraph position="0"> The dimensionality reduction algorithm that we employ is Latent Dirichlet Allocation (LDA) (Blei et. al. 2003). LDA is a generative probabilistic model for discrete data. For the application of topic discovery within a corpus of documents, a document consists of a mixture of underlying topics, and each topic consists of a probability distribution over the vocabulary. The topic proportions are drawn from a Dirichlet distribution, and the words are drawn from a multinomial over the topic. Probability distributions of documents and words are conditionally independent of topics. LDA produces two non-negative parameter matrices, Gamma and Beta: Gamma is the matrix of Dirichlet posteriors, encoding the distribution of documents and topics; Beta encodes the distribution of words and topics.</Paragraph> <Paragraph position="1"> The mapping of the data structures of LDA to the probabilistic paradigm is as follows. The document-word matrix is analogous to the suffix-stem D matrix. For morphology, a &quot;document&quot; is a multiset of tokens in a corpus, such that each of those tokens decomposes into a stem and a specified suffix. Different underlying canonical paradigms (&quot;topics&quot;) can be associated with suffixes, and each canonical paradigm allows a set of stems (&quot;words&quot;). For a suffix-stem (&quot;document-word&quot;) matrix of size m x n and k latent classes, the Gamma matrix is of size m x k, and the Beta matrix is of size k x n. The Gamma matrix, normalized by column, is the M matrix, and the Beta matrix, normalized by row, is the L matrix.</Paragraph> </Section> <Section position="2" start_page="70" end_page="71" type="sub_section"> <SectionTitle> 3.2 Recursive LDA </SectionTitle> <Paragraph position="0"> One standard issue in using these types of algorithms is selecting the number of classes. To deal with this, we have formulated a recursive wrapper algorithm for LDA that accomplishes a divisive clustering of suffixes. LDA is run at each stage to find the local Gamma and Beta matrices. To split the suffixes into two classes, we assign each suffix to the class for which its probability is greater, by examining the Gamma matrix. The input matrix is then divided into two smaller matrices based on this split, and the algorithm continues with each submatrix. The result is a binary tree describing the suffix splits at each node.</Paragraph> <Paragraph position="1"> To construct a classification of suffixes into paradigms, it is necessary to make a cut in the tree.</Paragraph> <Paragraph position="2"> Assuming that suffix splits are optimal, we start at the root of the tree and go down until reaching a node where there is sufficient uncertainty about which class a suffix should belong to. A good split of suffixes is one where the vectors of probabilities of suffixes given a class are orthogonal; we can find such a split by minimizing the cosine of the two columns of the node's Gamma matrix (we call this the &quot;Gamma cosine&quot;). Thus, a node at which suffixes should not be split has a high Gamma cosine, and when encountering such a node, a cut should be made. The suffixes below this node are grouped together as a paradigm; tree structure below the cut node is ignored. In our experiments we have selected thresholds for the Gamma cosine, but we do not know if there is a single value that would be successful cross-linguistically. After the tree has been cut, the Gamma and Beta matrices for ancestor nodes are normalized and combined to form the M and L matrices for the language.</Paragraph> <Paragraph position="3"> Another issue is dealing with suboptimal solutions. Random initializations of parameters lead the EM training procedure to local maxima in the solution space, and as a result LDA produces differing suffix splits across different runs. To get around this, we simply run LDA multiple times (25 in our experiments) and choose the solution that minimizes the Gamma cosine.</Paragraph> <Paragraph position="4"> We also experimented with minimizing the Beta cosine. The Beta matrix represents stem ambiguity with respect to a suffix split. Since there are inherently ambiguous stems, one should not expect the Beta cosine value to be extremely low.</Paragraph> <Paragraph position="5"> Minimizing the Beta cosine sometimes made the Beta matrix &quot;too disambiguated&quot; and forced the representation of ambiguity into Gamma matrix, thereby inflating the Gamma cosine and causing incorrect classifications of suffixes.</Paragraph> </Section> </Section> <Section position="4" start_page="71" end_page="72" type="metho"> <SectionTitle> 4 Data </SectionTitle> <Paragraph position="0"> We conducted experiments on English and Spanish. For English, we chose the Penn Treebank (Marcus et. al. 1993), which is already POStagged; for Spanish, we chose an equivalent-sized portion of newswire (Graff and Galegos 1999), POS-tagged by the FreeLing morphological analyzer (Carreras et. al. 2004). We restricted our data to nouns, verbs, adjectives, and adverbs. Words that did not follow canonical suffixation patterns for their POS category (irregulars, foreign words, incorrectly tagged words, etc.) were excluded. We segmented each word into stem and suffix for a specified set of suffixes. Rare suffixes were excluded, such as many English adjective-forming suffixes and Spanish 2nd person plural forms.</Paragraph> <Paragraph position="1"> Stems were not lemmatized, with the result that there can be multiple stem variants of a particular lemma, as with the words stemm.ing$ and stem.s$. Tokens were not disambiguated for word sense. Stems that occurred with only one suffix were excluded.</Paragraph> <Paragraph position="2"> We use several different representations of suffixes in constructing the data matrices: 1) merged, labeled suffixes; 2) merged, unlabeled suffixes; 3) unmerged, unlabeled suffixes. For unmerged suffixes, allomorphs are represented in their original spelling. A merged suffix is a common representa- null We abuse the standard usage of the term &quot;allomorph&quot; to include gender and conjugational variants.</Paragraph> <Paragraph position="3"> tion for the multiple surface manifestations of an underlyingly identical suffix. Suffixes also can be unlabeled, or labeled with base POS tags. For an example, a verb created would be segmented as create.d$ with an unmerged, labeled suffix, or create.d/ed$V with a merged, labeled suffix.</Paragraph> <Paragraph position="4"> Labels disambiguate otherwise categorically ambiguous suffixes.</Paragraph> <Paragraph position="5"> The gold standard for each language lists the suffixes that belong to a paradigm for stems of a particular POS category. We call this the &quot;input&quot; POS category, which is not indicated in annotations and is the concept to be predicted. This should be differentiated from the &quot;output&quot; POS labels on the suffixes: for example, ly$R attaches to stems of the input category &quot;adjective&quot;. Each suffix is an atomic entity, so the system actually has no concept of output POS categories. All that we require is that distinct suffixes are given distinct symbols.</Paragraph> <Paragraph position="6"> In the English gold standard (Table 1), each slashed pair of suffixes denotes one merged form; the unmerged forms are the individual suffixes.</Paragraph> <Paragraph position="7"> ally$R is the suffix ly$R preceded by an epenthetic vowel, as in the word basically. In the Spanish gold standard (Table 2), each slashed group of suffixes corresponds to one merged form. For adjectives and nouns, a$ and o$ are feminine and masculine singular forms, and as$ and os$ are the corresponding plurals. $ and s$ do not have gender; es$ is a plural allomorph.</Paragraph> <Paragraph position="8"> mente/amente$R is a derivational suffix. The first two groups of verbal suffixes are past participles, agreeing in number and gender. For the other verb forms, when three are listed they correspond to forms for the 1st, 2nd, and 3rd conjugations.</Paragraph> <Paragraph position="9"> When there are two, the first is for the 1st conjugation, and the other is identical for the 2nd and 3rd. o$V has the same form across all three conjugations. null Adjectives: $A, d/ed$A, r/er$A, ally/ly$R Nouns: $N, 's$N, es/s$N Verbs: $V, d/ed$V, es/s$V, ing$V, ing$A, ing$N, r/er$N Adjectives: a/o/$A, as/os/es/s$A, mente/amente$R Nouns: a/o/$N, as/os/es/s$N Verbs: ada/ida/ado/ido$V, adas/idas/ados/idos$V, ando/iendo$V, ar/er/ir$V, o$V, as/es$V, a/e$V, amos/emos/imos$V, an/en$V, aba/ia$V, abamos/iamos$V, aban/ian$V, are/ere/ire$V, ara/era/ira$V, aremos/eremos/iremos$V, aran/eran/iran$V, e/i$V, o/io$V, aron/ieron$V, aria/eria/iria$V, arian/erian/irian$V</Paragraph> </Section> <Section position="5" start_page="72" end_page="76" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="72" end_page="73" type="sub_section"> <SectionTitle> 5.1 Merged, labeled suffixes </SectionTitle> <Paragraph position="0"> Figure 1 shows the recursion tree for English data preprocessed with merged, labeled suffixes. To produce a classification of suffixes into paradigms, we start at the root and go down until reaching nodes with a Gamma cosine greater than or equal to the threshold. The cut for a threshold of .0009 produces three paradigms exactly matching the gold standard for verbs, adjectives, and nouns, respectively. Table 3 shows the complete M matrix, which contains suffix probabilities for each paradigm. Table 4 shows a portion of the L matrix, which contains the probabilities of stems belonging to paradigms. We list the stems that are most ambiguous with respect to paradigm membership (note that this table does not specify the words that belong to each category, only their stems).</Paragraph> <Paragraph position="1"> labeled suffixes, sorted by lowest entropy.</Paragraph> <Paragraph position="2"> Columns: p(paradigm|stem).</Paragraph> <Paragraph position="3"> Next, we examine the morphological and lexical conditional probabilities in the M and L matrices. It is possible that even though the correct classification of suffixes into paradigms was learned, the probabilities may be off. Table 5 shows, however, that the M and L matrices are an extremely accurate approximation of the true morphological and lexical probabilities. We have included statistics for the corresponding Spanish experiment; the paradigms that were discovered for Spanish also match the gold standard.</Paragraph> </Section> <Section position="2" start_page="73" end_page="73" type="sub_section"> <SectionTitle> 5.2 Unmerged, labeled suffixes </SectionTitle> <Paragraph position="0"> The next experiments tested the effect of allomorphy on paradigm discovery, using data where suffixes are labeled but not merged. There are competing pressures at work in determining how allomorphs are assigned to paradigms: on the one hand, the disjointedness of stem sets for allomorphs would tend to place them in separate paradigms; on the other hand, if those stem sets have other suffixes in common that belong to the same paradigm, the allomorphs might likewise be placed in that paradigm. In our experiments, we found that there was much more variability across runs than in the merged suffix cases. In English, for example, the suffix es$N was sometimes placed in the &quot;verb&quot; paradigm, although the maximally orthogonal solution placed it in the &quot;noun&quot; paradigm. Figure 2 shows the recursion tree and paradigms for Spanish. Gold standard noun and adjective categories are fragmented into multiple paradigms in the tree. Although nouns have a common parent node (2), the nouns of the different genders are placed in separate paradigms -- this is because a noun can have only one gender. The verbs are all in a single paradigm (node 10). Node 11 contains all the first-conjugation verbs, and node 12 contains all the second/third-conjugation verbs. The reason that they are not in separate paradigms is that a$V is shared by stems of all three conjugations, which leads to a split that is not quite orthogonal.</Paragraph> <Paragraph position="1"> The case of adjectives is the most interesting.</Paragraph> <Paragraph position="2"> Gendered and non-gendered adjective stems are disjoint, so adjectives appear in two separate sub-trees (nodes 4, 13). In node 4, the genderambiguous plural es$A is in conflict with the plural s$A, but it would conflict with two plurals as$A and os$A if it were placed in node 13.</Paragraph> <Paragraph position="3"> amente$R appears together in node 14 because it shares stems with the feminine adjectives.</Paragraph> <Paragraph position="4"> amente$R also shares stems with verbs, as it is also the derivational suffix which attaches to verbal past participles in the feminine &quot;a&quot; form. This is probably why the group of adjectives at node 13 is a sister to the verb nodes. The allomorph mente$R attaches to non-gendered adjectives, and is thus in the first adjective group.</Paragraph> <Paragraph position="6"> labeled suffixes, with Gamma cosine values. Dotted lines indicate paradigms for a Gamma cosine threshold of .0021.</Paragraph> </Section> <Section position="3" start_page="73" end_page="76" type="sub_section"> <SectionTitle> 5.3 Unmerged, unlabeled suffixes </SectionTitle> <Paragraph position="0"> The case of unmerged, unlabeled suffixes is not as successful. In the Gamma matrix for the root node (Table 6), there is no orthogonal division of the suffixes, as indicated by the high Gamma cosine value of .1705. Despite this, the algorithm has discovered useful information. There is a subpara- null digm of unambiguous suffixes {'s$,ally$}, and another of {d$,ed$,ing$,r$}. The other suffixes ($,er$,es$,ly$,s$) are ambiguous. The ambiguity of ly$ seems to be a secondary effect: since adjectives with the null suffix $ are found to be ambiguous, ly$ is likewise ambiguous.</Paragraph> <Paragraph position="1"> unmerged, unlabeled suffixes; the categorization is shown with brackets. Columns indicate p(suffix|class).</Paragraph> <Paragraph position="2"> (Goldsmith 2001), a freely available program for unsupervised discovery of morphological structure. We focus our attention on Linguistica's representation of morphology, rather than the algorithm used to learn it. Linguistica takes a list of word types, proposes segmentations of words into stems and suffixes, and organizes them into signatures. A signature is a non-probabilistic data structure that groups together all stems that share a common set of suffixes. Each stem belongs to exactly one signature, and the set of suffixes for each signature is unique. For example, running Linguistica on our raw English text, there is a signature {$,ful$,s$} for the stems {resource, truth, youth}, indicating the morphology of the words {resource$, truth$, youth$, resourceful$, truthful$, youthful$, resources$, truths$, youths$}. There are no POS types in the system. Thus, even for a prototypically &quot;noun&quot; signature such as {$,'s$}, it is quite possible that not all of the words that the signature represents are actually nouns. For example, the word structure$ is in http://linguistica.uchigago.edu this signature, but occurs both as a noun (59 times) and a verb (2 times) in our corpus.</Paragraph> <Paragraph position="3"> The signature model can be derived from the suffix-stem data matrix, by first converting all positive counts to 1, and then placing in separate groups all the stems that have the same 0/1 column pattern. Another way to view the signature is as a special case of the probabilistic paradigm where all probabilities are restricted to being 0 or 1, for if this were so, the only way to fit the data would be to let there be a canonical paradigm for every different subset of suffixes that some stem appears with. In theory, it is possible for the number of signatures to be exponential in the number of suffixes; in practice, Linguistica finds hundreds of signatures for English and Spanish. Although there has been work on reducing the number of signatures (Goldwater and Johnson 2004; Hu et. al. 2005, who report a reduction of up to 30%), the number of remaining signatures is still two orders of magnitude greater than the number of canonical paradigms we find. The simplest explanation for this is that a suffix can be listed many times in the different signatures, but only has one entry in the M matrix of the probabilistic paradigm.</Paragraph> <Paragraph position="4"> It is important for a natural language system to handle out-of-vocabulary words. A signature does not predict the forms of potential but unseen forms of stems. To some extent Linguistica could accommodate this, as it identifies when one signature's suffixes are a proper subset of another's, but it does not handle cases where suffixes are partially overlapping. One principal advantage of the probabilistic paradigm is that the canonical paradigm allows the instantiation of a lexical paradigm containing a complete set of predicted word forms for a stem.</Paragraph> <Paragraph position="5"> Since Linguistica is a system that starts from raw text, it may seem that it cannot be directly compared to our work, which assumes that segmentations and suffixes are already known. However, it is possible to run Linguistica on our data by doing further preprocessing. We rewrite the corpus in such a way that Linguistica can detect correct morphological and POS information for each token. Each token is replaced by the concatenation of its stem, the dummy string 12345, and a single-character encoding of its merged suffix. For example, the token accelerate.d/ed$V is mapped to accelerate12345D, where D represents d/ed$V.</Paragraph> <Paragraph position="6"> The omnipresence of the dummy string enables Linguistica to discover all the desired stems and suffixes, but no more. By mapping the input corpus in this way, we can examine the type of grammar that Linguistica would find if it knew the information that we have assumed in the previous experiments. Linguistica found 565 signatures from the &quot;cooked&quot; English data (Figure 3). 50% of word types are represented by the first 13 signatures.</Paragraph> <Paragraph position="7"> 1. { $N, es/s$N } 1540 abortion absence accent acceptance accident accolade accommodation 2. { $N, 's$N } 1168 aba abbie abc academy achenbaum aclu adams addington addison adobe 3. { $N, 's$N, es/s$N } 224 accountant acquisition actor administration airline airport alliance 5. { $A, ally/ly$R } 319 abrupt absolute abundant accurate actual additional adequate adroit 6. { $A, $N, es/s$N } 173 abrasive acid activist adhesive adult afghan african afrikaner aggregate 7. { $V, d/ed$V, es/s$V } 135 abate achieve administer afflict aggravate alienate amass apologize 9. { $V, d/ed$V, ing$V, es/s$V } 73 abound absorb adopt applaud assert assist attend attract avert avoid 13. { $N, $V, d/ed$V, es/s$N, es/s$V } 44 suffix English data. Each signature shows the suffix set, number of stems, and several example stems. Ranking is by log(num stems)* log(num suffixes).</Paragraph> <Paragraph position="8"> We have formulated two metrics to evaluate the quality of a collection of signatures or paradigms. Ideally, all suffixes of a particular signature would be of the same category, and all the words of a particular category would be contained within one signature. POS fragmentation measures to what extent the words of an input POS category are scattered across different signatures. It is the average number of bits required to encode the probability distribution of some category's words over signatures. Signature impurity measures the extent to which the suffixes of a signature are of mixed input POS types. It is the expected value of the number of bits required to encode the probability distribution of some signature's suffixes over input POS categories. Table 7 shows that, according to these metrics, the signature does not organize morphological information as efficiently as probabilistic paradigms . Linguistica's impurity scores are reasonably low because many of the signatures with the most stems are categorically homogeneous. Fragmentation scores show that the placement of the majority of words within top signatures offsets the scattering of a POS category's suffixes over many signatures.</Paragraph> <Paragraph position="9"> (3) = 1.585 bits.</Paragraph> <Paragraph position="10"> Finally, a morphological grammar should reflect the general, abstract morphological structure of the language from which a corpus was sampled. To test for consistency of morphological grammars across corpora, we split our cooked English data into two equal parts. Linguistica found 449 total signatures for the first half and 462 for the second. 296 signatures were common to both (in terms of the suffixes contained by the signatures). Of the 3506 stems shared by both data sets, 1831 (52.2%) occurred in the same signature. Of the top 50 signatures for each half-corpus, 45 were in common, and 1651 of 2403 shared stems (68.7%) occurred in the same signature. Recursive LDA found the Our scores would not be so good if we had chosen a poor Gamma cosine threshold value for classification. However, Linguistica's scores cannot be decreased, as there is only one signature model for a fixed set of stems and suffixes.</Paragraph> <Paragraph position="11"> same canonical paradigms for both data sets (which matched the gold standard). Differences in word counts between the corpus halves altered stem inventories and lexical probabilities, but not the structure of the canonical paradigms. Our system thus displays a robustness to corpus choice that does not hold for Linguistica.</Paragraph> </Section> </Section> <Section position="6" start_page="76" end_page="76" type="metho"> <SectionTitle> 7 Future Work </SectionTitle> <Paragraph position="0"> This section sketches some ideas for future work to increase the linguistic adequacy of the system, and to make it more unsupervised.</Paragraph> <Paragraph position="1"> 1. Bootstrapping: for fully unsupervised learning, we need to hypothesize stems and suffixes. The output of recursive LDA indicates which suffixes may be ambiguous. To bootstrap a disambiguator for the different categorial uses of these suffixes, one could use various types of distributional information, as well as knowledge of partial paradigmatic structure for non-ambiguous suffixes.</Paragraph> <Paragraph position="2"> 2. Automated detection of cut nodes: currently the system requires that the user select a Gamma cosine threshold for extracting paradigms from the recursion tree. We would like to automate this process, perhaps with different heuristics.</Paragraph> <Paragraph position="3"> 3. Suffix merging and formulation of generation rules: when we decide that two suffixes should be merged (based on some measures of distributional similarity and word-internal context), we also need to formulate phonological (i.e., spelling) rules to determine which surface form to use when instantiating a form from the canonical paradigm.</Paragraph> <Paragraph position="4"> 4. Non-regular forms: we can take advantage of empty cells in the data matrix in order to identify non-regularities such as suppletives, stem variants, semi-regular subclasses, and suffix allomorphs. If the expected frequency of a word form (as derived from the M matrix and frequency of a stem) is relatively high but the value in the D matrix is zero, this is evidence that a non-regular form may occupy this cell. Locating irregular words could use methods similar to those of (Yarowsky and Wicentowski 2000), who pair irregular inflections and their roots from raw text. Stem variants and allomorphic suffixes could be detected in a similar manner, by finding sets of stems/suffixes with mutually exclusive matrix entries.</Paragraph> <Paragraph position="5"> 5. Multiple morphological properties per word: we currently represent all morphological and POS information with a single suffix. The learning algorithm and representation could perhaps be modified to allow for multiple morphological properties. One could perform recursive LDA on a particular morphological property, then take each of the learned paradigms and perform recursive LDA again, but for a different morphological property. This method might discover Spanish conjugational classes as subclasses within &quot;verbs&quot;.</Paragraph> </Section> class="xml-element"></Paper>