File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0503_metho.xml
Size: 9,621 bytes
Last Modified: 2025-10-06 14:09:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0503"> <Title>Using Morphology and Syntax Together in Unsupervised Learning</Title> <Section position="4" start_page="21" end_page="23" type="metho"> <SectionTitle> 3 A more abstract statement of the </SectionTitle> <Paragraph position="0"> problem A minimum description length (MDL) analysis is especially appropriate for machine learning of linguistic analysis because simultaneously it puts a premium both on analytical simplicity and on goodness of fit between the model and the data (Rissanen 1989).</Paragraph> <Paragraph position="1"> We will present first the mathematical statement of the MDL model of the morphology, in (1), following the analysis in Goldsmith (2001), followed by a description of the meaning of the terms of the expressions, and then present the modified version which includes additional terms regarding part of speech (POS) information, in (2) and (3).</Paragraph> <Paragraph position="2"> The signature-collapsing problem has another side to it as well. An initial morphological analysis of English will typically give rise to a morphological analysis of words such as move, moves, moved, moving with a signature whose stems include mov and whose affixes are e.ed.es.ing. A successful solution to the signature-collapsing problem will collapse O.ed.ing.s with e.ed.es.ing, noting that O ~ e, ed ~ed, es ~ s, and ing ~ ing in an obvious sense. Equation (1a) states that our goal is to find the (morphological) grammar that simultaneously minimizes the sum of its own length and the compressed length of the data it analyzes, while (1b) specifies the grammar length (or model length) as the sum of the lengths of the links between the major components of the morphology: the list of letters (or phonemes) comprising the morphemes, the morphemes (stems and affixes), and the signatures. We use square brackets &quot;[.]&quot; to denote the token counts in a corpus containing a given morpheme or word. The first line of (1b) expresses the notion that each stem consists of a pointer to its signature and a list of pointers to the letters that comprise it; s(t) is the signature associated with stem t, and we take its probability to be</Paragraph> <Paragraph position="4"> ts , the empirical count of the words associated with s(t) divided by the total count of words in the data. The second line expresses the idea that the morphology contains a list of affixes, each of which contains a list of pointers to the letters that comprise it. The third line of (1b) expresses the notion that a signature consists of a list of pointers to the component affixes. (1c) expresses the compressed length of each word in the data.</Paragraph> <Paragraph position="5"> We now consider extending this model to include part of speech labeling, as sketched in (2). The principal innovation in (2) is the addition of part of speech tags; each affix is associated with one or more POS tags. As we We do not sum over all occurrences of a word in the corpus; we count the compressed length of each word type found in the corpus. This decision was made based on the observation that the (compressed length of the) data term grows much faster than the length of the grammar as the corpus gets large, and the loss in ability of the model to predict word frequencies overwhelms any increase in model simplicity when we count word tokens in the data terms. We recognize the departure from the traditional understanding of MDL here, and assume the responsibility to explain this in a future publication.</Paragraph> <Paragraph position="6"> have seen, a path from a particular signature s to a particular affix f constitutes what we have called a particular signature transform s_f ; and we condition the probabilities of the POS tags in the data on the preceding signature transformation. As a result, our final model takes the form in (3).</Paragraph> <Paragraph position="7"> The differences between the models are found in the added final term in (3b), which specifies the information required to predict, or specify, the part of speech given the signature transform, and the corresponding term in the corpus compression expression (3c).</Paragraph> <Paragraph position="8"> The model in (3) implicitly assumes that the true POSs are known; in a more complete model, the POSs play a direct role in assigning a higher probability to the corpus (and hence a smaller compressed size to the data). In the context of such a model, an MDL-based learning device searches for the best assignment of POS tags over all possible assignments. Instead of doing that in this paper, we employ the TreeTagger (Schmid, 1994) based tags (see section 5 below), and make the working assumption that optimization of description length over all signature-analyses and POS tags can be approximated by optimization over all signature-analyses, given the POS tags provided by TreeTagger.</Paragraph> </Section> <Section position="5" start_page="23" end_page="24" type="metho"> <SectionTitle> 4 The collapsing of signatures </SectionTitle> <Paragraph position="0"> We describe in this section our proposed algorithm, using context vectors to collapse signatures together, composed of a sequence of operations, all but the first of which may be familiar to the reader: Replacement of words by signaturetransforms: The input to our algorithm for collapsing signatures is a modified version of the corpus which integrates the (unsupervised) morphological analyses in the following way.</Paragraph> <Paragraph position="1"> First of all, we leave unchanged the 200 most frequent words (word types). Next, we replace words belonging to the K most reliable signatures (where K=50 in these experiments) by their associated signature transforms, and we in effect ignore all other words, by replacing them by a distinguished &quot;dummy&quot; symbol. In the following, we refer to our high frequency words and signature transforms together as elements--so an element is any member of the transformed corpus other than the &quot;dummy&quot;. Context vectors based on mutual information: By reading through the corpus, we populate both left and right context vectors for each element (=signature-transform and high-frequency word) by observing the elements that occur adjacent to it. The feature indicating the appearance of a particular word on the left is always kept distinct from the feature indicating the appearance of the same word on the right.</Paragraph> <Paragraph position="2"> The features in a context vector are thus associated with the members of the element vocabulary (and indeed, each member of the element vocabulary occurs as two features: one on the left, one on the right). We assign the value of each feature y of x's context vector as the pointwise mutual information of the corresponding element pair (x, y), defined as</Paragraph> <Paragraph position="4"> Simplifying context vectors with &quot;idf&quot;: In addition, because of the high dimensionality of the context vector and the fact that some features are more representative than others, we trim the original context vector. For each context vector, we sort features by their values, and then keep the top N (in general, we set N to 10) by setting these values to 1, and all others to 0. However, in this resulting simplified context vector, not all features do equally good jobs of distinguishing syntactical categories. As Wicentowski (2002) does in a similar context, we assign a weight</Paragraph> <Paragraph position="6"> in a fashion parallel to inverse document frequency (idf; see Sparck We view these as the diagonal elements of a matrix M (that is, m</Paragraph> <Paragraph position="8"> w ). We then check the similarity between two simplified context vectors by computing the weighted sum of the dot product of them. That is, given two simplified context vectors c and d, their similarity is defined as c T Md. If this value is larger than a threshold th that is set as one parameter, we deem these two context vectors to be similar. Then we determine the similarity between elements by checking whether both left and right simplified context vectors of them are similar (i.e., their weighted dot products exceed a threshold th). In the experiments we describe below, we explore four settings th for this threshold: 0.8 (the most &quot;liberal&quot; in allowing greater signature transform collapse, and hence greater signature collapse), 1.0, 1.2, and 1.5. Calculate signature similarity: To avoid considering many unnecessary pairs of signatures, we narrow the candidates into signature pairs in which the suffixes of one constitute a subset of suffixes of the other, and we set a limit to the permissible difference in the lengths of the signatures in the collapsed pairs, so that the difference in number of affixes cannot exceed 2. For each such pair, if all corresponding signature transforms are similar in the sense defined in the preceding paragraph, we deem the two signatures to be similar.</Paragraph> <Paragraph position="9"> Signature graph: Finally, we construct a signature graph, in which each signature is represented as a vertex, and an edge is drawn between two signatures iff they are similar, as just defined. In this graph, we find a number of cliques, each of which, we believe, indicates a cluster of signatures which should be collapsed. If a signature is a member of two or more cliques, then it is assigned to the largest clique (i.e., the one containing the largest number of signatures).</Paragraph> </Section> class="xml-element"></Paper>