File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0503_intro.xml

Size: 4,384 bytes

Last Modified: 2025-10-06 14:03:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0503">
  <Title>Using Morphology and Syntax Together in Unsupervised Learning</Title>
  <Section position="3" start_page="20" end_page="21" type="intro">
    <SectionTitle>
2 Signatures and signature transforms
</SectionTitle>
    <Paragraph position="0"> We employ the unsupervised learning of morphology developed by Goldsmith (Goldsmith, 2001). Regrettably, some of the discussion below depends rather heavily on material presented there, but we attempt to summarize the major points here.</Paragraph>
    <Paragraph position="1"> Two critical terms that we employ in this analysis are signature and signature transform. A signature found in a given corpus is a pair of lists: a stem-list and a suffix-list (or in the appropriate context, a prefix-list). By definition of signature s, the concatenation of every stem in the stem-list of s with every suffix in the suffix-list of s is found in the corpus, and a morphological analysis of a corpus can be viewed as a set of signatures that uniquely analyze each word in the corpus. For example, a corpus of English that includes the words jump, jumps, jumped, jumping, walk, walks, walked, and walking might include the signature s  whose stem list is { jump, walk } and whose suffix list is { O, ed, ing , s }. For convenience, we label a signature with the concatenation of its suffixes separated by period '.'. On such an analysis, the word jump is analyzed as belonging to the signature O.ed.ing.s, and it bears the suffix O. We say, then, that the signature transform of jump is O.ed.ing.s_ O, just as the signature transform of jumping is O.ed.ing.s_ing; in general, the signature transform of a word W, when W is morphologically analyzed as stem T followed by suffix F, associated with signature s, is defined as s_F.  In many of the experiments described below, we use a corpus in which all words whose frequency rank is greater than 200 have been replaced by their signature transforms. This move is motivated by the observation that high frequency words in natural languages tend to have syntactic distributions poorly predictable by any feature other than their specific identity, whereas the distribution properties of lower frequency words (which we take to be words whose frequency rank is 200 or below) are better predicted by category membership.</Paragraph>
    <Paragraph position="2"> In many cases, there is a natural connection between a signature transform and a lexical category. Our ultimate goal is to exploit this in the larger context of grammar induction. For example, consider the signature O.er.ly, which occurs with stems such as strong and weak; in fact, words whose signature transform is O.er.ly_ O are adjectives, those whose signature transform is O.er.ly_er are comparative adjectives, and those whose signature transform is O.er.ly_ly are adverbs.</Paragraph>
    <Paragraph position="3"> The connection is not perfect, however.</Paragraph>
    <Paragraph position="4"> Consider the signature O.ed.ing.s and its four signature transforms. While most words whose s -transform is O.ed.ing.s_s are verbs (indeed, 3rd person singular present tense verbs, as in he walks funny), many are in fact plural nouns (e.g., walks in He permitted four walks in the eighth inning is a plural noun). We will refer to this problem as the signature purity problem-it is essentially the reflex of the ambiguity of suffixes.</Paragraph>
    <Paragraph position="5"> In addition, many 3rd person singular present tense verbs are associated with other signature transforms, such as O.ing.s_s, O.ed.s_s, and so forth; we will refer to this as the signature-collapsing problem, because all other things being equal, we would like to collapse certain signatures, such as O.ed.ing.s and O.ed.ing, since a stem that is associated with the latter signature could have appeared in the corpus with an -s suffix; removing the O.ed.ing signature and reassigning its stems to the O.ed.ing.s signature will in general give us a better linguistic analysis of the corpus, one that can be better used in the problem of lexical category induction. This is the reflex of the familiar data sparsity concern.  Since we ultimately want to use signatures and signature transforms to learn syntactic categories, we base the similarity measure between the signatures on the context.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML