File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1302_intro.xml

Size: 26,942 bytes

Last Modified: 2025-10-06 14:02:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1302">
  <Title>On Statistical Parameter Setting</Title>
  <Section position="2" start_page="0" end_page="13" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In recent years there has been a growing amount of work focusing on the computational modeling of language processing and acquisition, implying a cognitive and theoretical relevance both of the models as such, as well as of the language properties extracted from raw linguistic data.</Paragraph>
    <Paragraph position="1">  In the computational linguistic literature several attempts to induce grammar or linguistic knowledge from such data have shown that at different levels a high amount of information can be extracted, even with no or minimal supervision. Different approaches tried to show how various puzzles of language induction could be solved. From this perspective, language acquisition is the process of segmentation of non-discrete acoustic input, mapping of segments to symbolic representations, mapping representations on higher-level representations such as phonology, morphology and syntax, and even induction of semantic properties. Due to space restrictions, we cannot discuss all these approaches in detail. We will focus on the close domain of morphology.</Paragraph>
    <Paragraph position="2"> Approaches to the induction of morphology as presented in e.g. Schone and Jurafsky (2001) or Goldsmith (2001) show that the morphological  See Batchelder (1998) for a discussion of these aspects.</Paragraph>
    <Paragraph position="3"> properties of a small subset of languages can be induced with high accuracy, most of the existing approaches are motivated by applied or engineering concerns, and thus make assumptions that are less cognitively plausible: a. Large corpora are processed all at once, though unsupervised incremental induction of grammars is rather the approach that would be relevant from a psycholinguistic perspective; b. Arbitrary decisions about selections of sets of elements are made, based on frequency or frequency profile rank,  though such decisions should rather be derived or avoided in general.</Paragraph>
    <Paragraph position="4"> However, the most important aspects missing in these approaches, however, are the link to different linguistic levels and the support of a general learning model that makes predictions about how knowledge is induced on different linguistic levels and what the dependencies between information at these levels are. Further, there is no study focusing on the type of supervision that might be necessary for the guidance of different algorithm types towards grammars that resemble theoretical and empirical facts about language acquisition, and processing and the final knowledge of language. While many theoretical models of language acquisition use innateness as a crutch to avoid outstanding difficulties, both on the general and abstract level of I-language as well as the more detailed level of E-language, (see, among others, Lightfoot (1999) and Fodor and Teller (2000), there is also significant research being done which shows that children take advantage of statistical regularities in the input for use in the language-learning task (see Batchelder (1997) and related references within).</Paragraph>
    <Paragraph position="5"> In language acquisition theories the dominant view is that knowledge of one linguistic level is bootstrapped from knowledge of one, or even several different levels. Just to mention such approaches: Grimshaw (1981), and Pinker (1984)  Just to mention some of the arbitrary decisions made in various approaches, e.g. Mintz (1996) selects a small set of all words, the most frequent words, to induce word types via clustering ; Schone and Jurafsky (2001) select words with frequency higher than 5 to induce morphological segmentation.</Paragraph>
    <Paragraph position="6">  assume that semantic properties are used to bootstrap syntactic knowledge, and Mazuka (1998) suggested that prosodic properties of language establish a bias for specific syntactic properties, e.g. headedness or branching direction of constituents. However, these approaches are based on conceptual considerations and psycholinguistc empirical grounds, the formal models and computational experiments are missing. It is unclear how the induction processes across linguistic domains might work algorithmically, and the quantitative experiments on large scale data are missing.</Paragraph>
    <Paragraph position="7"> As for algorithmic approaches to cross-level induction, the best example of an initial attempt to exploit cues from one level to induce properties of another is presented in Dejean (1998), where morphological cues are identified for induction of syntactic structure. Along these lines, we will argue for a model of statistical cue-based learning, introducing a view on bootstrapping as proposed in Elghamry (2004), and Elghamry and Cavar (2004), that relies on identification of elementary cues in the language input and incremental induction and further cue identification across all linguistic levels.</Paragraph>
    <Section position="1" start_page="9" end_page="9" type="sub_section">
      <SectionTitle>
1.1 Cue-based learning
</SectionTitle>
      <Paragraph position="0"> Presupposing input driven learning, it has been shown in the literature that initial segmenations into words (or word-like units) is possible with unsupervised methods (e.g. Brent and Cartwright (1996)), that induction of morphology is possible (e.g. Goldsmith (2001), Schone and Jurafsky (2001)) and even the induction of syntactic structures (e.g. Van Zaanen (2001)). As mentioned earlier, the main drawback of these approaches is the lack of incrementality, certain arbitrary decisions about the properties of elements taken into account, and the lack of integration into a general model of bootstrapping across linguistic levels.</Paragraph>
      <Paragraph position="1"> As proposed in Elghamry (2004), cues are elementary language units that can be identified at each linguistic level, dependent or independent of prior induction processes. That is, intrinsic properties of elements like segments, syllables, morphemes, words, phrases etc. are the ones available for induction procedures. Intrinsic properties are for example the frequency of these units, their size, and the number of other units they are build of. Extrinsic properties are taken into account as well, where extrinsic stands for distributional properties, the context, relations to other units of the same type on one, as well as across linguistic levels. In this model, extrinsic and intrinsic properties of elementary language units are the cues that are used for grammar induction only.</Paragraph>
      <Paragraph position="2"> As shown in Elghamry (2004) and Elghamry and Cavar (2004), there are efficient ways to identify a kernel set of such units in an unsupervised fashion without any arbitrary decision where to cut the set of elements and on the basis of what kind of features. They present an algorithm that selects the set of kernel cues on the lexical and syntactic level, as the smallest set of words that co-occurs with all other words. Using this set of words it is possible to cluster the lexical inventory into open and closed class words, as well as to identify the subclasses of nouns and verbs in the open class.</Paragraph>
      <Paragraph position="3"> The direction of the selectional preferences of the language is derived as an average of point-wise Mutual Information on each side of the identified cues and types, which is a self-supervision aspect that biases the search direction for a specific language. This resulting information is understood as derivation of secondary cues, which then can be used to induce selectional properties of verbs (frames), as shown in Elghamry (2004).</Paragraph>
      <Paragraph position="4"> The general claim thus is: * Cues can be identified in an unsupervised fashion in the input.</Paragraph>
      <Paragraph position="5"> * These cues can be used to induce properties of the target grammar.</Paragraph>
      <Paragraph position="6"> * These properties represent cues that can be used to induce further cues, and so on.</Paragraph>
      <Paragraph position="7"> The hypothesis is that this snowball effect can reduce the search space of the target grammar incrementally. The main research questions are now, to what extend do different algorithms provide cues for other linguistic levels and what kind of information do they require as supervision in the system, in order to gain the highest accuracy at each linguistic level, and how does the linguistic information of one level contribute to the information on another.</Paragraph>
      <Paragraph position="8"> In the following, the architectural considerations of such a computational model are discussed, resulting in an example implementation that is applied to morphology induction, where morphological properties are understood to represent cues for lexical clustering as well as syntactic structure, and vice versa, similar to the ideas formulated in Dejean (1998), among others.</Paragraph>
    </Section>
    <Section position="2" start_page="9" end_page="12" type="sub_section">
      <SectionTitle>
1.2 Incremental Induction Architecture
</SectionTitle>
      <Paragraph position="0"> The basic architectural principle we presuppose is incrementality, where incrementally utterances are processed. The basic language unit is an utterance, with clear prosodic breaks before and after. The induction algorithm consumes such utterances and breaks them into basic linguistic units, generating for each step hypotheses about  the linguistic structure of each utterance, based on the grammar built so far and statistical properties of the single linguistic units. Here we presuppose a successful segmentation into words, i.e. feeding the system utterances with unambiguous word boundaries. We implemented the following pipeline architecture: The GEN module consumes input and generates hypotheses about its structural descriptions (SD). EVAL consumes a set of SDs and selects the set of best SDs to be added to the knowledge base. The knowledge base is a component that not only stores SDs but also organizes them into optimal representations, here morphology grammars.</Paragraph>
      <Paragraph position="1"> All three modules are modular, containing a set of algorithms that are organized in a specific fashion. Our intention is to provide a general platform that can serve for the evaluation and comparison of different approaches at every level of the induction process. Thus, the system is designed to be more general, applicable to the problem of segmentation, as well as type and grammar induction.</Paragraph>
      <Paragraph position="2"> We assume for the input to consist of an alphabet: a non-empty set A of n symbols {s  In the following, the individual modules for the morphology induction task are described in detail.  For the morphology task GEN is compiled from a set of basically two algorithms. One algorithm is a variant of Alignment Based Learning (ABL), as described in Van Zaanen (2001).</Paragraph>
      <Paragraph position="3"> The basic ideas in ABL go back to concepts of substitutability and/or complementarity, as discussed in Harris (1961). The concept of substitutability generally applies to central part of the induction procedure itself, i.e. substitutable elements (e.g. substrings, words, structures) are assumed to be of the same type (represented e.g. with the same symbol).</Paragraph>
      <Paragraph position="4"> The advantage of ABL for grammar induction is its constraining characteristics with respect to the set of hypotheses about potential structural properties of a given input. While a brute-force method would generate all possible structural representations for the input in a first order explosion and subsequently filter out irrelevant hypotheses, ABL reduces the set of possible SDs from the outset to the ones that are motivated by previous experience/input or a pre-existing grammar.</Paragraph>
      <Paragraph position="5"> Such constraining characteristics make ABL attractive from a cognitive point of view, both because hopefully the computational complexity is reduced on account of the smaller set of potential hypotheses, and also because learning of new items, rules, or structural properties is related to a general learning strategy and previous experience only. The approaches that are based on a brute-force first order explosion of all possible hypotheses with subsequent filtering of relevant or irrelevant structures are both memory-intensive and require more computational effort.</Paragraph>
      <Paragraph position="6"> The algorithm is not supposed to make any assumptions about types of morphemes. There is no expectation, including use of notions like stem, prefix, or suffix. We assume only linear sequences. The properties of single morphemes, being stems or suffixes, should be a side effect of their statistical properties (including their frequency and co-occurrence patterns, as will be explained in the following), and their alignment in the corpus, or rather within words.</Paragraph>
      <Paragraph position="7"> There are no rules about language built-in, such as what a morpheme must contain or how frequent it should be. All of this knowledge is induced statistically.</Paragraph>
      <Paragraph position="8"> In the ABL Hypotheses Generation, a given word in the utterance is checked against morphemes in the grammar. If an existing morpheme LEX aligns with the input word INP, a hypothesis is generated suggesting a morphological boundary at the alignment positions: INP (speaks) + LEX (speak) = HYP [speak, s] Another design criterion for the algorithm is complete language independence. It should be able to identify morphological structures of Indo-European type of languages, as well as agglutinative languages (e.g. Japanese and Turkish) and polysynthetic languages like some Bantu dialects or American Indian languages. In order to guarantee this behavior, we extended the Alignment Based hypothesis generation with a pattern identifier that extracts patterns of character sequences of the types:  1. A -- B -- A 2. A -- B -- A -- B 3. A -- B -- A -- C  This component is realized with cascaded regular expressions that are able to identify and  return the substrings that correspond to the repeating sequences.</Paragraph>
      <Paragraph position="9">  All possible alignments for the existing grammar at the current state, are collected in a hypothesis list and sent to the EVAL component, described in the following. A hypothesis is defined as a tuple: H = &lt;w, f, g&gt;, with w the input word, f its frequency in C, and g a list of substrings that represent a linear list of morphemes in w, g = [</Paragraph>
      <Paragraph position="11"> EVAL is a voting based algorithm that subsumes a set of independent algorithms that judge the list of SDs from the GEN component, using statistical and information theoretic criteria. The specific algorithms are grouped into memory and usability oriented constraints.</Paragraph>
      <Paragraph position="12"> Taken as a whole, the system assumes two (often competing) cognitive considerations. The first of these forms a class of what we term &amp;quot;time-based&amp;quot; constraints on learning. These constraints are concerned with the processing time required of a system to make sense of items in an input stream, whereby &amp;quot;time&amp;quot; is understood to mean the number of steps required to generate or parse SDs rather than the actual temporal duration of the process.</Paragraph>
      <Paragraph position="13"> To that end, they seek to minimize the amount of structure assigned to an utterance, which is to say they prefer to deal with as few rules as possible.</Paragraph>
      <Paragraph position="14"> The second of these cognitive considerations forms a class of &amp;quot;memory-based&amp;quot; constraints. Here, we are talking about constraints that seek to minimize the amount of memory space required to store an utterance by maximizing the efficiency of the storage process. In the specific case of our model, which deals with morphological structure, this means that the memory-based constraints search the input string for regularities (in the form of repeated substrings) that then need only be stored once (as a pointer) rather than each time they are found. In the extreme case, the time-based constraints prefer storing the input &amp;quot;as is&amp;quot;, without any processing at all, where the memory-based constraints prefer a rule for every character, as this would assign maximum structure to the input.</Paragraph>
      <Paragraph position="15"> Parsable information falls out of the tension between these two conflicting constraints, which can then be applied to organize the input into potential syntactic categories. These can then be  This addition might be understood to be a sort of supervision in the system. However, as shown in recent research on human cognitive abilities, and especially on the ability to identify patterns in the speech signal by very young infants (Marcus et al, 1999) shows that we can assume such an ability to be part of the cognitive abilities, maybe not even language specific used to set the parameters for the internal adult parsing system.</Paragraph>
      <Paragraph position="16"> Each algorithm is weighted. In the current implementation these weights are set manually. In future studies we hope to use the weighting for self-supervision.</Paragraph>
      <Paragraph position="17">  Each algorithm assigns a numerical rank to each hypothesis multiplied with the corresponding weight, a real number between 0 and 1.</Paragraph>
      <Paragraph position="18"> On the one hand, our main interest lies in the comparison of the different algorithms and a possible interaction or dependency between them.</Paragraph>
      <Paragraph position="19"> Also, we expect the different algorithms to be of varying importance for different types of languages.</Paragraph>
      <Paragraph position="20"> Mutual Information (MI) For the purpose of this experiment we use a variant of standard Mutual Information (MI), see e.g. MacKay (2003). Information theory tells us that the presence of a given morpheme restricts the possibilities of the occurrence of morphemes to the left and right, thus lowering the amount of bits needed to store its neighbors. Thus we should be able to calculate the amount of bits needed by a morpheme to predict its right and left neighbors respectively. To calculate this, we have designed a variant of mutual information that is concerned with a single direction of information.</Paragraph>
      <Paragraph position="21"> This is calculated in the following way. For every morpheme y that occurs to the right of x we sum the point-wise MI between x and y, but we relativize the point-wise MI by the probability that y follows x, given that x occurs. This then gives us the expectation of the amount of information that x tells us about which morpheme will be to its right. Note that p(&lt;xy&gt;) is the probability of the bigram &lt;xy&gt; occurring and is not equal to p(&lt;yx&gt;) which is the probability of the bigram &lt;yx&gt; occurring.</Paragraph>
      <Paragraph position="22"> We calculate the MI on the right side of x[?]G by:</Paragraph>
      <Paragraph position="24"> One way we use this as a metric, is by summing up the left and right MI for each morpheme in a  One possible way to self-supervise the weights in this architecture is by taking into account the revisions subsequent components make when they optimize the grammar. If rules or hypotheses have to be removed from the grammar due to general optimization constraints on the grammars as such, the weight of the responsible algorithm can be lowered, decreasing its general value in the system on the long run. The relevant evaluations with this approach are not yet finished.</Paragraph>
      <Paragraph position="25">  hypothesis. We then look for the hypothesis that results in the maximal value of this sum. The tendency for this to favor hypotheses with many morphemes is countered by our criterion of favoring hypotheses that have fewer morphemes, discussed later.</Paragraph>
      <Paragraph position="26"> Another way to use the left and right MI is in judging the quality of morpheme boundaries. In a good boundary, the morpheme on the left side should have high right MI and the morpheme on the right should have high left MI. Unfortunately, MI is not reliable in the beginning because of the low frequency of morphemes. However, as the lexicon is extended during the induction procedure, reliable frequencies are bootstrapping this segmentation evaluation.</Paragraph>
    </Section>
    <Section position="3" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
Minimum Description Length (DL)
</SectionTitle>
      <Paragraph position="0"> The principle of Minimum Description Length (MDL), as used in recent work on grammar induction and unsupervised language acquisition, e.g. Goldsmith (2001) and De Marcken (1996), explains the grammar induction process as an iterative minimization procedure of the grammar size, where the smaller grammar corresponds to the best grammar for the given data/corpus.</Paragraph>
      <Paragraph position="1"> The description length metric, as we use it here, tells us how many bits of information would be required to store a word given a hypothesis of the morpheme boundaries, using the so far generated grammar. For each morpheme in the hypothesis that doesn't occur in the grammar we need to store the string representing the morpheme. For morphemes that do occur in our grammar we just need to store a pointer to that morphemes entry in the grammar. We use a simplified calculation, taken from Goldsmith (2001), of the cost of storing a string that takes the number of bits of information required to store a letter of the alphabet and multiply it by the length of the string. lg(len(alphabet))* len(morpheme) We have two different methods of calculating the cost of the pointer. The first assigns a variable the cost based on the frequency of the morpheme that it is pointing to. So first we calculate the frequency rank of the morpheme being pointed to, (e.g. the most frequent has rank 1, the second rank 2, etc.). We then calculate: floor(lg( freq_ rank)[?]1) to get a number of bits similar to the way Morse code assigns lengths to various letters.</Paragraph>
      <Paragraph position="2"> The second is simpler and only calculates the entropy of the grammar of morphemes and uses this as the cost of all pointers to the grammar. The entropy equation is as follows:</Paragraph>
      <Paragraph position="4"> The second equation doesn't give variable pointer lengths, but it is preferred since it doesn't carry the heavy computational burden of calculating the frequency rank.</Paragraph>
      <Paragraph position="5"> We calculate the description length for each GEN hypothesis only,  by summing up the cost of each morpheme in the hypothesis. Those with low description lengths are favored.</Paragraph>
      <Paragraph position="6"> Relative Entropy (RE) We are using RE as a measure for the cost of adding a hypothesis to the existing grammar. We look for hypotheses that when added to the grammar will result in a low divergence from the original grammar.</Paragraph>
      <Paragraph position="7"> We calculate RE as a variant of the Kullback- null takes this into account by calculating the costs for such a new element x to be the point-wise entropy of this element in P(X), summing up over all new elements:</Paragraph>
      <Paragraph position="9"> These two sums then form the RE between the original grammar and the new grammar with the addition of the hypothesis. Hypotheses with low RE are favored.</Paragraph>
      <Paragraph position="10"> This metric behaves similarly to description length, that is discussed above, in that both are calculating the distance between our original grammar and the grammar with the inclusion of the new hypothesis. The primary difference is RE also takes into account how the pmf differs in the two grammars and that our variation punishes new morphemes based upon their frequency relative to the frequency of other morphemes. Our implementation of MDL does not consider frequency in this way, which is why we are including RE as an independent metric.</Paragraph>
    </Section>
    <Section position="4" start_page="12" end_page="13" type="sub_section">
      <SectionTitle>
Further Metrics
</SectionTitle>
      <Paragraph position="0"> In addition to the mentioned metric, we take into account the following criteria: a. Frequency of  We do not calculate the sizes of the grammars with and without the given hypothesis, just the amount each given hypothesis would add to the grammar, favoring the least increase of total grammar size.</Paragraph>
      <Paragraph position="1">  morpheme boundaries; b. Number of morpheme boundaries; c. Length of morphemes.</Paragraph>
      <Paragraph position="2"> The frequency of morpheme boundaries is given by the number of hypotheses that contain this boundary. The basic intuition is that the higher this number is, i.e. the more alignments are found at a certain position within a word, the more likely this position represents a morpheme boundary. We favor hypotheses with high values for this criterion.</Paragraph>
      <Paragraph position="3"> The number of morpheme boundaries indicates how many morphemes the word was split into. To prevent the algorithm from degenerating into the state where each letter is identified as a morpheme, we favor hypotheses with low number of morpheme boundaries.</Paragraph>
      <Paragraph position="4"> The length of the morphemes is also taken into account. We favor hypotheses with long morphemes to prevent the same degenerate state as the above criterion.</Paragraph>
      <Paragraph position="5">  The acquired lexicon is stored in a hypothesis space which keeps track of the words from the input and the corresponding hypotheses. The hypothesis space is defined as a list of hypotheses:  Further, each morpheme that occurred in the SDs of words in the hypothesis space is kept with its frequency information, as well as bigrams that consist of morpheme pairs in the SDs and their frequency.</Paragraph>
      <Paragraph position="6">  Similar to the specification of signatures in Goldsmith (2001), we list every morpheme with the set of morphemes it co-occurs. Signatures are lists of morphemes. Grammar construction is performed by replacement of morphemes with a symbol, if they have equal signatures.</Paragraph>
      <Paragraph position="7"> The hypothesis space is virtually divided into two sections, long term and short term storage. Long term storage is not revised further, in the current version of the algorithm. The short term storage is cyclically cleaned up by eliminating the signatures with a low likelihood, given the long term storage.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML