File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/94/w94-0108_abstr.xml

Size: 38,288 bytes

Last Modified: 2025-10-06 13:48:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="W94-0108">
  <Title>Formalizing triggers: A learning model for finite</Title>
  <Section position="2" start_page="0" end_page="67" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> We report on the development of a robust parsing device which aims to provide a partial explanation for child language acquisition and help in the construction of better natural language processing systems. The backbone of the new approach is the synthesis of statistical and symbolic approaches to natural language.</Paragraph>
    <Paragraph position="1"> Motivation We report on the progress we have made towards developing a robust 'self-constructing' parsing device that uses indirect negative evidence (Kapur, 1992) to set its parameters. Generally, by parameter, we mean any point of variation at which two languages may differ. Thus, the relative placement of all object with respect to the verb, a determiner with respect to a noun, the difference between prepositional and postpositional languages, and the presence of long distance anaphors like Japanese &amp;quot;zibun&amp;quot; and Icelandic &amp;quot;sig&amp;quot; are all parameters. The device would be exposed to an input text consisting of simple unpreprocessed sentences. Oil the basis of this text, the device would induce indirect negative evidence in support of some one parsing device located in the parameter space.</Paragraph>
    <Paragraph position="2"> The development of a self-constructing parsing system would have a number of practical and theoretical benefits. First, such a parsing device would reduce the development costs of new parsers. At the moment, grammars must be developed by hand, a technique which requires a significant investment in money and man-hours.</Paragraph>
    <Paragraph position="3"> If a b~mic parser could be developed automatically, costs would be reduced significantly, even if the parser requires some fine-tuning after the initial automatic learning procedure. Second, a parser capable of self-modification is potentially more robust when confronted with novel or semigrammatical input. This type of parser would haw~ applications in information retrieval as well as language instruction and grammar correction.</Paragraph>
    <Paragraph position="4"> Finally, the development of a parser capable of self-modification would give us considerable insight into the formal properties of complex systems as well as the twin problems of language learnability and language acquisition.</Paragraph>
    <Paragraph position="5"> Given a linguistic parameter space, the problem of locating a target language somewhere in the space on the basis of a text consisting of only grammatical sentences is far from trivial. Clark (1990, 1992) has shown that the complexity of the problem is potentially exponential because the relationship between the points of variation and the actual data can he quite indirect and tangled. Since, given n parameters, there are 2 n possible parsing devices, enumerative search through the space is clearly impossible. Because each datum may be successfully parsed by a number of different parsing devices within the space and because the surface properties of grammatical strings underdetermine the properties of the parsing device which must be fixed by the learning algorithm, standard deductive machine learning techniques are as complex as a brute enumerative search (Clark, 1992, 1994). In order to solve this problem, robust techniques which can rapidly eliminate inferior hypotheses must be developed.</Paragraph>
    <Paragraph position="6"> We propose a learning procedure which unites symbolic computation with statistical tools. Historically, symbolic techniques have proven to be a versatile tool in natural language processing.</Paragraph>
    <Paragraph position="7"> These techniques have the disadvantage of being both brittle (easily broken by new input or by user error) and costly (as grammars are extended to handle new constructions, development becomes more difficult due to the complexity of rule interactions within the grammar). Statistical techniques have the advantage of robustness, although the resulting grammars may lack the intuitive clarity found in symbolic systems. We propose to fuse the symbolic and the statistical techniques, a development which we view as inevitable; the resulting system will use statistical  learning techniques to output a symbolic parsing device. We view this development to provide a nice middle ground between the problems of over-training versus undertraining. That is, statistical approaches to learning often tend to overfit the training set of data. Symbolic approaches, on the other hand, tend to behave as though they were undertrained (breaking down on novel input) since the grammar tends to be compact. Combining statistical techniques with symbolic parsing would give the advantage of obtaining relatively compact descriptions (symbolic processing) with robustness (statistical learning) that is not overtuned to the training set.</Paragraph>
    <Section position="1" start_page="60" end_page="60" type="sub_section">
      <SectionTitle>
Preliminaries
</SectionTitle>
      <Paragraph position="0"> Naturally, a necessary preliminary for our work is to specify a set of parameters which will serve as a testing ground for the learning algorithm.</Paragraph>
      <Paragraph position="1"> This set of parameters must be embedded in a parsing system so that the learning algorithm can be tested against data sets that approximate the kind of input that parsing devices are likely to encounter in real world applications.In this section, we first list some parameters that gives some idea of the kinds of variations between languages that our system is hoped to be capable of handling.</Paragraph>
      <Paragraph position="2"> We then illustrate why parameter setting is difficult by standard methods. This provides some additional explanation for the failure so far in developing a truly universal parameterized parser.</Paragraph>
    </Section>
    <Section position="2" start_page="60" end_page="62" type="sub_section">
      <SectionTitle>
Linguistic Parameters
</SectionTitle>
      <Paragraph position="0"> Our goal will be to first develop a prototype. We do not require that the prototype accept any arbitrarily selected language nor that the coverage of the prototype parser be complete in any given language. Instead, we will develop a prototype with coverage that extends to some basic structures that any language learning device must account for, plus some structures that have proven difficult for various learning theories. In particular, given an already existing parser, we will extend its coverage by parameterizing it, as described below.</Paragraph>
      <Paragraph position="1"> Our initial set of parameters will include the following other points of variation:  1. Relative order of specifiers and heads:  This parameter covers the placement of determiners relative to nouns, relative position of the subject and the placement of certain VPmodifying adverbs.</Paragraph>
      <Paragraph position="2"> Relative order of heads and complements: This parameter deals with the position of objects relative to the verb (VO or OV orders), placement of nominal and adjectival complements as well as the choice between prepositions and postpositions.</Paragraph>
      <Paragraph position="3"> .</Paragraph>
      <Paragraph position="4">  3. Scrambling: Some language.~ allow (rein.</Paragraph>
      <Paragraph position="5"> tively) free word order. For examph', Germall  has rules for displacing definite N Ps and clan.yes out of their canonical positions. Japanese allows relatively free ordering of NPs and post-positional phrases so long as the verbal ~'omplex remains clause final. Other languages allow even freer word orders. We will focus on German and Japanese scrambling, bearing ill mind that the model should be extendible to other types of scrambling.</Paragraph>
      <Paragraph position="6"> 4. Relative placement of negative markers and verbs: Languages vary as to where they place negative markers like English not. English places its negative marker after the first tensed auxiliary, thus forcing do insertion when there is no other auxiliary, while Italian places negation after the tensed verb. French uses discontinuous elements like ne...pas.., or ne...plus.., which are wrapped around the tensed verb or which occur as continuous elements in infinitivals. Italian differs from both English and French in placing its negative marker before the first verb, whether tensed or infinitive. The proper treatment of negation will require several parameters, given the range of variation.</Paragraph>
      <Paragraph position="7"> 5. Root word order changes: In general, languages allow for certain word order changes in root clauses but not in embedded clauses. An example of a root word order change is subject-auxiliary inversion in English which occurs ill root questions (Did John leave? vs. *I wonder did John leave?). Another example Would be inversion of the subject clitic with the tensed verb in French ( Quelle pomme a-t-il mangle \[&amp;quot;which apple did he eat?&amp;quot;\]) and process of subject postposition and PP preposition in English ( A man walked into the room vs. Into the room walked a man).</Paragraph>
      <Paragraph position="8"> 6. Rightward dislocation: This includes extraposition structures in English ( That John is late amazes me. vs. It amazes me that John is late.), presentational there structures (A man was in the park. vs. There was a man in the park.), and stylistic inversion in French (Quelle piste Marie a-t-elle choisie? \[&amp;quot;What path has Marie chosen?&amp;quot;\]). Each of these constructions present unique problems so that the entire data set is best handled by a system of interacting parameters.</Paragraph>
      <Paragraph position="9"> 7. Wh-movement versus wh-in situ: Languages vary in the way they encode whquestions. English obligatorily places one and only one wh-phrase (for example, who or which picture) in first position. In French the wh-phrase may remain in place (in silu) although it may also form wh-questions as in English.</Paragraph>
      <Paragraph position="10"> Polish allows wh-phrases to be stacked at the beginning of the question.</Paragraph>
      <Paragraph position="11"> 8. Exceptional Case Marking, Structural Case Marking: These parameters have little obvious effect on word order, but involve the treatment of infinitival complements. Thus, exceptional case marking and structural case marking allow for the generation of the order V\[+t~.,d NP VPl-ten,e\], where &amp;quot;V\[+tense\]&amp;quot; is a tensed verb and &amp;quot;VPl-tense\]&amp;quot; is a VP headed by a verb in the infinitive. Both parameters involve the semantic relations between the NP and the infinitival VP as well as the treatment of case marking. These relations are reflected in constituent structure rather than word order and thus pose an interesting problem for the learning algorithm.</Paragraph>
      <Paragraph position="12"> 9. Raising and control: In the case of raising verbs and control verbs, the learner must correctly categorize verbs which occur in the same syntactic frame into two distinct groups based on scmantic relations as reflected in the distribution of elements (for example, idiom chunks) around the verbs.</Paragraph>
      <Paragraph position="13"> 10. Long and short distance anaphora: Short distance anaphors, like &amp;quot;himself&amp;quot; in English must be related to a coreferential NP within a constrained local domain. Long distance anaphors (Japanese &amp;quot;zibun&amp;quot;, Korean &amp;quot;caki&amp;quot;) must also be related to a coreferential NP, but tiffs N P need not be contained within the same type of local domain as in the short distance case.</Paragraph>
      <Paragraph position="14"> The above sampling of parameters has the virtue of being both small (and, therefore, possible to implement relatively quickly) and posing interesting learnability problems which will appropriately test our learning algorithm. Although the above list can be described succinctly, the set of possible targets will be large and a simple enumerative search through the possible targets will not be efficient.</Paragraph>
      <Paragraph position="15"> Complexities of Parameter Setting Theories based on the principles and parameters (POP) paradigm hypothesize that languages share a central core of universal properties and that language variation can be accounted for by appeal to a finite number of points of variation, the so-called parameters. The parameters themselves may take on only a finite number of possibh, values, prespecified by Universal Grammar. A fully spooled I'~:~i' theory would account for I;mguagc acquisition by hypothesizing that the h'aruer sets parameters to the appropriate values by monitoring the input stream for &amp;quot;triggering data&amp;quot;; triggers are sentences which cause the  learner to set a particular parameter to a particular value. For example, the imperative in (1) is a trigger for the order &amp;quot;V(erb) O(bject)&amp;quot;:  (1) Kiss grandma.</Paragraph>
      <Paragraph position="16">  under the hypothesis that the learner analyzes grandma as the patient of kissing and is predisposed to treat patients as structural objects. Notice that trigger-based parameter setting presupposes that, for each parameter p and each value v, the learner can identify the appropriate trigger in the input stream. This is the problem of trigger detection. That is, given a particular input item, the learner must be able to recognize whether or not it is a trigger and, if so, what parameter and value it is a trigger for. Similarly, the learner must be able to recognize that a particular input datum is not a trigger for a certain parameter even though it may share many properties with a trigger. In order to make the discussion more concrete, consider the following example: (2) a. John: thinks that Mary likes ~aim i.</Paragraph>
      <Paragraph position="17"> b. *John thinks that Maryj likes herj.</Paragraph>
      <Paragraph position="18"> English allows pronouns to be coreferent with a c-commanding nominal just in case that nominal is not contained within the same local syntactic domain as the pronoun; this is a universal prop-erty of pronouns and would seem to present little problem to the learner.</Paragraph>
      <Paragraph position="19"> Notice, however, that some languages, including Chinese, Icelandic, Japanese andKorean, allow for long distance anaphors. These are elements which are obligatorily coreferent with another nominal in the sentence, but which may be separated from that nominal by several clause boundaries. Thus, the following example from Icelandic is grammatical even though the anaphor sig is separated from its antecedent JSn by a  Thus, UG includes a parameter which allows some languages to have long distance anaphors and which, perhaps, fixes certain other properties of this class of anaphora.</Paragraph>
      <Paragraph position="20"> Notice that the example in (3) is of the same structure as the pronominal example in (2a). A learner whose target is English must not take examples like (2a) as a trigger for the long distance anaphor parameter; what prevents the learner from being deceived? Why doesn't the learner conclude that English him is comparable to Icelandic sig? We would argue that the learner is sensitive to distributional evidence. For example, the learner is aware of examples like (4): (4) John i likes himj.</Paragraph>
      <Paragraph position="21"> where the pronoun is not coreferential with anything else in the sentence. The existence of (4) implies that him cannot be a pure anaphor, long distance or otherwise. Once the learner is aware of this distributional property of him, he or she can correctly rule out (2a) as a potential trigger for the long distance anaphor parameter.</Paragraph>
      <Paragraph position="22"> Distributional evidence, then, is crucial for parameter setting; no theory of parameter setting can avoid statistical properties of the input text. How far can we push the statistical component of parameter setting? In this paper, we suggest that statistically-based algorithms can be exploited to set parameters involving phenomena as diverse as word order, particularly verb second constructions, and cliticization, the difference between free pronouns and proclitics. The work reported here can be viewed as providing the basis for a theory of trigger detection; it seeks to establish a theory of the connection between the raw input text and the process of parameter setting.</Paragraph>
    </Section>
    <Section position="3" start_page="62" end_page="63" type="sub_section">
      <SectionTitle>
Parameter Setting Proposal
</SectionTitle>
      <Paragraph position="0"> Let us suppose that there are n binary parameters each of which can take one of two values ('+' or '-') in a particular natural language. The core of a natural language is uniquely defined once all the n parameters have been assigned a value) Consider a random division of the parameters into some m groups. Let us call these groups P1, P~,..., Pro. The Parameter Setting Machine first goes about setting all the parameters within the first group Px concurrently as sketched below.</Paragraph>
      <Paragraph position="1"> After these parameters have been fixed, the machine next tries to set the parameters in group P2 in a similar fashion, and so on.</Paragraph>
      <Paragraph position="2"> a Parameters can be looked at as fixed points of variation among languages, From a computational point of view, two different values of a parameter may simply correspond to two different bits of code in the parser. We are not committed to any particular scheme for the translation from a tuple of parameter values to the corresponding language. However, the sorts of parameters we consider have been listed in the previous section.</Paragraph>
      <Paragraph position="3">  1. All parameters are unset initially, i.t,., l.h,,r,, arc no preset values. The parser' is organized to only obey all the universal principles. At. this stage, utterances from any possible natural language are accommodated with equal ea.s,~, but no sophisticated structure can be built.</Paragraph>
      <Paragraph position="4"> 2. Both the values of each of the parameters pl E P1 are 'competing' to establish themselves.</Paragraph>
      <Paragraph position="5"> 3. Corresponding to Pi, a pair of hypotheses are generated, say H~. and Hi_.</Paragraph>
      <Paragraph position="6"> 4. Next, these hypotheses are tested on the basis of input evidence.</Paragraph>
      <Paragraph position="7"> 5. If H~. fails or H~. succeeds, set Pi'S value to '+'.  Otherwise, set pi's value to '-'.</Paragraph>
      <Paragraph position="8"> Formal Analysis of the Parameter Setting Machine We next consider a particular instantiation of the hypotheses and their testing. The way wc hart, in mind involves constructing suitable window-sizes during which the algorithm is sensitive to occurrence as well as non-occurrence of specific phenomena. Regular failure of a particular phenomenon to occur in a suitable window is one natural, robust kind of indirect negative evidence. For example, the pair of hypotheses may be  1. Hypothesis H~: Expect not to observe phenomena from a fixed set Oi of phenomena which support the parameter value '-'.</Paragraph>
      <Paragraph position="9"> 2. Hypothesis H~_: Expect not to observe phenomena from a fixed set O~. of phenomena which support the parameter value '+'.</Paragraph>
      <Paragraph position="10"> Let wi and ki be two small numbers. Testing the hypothesis H~ involves the following procedure: null 1. A window of size wi sentences is constructed and a record is maintained whether or not a phenomenon from within the set O~_ occurred among those wi sentences.</Paragraph>
      <Paragraph position="11"> 2. This construction of the window is repeated ki different times and a tally ci is made of the fraction of times the phenomena occurred at least once in the duration of the window.</Paragraph>
      <Paragraph position="12"> 3. The hypothesis H+ succeeds if and only if the  ratio of ci to kl is less than 0.5.</Paragraph>
      <Paragraph position="13"> Note that the phenomena under scrutiny are assumed to be such that the parser is always capable of analyzing (to whatever extent necessary) the input. This is because in our view the parser consists of a fixed, core program whose behavior can be modified by selecting from among a finite set of 'flags' (the parameters). Therefore, even if not all of the flags have been set to the correct values, the parser is such that it can at least partially represent the input. Thus, the parser is * always capable of analyzing the input. Also, there is no need to explicitly store any input evidence. Saitable window-sizes can be constructed during which the algorithm is sensitive to occurrence as well as non-occurrence of specific phenomena. By using windows, just the relevant bit of information from the input is extracted and maintained. (For detailed argumentation that this is a reasonable theoretica! argument, see Kaput (1992, 1993).) Notice also that we have only sketched and analyzed a particular, simple version of our algorithm. In general, a whole range of window-sizes may be used and this may be governed by the degree to which the different hypotheses have earned corroboration. (For some ideas along this direction in a more general setting, see Kaput (199l, 1992).) Order in which parameters get set Notice that in our approach certain parameters get set quicker than others. These are the ones that are expressed very frequently. It is possible that these parameters also make the information extraction more efficient quicker, for exampie, by enabling structure building so that other parameters can be set. If our proposal is right, then, for example, the word order parameters which are presumably the very first ones to be set must be set based on a very primitive parser capable of handling any natural language. At this early stage, it may be that word and utterance boundaries cannot be reliably recognized and the lexicon is quite rudimentary. Furthermore, the only accessible property in the input stream may be the linear word order. Another particular difficulty with setting word-order parameters is that the surface order of constituents in the input does not necessarily reflect the underlying word-order. For example, even though Dutch and German are SOV languages, there is a preponderance of SVO forms in the input due to the V2 (verb-second) phenomenon. The finite verb in root clauses moves to the second position and then the first position can be occupied by the subject, objects (direct or indirect), adverbials or prepositional phrases. As we shall see, it is import;rot to note that if the subject is not in the first position in a V2 language, it is most likely in the first position to the right of the verb. Finally, it has been shown by Gibson and Wexler (1992) that the parameter space created by the head-direction parameters along with the V2 parameter has local maxima, thai. is, incorrect parameter settings front which the learner can never escape.</Paragraph>
    </Section>
    <Section position="4" start_page="63" end_page="67" type="sub_section">
      <SectionTitle>
Computational Analysis of the
Parameter Setting Machine
</SectionTitle>
      <Paragraph position="0"> V2 parameter In this section, we summarize results we have obtained which show that word or- null der parameters can plausibly be set in our model. 2 The key concept we use is that of entropy, an information-theoretic statistical measure of randomness of a random variable. The entropy H(X) of a random variable X, measured in bits, is - ~x p(z)logp(z). To give a concrete example, the outcome of a fair coin has an entropy of -(.5 * log(.5) + .5 * log(.5)) = 1 bit. If the coin is not fair and has .9 chance of heads and. 1 chance of tails, then the entropy is around .5 bits. There is less uncertainty with the unfair coin--it is most likely going to turn up heads. Entropy can also be thought of as the number of bits on the average required to describe a random variable. Entropy of one variable, say X, conditioned on another, say Y, denoted as H(X\]Y) is a measure of how much better the first variable can be predicted when the value of the other variable is known.</Paragraph>
      <Paragraph position="1"> Descriptively, verb second (V2) languages place the tensed verb in a position that immediately follows the first constituent of the sentence. For example, German is V2 in root clauses, as shown in (refex:v2-root), but not in embedded clauses,  as shown in (telex:embedding): 3 (5) a. Hans hat Maria H. has M. getroffen.</Paragraph>
      <Paragraph position="2"> met &amp;quot;Hans has met Maria.&amp;quot; b. Hans wird Maria H. will M. getroffen haben.</Paragraph>
      <Paragraph position="3"> met has &amp;quot;Hans will have met Maria.&amp;quot; (o) a. well Hans Maria  because H. M. getroffen, hat.</Paragraph>
      <Paragraph position="4"> met has &amp;quot;Hans has met Maria.&amp;quot; b. well Hans Maria because H. M. getroffen haben wird.</Paragraph>
      <Paragraph position="5"> met has will &amp;quot;because Hans will have met Maria.&amp;quot; In the examples in (5), a constituent, XP, has 2Preliminary results obtained with Eric Brill were presented at the 1993 Georgetown Roundtable on Language and Linguistics: Pre-session on Corpus-based Linguistics.</Paragraph>
      <Paragraph position="6"> 3See the papers collected in Haider &amp; Prinzhorn (1985) for a genera\] discussion of V2 constructions. been moved into the Specifier position of CP, triggering movement of the finite verb to C o . This results in the structure shown in (7). Notice that the constituent X P can be of any category, may be extracted from an embedded clause ormay be an adverbial; thus, the XP need not be related to the finite verb via selectional restrictions or subcategorization: null (7) \[CP XPi \[C O Vj\] ... ti... tj\] where Vj is a finite verb.</Paragraph>
      <Paragraph position="7"> The V2 parameter (or set of parameters) would regulate the movement of a constituent to the Specifier of CP, forcing movement of the finite verb to C O as well as determining whether the V2 structures are restricted to the root clause or may occur in embedded clauses.</Paragraph>
      <Paragraph position="8"> We considered the possibility that by investigating the behavior of the entropy of positions in the neighborhood of verbs in a language, word order characteristics of that language may be discovered. 4 For a V2 language, we expect that there will be more entropy to the left of the verb than to its right, i.e., the position to the \[eft will be less predictable than the one to the right. This is because the first position need not be related to the verb in any systematic way while the position following the verb will be drawn from a more restricted class of elements (it will either be the subject or an element internal to the VP); hence, there is more uncertainty (higher entropy) about the first position than about the position following the verb. We first show that using a simple distributional analysis technique based on the five verbs the algorithm is assumed to know, another fifteen words most of which turn out to be verbs can readily be obtained.</Paragraph>
      <Paragraph position="9"> Consider text as generating tuples of the form (v,d,w), where v is one of the top twenty words (most of which are verbs), d is either the position to the left of the verb or to the right, and w is the word at that position. ~ V, D and W are the corresponding random variables.</Paragraph>
      <Paragraph position="10"> The procedure for setting the V2 parameter is 4In the competition model for language acquisition (MacWhinney, 1987), the child considers cues to determine properties of the language but while these cues are reinforced in a statistical sense, the cues themselves axe not information-theoretic in the way that ours are. In some redent discussion of triggering, Niyogi and Berwick (1993) formalize parameter setting as a Maxkov process. Crucially, there again the statistical assumption, on the input is merely used to ensure that convergence is likely, and triggers are simple sentences.</Paragraph>
      <Paragraph position="11"> SWe thank Steve Abney for suggesting this formulation to us.</Paragraph>
      <Paragraph position="12">  On each of the 9 languages on which it has been possible to test our algorithm, the correct result was obtained. (Only the last three languages in the table are V2 languages.) Furthermore, in almost all cases, it was also shown to be statistically significant. The amount (only 3000 utterances) and the quality of the input (unstructured unannotated input caretaker speech subcorpus from the CHILDES database (MacWhinney, 1991)), and the computational resources needed for parameter setting to succeed are psychologically plausible. Further tests were successfully conducted in order to establish both the robustness and the simplicity of this learning algorittun. It is also clear that once the value of the V2 parameter has been correctly set, the input is far more revealing with regard to other word order parameters and they too can be set using similar techniques.</Paragraph>
      <Paragraph position="13"> In order to make clear how this procedure lits into our general parameter setting proposal, we spell out what the hypotheses are. In the case of the V2 parameter, the two hypotheses are not separately necessary since one hypothesis is the exact complement of the other. So the hypothesis H+ may be as shown.</Paragraph>
      <Paragraph position="14"> Hypothesis H+: Expect not to observe that the entropy to the left of the verbs is lower than that to the right.</Paragraph>
      <Paragraph position="15"> The window size that may be used could be around 300 utterances and the nmnber of repetitions need to be around 10. Our previous results provide empirical support that this should suflh:e. By assuming that besides knowing a fcw verbs, as before, the algorithm also recognizes some of the first and second person pronouns of the language, we can not only detcrmine aspects uf thu pronoun system (see below) but also get information about the V2 parameter. The first step of learning is same as above; that is, the learner acquires additional verbs based on distributional analysis. We expect that in the V2 languages (Dutch and German), the pronouns will appear more often immediately to the right of the verb than to the left. For French, English and Italian exactly the reverse is predicted. Our results (2 to 1 or better ratio in the predicted direction) confirm these predictions: Clitic pronouns We now show that our techniques can lead to straightforward identification and classification of clitic pronouns7 Briefly, clitic pronouns are phonologically reduced elements which obligatorily attach to another ele,,,ent. Syntactic clitics have a number of syntactic consequences including special word order propcrties and an inability to participate in conjunct.ions and disjunctions. For example, in French,, fldl direct objects occur after the lexical verb but accusative clitics appear before the verb: (s) a. Jean a vu les J. has seen the filles.</Paragraph>
      <Paragraph position="16"> girls &amp;quot;Jean saw the girls.&amp;quot; b. Jean les a rues.</Paragraph>
      <Paragraph position="17"> J. clitic has seen &amp;quot;Jean saw them.&amp;quot; Restricting our attention, for the moment to French, we should note that clitic pronouns may occur in sequences, in which case there are a number of restrictions on their relative order. Thus, nominative clitics (eg., &amp;quot;je&amp;quot;, &amp;quot;tu&amp;quot;, &amp;quot;il&amp;quot;, etc.) occur first, followed by the negative element &amp;quot;ne&amp;quot;, fi)llowed by accusative clitics (eg., &amp;quot;la&amp;quot;, &amp;quot;me&amp;quot;, &amp;quot;re&amp;quot;) and dative clitics (&amp;quot;lui&amp;quot;), followed, at last, I)y the first element of the verbal sequence (an auxiliary or the main verb). There are further ordering constraints within the accusative and dative elites based on the person of the clitic; see Perlmutter (1971) for an exhaustive description of clitic pronouns in French.</Paragraph>
      <Paragraph position="18"> In order to correctly set the parameters governing the syntax of pronominals, the learner must distinguish clitic pronouns from free and weak pronouns as well as sort all pronoun systems according to their proper case system (e.g., nominatiw' pronouns, accusal.iw, pronouns). Furtherr Wc also vcrilicd that tile object clitics in French were not primarily responsible for the correct result. 7preliminary results were presented at the Berne workshop on L|- and \[,2-acquisition of clause-internal rules: scrambling and cliticization in January, 1994. more, the learner must have some reliable method for identifying the presence of clitic pronouns in the input stream. The above considerations suggest that free pronouns occur in a wider range of syntactic environments than clitic pronouns and, so, should carry less information about the syntactic nature of the positions that surround them. Clitic pronouns, on the other hand, occur in a limited number of environments and, hence, carry more information about the surrounding positions. Furthermore, since there are systematic constraints on the relative ordering of clitics, we would expect them to fall into distribution classes depending on the information they carry about the positions that surround them. The algorithm we report, which is also based on the observation of entropies of positions in the neighborhood of pronouns, not only distinguishes accurately between clitic and free-standing pronouns, but also successfully sorts clitic pronouns into linguistically natural classes.</Paragraph>
      <Paragraph position="19"> It is assumed that the learner knows a set of first and second person pronouns. The learning algorithm computes the entropy profile for three positions to the left and right of the pronouns (H(W\]P = p) for the six different positions), where ps are the individual pronouns. These profiles are then compared and those pronouns which have similar profiles are clustered together. Interestingly, it turns out that the clusters are syntactically appropriate categories.</Paragraph>
      <Paragraph position="20"> In French, for example, based on the Pearson correlation coefficients we could deduce that the object clitics &amp;quot;me&amp;quot; and &amp;quot;te&amp;quot;, the subject clitics &amp;quot;je&amp;quot; and &amp;quot;tu&amp;quot;, the non-clitics &amp;quot;moi&amp;quot; and &amp;quot;toi&amp;quot;, and the ambiguous pronouns &amp;quot;nous&amp;quot; and &amp;quot;vons&amp;quot; are most closely related only to the other element in their own class.</Paragraph>
      <Paragraph position="21">  In fact, the entropy signature for the ambiguous pronouns can be analyzed as a mathematical combination of the signatures for the conflated forms. To distinguish clitics from non-clitics, we use the measure of stickiness (proportion of times they are sticking to the verbs compared to the times they are two or three positions away). These resuits are quite good. The stickiness is as high as 54-55% for the subject clitics; non-clitics have stickiness no more than 17%.</Paragraph>
      <Paragraph position="22"> The Dutch clitic system is far more complicated than the French pronoun system. (See for example, Zwart (1993).) Even so, our entropy calculations made some headway towards classifying the pronouns. We are able to distinguish the weak and strong subject pronouns. Since even the strong subject pronouns in Dutch tend to stick to their verbs very closely and two clitics can come next to each other, the raw stickiness measure seems to be inappropriate. Although the Dutch case is problematic due to the effects of V2 and scrambling, we are in the process of treating these phenomena and anticipate that the pronoun calculations in Dutch will sort out properly once the influence of these other word order processes are factored in appropriately.</Paragraph>
      <Paragraph position="23"> Conclusions It needs to be emphasized that in our statistical procedure there is a mechanism available to the learning mechanism by which it can determine when it has seen enough input to reliably determine the value of a certain parameter. (Such means are non-existent in any trigger-based error-driven learning theory.) In principle at least, the learning mechanism can determine the variance in the quantity of interest as a function of the text size and then know when enough text has been seen to be sure that a certain parameter has to be set in a particular way.</Paragraph>
      <Paragraph position="24"> We are currently extending the results we have obtained to other parameters and other languages. We are convinced that the word order parameters (for example, those in (1-2) in the section Preliminaries) should be fairly easy to set and amenable to an information-theoretic analysis along the lines sketched earlier. Scrambling also provides a case where calculations of entropy should provide an immediate solution to the parameter-setting problem. Notice however that both scrambling and V2 interact in an interesting way with the basic word order parameters; a learner may be potentially misled by both scrambling and V2 into mis-setting the basic word order parameters since both parameters can alter the relationship between heads, their complements and their specifiers.</Paragraph>
      <Paragraph position="25"> Parameters involving adverb placement, extraposition and wh-movement should be relatively more challenging to the learning algorithm given the relatively low frequency with which adverbs are found in adult speech to children. These cases provide good examples which motivate the use of multiple trials by the learner. The interaction between adverb placement and head move- null meat, then, will pose an interesting problem for the learner since the two parameters are interdependent; what the learner assumes about adverb placement is contingent on what it assumes about head placement and vice versa.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML