File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0502_metho.xml

Size: 18,565 bytes

Last Modified: 2025-10-06 14:09:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0502">
  <Title>Simulating Language Change in the Presence of Non-Idealized Syntax</Title>
  <Section position="3" start_page="11" end_page="12" type="metho">
    <SectionTitle>
2 Linguistic specifics of the simulation
</SectionTitle>
    <Paragraph position="0"> The change of interest is the loss of V2 in Middle English and Old French, in particular why V2 was unstable in these languages but not in others. Therefore, the idealized grammars allowed in this simulation will be limited to four: All have underlying subject-verb-object word order, and allow sentence-initial adjuncts. The options are V2 or not, and pro-drop or not. Thus, a grammar is specified by a pair of binary parameter values. For simplicity, the pro-drop parameter as in Old French is used rather than trying to model the clitic status of Middle English subject pronouns.</Paragraph>
    <Paragraph position="1"> Sentences are limited to a few basic types of declarative statements, following the degree-0 learning hypothesis (Lightfoot, 1999): The sentence may or may not begin with an adjunct, the subject may be either a full noun phrase or a pronoun, and the verb may optionally require an object or a subject.</Paragraph>
    <Paragraph position="2"> A verb, such as rain, that does not require a subject is given an expletive pronoun subject if the grammar is not pro-drop. Additionally, either the adjunct, the subject, or the object may be topicalized. For a V2 grammar, the topicalized constituent appears just before the verb; otherwise it is indicated only by spoken emphasis.</Paragraph>
    <Paragraph position="3"> A fuzzy grammar consists of a pair of beta distributions with parameters a and b, following the convention from (Gelman et al., 2004) that the density</Paragraph>
    <Paragraph position="5"> Each beta distribution controls one parameter in the idealized grammar.1 The special case of Beta(1,1) is the uniform distribution, and two such distributions are used as the initial state for the agent's fuzzy grammar. The density for Beta(1 + m,1 + n) is a bump with peak at m/(m + n) that grows sharper for larger values of m and n. Thus, it incorporates a natural critical period, as each additional data point changes the mean less and less, while allowing for variation in adult grammars as seen in manuscripts.</Paragraph>
    <Paragraph position="6"> To produce a sentence, an agent with fuzzy grammar (Beta(a1,b1),Beta(a2,b2)) constructs an idealized grammar from a pair of random parameter settings, each 0 or 1, selected as follows. The agent picks a random number Qj [?] Beta(aj,bj), then sets parameter j to 1 with probability Qj and 0 with probability 1[?]Qj. An equivalent and faster operation is to set parameter j to 1 with probability uj and 0 with probability 1[?]uj, where uj = aj/(aj +bj) is the mean of Beta(aj,bj).</Paragraph>
    <Paragraph position="7"> To learn from a sentence, an agent first constructs a random idealized grammar as before. If the grammar can parse the sentence, then some of the agent's beta distributions are adjusted to increase the probability that the successful grammar is selected again. If the grammar cannot parse the sentence, then no adjustment is made. To adjust Beta(a,b) to favor 1, the agent increments the first parameter, yielding Beta(a + 1,b). To adjust it to favor 0, the agent increments the second parameter, yielding Beta(a,b + 1).</Paragraph>
    <Paragraph position="8"> Within this general framework, many variations are possible. For example, the initial state of an agent, the choice of which beta distributions to update for particular sentences, and the social structure (who speaks to who) may all be varied.</Paragraph>
    <Paragraph position="9">  The simulation in (Briscoe, 2002) also makes use of Bayesian learning, but within an algorithm for which learners switch abruptly from one idealized grammar to another as estimated probabilities cross certain thresholds. The smoother algorithm used here is preferable because children do not switch abruptly between grammars (Yang, 2002). Furthermore, this algorithm allows simulations to include children's highly variable speech. Children learning from each other is thought be an important force in certain language changes; for example, a recent change in the Icelandic case system, known as dative sickness, is thought to be spreading through this mechanism.</Paragraph>
  </Section>
  <Section position="4" start_page="12" end_page="13" type="metho">
    <SectionTitle>
3 Adaptation for Markov chain analysis
</SectionTitle>
    <Paragraph position="0"> To the learning model outlined so far, we add the following restrictions. The social structure is fixed in a loop: There are n agents, each of which converses with its two neighbors. The parameters aj and bj are restricted to be between 1 and N. Thus, the population can be in one of N4n possible states, which is large but finite.</Paragraph>
    <Paragraph position="1"> Time is discrete with each time increment representing a single sentence spoken by some agent to a neighbor. The population is represented by a sequence of states (Xt)t[?]Z. The population is updated as follows by a transition function Xt+1 = ph(Xt,Ut) that is fed the current population state plus a tuple of random numbers Ut. One agent is selected uniformly at random to be the hearer. With probability pr, that agent dies and is replaced by a baby in an initial state (Beta(1,1),Beta(1,1)). With probability 1 [?] pr, the agent survives and hears a sentence spoken by a randomly selected neighbor.</Paragraph>
    <Paragraph position="2"> Two variations of the learning process are explored here. The first, called LEARN-ALWAYS, serves as a base line: The hearer picks an idealized grammar according to its fuzzy grammar, and tries to parse the sentence. If it succeeds, it updates any one beta distribution selected at random in favor of the parameter that led to a successful parse. If the parse fails, no update is made. This algorithm is similar to Naive Parameter Learning with Batch (Yang, 2002, p. 24), but adapted to learn a fuzzy grammar rather than an idealized grammar, and to update the agent's knowledge of only one syntactic parameter at a time.</Paragraph>
    <Paragraph position="3"> The second, called PARAMETER-CRUCIAL, is the same except that the parameter is only updated if it is crucial to the parse: The agent tries to parse the sentence with that parameter in the other setting. If the second parse succeeds, then the parameter is not considered crucial and is left unchanged, but if it fails, then the parameter is crucial and the original setting is reinforced. This algorithm builds on LEARN-ALWAYS by restricting learning to sentences that are more or less unambiguous cues for the speaker's setting for one of the syntactic parameters. The theory of cue-based learning assumes that children incorporate particular features into their grammar upon hearing specific sentences that unambiguously require them. This process is thought to be a significant factor in language change (Lightfoot, 1999) as it provides a feedback mechanism: Once a parameter setting begins to decline, cues for it will become less frequent in the population, resulting in further decline in the next generation. A difficulty with the theory of cue-based learning is that it is unclear what exactly &amp;quot;unambiguous&amp;quot; should mean, because realistic language models generally have cases where no single sentence type is unique to a particular grammar or parameter setting (Yang, 2002, p. 34, 39). The definition of a crucial parameter preserves the spirit of cue-based learning while avoiding potential difficulties inherent in the concept of &amp;quot;unambiguous.&amp;quot; null These modifications result in a finite-state Markov chain with several useful properties. It is irreducible, which means that there is a strictly positive probability of eventually getting from any initial state to any other target state. To see this, observe that there is a tiny but strictly positive probability that in the next several transitions, all the agents will die and the following sentence exchanges will happen just right to bring the population to the target state. This Markov chain is also aperiodic, which means that at any time t far enough into the future, there is a strictly positive probability that the chain will have returned to its original state. Aperiodicity is a consequence of irreducibility and the fact that there is a strictly positive probability that the chain does not change states from one time step to the next. That happens when a hearer fails to parse a sentence, for example. An irreducible aperiodic Markov chain al- null ways has a stationary distribution. This is a probability distribution on its states, normally denoted pi, such that the probability that Xt = x converges to pi(x) as t - [?] no matter what the initial state X0 is. Furthermore, the transition function preserves pi, which means that if X is distributed according to pi, then so is ph(X,U). The stationary distribution represents the long term behavior of the Markov chain.</Paragraph>
    <Paragraph position="4"> Agents have a natural partial ordering followsequal defined</Paragraph>
    <Paragraph position="6"> This ordering means that the left-hand agent is slanted more toward 1 in both parameters. Not all pairs of agent states are comparable, but there are unique maximum and minimum agent states under this partial ordering,</Paragraph>
    <Paragraph position="8"> such that all agent states A satisfy Amax followsequal A followsequal Amin. Let us consider two population states X and Y and denote the agents in X by Aj and the agents in Y by Bj, where 1 [?] j [?] n. The population states may also be partially ordered, as we can define X followsequal Y to mean all corresponding agents satisfy Aj followsequal Bj. There are also maximum and minimum population states Xmax and Xmin defined by setting all agent states to Amax and Amin, respectively.</Paragraph>
    <Paragraph position="9"> A Markov chain is monotonic if the set of states has a partial ordering with maximum and minimum elements and a transition function that respects that ordering. There is a perfect sampling algorithm called monotonic coupling from the past (MCFTP) that generates samples from the stationary distribution pi of a monotonic Markov chain without requiring certain properties of it that are difficult to compute (Propp and Wilson, 1996). The partial ordering followsequal on population states was constructed so that this algorithm could be used. The transition function ph mostly respects this partial ordering, that is, if X followsequal Y , then with high probability ph(X,U) followsequal ph(Y,U). This monotonicity property is why ph was defined to change only one agent per time step, and why the learning algorithms change that agent's knowledge of at most one parameter per time step. However, ph does not quite respect followsequal, because one can construct X, Y , and U such that X followsequal Y but ph(X,U) and ph(Y,U) are not comparable. So, MCFTP does not necessarily produce correctly distributed samples. However, it turns out to be a reasonable heuristic, and until further theory can be developed and applied to this problem, it is the best that can be done.</Paragraph>
    <Paragraph position="10"> The MCFTP algorithm works as follows. We suppose that (Ut)t[?]Z is a sequence of tuples of random numbers, and that (Xt)t[?]Z is a sequence of random states such that each Xt is distributed according to pi and Xt+1 = ph(Xt,Ut). We will determine X0 and return it as the random sample from the distribution pi. To determine X0, we start at time T &lt; 0 with a list of all possible states, and compute their futures using ph and the sequence of Ut. If ph has been chosen properly, many of these paths will converge, and with any luck, at time 0 they will all be in the same state. If this happens, then we have found a time T such that no matter what XT is, there is only one possible value for X0, and that random state is distributed according to pi as desired. Otherwise, we continue, starting twice as far back at time 2T, and so on. This procedure is generally impractical if the number of possible states is large. However, if the Markov chain is monotonic, we can take the short-cut of only looking at the two paths starting at Xmax and Xmin at time T. If these agree at time 0, then all other paths are squeezed in between and must agree as well.</Paragraph>
  </Section>
  <Section position="5" start_page="13" end_page="15" type="metho">
    <SectionTitle>
4 Tweaking
</SectionTitle>
    <Paragraph position="0"> Since this simulation is intended to be used to study the loss of V2, certain long term behavior is desirable. Of the four idealized grammars available in this simulation, three ought to be fairly stable, since there are languages of these types that have retained these properties for a long time: SVO (French, English), SVO+V2 (Icelandic), and SVO+pro-drop (Spanish). The fourth, SVO+V2+pro-drop, ought to be unstable and give way to SVO+pro-drop, since it approximates Old French before it changed. In any case, the population ought to spend most of its time in states where most of the agents use one of  the four grammars predominantly, and neighboring agents should have similar fuzzy grammars.</Paragraph>
    <Paragraph position="1"> In preliminary experiments, the set of possible sentences did not contain expletive subject pronouns, sentence initial adverbs, or any indication of spoken stress. Thus, the simulated SVO language was a subset of all the others, and SVO+pro-drop was a subset of SVO+V2+pro-drop. Consequently, the PARAMETER-CRUCIAL learning algorithm was unable to learn either of these languages because the non-V2 setting was never crucial: Any sentence that could be parsed without V2 could also be parsed with it. In later experiments, the sentences and grammars were modified to include expletive pronouns, thereby ensuring that SVO is not a subset of SVO+pro-drop or SVO+V2+pro-drop. In addition, marks were added to sentences to indicate spoken stress on the topic. In the simulated V2 languages, topics are always fronted, so such stress can only appear on the initial constituent, but in the simulated non-V2 languages it can appear on any constituent.</Paragraph>
    <Paragraph position="2"> This modification ensures that no language within the simulation is a subset of any of the others.</Paragraph>
    <Paragraph position="3"> The addition of spoken stress is theoretically plausible for several reasons. First, the acquisition of word order and case marking requires children to infer the subject and object of sample sentences, meaning that such thematic information is available from context. It is therefore reasonable to assume that the thematic context also allows for inference of the topic. Second, Chinese allows topics to be dropped where permitted by discourse, a feature also observed in the speech of children learning English.</Paragraph>
    <Paragraph position="4"> These considerations, along with the fact that the simulation works much better with topic markings than without, suggests that spoken emphasis on the topic provides positive evidence that children use to determine that a language is not V2.</Paragraph>
    <Paragraph position="5"> It turns out that the maximum value N allowed for aj and bj must be rather large. If it is too small, the population tends to converge to a saturated state where all the agents are approximately ^A = (Beta(N,N),Beta(N,N)). This state represents an even mixture of all four grammars and is clearly unrealistic. To see why this happens, imagine a fixed linguistic environment and an isolated agent learning from this environment with no birthand-death process. This process is a Markov chain with a single absorbing state ^A, meaning that once the learner reaches state ^A it cannot change to any other state: Every learning step requires increasing one of the numerical parameters in the agent's state, and if they are all maximal, then no further change can take place. Starting from any initial state, the agent will eventually reach the absorbing state. The number of states for an agent must be finite for practical and theoretical reasons, but by making N very large, the time it takes for an agent to reach ^A becomes far greater than its life span under the birthand-death process, thereby avoiding the saturation problem. With pr = 0.001, it turns out that 5000 is an appropriate value for N, and effectively no agents come close to saturation.</Paragraph>
    <Paragraph position="6"> After some preliminary runs, the LEARN-ALWAYS algorithm seemed to produce extremely incoherent populations with no global or local consensus on a dominant grammar. Furthermore, MCFTP was taking an extremely long time under the PARAMETER-CRUCIAL algorithm. An additional modification was put in place to encourage agents toward using predominantly one grammar. The best results were obtained by modifying the speaking algorithm so that agents prefer to speak more toward an extreme than the linguistic data would indicate. For example, if the data suggests that they should use V2 with a high probability of 0.7, then they use V2 with some higher probability, say, 0.8. If the data suggests a low value, say 0.3, then they use an even lower value, say 0.2. The original algorithm used the mean uj of beta distribution Beta(aj,bj) as the probability of using 1 for parameter j. The biased speech algorithm uses f(uj) instead, where f is a sigmoid function</Paragraph>
    <Paragraph position="8"> that satisfies f(1/2) = 1/2 and fprime(1/2) = k. The numerical parameter k can be varied to exaggerate the effect. This modification leads to some increase in coherence with the LEARN-ALWAYS algorithm; it has minimal effect on the samples obtained with the PARAMETER-CRUCIAL algorithm, however MCFTP becomes significantly faster.</Paragraph>
    <Paragraph position="9"> The biased speech algorithm can be viewed as a smoother form of the thresholding operation used in (Briscoe, 2002), discussed earlier. An alternative in- null terpretation is that the acquisition process may involve biased estimates of the usage frequencies of syntactic constructions. Language acquisition requires children to impose regularity on sample data, leading to creoles and regularization of vocabulary, for instance (Bickerton, 1981; Kirby, 2001). This addition to the simulation is therefore psychologically plausible.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML