XML Viewer - p06-1053

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1053_metho.xml
Size: 28,587 bytes
Last Modified: 2025-10-06 14:10:15
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1053">
  <Title>Integrating Syntactic Priming into an Incremental Probabilistic Parser, with an Application to Psycholinguistic Modeling</Title>
  <Section position="4" start_page="417" end_page="417" type="metho">
    <SectionTitle>
2 Priming Models
</SectionTitle>
    <Paragraph position="0"> We propose three models designed to capture the different theories of structural repetition discussed above. To keep our model as simple as possible, each formulation is based on an unlexicalized probabilistic context free grammar (PCFG). In this section, we introduce the models and discuss the novel techniques used to model structural similarity. We also discuss the design of the probabilistic parser used to evaluate the models.</Paragraph>
  </Section>
  <Section position="5" start_page="417" end_page="419" type="metho">
    <SectionTitle>
2.1 Baseline Model
</SectionTitle>
    <Paragraph position="0"> The unmodi ed PCFG model serves as the Baseline. A PCFG assigns trees probabilities by treating each rule expansion as conditionally independent given the parent node. The probability of a rule LHS - RHS is estimated as:</Paragraph>
    <Paragraph position="2"/>
    <Section position="1" start_page="417" end_page="417" type="sub_section">
      <SectionTitle>
2.2 Copy Model
</SectionTitle>
      <Paragraph position="0"> The rst model we introduce is a probabilistic variant of Frazier and Clifton's (2001) copying mechanism: it models parallelism in coordination and nothing else. This is achieved by assuming that the default operation upon observing a coordinator (assumed to be anything with a CC tag, e.g., 'and') is to copy the full subtree of the preceding coordinate sister. Copying impacts on how the parser works (see Section 2.5), and in a probabilistic setting, it also changes the probability of trees with parallel coordinated structures. If coordination is present, the structure of the second item is either identical to the rst, or it is not.1 Let us call 1The model only considers two-item coordination or the last two sisters of multiple-item coordination.</Paragraph>
      <Paragraph position="1"> the probability of having a copied tree as pident.</Paragraph>
      <Paragraph position="2"> This value may be estimated directly from a corpus using the formula pident = cidentc total Here, cident is the number of coordinate structures in which the two conjuncts have the same internal structure and ctotal is the total number of coordinate structures. Note we assume there is only one parameter pident applicable everywhere (i.e., it has the same value for all rules).</Paragraph>
      <Paragraph position="3"> How is this used in a PCFG parser? Let t1 and t2 represent, respectively, the rst and second coordinate sisters and let PPCFG(t) be the PCFG probability of an arbitrary subtree t.</Paragraph>
      <Paragraph position="4"> Because of the independence assumptions of the PCFG, we know that pident greatermuch PPCFG(t). One way to proceed would be to assign a probability of pident when structures match, and (1[?] pident)* PPCFG(t2) when structures do not match. However, some probability mass is lost this way: there is a nonzero PCFG probability (namely, PPCFG(t1)) that the structures match.</Paragraph>
      <Paragraph position="5"> In other words, we may have identical subtrees in two different ways: either due to a copy operation, or due to a PCFG derivation. If pcopy is the probability of a copy operation, we can write this fact more formally as: pident = PPCFG(t1)+ pcopy.</Paragraph>
      <Paragraph position="6"> Thus, if the structures do match, we assign the second sister a probability of: pcopy + PPCFG(t1) If they do not match, we assign the second conjunct the following probability:</Paragraph>
      <Paragraph position="8"> This accounts for both a copy mismatch and a PCFG derivation mismatch, and assures the probabilities still sum to one. These probabilities for parallel and non-parallel coordinate sisters, therefore, gives us the basis of the Copy model.</Paragraph>
      <Paragraph position="9"> This leaves us with the problem of nding an estimate for pcopy. This value is approximated as:</Paragraph>
      <Paragraph position="11"> In this equation, T2 is the set of all second conjuncts. null</Paragraph>
    </Section>
    <Section position="2" start_page="417" end_page="418" type="sub_section">
      <SectionTitle>
2.3 Between Model
</SectionTitle>
      <Paragraph position="0"> While the Copy model limits itself to parallelism in coordination, the next two models simulate structural priming in general. Both are similar in design, and are based on a simple insight: we may  condition a PCFG rule expansion on whether the rule occurred in some previous context. If Prime is a binary-valued random variable denoting if a rule occurred in the context, then we de ne:</Paragraph>
      <Paragraph position="2"> This is essentially an instantiation of Church's (2000) adaptation probability, albeit with PCFG rules instead of words. For our rst model, this context is the previous sentence. Thus, the model can be said to capture the degree to which rule use is primed between sentences. We henceforth refer to this as the Between model. Following the convention in the psycholinguistic literature, we refer to a rule use in the previous sentence as a 'prime', and a rule use in the current sentence as the 'target'. Each rule acts once as a target (i.e., the event of interest) and once as a prime. We may classify such adapted probabilities into 'positive adaptation', i.e., the probability of a rule given the rule occurred in the preceding sentence, and 'negative adaptation', i.e., the probability of a rule given that the rule did not occur in the preceding sentence.</Paragraph>
    </Section>
    <Section position="3" start_page="418" end_page="418" type="sub_section">
      <SectionTitle>
2.4 Within Model
</SectionTitle>
      <Paragraph position="0"> Just as the Between model conditions on rules from the previous sentence, the Within sentence model conditions on rules from earlier in the current sentence. Each rule acts once as a target, and possibly several times as a prime (for each subsequent rule in the sentence). A rule is considered 'used' once the parser passes the word on the left-most corner of the rule. Because the Within model is ner grained than the Between model, it can be used to capture the parallelism effect in coordination. In other words, this model could explain parallelism in coordination as an instance of a more general priming effect.</Paragraph>
    </Section>
    <Section position="4" start_page="418" end_page="419" type="sub_section">
      <SectionTitle>
2.5 Parser
</SectionTitle>
      <Paragraph position="0"> As our main purpose is to build a psycholinguistic model of structure repetition, the most important feature of the parsing model is to build structures incrementally.2 Reading time experiments, including the parallelism studies of Frazier et al. (2000), make word-by-word measurements of the time taken to read 2In addition to incremental parsing, a characteristic some of psycholinguistic models of sentence comprehension is to parse deterministically. While we can compute the best incremental analysis at any point, ours models do not parse deterministically. However, following the principles of rational analysis (Anderson, 1991), our goal is not to mimic the human parsing mechanism, but rather to create a model of human parsing behavior.</Paragraph>
      <Paragraph position="1"> a novel and a bookwrote  copy model copies the most likely rst conjunct.</Paragraph>
      <Paragraph position="2"> sentences. Slower reading times are known to be correlated with processing dif culty, and faster reading times (as is the case with parallel structures) are correlated with processing ease. A probabilistic parser may be considered to be a sentence processing model via a 'linking hypothesis', which links the parser's word-by-word behavior to human reading behavior. We discuss this topic in more detail in Section 3. At this point, it suf ces to say that we require a parser which has the prex property, i.e., which parses incrementally, from left to right.</Paragraph>
      <Paragraph position="3"> Therefore, we use an Earley-style probabilistic parser, which outputs Viterbi parses (Stolcke, 1995). We have two versions of the parser: one which parses exhaustively, and a second which uses a variable width beam, pruning any edges whose merit is 12000 of the best edge. The merit of an edge is its inside probability times a prior P(LHS) times a lookahead probability (Roark and Johnson, 1999). To speed up parsing time, we right binarize the grammar,3 remove empty nodes, coindexation and grammatical functions. As our goal is to create the simplest possible model which can nonetheless model experimental data, we do not make any tree modi cation designed to improve accuracy (as, e.g., Klein and Manning 2003).</Paragraph>
      <Paragraph position="4"> The approach used to implement the Copy model is to have the parser copy the subtree of the rst conjunct whenever it comes across a CC tag.</Paragraph>
      <Paragraph position="5"> Before copying, though, the parser looks ahead to check if the part-of-speech tags after the CC are equivalent to those inside the rst conjunct. The copying model is visualized in Figure 1: the top panel depicts a partially completed edge upon seeing a CC tag, and the second panel shows the completed copying operation. It should be clear that  the copy operation gives the most probable sub-tree in a given span. To illustrate this, consider Figure 1. If the most likely NP between spans 2 and 7 does not involve copying (i.e. only standard PCFG rule derivations), the parser will nd it using normal rule derivations. If it does involve copying, for this particular rule, it must involve the most likely NP subtree from spans 2 to 3. As we parse incrementally, we are guaranteed to have found this edge, and can use it to construct the copied conjunct over spans 5 to 7 and therefore the whole co-ordinated NP from spans 2 to 7.</Paragraph>
      <Paragraph position="6"> To simplify the implementation of the copying operation, we turn off right binarization so that the constituent before and after a coordinator are part of the same rule, and therefore accessible from the same edge. This makes it simple to calculate the new probability: construct the copied subtree, and decide where to place the resulting edge on the chart.</Paragraph>
      <Paragraph position="7"> The Between and Within models require a cache of recently used rules. This raises two dilemmas. First, in the Within model, keeping track of full contextual history is incompatible with chart parsing. Second, whenever a parsing error occurs, the accuracy of the contextual history is compromised. As we are using a simple unlexicalized parser, such parsing errors are probably quite frequent. null We handle the rst problem by using one single parse as an approximation of the history. The more realistic choice for this single parse is the best parse so far according to the parser. Indeed, this is the approach we use for our main results in Section 3. However, because of the second problem noted above, in Section 4, we simulated the context by lling the cache with rules from the correct tree. In the Between model, these are the rules of the correct parse of the previous tree; in the Within model, these are the rules used in the correct parse at points up to (but not including) the current word.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="419" end_page="420" type="metho">
    <SectionTitle>
3 Human Reading Time Experiment
</SectionTitle>
    <Paragraph position="0"> In this section, we test our models by applying them to experimental reading time data. Frazier et al. (2000) reported a series of experiments that examined the parallelism preference in reading. In one of their experiments, they monitored subjects' eye-movements while they read sentences like (1): (1) a. Hilda noticed a strange man and a tall woman when she entered the house.</Paragraph>
    <Paragraph position="1"> b. Hilda noticed a man and a tall woman when she entered the house.</Paragraph>
    <Paragraph position="2"> They found that total reading times were faster on the phrase tall woman in (1a), where the coordinated noun phrases are parallel in structure, compared with in (1b), where they are not.</Paragraph>
    <Paragraph position="3"> There are various approaches to modeling processing dif culty using a probabilistic approach. One possibility is to use an incremental parser with a beam search or an n-best approach. Processing dif culty is predicted at points in the input string where the current best parse is replaced by an alternative derivation (Jurafsky, 1996; Crocker and Brants, 2000). An alternative is to keep track of all derivations, and predict dif culty at points where there is a large change in the shape of the probability distribution across adjacent parsing states (Hale, 2001). A third approach is to calculate the forward probability (Stolcke, 1995) of the sentence using a PCFG. Low probabilities are then predicted to correspond to high processing dif culty. A variant of this third approach is to assume that processing dif culty is correlated with the (log) probability of the best parse (Keller, 2003). This nal formulation is the one used for the experiments presented in this paper.</Paragraph>
    <Section position="1" start_page="419" end_page="420" type="sub_section">
      <SectionTitle>
3.1 Method
</SectionTitle>
      <Paragraph position="0"> The item set was adapted from that of Frazier et al.</Paragraph>
      <Paragraph position="1"> (2000). The original two relevant conditions of their experiment (1a,b) differ in terms of length.</Paragraph>
      <Paragraph position="2"> This results in a confound in the PCFG framework, because longer sentences tend to result in lower probabilities (as the parses tend to involve more rules). To control for such length differences, we adapted the materials by adding two extra conditions in which the relation between syntactic parallelism and length was reversed. This resulted in the following four conditions: (2) a. DT JJ NN and DT JJ NN (parallel) Hilda noticed a tall man and a strange woman when she entered the house.</Paragraph>
      <Paragraph position="3"> b. DT NN and DT JJ NN (non-parallel) Hilda noticed a man and a strange woman when she entered the house.</Paragraph>
      <Paragraph position="4"> c. DT JJ NN and DT NN (non-parallel) Hilda noticed a tall man and a woman when she entered the house.</Paragraph>
      <Paragraph position="5"> d. DT NN and DT NN (parallel) Hilda noticed a man and a woman when she entered the house.</Paragraph>
      <Paragraph position="6">  In order to account for Frazier et al.'s parallelism effect a probabilistic model should predict a greater difference in probability between (2a) and (2b) than between (2c) and (2d) (i.e., (2a)[?](2b) &gt; (2c)[?](2d)). This effect will not be confounded with length, because the relation between length and parallelism is reversed between (2a,b) and (2c,d). We added 8 items to the original Frazier et al. materials, resulting in a new set of 24 items similar to (2).</Paragraph>
      <Paragraph position="7"> We tested three of our PCFG-based models on all 24 sets of 4 conditions. The models were the Baseline, the Within and the Copy models, trained exactly as described above. The Between model was not tested as the experimental stimuli were presented without context. Each experimental sentence was input as a sequence of correct POS tags, and the log probability estimate of the best parse was recorded.</Paragraph>
    </Section>
    <Section position="2" start_page="420" end_page="420" type="sub_section">
      <SectionTitle>
3.2 Results and Discussion
</SectionTitle>
      <Paragraph position="0"> Table 1 shows the mean log probabilities estimated by the models for the four conditions, along with the relevant differences between parallel and non-parallel conditions.</Paragraph>
      <Paragraph position="1"> Both the Within and the Copy models show a parallelism advantage, with this effect being much more pronounced for the Copy model than the Within model. To evaluate statistical signi cance, the two differences for each item were compared using a Wilcoxon signed ranks test. Signi cant results were obtained both for the Within model (N = 24, Z = 1.67, p &lt; .05, one-tailed) and for the Copy model (N = 24, Z = 4.27, p &lt;.001, onetailed). However, the effect was much larger for the Copy model, a conclusion which is con rmed by comparing the differences of differences between the two models (N = 24, Z = 4.27, p&lt;.001, one-tailed). The Baseline model was not evaluated statistically, because by de nition it predicts a constant value for (2a)[?](2b) and (2c)[?](2d) across all items. This is simply a consequence of the PCFG independence assumption, coupled with the fact that the four conditions of each experimental item differ only in the occurrences of two NP rules.</Paragraph>
      <Paragraph position="2"> The results show that the approach taken here can be successfully applied to the modeling of experimental data. In particular, both the Within and the Copy models show statistically reliable parallelism effects. It is not surprising that the copy model shows a large parallelism effect for the Frazier et al. (2000) items, as it was explicitly designed to prefer structurally parallel conjuncts.</Paragraph>
      <Paragraph position="3"> The more interesting result is the parallelism effect found for the Within model, which shows that such an effect can arise from a more general probabilistic priming mechanism.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="420" end_page="421" type="metho">
    <SectionTitle>
4 Parsing Experiment
</SectionTitle>
    <Paragraph position="0"> In the previous section, we were able to show that the Copy and Within models are able to account for human reading-time performance for parallel coordinate structures. While this result alone is suf cient to claim success as a psycholinguistic model, it has been argued that more realistic psycholinguistic models ought to also exhibit high accuracy and broad-coverage, both crucial properties of the human parsing mechanism (e.g., Crocker and Brants, 2000).</Paragraph>
    <Paragraph position="1"> This should not be dif cult: our starting point was a PCFG, which already has broad coverage behavior (albeit with only moderate accuracy).</Paragraph>
    <Paragraph position="2"> However, in this section we explore what effects our modi cations have to overall coverage, and, perhaps more interestingly, to parsing accuracy.</Paragraph>
    <Section position="1" start_page="420" end_page="420" type="sub_section">
      <SectionTitle>
4.1 Method
</SectionTitle>
      <Paragraph position="0"> The models used here were the ones introduced in Section 2 (which also contains a detailed description of the parser that we used to apply the models). The corpus used for both training and evaluation is the Wall Street Journal part of the Penn Treebank. We use sections 1 22 for training, section 0 for development and section 23 for testing. Because the Copy model posits coordinated structures whenever POS tags match, parsing ef ciency decreases if POS tags are not predetermined. Therefore, we assume POS tags as input, using the gold-standard tags from the treebank (following, e.g., Roark and Johnson 1999).</Paragraph>
    </Section>
    <Section position="2" start_page="420" end_page="421" type="sub_section">
      <SectionTitle>
4.2 Results and Discussion
</SectionTitle>
      <Paragraph position="0"> Table 2 lists the results in terms of F-score on the test set.4 Using exhaustive search, the base-line model achieves an F-score of 73.3, which is comparable to results reported for unlexicalized incremental parsers in the literature (e.g. the RB1 model of Roark and Johnson, 1999). All models exhibit a small decline in performance when beam search is used. For the Within model we observe a slight improvement in performance over the baseline, both for the exhaustive search and the beam 4Based on a kh2 test on precision and recall, all results are statistically different from each other. The Copy model actually performs slightly better than the Baseline in the exhaustive case.</Paragraph>
      <Paragraph position="1">  resulted in a decrease in performance.</Paragraph>
      <Paragraph position="2"> We also nd that the Copy model performs at the baseline level. Recall that in order to simplify the implementation of the copying, we had to disable binarization for coordinate constituents. This means that quaternary rules were used for coordination (X - X1 CC X2 X0), while normal binary rules (X - Y X0) were used everywhere else. It is conceivable that this difference in binarization explains the difference in performance between the Between and Within models and the Copy model when beam search was used. We therefore also state the performance for Between and Within models with binarization limited to noncoordinate structures in the column labeled 'Beam + Coord' in Table 2. The pattern of results, however, remains the same.</Paragraph>
      <Paragraph position="3"> The fact that coverage differs between models poses a problem in that it makes it dif cult to compare the F-scores directly. We therefore compute separate F-scores for just those sentences that were covered by all four models. The results are reported in the 'Fixed Coverage' column of Table 2. Again, we observe that the copy model performs at baseline level, while the Within model slightly outperforms the baseline, and the Between model performs worse than the baseline. In Section 5 below we will present an error analysis that tries to investigate why the adaptation models do not perform as well as expected.</Paragraph>
      <Paragraph position="4"> Overall, we nd that the modi cations we introduced to model the parallelism effect in humans have a positive, but small, effect on parsing accuracy. Nonetheless, the results also indicate the success of both the Copy and Within approaches to parallelism as psycholinguistic models: a modi cation primarily useful for modeling human behavior has no negative effects on computational measures of coverage or accuracy.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="421" end_page="423" type="metho">
    <SectionTitle>
5 Distance Between Rule Uses
</SectionTitle>
    <Paragraph position="0"> Although both the Within and Copy models succeed at the main task of modeling the parallelism effect, the parsing experiments in Section 4 showed mixed results with respect to F-scores: a slight increase in F-score was observed for the Within model, but the Between model performed below the baseline. We therefore turn to an error analysis, focusing on these two models.</Paragraph>
    <Paragraph position="1"> Recall that the Within and Between models estimate two probabilities for a rule, which we have been calling the positive adaptation (the probability of a rule when the rule is also in the history), and the negative adaptation (the probability of a rule when the rule is not in the history). While the effect is not always strong, we expect positive adaptation to be higher than negative adaptation (Dubey et al., 2005). However, this is not always the case.</Paragraph>
    <Paragraph position="2"> In the Within model, for example, the rule NP - DT JJ NN has a higher negative than positive adaptation (we will refer to such rules as 'negatively adapted'). The more common rule NP DT NN has a higher positive adaptation ('positively adapted'). Since the latter is three times more common, this raises a concern: what if adaptation is an artifact of frequency? This 'frequency' hypothesis posits that a rule recurring in a sentence is simply an artifact of the its higher frequency.</Paragraph>
    <Paragraph position="3"> The frequency hypothesis could explain an interesting fact: while the majority of rules tokens have positive adaptation, the majority of rule types have negative adaptation. An important corollary of the frequency hypothesis is that we would not expect to nd a bias towards local rule re-uses.</Paragraph>
    <Paragraph position="4">  Iterate through the treebank Remember how many words each constituent spans Iterate through the treebank Iterate through each tree Upon finding a constituent spanning 1-4 words Swap it with a randomly chosen constituent of 1-4 words Update the remembered size of the swapped constituents and their subtrees Iterate through the treebank 4 more times Swap constituents of size 5-9, 10-19, 20-35 and 35+ words, respectively  Nevertheless, the NP - DT JJ NN rule is an exception: most negatively adapted rules have very low frequencies. This raises the possibility that sparse data is the cause of the negatively adapted rules. This makes intuitive sense: we need many rule occurrences to accurately estimate positive or negative adaptation.</Paragraph>
    <Paragraph position="5"> We measure the distribution of rule use to explore if negatively adapted rules owe more to frequency effects or to sparse data. This distributional analysis also serves to measure 'decay' effects in structural repetition. The decay effect in priming has been observed elsewhere (Szmrecsanyi, 2005), and suggests that positive adaptation is higher the closer together two rules are.</Paragraph>
    <Section position="1" start_page="422" end_page="422" type="sub_section">
      <SectionTitle>
5.1 Method
</SectionTitle>
      <Paragraph position="0"> We investigate the dispersion of rules by plotting histograms of the distance between subsequent rule uses. The basic premise is to look for evidence of an early peak or skew, which suggests rule re-use. To ensure that the histogram itself is not sensitive to sparse data problems, we group all rules into two categories: those which are positively adapted, and those which are negatively adapted.</Paragraph>
      <Paragraph position="1"> If adaptation is not due to frequency alone, we would expect the histograms for both positively and negatively adapted rules to be skewed towards local rule repetition. Detecting a skew requires a baseline without repetition. We propose the concept of 'randomizing' the treebank to create such a baseline. The randomization algorithm is described in Figure 2. The algorithm entails swapping subtrees, taking care that small subtrees are swapped rst (otherwise large chunks would be swapped at once, preserving a great deal of context). This removes local effects, giving a distribution due frequency alone.</Paragraph>
      <Paragraph position="2"> After applying the randomization algorithm to the treebank, we may construct the distance his- null togram for both the non-randomized and randomized treebanks. The distance between two occurrences of a rule is calculated as the number of words between the rst word on the left corner of each rule. A special case occurs if a rule expansion invokes another use of the same rule. When this happens, we do not count the distance between the rst and second expansion. However, the second expansion is still remembered as the most recent.</Paragraph>
      <Paragraph position="3"> We group rules into those that have a higher positive adaptation and those that have a higher negative adaptation. We then plot a histogram of rule re-occurrence distance for both groups, in both the non-randomized and randomized corpora.</Paragraph>
    </Section>
    <Section position="2" start_page="422" end_page="423" type="sub_section">
      <SectionTitle>
5.2 Results and Discussion
</SectionTitle>
      <Paragraph position="0"> The resulting plot for the Within model is shown in Figure 3. For both the positive and negatively adapted rules, we nd that randomization results in a lower, less skewed peak, and a longer tail.</Paragraph>
      <Paragraph position="1"> We conclude that rules tend to be repeated close to one another more than we expect by chance, even for negatively adapted rules. This is evidence against the frequency hypothesis, and in favor of the sparse data hypothesis. This means that the small size of the increase in F-score we found in Section 4 is not due to the fact that the adaption is just an artifact of rule frequency. Rather, it can probably be attributed to data sparseness.</Paragraph>
      <Paragraph position="2"> Note also that the shape of the histogram provides a decay curve. Speculatively, we suggest that this shape could be used to parameterize the decay effect and therefore provide an estimate for adaptation which is more robust to sparse data. However, we leave the development of such a smoothing function to future research.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML