XML Viewer - p98-1020

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1020_metho.xml
Size: 24,334 bytes
Last Modified: 2025-10-06 14:14:54
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1020">
  <Title>Trigger-Pair Predictors in Parsing and Tagging</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> In this article, we apply to natural language parsing and tagging the device of trigger-pair predictors, previously employed exclusively within the field of language modelling for speech recognition. Given the task of predicting the correct rule to associate with a parse-tree node, or the correct tag to associate with a word of text, and assuming a particular class of parsing or tagging model, we quantify the information gain realized by taking account of rule or tag trigger-pair predictors, i.e.</Paragraph>
    <Paragraph position="1"> pairs consisting of a &amp;quot;triggering&amp;quot; rule or tag which has already occurred in the document being processed, together with a specific &amp;quot;triggered&amp;quot; rule or tag whose probability of occurrence within the current sentence we wish to estimate. This information gain is shown to be substantial. Further, by utilizing trigger pairs taken from the same general sort of document as is being processed (e.g. same subject matter or same discourse type)--as opposed to predictors derived from a comprehensive general set of English texts--we can significantly increase this information gain.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="131" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Ifa person or device wished to predict which words or grammatical constructions were about to occur in some document, intuitively one of the most helpful things to know would seem to be which words and constructions occurred within the last half-dozen or dozen sentences of the document. Other things being equal, a text that has so far been larded with, say, mountaineering terms, is a good bet to continue featuring them. An author with the habit of ending sentences with adverbial clauses of confirmation, e.g.</Paragraph>
    <Paragraph position="1"> &amp;quot;as we all know&amp;quot;, will probably keep up that habit as the discourse progresses.</Paragraph>
    <Paragraph position="2"> Within the field of language modelling for speech recognition, maintaining a cache of words that have occurred so far within a document, and using this information to alter probabilities of occurrence of particular choices for the word being predicted, has proved a winning strategy (Kuhn et al., 1990). Models using trigger pairs of words, i.e. pairs consisting of a &amp;quot;triggering&amp;quot; word which has already occurred in the document being processed, plus a specific &amp;quot;triggered&amp;quot; word whose probability of occurrence as the next word of the document needs to be estimated, have yielded perplexity 1 reductions of 29-38% over the baseline trigram model, for a 5-million-word Wall Street Journal training corpus (Rosenfeld, 1996).</Paragraph>
    <Paragraph position="3"> This paper introduces the idea of using trigger-pair techniques to assist in the prediction of rule and tag occurrences, within the context of natural-language parsing and tagging. Given the task of predicting the correct rule to associate with a parse-tree node, or the correct tag to associate with a word of text, and assuming a particular class of parsing or tagging model, we quantify the information gain realized by taking account of rule or tag trigger-pair predictors, i.e. pairs consisting of a &amp;quot;triggering&amp;quot; rule or tag which has already occurred in the document being processed, plus a specific &amp;quot;triggered&amp;quot; rule or tag whose probability of occurrence within the current sentence we wish to estimate.</Paragraph>
    <Paragraph position="4"> In what follows, Section 2 provides a basic overview of trigger-pair models. Section 3 describes the experiments we have performed, which to a large extent parallel successful modelling experiments within the field of language modelling for speech recognition. In the first experiment, we investigate the use of trigger pairs to predict both rules and tags over our full corpus of around a million words. The subsequent experiments investigate the \]See Section 2.</Paragraph>
    <Paragraph position="5">  additional information gains accruing from trigger-pair modelling when we know what sort of document is being parsed or tagged. We present our experimental results in Section 4, and discuss them in Section 5. In Section 6, we present some example trigger pairs; and we conclude, with a glance at projected future research, in Section 7.</Paragraph>
  </Section>
  <Section position="4" start_page="131" end_page="131" type="metho">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"> Trigger-pair modelling research has been pursued within the field of language modelling for speech recognition over the last decade (Beeferman et al., 1997; Della Pietra et al., 1992; Kupiec, 1989; Lau, 1994; Lau et al., 1993; Rosenfeld, 1996).</Paragraph>
    <Paragraph position="1"> Fundamentally, the idea is a simple one: if you have recently seen a word in a document, then it is more likely to occur again, or, more generally, the prior occurrence of a word in a document affects the probability of occurrence of itself and other words.</Paragraph>
    <Paragraph position="2"> More formally, from an information-theoretic viewpoint, we can interpret the process as the relationship between two dependent random variables.</Paragraph>
    <Paragraph position="3"> Let the outcome (from the alphabet of outcomes Ay) of a random variable Y be observed and used to predict a random variable X (with alphabet .Ax).</Paragraph>
    <Paragraph position="4"> The probability distribution of X, in our case, is dependent on the outcome of Y.</Paragraph>
    <Paragraph position="5"> The average amount of information necessary to specify an outcome of X (measured in bits) is called its entropy H(X) and can also be viewed as a measure of the average ambiguity of its outcome: 2</Paragraph>
    <Paragraph position="7"> The mutual information between X and Y is a measure of entropy (ambiguity) reduction of X from the observation of the outcome of Y. This is the entropy of X minus its a posteriori entropy, having observed the outcome of Y.</Paragraph>
    <Paragraph position="9"> The dependency information between a word and its history may be captured by the trigger pair. 3 A trigger pair is an ordered pair of words t and w. Knowledge that the trigger word t has occurred within some window of words in the history, changes  elling, see (Rosenfeld, 1996).</Paragraph>
    <Paragraph position="10"> the probability estimate that word w will occur subsequently. null Selection of these triggers can be performed by calculating the average mutual information between word pairs over a training corpus. In this case, the alphabet Ax = {w,~}, the presence or absence of word w; similarly, Ay = {t,t}, the presence or absence of the triggering word in the history.</Paragraph>
    <Paragraph position="11"> This is a measure of the effect that the knowledge of the occurrence of the triggering word t has on the occurence of word w, in terms of the entropy (and therefore perplexity) reduction it will provide. Clearly, in the absence of other context (i.e. in the case of the a priori distribition of X), this information will be additional. However, once ~elated contextual information is included (for example by building a trigram model, or, using other triggers for the same word), this is no longer strictly true.</Paragraph>
    <Paragraph position="12"> Once the trigger pairs are chosen, they may be used to form constraint functions to be used in a maximum-entropy model, alongside other constraints. Models of this form are extremely versatile, allowing the combination of short- and long-range information. To construct such a model, one transforms the trigger pairs into constraint functions f(t, w):</Paragraph>
    <Paragraph position="14"> The expected values of these functions are then used to constrain the model, usually in combination of with other constraints such as similar functions embodying uni-, bi- and trigram probability estimates. null (Beeferman et al., 1997) models more accurately the effect of distance between triggering and triggered word, showing that for non-self-triggers, 4 the triggering effect decays exponentially with distance. For self-triggers, 5 the effect is the same except that the triggering effect is lessened within a short range of the word. Using a model of these distance effects, they are able to improve the performance of a trigger model.</Paragraph>
    <Paragraph position="15"> We are unaware of any work on the use of trigger pairs in parsing or tagging. In fact, we have not found any previous research in which extrasentential data of any sort are applied to the problem of parsing or tagging.</Paragraph>
  </Section>
  <Section position="5" start_page="131" end_page="132" type="metho">
    <SectionTitle>
3 The Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="131" end_page="132" type="sub_section">
      <SectionTitle>
3.1 Experimental Design
</SectionTitle>
      <Paragraph position="0"> In order to investigate the utility of using long-range trigger information in tagging and parsing 4i.e. words which trigger words other than themselves 5i.e. words which trigger themselves  tasks, we adopt the simple mutual-information approach used in (Rosenfeld, 1996). We carry over into the domain of tags and rules an experiment from Rosenfeld's paper the details of which we outline below. null The idea is to measure the information contributed (in bits, or, equivalently in terms of perplexity reduction) by using the triggers. Using this technique requires special care to ensure that information &amp;quot;added&amp;quot; by the triggers is indeed additional information.</Paragraph>
      <Paragraph position="1"> For this reason, in all our experiments we use the unigram model as our base model and we allow only one trigger for each tag (or rule) token. 6 We derive these unigram probabilities from the training corpus and then calculate the total mutual information gained by using the trigger pairs, again with respect to the training corpus.</Paragraph>
      <Paragraph position="2"> When using trigger pairs, one usually restricts the trigger to occur within a certain window defined by its distance to the triggered token. In our experiments, the window starts at the sentence prior to that containing the token and extends back W (the window size) sentences. The choice to use sentences as the unit of distance is motivated by our intention to incorporate triggers of this form into a probabilistie treebank-based parser and tagger, such as (Black et al., 1998; Black et al., 1997; Brill, 1994; Collins, 1996; Jelinek et al., 1994; Magerman, 1995; Ratnaparkhi, 1997). All such parsers and taggers of which we are aware use only intrasentential information in predicting parses or tags, and we wish to remove this information, as far as possible, from our results 7 The window was not allowed to cross a document boundary. The perplexity of the task before taking the trigger-pair information into account for tags was 224.0 and for rules was 57.0.</Paragraph>
      <Paragraph position="3"> The characteristics of the training corpus we employ are given in Table 1. The corpus, a subset s of the ATR/Lancaster General-English Treebank (Black et al., 1996), consists of a sequence of sentences which have been tagged and parsed by human experts in terms of the ATR English Grammar; a broad-coverage grammar of English with a high level of analytic detail (Black et al., 1996; Black et al., 1997). For instance, the tagset is both seman-C/By rule assignment, we mean the task of assigning a rule-name to a node in a parse tree, given that the constituent boundaries have already been defined.</Paragraph>
      <Paragraph position="4"> 7This is not completely possible, since correlations, even if slight, will exist between intra- and extrasentential information Sspecifically, a roughly-900,000-word subset of the full ATR/Lancaster General-English Treebank (about 1.05 million words), from which all 150,000 words were excluded that were treebanked by the two least accurate ATR/Lancaster treebankers (expected hand-parsing error rate 32%, versus less than 10% overall for the three</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="132" end_page="133" type="metho">
    <SectionTitle>
ATR/Laneaster General-English Treebank)
</SectionTitle>
    <Paragraph position="0"> tic and syntactic, and includes around 2000 different tags, which classify nouns, verbs, adjectives and adverbs via over 100 semantic categories. As examples of the level of syntactic detail, exhaustive syntactic and semantic analysis is performed on all nominal compounds; and the full range of attachment sites is available within the Grammar for sentential and phrasal modifiers, and are used precisely in the Treebank. The Treebank actually consists of a set of documents, from a variety of sources. Crucially for our experiments (see below), the idea 9 informing the selection of (the roughly 2000) documents for inclusion in the Treebank was to pack into it the maximum degree of document variation along many different scales--document length, subject area, style, point of view, etc.--but without establishing a single, pre-determined classification of the included documents.</Paragraph>
    <Paragraph position="1"> In the first experiment, we examine the effectiveness of using trigger pairs over the entire training corpus. At the same time we investigate the effect of varying the window size. In additional experiments, we observe the effect of partitioning our training dataset into a few relatively homogeneous subsets, on the hypothesis that this will decrease perplexity. It seems reasonable that in different text varieties, different sets of trigger pairs will be useful, and that tokens which do not have effective triggers within one text variety may have them in another) deg To investigate the utility of partitioning the dataset, we construct a separate set of trigger pairs for each class. These triggers are only active for their respective class and are independent of each other.</Paragraph>
    <Paragraph position="2"> Their total mutual information is compared to that derived in exactly the same way from a random partition of our corpus into the same number of classes, each comprised of the same number of documents.</Paragraph>
    <Paragraph position="3"> Our training data partitions naturally into four subsets, shown in Table 2 as Partitioning 1 (&amp;quot;Source&amp;quot;). Partitioning 2, &amp;quot;List Structure&amp;quot;, puts all documents which contain at least some HTMLlike &amp;quot;List&amp;quot; markup (e.g. LI (=List Item)) 11 in one  subset, and all other documents in the other subset. By merging Partitionings 1 and 2 we obtain Partitioning 3, &amp;quot;Source Plus List Structure&amp;quot;. Partitioning 4 is &amp;quot;Source Plus Document Type&amp;quot;, and contains 9 subsets, e.g. &amp;quot;Letters; diaries&amp;quot; (subset 8) and &amp;quot;Novels; stories; fables&amp;quot; (subset 7). With 13 subsets, ~e Partitioning 5, &amp;quot;Source Plus Domain&amp;quot; includes e.g. ' .~ &amp;quot;Social Sciences&amp;quot; (subset 9) and Recreation (subset 1). Partitionings 4 and 5 were effected by actual inspection of each document, or at least of its title and/or summary, by one of the authors. The reason Pwe included Source within most partitionings was to determine the extent to which information gains were additive, a2</Paragraph>
  </Section>
  <Section position="7" start_page="133" end_page="133" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="133" end_page="133" type="sub_section">
      <SectionTitle>
4.1 Window Size
</SectionTitle>
      <Paragraph position="0"> Figure 1 shows the effect of varying the window size from 1 to 500 for both rule and tag tokens. The optimal window size for tags was approximately 12 sentences (about 135 words) and for rules it was approximately 6 sentences (about 68 words). These values were used for all subsequent experiments. It is interesting to note that the curves are of similar shape for both rules and tags and that the optimal value is not the largest window size. Related effects for words are reported in (Lau, 1994; Beeferman et al., 1997). In the latter paper, an exponential model of distance is used to penalize large distances between triggering word and triggered word. The variable window used here can be seen as a simple alternative to this.</Paragraph>
      <Paragraph position="1"> One explanation for this effect in our data is, in the case of tags, that topic changes occur in documents. In the case of rules, the effect would seem to indicate a short span of relatively intense stylistic carryover in text. For instance, it may be much more important, in predicting rules typical of list structure, to know that similar rules occurred a few sentences ago, than to know that they occurred dozens of sentences back in the document.</Paragraph>
    </Section>
    <Section position="2" start_page="133" end_page="133" type="sub_section">
      <SectionTitle>
4.2 Class-Specific Triggers
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the improvement in perplexity over the base (unigram) tag and rule models for both the randomly-split and the hand-partitioned training sets. In every case, the meaningful split yielded significantly more information than the random split.</Paragraph>
      <Paragraph position="1"> (Of course, the results for randomly-split training sets are roughly the same as for the unpartitioned training set (Figure 1)).</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="133" end_page="135" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> The main result of this paper is to show that analogous to the case of words in language modelling, a significant amount of extrasentential information can be extracted from the long-range history of a document, using trigger pairs for tags and rules. Although some redundancy of information is inevitable, we have taken care to exclude as much information as possible that is already available to (intrasentential-data-based, i.e. all known) parsers and taggers.</Paragraph>
    <Paragraph position="1"> Quantitatively, the studies of (Rosenfeld, 1996) yielded a total mutual information gain of 0.38 bits, using Wall Street Journal data, with one trigger per word. In a parallel experiment, using the same technique, but on the ATR/Lancaster corpus, the total mutual information of the triggers for lags was 0.41 bits. This figure increases to 0.52 bits when tags further away than 135 tags (the approximate equivalent in words to the optimal window size in sentences) are excluded from the history. For the remainder of our experiments, we do not use as part of the history the tags/rules from the sentence containing the token to be predicted. This is motivated by our wish to exclude the intrasentential information which is already available to parsers and taggers.</Paragraph>
    <Paragraph position="2"> In the case of tags, using the optimal window size, the gain was 0.31 bits, and for rules the information gain was 0.12 bits. Although these figures are not as large as for the case where intrasentential information is incorporated, they are sufficiently close to encourage us to exploit this information in our models. null For the case of words, the evidence shows that triggers derived in the same manner as the triggers in our experiments, can provide a substantial amount of new information when used in combination with sophisticated language models. For example, (Rosenfeld, 1996) used a maximum-entropy  3: Science, Techn. 4018 4: Humanities 2224 5: Daily Living 896 6: Health, Education 1649 7: Government, Polit. 1768 8: Travel 2667 9: Social Sciences 3617 10: Idiom examp, sents 666 11: Canadian Hansards 5002 12: Assoc. Press, WSJ 8851 13: Travel dialgs 43341  Arising From Partitioned vs. Unpartitioned Training Sets model trained on 5 million words, with only trigger, uni-, hi- and trigram constraints, to measure the test-set perpexity reduction with respect to a &amp;quot;compact&amp;quot; backoff trigram model, a well-respected model in the language-modelling field. When the top six triggers for each word were used, test-set perplexity was reduced by 25%. Furthermore, when a more sophisticated version of this model 13 was applied in conjunction with the SPHINX II speech recognition system (Huang et al., 1993), a 10-14% reduction in word error rate resulted (Rosenfeld, 1996). We see no reason why this effect should not carry over to tag and rule tokens, and are optimistic that long-range trigger information can be used in both parsing and tagging to improve performance.</Paragraph>
    <Paragraph position="3"> For words (Rosenfeld, 1996), self-triggers--words which triggered themselves--were the most frequent kind of triggers (68% of all word triggers were selftriggers). This is also the case for tags and rules. For tags, 76.8% were self-triggers, and for rules, 96.5% were self-triggers. As in the case of words, the set of self-triggers provides the most useful predictive information.</Paragraph>
  </Section>
  <Section position="9" start_page="135" end_page="136" type="metho">
    <SectionTitle>
6 Some Examples
</SectionTitle>
    <Paragraph position="0"> We will now explicate a few of the example trigger pairs in Tables 4-6. Table 4 Item 5, for instance, captures the common practice of using a sequence of points, e.g ........... , to separate each item of a (price) list and the price of that item. Items 6 and 7 are similar cases (e.g. &amp;quot;contact~call (someone) at:&amp;quot; + phone number; &amp;quot;available from:&amp;quot; + source, typically including address, hence zipcode). These correlations typically occur within listings, and, crucially a3trained on 38 million words, and also employing distance-2 N-gram constraints, a unigram cache and a conditional bigram cache (this model reduced perplexity over the baseline trigram model by 32%) for their usefulness as triggers, typically occur many at a time.</Paragraph>
    <Paragraph position="1"> When triggers are drawn from a relatively homogeneous set of documents, correlations emerge which seem to reflect the character of the text type involved. So in Table 6 Item 5, the proverbial equation of time and money emerges as more central to Business and Commerce texts than the different but equally sensible linkup, within our overall training set, between business corporations and money.</Paragraph>
    <Paragraph position="2"> Turning to rule triggers, Table 5 Item 1 is more or less a syntactic analog of the tag examples Table 4 Items 5-7, just discussed. What seems to be captured is that a particular style of listing things, e.g. * + listed item, characterizes a document as a whole (if it contains lists); further, listed items are not always of the same phrasal type, but are prone to vary syntactically. The same document that contains the list item &amp;quot;* DIG. AM/FM TUNER&amp;quot;, for instance, which is based on a Noun Phrase, soon afterwards includes ':* WEATHER PROOF&amp;quot; and &amp;quot;* ULTRA COMPACT&amp;quot;, which are based on Adjective Phrases.</Paragraph>
    <Paragraph position="3"> Finally, as in the case of the tag trigger examples of Table 6, text-type-particular correlations emerge when rule triggers are drawn from a relatively homogeneous set of documents. A trigger pair of constructions specific to Class 1 of the Source partitioning, which contains only Associated Press newswire and Wall Street Journal articles, is the following: A sentence containing both a quoted remark and an attribution of that remark to a particular source, triggers a sentence containing simply a quoted remark, without attribution. (E.g. &amp;quot;The King was in trouble,&amp;quot; Wall wrote, triggers &amp;quot;This increased the King's bitterness.&amp;quot;.) This correlation is essentially absent in other text types.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML