XML Viewer - w98-1211

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1211_metho.xml
Size: 23,189 bytes
Last Modified: 2025-10-06 14:15:13
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1211">
  <Title>IH HI I I I I i I Linguistic Theory in Statistical Language Learning</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Word N-gr_~m Models
</SectionTitle>
    <Paragraph position="0"> Let us return to the simple bigram word model, where the probability of each next word is determined from the current one. We already noted that this model relies on the notion of a word, the notion of an utterance, and the notion that an utterance is a sequence of words.</Paragraph>
    <Paragraph position="1"> The way this model is best visualized, and as it happens, best implemented, is as a finite-state automaton (FSA), with arcs and states both labelled with words, and transition probabilities associated with each arc. For example, there will be one state labelled The with one arc to each other state, for example to the state Cat, and this arc will be labelled cat. The reason for labelling both arcs and states with words is that the states constitute the only memory device available to an FSA. To remember that the most recent word was &amp;quot;cat&amp;quot;, all arcs labelled cat must fall into the same state Cat.</Paragraph>
    <Paragraph position="2"> The transition probability from the state The along the unique arc labelled cat to the state Cat will be the probability of the word &amp;quot;cat&amp;quot; following the word &amp;quot;the&amp;quot;, P(cat l the).</Paragraph>
    <Paragraph position="3"> More generally, we enumerate the words Samuelsson 83 Linguistic Theory Christer Samuelsson, Bell Laboratories (1998) Linguistic Theory in Statistical Language Learning. In D.M.W. Powers (ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural Language Learning, ACL, pp 83-89. {Wa,...,wlv} and associate a state Si with each word wl. Now the automaton has the states {$1,..., SN} and from each state Si there is an arc labelled wj to state Sj with transition probability P(wj I wi), the word bigram probability. To establish the probabilities of each word starting or finishing off the utterance, we introduce the special state So and special word w0 that marks the end of the utterance, and associate the arc from So to Si with the probability of wi starting an utterance, and the arc from Si to So with the probability of an utterance ending with word wl.</Paragraph>
    <Paragraph position="4"> If we want to calculate the probability of a word sequence wil ... wi,,, we simply multiply the bigram probabilities:</Paragraph>
    <Paragraph position="6"> We now recall something from formal language theory about the equivalence between finite-state automata and regular languages. What does the equivalent regular language look like? Let's just first rename So S and, by stretching it just a little, let the end-of-utterance marker wo be ~, the empty string.</Paragraph>
    <Paragraph position="8"> Does this give us any new insight? Yes, it does! Let's define a string rewrite in the usual way: cA7 =~ a~7 if the rule A -+ fl is in the grammar. We can then derive the string Wil ... wl, from the top symbol S in n+l steps:</Paragraph>
    <Paragraph position="10"> Now comes the clever bit: if we define the derivation probability as the product of the rewrite probabilities, and identify the rewrite and the rule probabilities, we realize that the string probability is simply the derivation probability. This illustrates one of the most central aspects of probabilistic parsing: String probabilities are defined in terms o\] derivation probabilities.</Paragraph>
    <Paragraph position="11"> So the simple word bigram model not only employs highly useful notions from linguistic theory, it implicitly employs the machinery of rewrite rules and derivations from formal language theory, and it also assigns string probabilities in terms of derivation probabilities, just like most probabilistic parsing schemes around. However, the heritage from finite-state automata results in simplistic models of interword dependencies.</Paragraph>
    <Paragraph position="12"> General word N-gram models, of which word bi-gram models are a special case with &amp;quot;N&amp;quot; equal to two, can be accommodated in very much the same way by introducing states that remember not only the previous word, but the N-1 previous words. This generalization is purely technical and adds little or no linguistic fuel to the model from a theoretical point of view. From a practical point of view, the gain in predictive power using more conditioning in the probability distributions is very quickly overcome by the difficulty in estimating these probability distributions accurately from available training data; the perennial sparse-data problem.</Paragraph>
    <Paragraph position="13"> So why does this model look like it does? We conjecture the following explanations: Firstly, it is directly applicable to the representation used by an acoustic speech recognizer, and this can be done efficiently as it essentially involves intersecting two finite-state automata. Secondly, the model parameters -- the word bigram probabilities -- Can be estimated directly from electronically readable texts, and there is a lot of that a~ilable.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Tag N-gram Models
</SectionTitle>
    <Paragraph position="0"> Let us now move on to a somewhat more linguistically sophisticated language model, the tag N-gram model. Here, the interaction between words is mediated by part-of-speech (PoS) tags, which constitute linguistically motivated labels that we assign to each word in an utterance. For example, we might look at the basic word classes adjectives, adverbs, articles, conjunctions, nouns, numbers, prepositions, pronouns and verbs, essentially introduced already by the ancient Greek Dionysius Thrax. We immediately realise that this gives us the opportunity to include a vast amount of linguistic knowledge into our model by selecting the set of PoS tags appropriately; consequently, this is a much debated and controversial issue.</Paragraph>
    <Paragraph position="1"> Such a representation can be used for disambiguation, as in the case of the well-known, highly ambiguous example sentence &amp;quot;Time flies like an arrow&amp;quot;. We can for example prescribe that &amp;quot;Time&amp;quot; is a noun, &amp;quot;flies&amp;quot; is a verb, &amp;quot;like&amp;quot; is a preposition (or adverb, according to your taste), &amp;quot;an&amp;quot; is an article, and that &amp;quot;arrow&amp;quot; is a noun. In effect, a label, i.e., a part-of-speech tag, has been assigned to each word. We realise that words may be assigned different labels in different context, or in different readings; for example, if we instead prescribe that '2\]ies&amp;quot; is a noun and &amp;quot;like&amp;quot; is a verb, we get another reading of the sentence.</Paragraph>
    <Paragraph position="2">  detail? We can actually recast it in virtually the same terms as the word bigram model, the only difference being that we interpret each state Si as a PoS tag (in the bigram case, and as a tag sequence in the general N-gram case):</Paragraph>
    <Paragraph position="4"> Note that we have now separated the words Wk from the states Si and that thus in principle, any state can generate any word. This is actually a slightly more powerful formalism than the standard hidden-markov model (HMM) used for N-gram PoS tagging (5). We recast it as follows:</Paragraph>
    <Paragraph position="6"> Here we have the rules of the form Si ~ TjSj, with the corresponding probabilities P(Sj \[ Si), encoding the tag N-gram statistics. This is the probability that the tag Tj will follow the tag Ti, (in the bigram case, or the sequence encoded by Si in the general N-gram case). The rules Ti -+ wk with probabilities</Paragraph>
    <Paragraph position="8"> the probability of tag Ti being realised as word wk.</Paragraph>
    <Paragraph position="9"> The latter probabilities seem a bit backward, as we would rather think in terms of the converse probability P(Ti \[ Wk) of a particular word wk being assigned some PoS tag Ti, but one is easily recoverable form the other using Bayesian inversion:</Paragraph>
    <Paragraph position="11"> We now connect the second formulation with the first one by unfolding each rule Tj ---&gt; wk into each rule Si -+ TjSj. This lays bare the independence</Paragraph>
    <Paragraph position="13"> As should be clear from this correspondence, the HMM-based PoS-tagging model can be formulated as a (deterministic) FSA, thus allowing very fast processing, linear in string length.</Paragraph>
    <Paragraph position="14"> The word string wkl ... wk, can be derived from the top symbol S in 2n+l steps:</Paragraph>
    <Paragraph position="16"> The interpretation of this is that we start off in the initial state S, select a PoS tag Tia at random, according to the probability distribution in state S, then generate the word wkl at random according to the lexical distribution associated with tag Til, then draw a next PoS tag Ti2 at random according to the transition probabilities associated with state Si~, hop to the corresponding state Si2, generate the word wk2 at random according to the lexical distribution associated with tag Ti2, etcetera.</Paragraph>
    <Paragraph position="17"> Another general lesson can be learned from this: If we wish to calculate the probability of a word string, rather than of a word string with a particular tag assbciated with each word, as the model does as it stands, it would be natural to sum over the set of possible ways of assigning PoS tags to the words of the string. This means that: The probability of a word string is the sum of its derivation probabilities.</Paragraph>
    <Paragraph position="18"> The model parameters P(Sj \[ Si) and P(wk I Tj) can be estimated essentially in two different ways.</Paragraph>
    <Paragraph position="19"> The first employs manually annotated training data and the other uses unannotated data and some reestimation technique such as Baum-Welch reestimation (1). In both cases, an optimal set of parameters is sought, which will maximize the probability of the training data, supplemented with a portion of the black art of smoothing. In the former case, we are faced with two major problems: a shortage of training data, and a relatively high noise level in existing data, in terms of annotation inconsistencies.</Paragraph>
    <Paragraph position="20"> In the latter case, the problems are the instability of the resulting parameters as a function of the initial lexieal bias required, and the fact that the chances of finding a global optimum using any computationally feasible technique rapidly approach zero as the size of the model (in terms of the number of tags, and N) increases. Experience as shown that, despite the noise level, annotated training data yields better models.</Paragraph>
    <Paragraph position="21"> Let us take a step back and see what we have got: We have notion of a word, the notion of an utterance, the notion that an utterance is a sequence of words, the machinery of rewrite rules and derivations, and string probabilities are defined as the sum of the of derivation probabilities. In addition to this, we have the possibility to include a lot of linguistic knowledge into the model by selecting an appropriate set of PoS tags. We also need to somehow specify the model parameters P(Sj \[ Si) and P(wk I Tj). Once this is done, the model is completely determined.</Paragraph>
    <Paragraph position="22"> In particular, the only way that syntactic relations are modelled are by the probability of one PoS tag</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Linguistic Theory
</SectionTitle>
      <Paragraph position="0"> given the previous tag (or in the general N-gram case, given the previous N-1 tags). And just as in the case of word N-grams, the sparse data problem sets severe bounds on'N, effectively limiting it to about three.</Paragraph>
      <Paragraph position="1"> We conjecture that the explanation to why this model looks like it does is that it was imported wholesale from the field of speech recognition, and proved to allow fast, robust processing at accuracy level that until recently were superior to, or on par with, those of hand-crafted rule-based approaches.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Stochastic Grammar Models
</SectionTitle>
    <Paragraph position="0"> To gain more control over the syntactic relationships between the words, we turn to stochastic context-free grammars (SCFGs), originally proposed by Booth and Thompson (4). This is the framework in which we have already discussed the N-gram models, and it has been the starting point for many excursions into probabilistic-parsing land. A stochastic context-free grammar is really just a context-free grammar where each grammar rule has been assigned a probability. If we keep the left-hand-side (LHS) symbol of the rule fix, and sum these probabilities over the different RHSs, we get one, since the probabilities are conditioned on the LHS symbol.</Paragraph>
    <Paragraph position="1"> The probability of a particular parse tree is the probability of its derivation, which in turn is the product of the probability of each derivation step.</Paragraph>
    <Paragraph position="2"> The probability of a derivation step is the probability of rewriting a given symbol using some grammar rule, and equals the rule probability. Thus, the parse-tree probability is the product of the rule probabilities. Since the same parse tree can be derived in different ways by first rewriting some symbol and then another, or vice versa, we need to specify the order in which the nonterminal symbols of a sentential form are rewritten. We require that in each derivation step, the leftmost nonterminal is always rewritten, which yields us the leftmost derivation. This establishes a one-to-one correspondence between parse trees and derivations.</Paragraph>
    <Paragraph position="3"> We now have plenty of opportunity to include linguistic theory into our model by the choice of syntactic categories, and by the selection of grammar rules. The probabilistic limitations of the model mirror the expressive power of context-free grammars, as the independence assumptions exactly match the compositionality assumptions. For this reason, there is an efficient algorithm for finding the most probable pares tree, or calculating the string probability under an SCFG. The algorithm is a variant of the Cocke-Kasami-Younger (CKY) algorithm (17), but can also be seen as an incarnation of a more general dynamic-programming scheme, and it is cubic in string length and grammar size. We conjecture that exactly the properties of SCFGs discussed in this paragraph explain why the model looks like it does.</Paragraph>
    <Paragraph position="4"> We again have the choice between training the model parameters, the rule probabilities, on annotated data, or use unannotated data and some reestimation method like the inside-outside algorithm, which is the natural generalization of the Baum-Welch method of the previous section. If the chances of finding a global optimum were slim using the Baurn-Welch algorithm, they're virtually zero using the inside-outside algorithm. There is also very much instability in terms of what set of rule probabilities one arrives at as a function of the initial assignment of rule probabilities in the reestimation process. The other option, training on annotated data, is also problematic, as there is precious little of it available, and what exist is quite noisy. A corpus of CFG-analysed sentences is known as a tree bank, and tree banks will be the topic of the next section.</Paragraph>
    <Paragraph position="5"> As we have been stressing, the key idea is to assign probabilities to derivation steps. If we instead look at the rightmost derivation in reverse, as constructed by an LR parser, we can take as the derivation probability the probability of the action sequence, i.e., the product of the probabilities of each shift and reduce action in it. This isn't exactly the same think as an SCFG, since the probabilities are typically not conditioned on the LHS symbol of some grammar rule, but on the current internal state and the current lookahead symbol. As observed by Fernando Pereira (12), this gives us the possibility to throw in a few psycho-linguistic features such as right association and minimal attachment by preferring shift actions to reductions, and longer reductions to shorter ones, respectively. So if these features are present in language, they should show up in our training data, and thus in our language model. Whether these features are introduced or incidental is debatable.</Paragraph>
    <Paragraph position="6"> We can take the idea of derivational stochastic grammars one step further and claim that a parse tree constructed by any sequence of derivation actions, regardless of what the derivation actions are, should be assigned the product of the probabilities of each derivation step, appropriately conditioned.</Paragraph>
    <Paragraph position="7"> This idea will be crucial for the various extensions to SCFGs discussed in the next section.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Models Using Tree Banks
</SectionTitle>
    <Paragraph position="0"> As previously mentioned, a tree bank is.a corpus of CFG-annotated sentences, i.e., a collection of parse  trees. The mere existence of a tree bank actually inspired a statistic language model, namely the data-oriented parsing (DOP) model (3) advocated by Remko Scha and Rens Bod. This model parses not only with the entire tree bank as its grammar, but with a grammar consisting of each subtree of each tree in the tree bank. One interesting consequence of this is that there will in general be many different leftmost derivations of any given parse tree. This can most easily be seen by noting that there is one leftmost derivation for each way of cutting up a parse tree into subtrees. Therefore, the parse probability is defined as the sum of the derivation probabilities, which is the source to the NP-hardness of finding the most probable parse tree for a given input sentence under this model, as demonstrated by Khalil Sima'an (15).</Paragraph>
    <Paragraph position="1"> There aren't really that many tree banks around, and the by far most popular one for experimenting with probabilistic parsing is the Penn Treebank (11). This leads usto the final source of influence on the linguistic theory employed in statistical language learning: the available training and testing data.</Paragraph>
    <Paragraph position="2"> The annotators of the Penn Treebank may have overrated the minimal-attachment principle, resulting in very fiat rules with a minimum of recursion, and thus in very many rules. In fact, the Wall-Street-Journal portion of it consists of about a million words analysed using literally tens of thousands of distinct grammar rules. For example, there is one rule of the form NP ~ Det Noun (, Noun )n Conj Noun for each value of n seen in the corpus. There is not even close to enough data to accurately estimate the probabilities of most rules seen in the training data, let alone to achieve any type of robustness for unseen rules. This inspired David Magerman and subsequently Michael Collins to instead generate the RHS dynamically during parsing.</Paragraph>
    <Paragraph position="3"> Magerman (10) grounded this in the idea that a parse tree is constructed by a sequence of generalized derivation actions and the derivation probability is the parse probability, a framework that is sometimes referred to as history-based parsing (2), at least when decision trees are employed to determine the probability of each derivation action taken. More specifically, to allow us to assemble the RI-ISs as we go along, any previously constructed syntactic constituent is assigned the role of the leftmost, rightmost, middle or single daughter of some other constituent with some probability. It may or may not also be the syntactic head of the other constituent, and here we have another piece of highly Samuelsson 87 useful linguistic theory incorporated into a statistical language model: the grammatical notion of a syntactic head. The idea here is to propagate up the lexical head to use (amongst other things) lexical collocation statistics on the dependency level to determine the constituent boundaries and attachment preferences.</Paragraph>
    <Paragraph position="4"> Collins (6; 7) followed up on these ideas and added further elegance to the scheme by instead generating the head daughter first, and then the rest of the daughters as two zero-order Markov processes, one going left and one going right from it. He also managed to adapt essentially the standard SCFG parsing scheme to his model, thus allowing polynomial processing time. It is interesting to note that although the conditioning of the probabilities are topdown, parsing is performed bottom-up, just as is the case with SCPGs. This allows him to condition his probabilities on the word string dominated by the constituent, which he does in terms of a distance between the head constituent and the current one being generated. This in turn makes it possible to let phrase-boundary indicators such as punctuation marks influence the probabilities, and gives the model the chance to infer preferences for, e.g., right association.</Paragraph>
    <Paragraph position="5"> In addition to this, Collins incorporated the notion of lexical complements and wh-movement ~ la Generalized Phrase-Structure Grammar (GPSG) (8) into his probabilistic language model. The former is done by knocking off complements from a hypothesised complement list as the Markov chain of the siblings of the head constituent are generated. The latter is achieved by adding hypothesised NP gaps to these lists, requiring that they be either matched against an NP on the complement list, or passed on to one of the sibling constituents or the head constituent itself, thus mimicking the behavior of the &amp;quot;slash feature&amp;quot; used in GPSG. The model learns the probabilities for these rather sophisticated derivation actions under various conditionings. Not bad for something that started out as a simple SCFG!</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 A Non-Derivational Model
</SectionTitle>
    <Paragraph position="0"> The Constraint Grammar framework (9) introduced by Fred Karlsson and championed by Atro Voutilainen is a grammar formalism without derivations. It's not even constructive, but actually rather destructive. In fact, most of it is concerned with destroying hypotheses. Of course, you first have to have some hypotheses if you are going to destroy them, so there are a few components whose task it is to generate hypotheses. The first one is a lexicon, which assigns a set of possible morphological read-</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML