XML Viewer - e95-1038

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/95/e95-1038_abstr.xml
Size: 22,097 bytes
Last Modified: 2025-10-06 13:48:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="E95-1038">
  <Title>A State-Transition Grammar for Data-Oriented Parsing</Title>
  <Section position="1" start_page="0" end_page="276" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> This paper presents a grammar formalism designed for use in data-oriented approaches to language processing. It goes on to investigate ways in which a corpus pre-parsed with this formalism may be processed to provide a probabilistic language model for use in the parsing of fresh texts.</Paragraph>
    <Paragraph position="1"> Introduction Recent years have seen a resurgence of interest in probabilistic techniques for automatic language analysis. In particular, there has arisen a distinct paradigm of processing on the basis of pre-analyzed data which has taken the name Data-Oriented Parsing.</Paragraph>
    <Paragraph position="2"> &amp;quot;Data Oriented Parsing (DOP) is a model where no abstract rules, but language experiences in the form of an analyzed corpus, constitute the basis for language processing.&amp;quot; 1 There is not space here to present full justification 1. for adopting such an approach or to detail the advantages that it offers. The main claim it makes is that effective language processing requires a consideration of both the structural and statistical as- 2. pects of language, whereas traditional competence grammars rely only on the former, and standard statistical techniques such as n-gram models only on the latter. DOP attempts to combine these two traditions and produce &amp;quot;performance grammars&amp;quot;, which: &amp;quot;... should not only contain information on the structural possibilities of the general 3.</Paragraph>
    <Paragraph position="3"> language system, but also on details of actual language use in a language community... ''2 *This research was funded by a research studentship from the ESRC. My thanks also for discussion and comments to Matt Crocker, Chris Brew, David Milward and Anna Babarczy.</Paragraph>
    <Paragraph position="4"> 1Bod, 1992. 4.</Paragraph>
    <Paragraph position="5"> 2ibid.</Paragraph>
    <Paragraph position="6"> This approach entails however that a corpus has first to be pre-analyzed (ie. hand-parsed), and the question immediately arises as to the formalism to be used for this. There is no lack of competing competence grammars available, but also no reason to expect that such grammars should be suited to a DOP approach, designed as they were to characterize the nature of linguistic competence rather than performance.</Paragraph>
    <Paragraph position="7"> The next section sets out some of the properties that we might require from such a &amp;quot;performance grammar&amp;quot; and offers a formalism which attempts to satisfy these requirements.</Paragraph>
    <Paragraph position="8"> A Formalism for DOP Given that we are attempting to construct a formalism that will do justice to both the statistical and structural aspects of language, the features that we would wish to maximize will include the following: The formalism should be easy to use with probabilistic processing techniques, ideally having a close correspondence to a simple probabilistic model such as a Markov process.</Paragraph>
    <Paragraph position="9"> The formalism should be fine-grained, ie. responsive to the behaviour of individual words (as n-gram models are). This suggests a radically lexiealist approach (cf. Karttunen, 1990) in which all rules are encoded in the lexicon, there being no phrase structure rules which do not introduce lexical items.</Paragraph>
    <Paragraph position="10"> It should be capable of capturing fully the linguistic intuitions of language users. In other words, using the formalism one should be able to characterize the structural regularities of language with at least the sophistication of modern competence grammars.</Paragraph>
    <Paragraph position="11"> As it is to be used with real data, the formalism should be able to characterize the wide range  of syntactic structures found in actual language use, including those normally excluded by competence grammars as belonging to the &amp;quot;periphery&amp;quot; of the language or as being &amp;quot;ungrammatical&amp;quot;. Ideally every interpretable utterance should have one and only one analysis for any interpretation of it.</Paragraph>
    <Paragraph position="12"> Considering the first of these points, namely a close relation to a simple probabilistic model, a good place to start the search might be with a right-branching finlte-state grammar. In this class of grammars every rule has the form A -4 a B (A,B E {non-terminals}, a E {terminals}) and all trees have the simple structure : a A A-- B-- C-- D-- b B</Paragraph>
    <Paragraph position="14"> a b c d d D (with an equivalent vertical alignment, henceforth to be used in this paper, on the right) In probabilistic terms, a finite-state grammar corresponds to a first-order Markov process, where given a sequence of states Si, Sj,... drawn from a finite set of possible states {So,..,S=} the probability of a particular state occurring depends solely on the identity of the previous state. In the finite-state grammar each word is associated with a transition between two categories, in the tree above 'a' with the transition A -4 B and so on. To calculate the probability that a string of words xl, x2, x3,.., xn has the parse represented by the string of category-states 81, $2, S3,...S=, we simply take the product of the probability of each transition: ie. l-h~l P(xi : si -4 si+l).</Paragraph>
    <Paragraph position="15"> In addition to satisfying our first criterion, a finite-state grammar also fulfills the requirement that the formalism be radically lexicalist, as by definition every rule introduces a lexical item.</Paragraph>
    <Paragraph position="16"> Accounting for Linguistic Structure If a finite-state grammar is chosen however, the third criterion, that of linguistic adequacy, seems to present an insurmountable stumbling block.</Paragraph>
    <Paragraph position="17"> How can such a simple formalism, in which syntax is reduced to a string of category-states, hope to capture even the basic hierarchical structure, the familiar &amp;quot;tree structure&amp;quot;, of linguistic expressions? Indeed, if the non-terminals are viewed as atomic categories then there is no way this can be done. If however, in line with most current theories, categories are taken to be bundles of features and crucially if one of these featflres has the value of a stack of categories, then this hierarchical structure can indeed be represented.</Paragraph>
    <Paragraph position="18"> Using the notation A \[B\] to represent a state of basic category A carrying a category B on its stack, the hierarchical structure of the sentence:  Intuitively, syntactic links between non-adjacent words, impossible in a standard finite-state grammar, are here established by passing categories along on the stack &amp;quot;through&amp;quot; the state of intervening words. That such a formalism can fully capture basic linguistic structures is confirmed by the proof in Aho (1968) that an indexed grammar (ie. one where categories are supplemented with a stack of unbounded length, as above), if restricted to right linear trees (also as above), is equivalent to a context-free grammar.</Paragraph>
    <Paragraph position="19"> A perusal of the state transitions associated with individual words in (la) reveals an obvious relationship to the &amp;quot;types&amp;quot; of categorial grammar. Using a to represent a list of categories (possibly null), we arrive at the following transitions (with their corresponding categorial types alongside).</Paragraph>
    <Paragraph position="20"> The ditransitive verb 'gave' is  The common nouns are all: N \[a\] -4 a N In fact as no intermediate constituents are formed in the analysis, an even closer parallel is to a dependency syntax where only rightward pointing arrows are allowed, of which the formalism as presented above is a notational variant. This lack of  intermediate constituents has the added benefit that no &amp;quot;spurious ambiguities&amp;quot; can arise. Knowing now that the addition of a stack-valued feature suffices to capture the basic hierarchical structure of language, additional features can be used to deal with other syntactic relations.</Paragraph>
    <Paragraph position="21"> For example, following the example of GPSG, unbounded dependencies can be captured using &amp;quot;slashed&amp;quot; categories. If we represent a &amp;quot;slashed&amp;quot; category X with the lower case x, and use the notation A(b) for a category A carrying a feature b, then the topicalized sentence: (2) This bone the man gave the puppy.</Paragraph>
    <Paragraph position="22"> will have the analysis:  Although there is no space in this paper to go into greater detail, further constructions involving unbounded dependency and complement control phenomena can be captured in similar ways.</Paragraph>
    <Paragraph position="23"> Coverage The criterion that remains to be satisfied is that of width of coverage: can the formalism cope with the many &amp;quot;peripheral&amp;quot; structures found in real written and spoken texts? As it stands the formalism is weakly equivalent to a context-free grammar and as such will have problems dealing with phenomena like discontinuous constituents, non-constituent coordination and gapping. Fortunately if extensions are made to the formalism, necessarily taking it outside weak equivalence to a context-free grammar, natural and general analyses present themselves for such constructions. Two of these will now be sketched.</Paragraph>
    <Paragraph position="24">  Consider the pair of sentences (3) and (4), identical in interpretation, but the latter containing a discontinuous noun phrase and the former not:</Paragraph>
    <Paragraph position="26"> The only transition in (4a) that differs from that of the corresponding word in the 'core' variant (3a) is that of 'dog' which has the respective transitions: null</Paragraph>
    <Paragraph position="28"> Both nouns introduce a relative clause modifier S(rel), the difference being that in the discontinuous variant a category has been taken off the stack at the same time as the modifier has been placed on the stack. It has been assumed so far that we are using a right-linear indexed grammar, but such a rule is expressly disallowed in an indexed grammar and so allowing transitions of this kind ends the formalism's weak equivalence to the context-free grammars.</Paragraph>
    <Paragraph position="29"> Of course, having allowed such crossed dependencies, there is nothing in the formalism itself that will disallow a similar analysis for a discontinuity unacceptable in English such as: (5) I saw a yesterday dog.</Paragraph>
    <Paragraph position="30"> This does not present a problem, however, as in DOP it is information in the parsed corpus which determines the structures that are possible. There is no need to explicitly rule out (5), as the transition NP \[hi --+ a \[N\] will be vanishingly rare in any corpus of even the most garbled speech, while the transition N \[hi --+ a \[S(rel)\] is commonly met with in both written and spoken English.</Paragraph>
    <Paragraph position="31">  Instead of a typical transition for 'gnawed' of VP -+ NP, we have a transition introducing a coordinated VP: VP -4 NP \[VP(+)\] In general for any transition X -4 Y , where X is a category and Y a list of categories (possibly empty), there will be a transition introducing coordination: X -4 Y IX(+)\] Non-constituent coordinations such as (7) present serious problems for phrase-structure approaches: (7) Fido had a bone yesterday and biscuit today.</Paragraph>
    <Paragraph position="32"> However if we generalize the schema already obtained for standard coordination by allowing X to be not only a single category, but a list of categories ~, it is found to suffice for non-constituent coordination as well.</Paragraph>
    <Paragraph position="33">  In this analysis instead of a regular transition for 'bone' of: N \[NP(t)\] -4 NP(t) \[\] there is instead a transition introducing coordination: N \[NP(t)\] -4 NP(t) \[N(+) \[NP(t)\]\] Allowing categories on the stack to themselves have non-empty stacks moves the formalism one step further from being an indexed grammar. This is the final incarnation of the formalism, being the State-Transition Grammar of the title 6.</Paragraph>
    <Paragraph position="34"> Similar schemas are being investigated to characterize gapping constructions.</Paragraph>
    <Paragraph position="35"> Centre-Embedding It should be noted that an indefinite amount of centre-embedding can be described, but only  list, eg. &amp;quot;I gave Fido a biscuit yesterday in the house and Rover a bone today in his kennel.&amp;quot; fiMilward (1990) introduces a formalism essentially identical to the one presented here, although viewed from a very different perspective. Milward (1994) shows how it handles a wide range of non-constituent co-ordinations. at the expense of unlimited' growth in the length of states:  As the model is to be trained from real data, transitions involving long states as in (8) will have an ever smaller and eventually effectively nil probability. Therefore, when tuned to any particular language corpus the resulting grammar will be effectively finite-state r.</Paragraph>
    <Section position="1" start_page="273" end_page="275" type="sub_section">
      <SectionTitle>
Parsing
</SectionTitle>
      <Paragraph position="0"> Assuming that we now have a corpus parsed with the state-transition grammar, how can this information be used to parse fresh text? Firstly, for each word type in the corpus we can collect the transitions with which it occurs and calculate its probability distribution over all possible transitions (an infinite number of which will be zero). To make this concrete, there are five tokens of the word 'dog' in the examples thus far, and so 'dog' will have the transition probability distribution:</Paragraph>
      <Paragraph position="2"> rThis may be compared to the claim in Krauwer &amp; Des Tombes (1981) that finite-state automata offer a more satisfactory characterization of language than context-free grammars.</Paragraph>
      <Paragraph position="4"> To find the most probable parse for a sentence, we simply find the path from word to word which maximizes the product of the state transitions (as we have a first order Markov process).</Paragraph>
      <Paragraph position="5"> However this simple-minded approach, although easy to implement, in other ways leaves much to be desired. The probability distributions are far too &amp;quot;gappy&amp;quot; and even if a huge amount of data were collected, the chances that they would provide the desired path for a sentence of any reasonable length are slim. The process of generalizing or smoothing the transition probabilities is therefore seen to be indispensable.</Paragraph>
    </Section>
    <Section position="2" start_page="275" end_page="275" type="sub_section">
      <SectionTitle>
Smoothing Probability Distributions
</SectionTitle>
      <Paragraph position="0"> Although far from exhausting the possible methods for smoothing, the following three are those used in the implementation described at the end of the paper.</Paragraph>
      <Paragraph position="1"> 1. Factor out elements on the stack which are merely carried over from state to state (which was done earlier in looking at the correspondence of state transitions to categorial types). The previous transitions for 'dog' then become:</Paragraph>
      <Paragraph position="3"> 2. Factor out other features which are merely passed from state to state. For instance in the example sentences, 'the' has the generalized transitions: null s \[~\] ~ N \[VP,~\] S(np) \[a\] --4 N \[VP(np),a\] which can be further generalized to the single transition: S(fl) \[a\] -~ N \[VP(j3),a\] /3 = set of features Words hitherto unknown to the system can be treated as being extreme examples of words lacking sufficient transition data and they might then be given a transition distribution blended from the open class word paradigms.</Paragraph>
    </Section>
    <Section position="3" start_page="275" end_page="276" type="sub_section">
      <SectionTitle>
Problems Arising from Smoothing
</SectionTitle>
      <Paragraph position="0"> Although essential for effective processing, the smoothing operations may give rise to new problems. For example, factoring out items on the stack, as in (1), removes from the model the disinclination for long states inherent in the original corpus. To recapture this discarded aspect of the language, it would be sufficient to introduce into the model a probabilistic penalty based on state length. This penalty may easily be calculated according to the lengths of states in the parsed corpus. null Not only would this allow the modelling of the restriction on centre-embedding, but it would also allow many other &amp;quot;processing&amp;quot; phenomena to be accurately characterized. Taking as an example &amp;quot;heavy-NP shift&amp;quot;, suppose that the corpus contained two distinct transitions for the word 'threw', with the particle 'out' both before and after the object.</Paragraph>
      <Paragraph position="1"> threw VP ~ NP, X(out) prob: pl VP --+ X(out), NP prob: p2 Even if pl were considerably greater than p2, the cumulative negative effect of the longer states in (10) would eventually lead to the model giving the sentence with the shifted NP (11) a higher probability.</Paragraph>
      <Paragraph position="2"> 3. Establish word paradigms, ie. classes of words which occur with similar transitions. The prob- I ability distribution for individual words can then threw be smoothed by suitably blending in the paradig- out matic distribution. These paradigms will corre- the spond to a great extent to the word classes of (11) bacon rule-based grammars. The advantage would be re- that tained however that the system is still fine-grained Fido enough to reflect the idiosyncratic patterns of in- had dividual words and could override this paradig- chewed matic information if sufficient data were available.</Paragraph>
    </Section>
    <Section position="4" start_page="276" end_page="276" type="sub_section">
      <SectionTitle>
Capturing Lexical Preferences
</SectionTitle>
      <Paragraph position="0"> One strength of n-gram models is that they can capture a certain amount of lexical preference information. For example, in a bigram model trained on sufficient data the probability of the bigram 'dog barked' could be expected to be significantly higher than 'cat barked', and this slice of &amp;quot;world knowledge&amp;quot; is something our model lacks. It would not be difficult to make a small extension to the present model to capture such information, namely by introducing an additional feature containing the &amp;quot;lexical value&amp;quot; of the head of a phrase. Abandoning the shorthand 'VP' and representing a subject explicitly as a &amp;quot;slashed&amp;quot; NP, a sentence with added lexical head features would appear as:  In contrast to n-grams, where this sentence would cloud somewhat the &amp;quot;world knowledge&amp;quot;, containing as it does the bigram 'cat barked', the added structure of our model allows the lexical preference to be captured no matter how far the head noun is from the head verb. From (12) the world knowledge of the system would be reinforced by the two stereotypical transitions:</Paragraph>
      <Paragraph position="2"/>
    </Section>
    <Section position="5" start_page="276" end_page="276" type="sub_section">
      <SectionTitle>
Present Implementation
</SectionTitle>
      <Paragraph position="0"> 16,000+ running words from section N of the Brown corpus (texts N01-N08) were hand-parsed using the state-transition grammar. The actual formalism used was much fuller than the rather schematic one given above, including many additional features such as case, tense, person and number. Transition probabilities were generalized in the ways discussed in the previous section.</Paragraph>
    </Section>
    <Section position="6" start_page="276" end_page="276" type="sub_section">
      <SectionTitle>
Results
</SectionTitle>
      <Paragraph position="0"> 100 sentences of less than 15 words were chosen randomly from other texts in section N of the Brown corpus (N09-N14) and fed to the parser without alteration. Unknown words in the input, of which there were obviously many, were assigned to one of seven orthographic classes and given appropriate transitions calculated from the corpus.</Paragraph>
      <Paragraph position="1"> * 27 were parsed correctly, ie. exactly the same as the hand parse or differing in only relatively insignificant ways which the model could not hope to know s .</Paragraph>
      <Paragraph position="2"> * 23 were parsed wrongly, ie. the analysis differed from the hand parse in some non-trivial way.</Paragraph>
      <Paragraph position="3"> * 50 were not parsed at all, ie. one or more of the transitions necessary to find a parse path was lacking, even after generalizing the transitions.</Paragraph>
    </Section>
    <Section position="7" start_page="276" end_page="276" type="sub_section">
      <SectionTitle>
Future Development
</SectionTitle>
      <Paragraph position="0"> Although the results at present are extremely modest, it should be borne in mind both that the amount of data the system has to work on is very small and that the smoothing of transition probabilities is still far from optimal. The present target is to achieve such a level of performance that the corpus can be extended by hand-correction of the parser output, rather than hand-parsing from scratch. Not only will this hopefully save a certain amount of drudgery, it should also help to minimize .errors and maintain consistency.</Paragraph>
      <Paragraph position="1"> A more distant goal is to ascertain whether the performance of the model can improve after parsing new texts and processing the data therein even without hand-correction of the parses, and if so what the limits are to such &amp;quot;self-improvement&amp;quot;.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML