File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/85/e85-1024_metho.xml
Size: 15,406 bytes
Last Modified: 2025-10-06 14:11:43
<?xml version="1.0" standalone="yes"?> <Paper uid="E85-1024"> <Title>A PROBABILISTIC PARSER</Title> <Section position="3" start_page="166" end_page="166" type="metho"> <SectionTitle> INPUT TO THE ANALYSIS SYSTEM </SectionTitle> <Paragraph position="0"> The input to the analysis system is essentially the output from the tagging system described above. An example of this is given in figure i.</Paragraph> <Paragraph position="1"> Each line of the tagged LOB corpus contains one word or punctuation mark, and each sentence is separated from the preceding one by the sentence initial marker, here represented by a horizontal line. Each line consists of three main fields; a reference number specifying the genre, text number, line number, and position within the line; the word or punctuation mark itself; and the correct tag. The tags are taken from a set of 134 tags, based on the Brown tagset (Greene and Rubin 1971), but modified where we felt it was desirable.</Paragraph> </Section> <Section position="4" start_page="166" end_page="166" type="metho"> <SectionTitle> OUTPUT FROM THE ANALYSIS SYSTEM </SectionTitle> <Paragraph position="0"> Typical output from the analysis system would look like figure 2.</Paragraph> <Paragraph position="1"> The field on the right is meant to represent a typical parse tree, but in a columnar form. Each constituent is represented by a an upper case letter; thus S is the sentence, N is a noun phrase, and F indicates a subordinate clause. The upper case letter may be followed by one or more lower case letters, indicating features of interest in the constituent; thus Fn indicates a nominal clause. The boundaries of a constituent are given by open and close square brackets, so that for instance the subordinate clause indicated by Fn starts at the word &quot;that&quot; and ends at the word &quot;conference&quot;.</Paragraph> </Section> <Section position="5" start_page="166" end_page="167" type="metho"> <SectionTitle> STAGE ONE - ASSIGNMENT </SectionTitle> <Paragraph position="0"> It is clear that a tag, or a pair of consecutive tags, is partially diagnostic of the beginning, continuation or termination of a constituent. Thus, for example, the pair &quot;noun-verb&quot; tends to indicate the end of a noun phase and the beginning of a verb phase, and the pair &quot;noun-noun&quot; tends to indicate the continuation of a noun phase. The first step in the syntactic analysis is therefore to deduce from the sequence of tags a tentative sequence of markings for the type and boundaries of the constituents. Since the beginnings of constituents tend to be marked, but not the ends, this sequence of markings will tend to omit many of the right-hand or closing brackets, and these are inserted at a later stage.</Paragraph> <Paragraph position="1"> The first stage of parsing is therefore to look up each (tag, tag) pair in a dictionary, and this results in one or more possible sequences of open and close brackets and constituent markings - each of these sequences is, for historical reasons, called a &quot;T-tag&quot;. A T-tag consists of a left-hand and a right-hand part. The left-hand part consists of an indication of what constituent should be current (i.e. at the top of the stack of open constituents) at this stage, perhaps followed by one or more closing brackets. The right-hand part normally consists of an indication that one or more new constituents should be opened, that some particular constituent should be continued, or more rarely that a new constituent should be (and this will be deduced later on in the analysis process). Thus the tag-,, pair &quot;noun followed by subordinating conjunction indicates two possible T-tags, either &quot;Y\] \[F&quot; or &quot;Y ~&quot;. The first means close the current constituent whatever it is (Y matches any constituent) and open a new subordinate clause (F) constituent, while the second means continue the current constituent and open an F constituent. The look-up procedure as described above requires a dictionary entry for each possible pair of tags, which is inefficient and difficult to relate to meaningful linguistic categories.</Paragraph> <Paragraph position="2"> Instead the 134 tags are subsumed in a set of 33 &quot;cover symbols&quot; (the term is taken from the Brown tagging system). Thus all the different forms of noun word tag are subsumed in the cover symbols N* (singular noun), *S (plural noun) and *$ (noun with genitive marker).</Paragraph> <Paragraph position="3"> The required tag-pair dictionary will therefore require only an entry for each cover-symbol pair (together with a list of exceptions, where the tag rather than the cover symbol is diagnostic of the appropriate T-tags). A further simplification is that in many cases (because of the admissibility of the &quot;wild&quot; constituent marker Y) the first tag of the pair is irrelevant and the second tag in the pair determines the set of T-tag options.</Paragraph> <Paragraph position="4"> I said that the T-tag dictionary look-up would often result in more than one possible T-tag, rather than just one. Some of these options can be eliminated immediately by matching the current constituent with the putative extension, but others need to be retained for later disambiguation.</Paragraph> </Section> <Section position="6" start_page="167" end_page="167" type="metho"> <SectionTitle> CONSTRUCTING THE T-TAG DICTIONARY </SectionTitle> <Paragraph position="0"> The original version of the T-tag dictionary was generated using linguistic intuition.</Paragraph> <Paragraph position="1"> If there are several possible T-tags to an entry, they are given in approximately decreasing likelihood and rare T-tags are marked as such.</Paragraph> <Paragraph position="2"> The treebank of manually parsed sentences can now be used to extract information about what constituent types and boundaries are associated with what pairs of tags. We have therefore written a program which takes a current version of the T-tag dictionary and a set of parsed sentences, and generates; (a) information about putative exceptions to the curent T-tag dictionary, in the form of cases where the effective T-tag in the parsed sentence is not among those proposed by the T-tag dictionary, and (b) where the effective T-tag is among those proposed by the T-tag dictionary, statistics are gathered as to the differential probabilities of the various T-tags associated with a particular tagpair.</Paragraph> <Paragraph position="3"> The first set of information is used to guide the intuition of a linguist in deciding how to modify the original T-tag table. This cannot (at least at present) be done automatically, since there are various unsystematic differences between the T-tag as looked up in the dictionary and the sequence of constituent types and boundaries as they appear in the parsed sentences. We are thus using information from the parsed corpus texts to generate improved versions of the T-tag dictionary.</Paragraph> <Paragraph position="4"> The frequency information about the optional T-tags associated with a particular tagpair is not at present used by the analysis system, but we feel that it may be a further factor to be taken into account when deciding on a preferred parse in the third stage of analysis. The information is of course being used to refine linguistic intuition about the ordering of possible T-tags in the dictionary a~d their marking for rarity.</Paragraph> </Section> <Section position="7" start_page="167" end_page="168" type="metho"> <SectionTitle> STAGE THREE - TREE-CLOSING </SectionTitle> <Paragraph position="0"> The output from the first stage consists of indications of a number of constituents and where they begin, but in many cases the ending position of a constituent is unknown, or at least is located ambiguously at one of several positions. The main task of the third stage is to insert these constituent closures. There is a further stage between T-tag assignment and tree-closure which we will return to in a later section.</Paragraph> <Paragraph position="1"> The third stage proceeds as follows. A backward search is made from the end of the sentence to find a position at which choices and/or decisions have to be made. At the first such point the alternative trees are constructed and then all unclosed constituents are completed, by means of likelihood calculations based on the database of probabilities. To effect closure, the last unclosed constituent is selected and a subtree data structure is created to represent this constituent. The parser then attempts to attach to it as daughters any constituents (word-classes or constituents) lying positionally below it. As a consequence of each successive attachment there exists a distinct mother-daughter sequence pattern, the probability of which can be extracted from the mother-daughter table derived from the treebank (the parser will not attempt to build subtrees with probabilities below a certain threshold). If a sequence of constituents is attached as daughters, then any remaining constituents lying below the last attached daughter are attached to the subtree as sisters. Thus the constituent is closed in all statistically possible ways, and the parser is once again positioned at the end of the sentence.</Paragraph> <Paragraph position="2"> The parser again selects the next unclosed constituent, this time passing over the newly closed constituent (which is now represented as a subtree), and it proceeds to close the new constituent in the manner described above.</Paragraph> <Paragraph position="3"> However when attaching as daughter or sister the newly closed constituent from the previous selection it attaches a set of subtrees that represents all its possible closure patterns.</Paragraph> <Paragraph position="4"> This process is repeated until the top level is reached. If the head of the sentence has been reached, then many sub-trees are discarded because at this level all other constituents must be daughters and not sisters. If more than one tree is to be completed from a choice, then this process is repeated until all the alternative trees have been closed.</Paragraph> </Section> <Section position="8" start_page="168" end_page="168" type="metho"> <SectionTitle> STATISTICS FOR THE MOTHER-DAUGHTER SEQUENCES </SectionTitle> <Paragraph position="0"> The main problem is how to store the frequency information on possible daughter sequences for each mother constituent. Originally the manually parsed sentences collected in the treebank were decomposed into a mother constituent and each of its daughter sequences in its entirety. So for a mother constituent N (noun phrase) a possible daughter is &quot;ATI, JJ, NNS, Fr&quot; (i.e. determiner, adjective, plural noun, subordinate clause).</Paragraph> <Paragraph position="1"> The main problem with this is that, for all the most common daughter sequences, the statistics were too dependent on exactly which sentences had occurred. This also implies that the parser has to match very specific patterns when a subtree is being investigated.</Paragraph> <Paragraph position="2"> To produce statistical tables of sufficient generality, each daughter sequence was decomposed into its individual pairs of elements (each daughtser sequence in its entirety having implied opening and closing delimiters, represented by the symbols '\[' and '\]' respectively) and all like pairs were added together. The frequency information now consists of the mother constituent and a set of daughter pairs.</Paragraph> <Paragraph position="3"> Now, for the parser to assess the probability of any daughter sequence, this sequence has first to be decomposed into pairs, which are looked up in the mother-daughter table, and the probabilities of the pairs aggregated together to give the overall probability of the complete sequence. For the sequence described above the individual pairs would be &quot;\[ATI, ATI JJ, JJ NNS, NNS Fr, Fr \]&quot;. It seems clear that in some cases the aggregation of the probabilities of two or more pairs does not give a reasonable approximation to the original statistics, because of longer-distance dependencies, It is likely therefore that this technique will need a dictionary of pairs together with a dictionary of exceptional triples, quadruples, etc., to correct the pairs dictionary where necessary.</Paragraph> </Section> <Section position="9" start_page="168" end_page="168" type="metho"> <SectionTitle> STAGE TWO - HEURISTICS </SectionTitle> <Paragraph position="0"> The first stage of T-tag assignment intro- * duces constituent types and boundary markings only if they can be expressed in terms of look-up in a dictionary of tag-pairs. However there are a number of cases where a more complex form of processing seems desirable, in order to produce a more suitable partial parse to be fed to the third stage. We are therefore designing a second stage, analogous to the second stage of the tagging system, which is able to look for various patterns of tags and the constituent markings already assigned by the first stage, and then add to or modify the constituent markings passed to the third stage; an area where this will be important is in coordinated structures.</Paragraph> <Paragraph position="1"> I have suggested in the above that the parsing system is constructed as three separate stages, which pass their output to the next stage.</Paragraph> <Paragraph position="2"> In fact this is mainly for expository and developmental reasons, and we envisage an interconnection between at least some of the stages, so that earlier stages may be able to take account of information provided by later stages.</Paragraph> </Section> <Section position="10" start_page="168" end_page="168" type="metho"> <SectionTitle> PROBLEMS AND CONCLUSIONS </SectionTitle> <Paragraph position="0"> I have described the basic structure of the parsing system that we are currently developing at Lancaster. There are of course a number of areas where the techniques described will need to be extended to take account of lingustic structures not provided for. But our technique with the tagging project was to develop basic mechanisms to cope with a large portion of the texts being processed, and then to modify then to perform more accurately in particular areas where they were deficient, and we expect to follow this procedure with the current project.</Paragraph> <Paragraph position="1"> The two main features of the technique we are using seem to be (a) the use of probabilistic methods for disambiguation of linguistic structures, and (b) the use of a corpus of unconstrained English text as a testbed for our methods, as a source of information about the statistical properties of language, and as an indicator of what are the important areas of inadequacy in each stage of the analysis system.</Paragraph> <Paragraph position="2"> Because of the success of these techniques in the tagging system, and because of the promising results already achieved in applying these techniques to the syntactic analysis of a number of simple sentences, we have every hope of being able to develop a robust and economic parsing system able to operate over unconstrained English text with a high degree of accuracy.</Paragraph> </Section> class="xml-element"></Paper>