File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/w94-0112_metho.xml
Size: 12,563 bytes
Last Modified: 2025-10-06 14:13:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W94-0112"> <Title>Bootstrapping Statistical Processing into a Rule-based Natural Language Parser</Title> <Section position="3" start_page="97" end_page="99" type="metho"> <SectionTitle> 2 The Bootstrapping Method </SectionTitle> <Paragraph position="0"> We use a broad-coverage, rule-based, bottom-up, chart parser as the basis for this work. It utilizes the Microsoft English Grammar (MEG), which is a set of augmented phrase structure grammar rules containing conditions designed to eliminate many potential, but less-preferred, parses. It seeks to produce a single approximate syntactic parse for each input, although it may also produce multiple parses or even a &quot;fitted&quot; parse in the event that a well-formed parse is not obtained. The &quot;approximate&quot; nature of a parse is exemplified by the packing of many attachment ambiguities, where phrases often default to simple fight attachment and a notation is made for further processing to resolve the ambiguity at a later point in the NLP system.</Paragraph> <Paragraph position="1"> The bootstrapping method begins by using the rule-based parser to parse a large corpus of untagged NL text. During parsing, frequencies that will be used to compute rule and part-of-speech probabilities are obtained. For rule probabilities, these frequencies in their simplest form include the number of times that each rule r creates a node n, in a well-formed parse tree and the total number of times that r was attempted (i.e., the sequence of constituents c~ ..... c~, that trigger r occurred in the chart and r's conditions were evaluated relative to those constituents). At the end of parsing the corpus, the former frequency is divided by the latter frequency to obtain the probability for each rule, as given in Figure ! below. The reason for using the denominator as given rather than the number of times ct ..... Cs occurs below n, in a parse tree is that it adjusts for the conditions on rules contained in MEG, which may allow many such sequences of constituents to occur in the chart, but only very few of them to occur in the final parse tree. In this case, the probability of a rule might be skewed in favor of trying it more often than it should be, unless the denominator were based on constituents in the chart vs. in the parse tree.</Paragraph> <Paragraph position="2"> (# times n r occurs in trees) P(r~Cl ..... crn )= (# times cl ..... c m occur in chart) For part-of-speech probabilities, the frequencies obtained during parsing include the number of times a word w occurs having a particular part-of-speech p in a well-formed parse tree and the total number of times that w occurs. Once again, at the end of parsing, the former frequency is divided by the latter to obtain the simple probability that a word will occur with a particular part of speech, as given in Figure 2.</Paragraph> <Paragraph position="3"> p(plw) = (# times w occurs having p in trees) (# times w occurs in trees) Since the choice was made to use the denominator for role probabilities given above, the part-of-speech probabilities must be normalized so that the two sets of probabilities are compatible and may be used together during the probabilistic algorithm described below. The normalization is achieved by multiplying each part-of-speech probability by the ratio of the average probability of all the rules over the average probability of all the parts of speech for all the words. This effectively lowers the part-of-speech probabilities into the same range as the rule probabilities, so that as the probabilistic algorithm proceeds, it will try lower probability parts of speech for words at a consistent point relative to the application of lower probability rules.</Paragraph> <Paragraph position="4"> After computing and normalizing the probabilities, they are incorporated into the same rule-based parser used to compute them. The parser is guided by these probabilities, while parsing any new input, to seek the most probable path through the parse search space, instead of taking the &quot;all-paths&quot; breadth-first approach it took when parsing without the use of the probabilities. A simplified description of the chart parsing algorithm as guided by probabilities is given in Figure 3 below. The term record used in the algorithm may be likened to an edge in traditional chart parsing terminology. A part-of-speech record refers to an edge representing one (of possibly many) of the parts of speech for a given word. A list (PLIST below) of potential rule applications and part-of-speech records, sorted by probability in descending order (i.e., highest probability first), is maintained throughout the execution of the algorithm. The next most probable potential rule application or part-of-speech record is always located at the top of PLIST.</Paragraph> <Paragraph position="5"> 1. Put all of the part-of-speech records for each word in the input into PLIST, forcing the probability of the highest probability part-of-speech record for each word to 1 (ensuring that at least one part-of-speech record for each word will be put into the chart immediately).</Paragraph> <Paragraph position="6"> 2. Process the next most probable item in PLIST: a. If it is a potential rule application, remove it from PLIST and try the rule. If the role succeeds, add a record representing a new sub-tree to the chart.</Paragraph> <Paragraph position="7"> b, Otherwise, if it is a part-of-speech record, remove it from PLIST and add it directly to the chart.</Paragraph> <Paragraph position="8"> 3. If a record was added to the chart in step 2, identify all new potential rule applications (by examining the constituent sequences in the chart), obtain their probabilities (from those that were computed and stored previously), and put them in their appropriate position in PLIST.</Paragraph> <Paragraph position="9"> 4. Stop if a record representing a parse tree for the entire input string was generated or if PLIST is empty, otherwise go to step 2.</Paragraph> <Paragraph position="10"> The PLIST in this algorithm is similar to the ordered agenda used in the &quot;best first&quot; parser described by Allen (1994). However, in contrast to Allen's parser, the probabilities used by this algorithm do not take into account the probabilities of the underlying nodes in each subtree, which in the former case are multiplied together (on the basis of a pragmatically motivated independence assumption) to obtain a probability representative of the entire subtree. Therefore, this algorithm is not guaranteed to produce the most probable parse first. In practice, though, the algorithm does achieve good results and avoids having to deal with the problems that Allen admits are encountered when trying to apply a best-first strategy based on independence assumptions to a large-scale grammar. These include a rapid drop-off of the probabilities as subtrees grow deeper, causing a regression to nearly breadth-first searching. We desire instead to maintain parsing efficiency at the cost of potentially not generating some number of most probable parses, while still generating a large number of those that are most probable. The results reported below appear to bear this out.</Paragraph> </Section> <Section position="4" start_page="99" end_page="99" type="metho"> <SectionTitle> 3 Discussion </SectionTitle> <Paragraph position="0"> One potential disadvantage of the bootstrapping method is that the parser can reinforce its own bad behavior. However, this may be controlled by parsing a large amount of data, ~ and then by using only the probabilities computed for &quot;shorter&quot; sentences (currently, those less than 35 words) for which a single, well-formed parse is obtained (in contrast to those for which multiple or &quot;fitted&quot; parses are obtained). Our assessment thus far is that our parser generates straightforward structures for the large majority of such sentences, resulting in fairly accurate rule and part-of-speech probabilities. In many ways, this strategy is similar to the strategies employed by Hindle and Rooth (1993) and by Kinoshita et al. (1993) in that we rely on processing of less ambiguous data to provide information to assist the parser in processing the more difficult, ambiguous cases.</Paragraph> <Paragraph position="1"> Another factor in avoiding the reinforcement of bad behavior is our linguist's skill in making sure that the most common structures parse accurately. As t We have used the I million word Brown corpus to compute our current set of statistics, but anticipate using larger corpora.</Paragraph> <Paragraph position="2"> we evaluate the output of the probabilistic version of our parser, our linguist continues, in a principled manner, to add and change conditions on rules to correct problems with parse structures and partsof-speech. We have just made changes to the parser that enable it to use one set of probabilities (along with the changes our linguist made on that base) during parsing while computing another set.</Paragraph> <Paragraph position="3"> This will allow us to iterate during the development of the parser in a rule-based/statistics-based cycle, and to experiment with the effects of one set of methods on the other.</Paragraph> <Paragraph position="4"> Also, the simple probabilities described in the previous section are only a starting point. Already, we have dependently conditioned the probabilities of rules on the following characteristics of the parse tree nodes generated by them: I. l, the length (in words) of the text covered by the node, divided by 5 2. d, the distance (in words) that the text covered by the node is from the end of the sentence, divided by 5 3. m, the minimal path length (in nodes) of the node The division of the first two conditioning factors by 5 serves to lump together the values obtained during probability computation, thereby decreasing the potential for sparse data. The third factor, the minimal path length of a node, is defined as the smallest number of contiguous nodes that covers the text between the node and the end of the sentence, where nodes are contiguous if the text strings they represent are contiguous in the sentence. The rule probability computation, including these three conditioning factors, is given in Figure 4. The term &quot;composite&quot; in the denominator means that the specific li, di, and mi are computed as if the constituents ct ..... Cm were one node.</Paragraph> <Paragraph position="5"> Although these conditioning factors are not linguistically motivated in the theoretical sense, they have nevertheless contributed significantly to further improving the speed and accuracy of the parser. Results based on their use are provided in the next section. They were identified based on an inspection of the conditions in the MEG rules and how those rules go about building up well-formed parse tree structures (namely, right to left between certain clause and phrase boundaries). Through experimentation, it was confirmed that these three factors are all helpful in guiding the parser to explore the most probable linguistic structures in the search space in an order that is consistent with how the MEG rules tend to build these structures.</Paragraph> <Paragraph position="6"> Specifically, MEG tends to extend structures from right to left that are longer and span from any given word to the end of a clause, especially to the end of the sentence. The advantageous use of these conditions points to the importance of carefully considering various aspects of the existing rule set when integrating statistical processing within a rule-based parser.</Paragraph> <Paragraph position="7"> P(~Cl ..... cm,li,di,mi ) = (# times n r with li,d i, and m i occurs in trees) , l # times c ! ..... c m with composite li,d i, and m i ) occur in chart In the future, we anticipate conditioning the probabilities further based on truly linguistic considerations, such as the rule history or head word of a given structure. This has been suggested in works such as Black, et al. (1993). We also anticipate experimenting cautiously with various independence assumptions in order to decrease our parameter space as we increase the number of conditioning factors. In all of these endeavors, we will seek to determine the most beneficial interplay between the rule-based and statistics-based aspects of our system.</Paragraph> </Section> class="xml-element"></Paper>