XML Viewer - p90-1031

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/p90-1031_metho.xml
Size: 25,865 bytes
Last Modified: 2025-10-06 14:12:36
<?xml version="1.0" standalone="yes"?>
<Paper uid="P90-1031">
  <Title>PARSING THE LOB CORPUS</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE LOB CORPUS
</SectionTitle>
    <Paragraph position="0"> The Lancaster/Oslo-Bergen Corpus is an on-line collection of more than 1,000,000 words of English text taken from a variety of sources, broken up into sentences which are often 50 or more words long. Approximately 40,000 different words and 50,000 sentences appear in the corpus. We have used the LOB corpus in a standard way to build several statistical tables of part of speech usage. Foremost is a dictionary keying every word found in the corpus to the number of times it is used as a certain part of speech, which a/lows us to compute the probability that a word takes on a given part of speech. In addition, we recorded the number of times each part of speech occurred in the corpus, and built a digram array, listing the number of times one part of speech was followed by another.</Paragraph>
    <Paragraph position="1"> These numbers can be used to compute the probability of one category preceding another.</Paragraph>
    <Paragraph position="2"> Some disambiguation schemes require knowing the number of trigram occurrences (three specific categories in a row). Unfortunately, with a 132 category system and only one million words of tagged text, the statistical accuracy of LOB trigrams would be minima/. Indeed, even in the digram table we have built, fewer than 3100 of the 17,500 digrams occur more than 10 times. When using the digram table in statistica/schemes, we treat each of the 10,500 digrams which never occur as if they occur once.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="243" type="metho">
    <SectionTitle>
STATISTICAL DISAMBIGUATION
</SectionTitle>
    <Paragraph position="0"> Many different schemes have been proposed to disambiguate word categories before or during parsing. One common style of disambiguatots, detailed in this paper, rely on statistical cooccurance information such as that discussed in the section above. Specific statistical disambiguators are described in both DeRose 1988 and Church 1988. They can be thought of as algorithms which maximize a function over the possible selections of categories. For instance, for each word A-&amp;quot; in a sentence, the DeRose algorithm takes a set of categories {a~, a~,...} as input. It outputs a particular category a~z such  that the product of the probability that A: is the category a~, and the probability that the category a~.. occurs before the category a z+l is i.z+l maximized. Although such an algorithm might seem to be exponential in sentence length since there are an exponential number of combinations of categories, its limited leftward and rightward dependencies permit linear time dynamic programming method. Applying his algorithm to the Brown Corpus 2, DeRose claims the accuracy rate of 96%. Throughout this paper we will present accuracy figures in terms of how often words are incorrectly disambiguated. Thus, we write 96% correctness as an accuracy of 25 (words per error).</Paragraph>
    <Paragraph position="1"> We have applied the DeRose scheme and several variations to the LOB corpus in order to find an optimal disambiguation method, and display our findings below in Figure 1. First, we describe the four functions we maximize: Method A: Method A is also described in the DeRose paper. It maximizes the product of the probabilities of each category occurring before the next, or n--1 IIP (a~zis-flwd-by a'~+l )~+1 z=l Method B: Method B is the other half of the Dettose scheme, maximizing the product of the probabilities of each category occurring for its word. Method B simply selects each word's most probable category, regardless of context.</Paragraph>
    <Paragraph position="3"> can perform perfectly if it only returns one part of speech per word, because there are words and sequences of words which can be truly ambiguous in certain contexts. Method D addresses this problem by on occasion returning more than one category per word.</Paragraph>
    <Paragraph position="4"> The DeRose algorithm moves from left to right assigning to each category a~ an optimal path of categories leading from the start of the sentence to a~, and a corresponding probability.</Paragraph>
  </Section>
  <Section position="7" start_page="243" end_page="243" type="metho">
    <SectionTitle>
2 The Brown Corpus is a large, tagged text
</SectionTitle>
    <Paragraph position="0"> database quite similar to the LOB.</Paragraph>
    <Paragraph position="1"> It then extends each path with the categories of the word A -'+1 and computes new probabilities for the new paths. Call the greatest new probability P. Method D assigns to the word A z those categories {a~} which occur in those new paths which have a probability within a factor F of P. It remains a linear time algorithm.</Paragraph>
    <Paragraph position="2"> Naturally, Method D will return several categories for some words, and only one for others, depending on the particular sentence and the factor F. If F = 1, Method D will return only one category per word, but they are not necessarily the same categories as DeRose would return. A more obvious variation of DeRose, in which alternate categories are substituted into the DeRose disambiguation and accepted if they do not reduce the overall disambiguation probability significantly, would approach DeRose as F went to 1, but turns out not to perform as well as Method D. 3 Disambiguator Results: Each method was applied to the same 64,000 words of the LOB corpus. The results were compared to the LOB part of speech pre-tags, and are listed in Figure 1. 4 If a word was pre-tagged as being a proper noun, the proper noun category was included in the dictionary, but no special information such as capitalization was used to distinguish that category from others during disambiguation. For that reason, when judging accuracy, we provide two metrics: one simply comparing disambiguator output with the pretags, and another that gives the disambiguator the benefit of the doubt on proper nouns, under the assumption that an &amp;quot;oracle&amp;quot; pre-processor could distinguish proper nouns from contextual or capitalization information. Since Method D can return several categories for each word, we provide the average number of categories per word returned, and we also note the setting of the parameter F, which determines how many categories, on average, are returned.</Paragraph>
    <Paragraph position="3"> The numbers in Figure 1 show that simple statistical schemes can accurately disambiguate parts of speech in normal text, confirming DeRose and others. The extraordinary 3 To be more precise, for a given average number of parts of speech returned V, the &amp;quot;substitution&amp;quot; method is about 10% less accurate when 1 &lt; V &lt; 1.1 and is almost 50% less accurate for 1.1 &lt; V &lt; 1.2.</Paragraph>
  </Section>
  <Section position="8" start_page="243" end_page="244" type="metho">
    <SectionTitle>
4 In all figures quoted, punctuation marks
</SectionTitle>
    <Paragraph position="0"> have been counted as words, and are treated as parts of speech by the statistical disambiguators.</Paragraph>
    <Paragraph position="1">  strategies, in number of words per error. On average, the dictionary had 2.2 parts of speech listed per word.</Paragraph>
    <Paragraph position="2"> accuracy one can achieve by accepting an additional category every several words indicates that disambiguators can predict when their answers are unreliable.</Paragraph>
    <Paragraph position="3"> Readers may worry about correlation resulting from using the same corpus to both learn from and disambiguate. We have run tests by first learning from half of the LOB (600,000 words) and then disambiguating 80,000 words of random text from the other half. The ac- curacy figures varied by less than 5% from the ones we present, which, given the size of the LOB, is to be expected. We have also applied each disambiguation method to several smaller (13,000 word) sets of sentences which were selected at complete random from throughout the LOB. Accuracy varied both up and down from the figures we present, by up to 20% in terms of words per error, but relative accuracy between methods remained constant.</Paragraph>
    <Paragraph position="4"> The fact the Method D with F = 1 (with F = 1 Method D returns only one category per word) performs as well or even better on the LOB than DeKose's algorithm indicates that, with exceptions, disambiguation has very limited rightward dependence: Method D employs a one category lookahead, whereas DeRose's looks to the end of the sentence. This suggests that Church's strategy of using trigrams instead of digrams may be wasteful. Church manages to achieve results similar or slightly better than DeRose's by defining the probability that a category A appears in a sequence ABC to be the number of times the sequence ABC appears divided by the number of times the sequence BC appears. In a 100 category system, this scheme requires an enormous table of data, which must be culled from tagged text. If the rightward dependence of disambiguation is small, as the data suggests, then the extra effort may be for naught. Based on our results, it is more efficient to use digrams in genera\] and only mark special cases for trigrams, which would reduce space and learning requirements substantially.</Paragraph>
    <Paragraph position="5"> Integrating Disambiguator and Parser: As the LOB corpus is pretagged, we could ignore disambiguation problems altogether, but to guarantee that our system can be applied to arbitrary texts, we have integrated a variation of disambiguation Method D with our parser.</Paragraph>
    <Paragraph position="6"> When a sentence is parsed, the parser is initially passed all categories returned by Method D with F = .01. The disambiguator substantially reduces the time and space the parser needs for a given parse, and increases the parser's accuracy. The parser introduces syntactic constraints that perform the remaining disambiguation well.</Paragraph>
  </Section>
  <Section position="9" start_page="244" end_page="248" type="metho">
    <SectionTitle>
THE PARSER
</SectionTitle>
    <Paragraph position="0"> Introduction: The LOB corpus contains unedited English, some of which is quite complex and some of which is ungrammatical. No known parser could produce full parses of all the material, and even one powerful enough to do so would undoubtably take an impractical length of time. To facilitate the analysis of the LOB, we have implemented a simple parser which is capable of rapidly parsing simple constructs and of &amp;quot;failing gracefully&amp;quot; in more complicated situations. By trading completeness for accuracy, and by utilizing the statistical disambiguator, the parser can perform rapidly and correctly enough to usefully parse the entire LOB in a few hours. Figure 2 presents a sample parse from the LOB.</Paragraph>
    <Paragraph position="1"> The parser employs three methods to build phrases. CFG-like rules are used to recognize lengthy, less structured constructions such as NPs, names, dates, and verb systems. Neighboring phrases can connect to build the higher level binary-branching structure found in English, and single phrases can be projected into new ones. The ability of neighboring phrase pairs to initiate the CFG-like rules permits context-sensitive parsing. And, to increase the efficiency of the parser, an innovative system of deterministically discarding certain phrases is used, called &amp;quot;lowering&amp;quot;.</Paragraph>
    <Paragraph position="2"> Some Parser Details: Each word in an input sentence is tagged as starting and ending at a specific numerical location. In the sentence &amp;quot;I saw Mary.&amp;quot; the parser would insert the locations 0-4, 0 I 1 SAW 2 MARY 3  batim from the LOB corpus, printed without features. Notice that the grammar does not attach PP adjuncts.</Paragraph>
    <Paragraph position="3"> 4. A phrase consists of a category, starting and ending locations, and a collection of feature and tree information. A verb phrase extending from 1 to 3 would print as \[VP 1 3\].</Paragraph>
    <Paragraph position="4"> Rules consist of a state name and a location. If a verb phrase recognition rule was firing in location 1, it would get printed as (VP0 a* 1) where VP0 is the name of the rule state. Phrases and rules which have yet to be processed are placed on a queue. At parse initialization, phrases are created from each word and its category(ies), and placed on the queue along with an end-of-sentence marker. The parse proceeds by popping the top rule or phrase off the queue and performing actions on it. Figure 3 contains a detailed specification of the parser algorithm, along with parts of a grammar. It should be comprehensible after the following overview and parse example.</Paragraph>
    <Paragraph position="5"> When a phrase is popped off the queue, rules are checked to see if they fire on it, a table is examined to see if the phrase automatically projects to another phrase or creates a rule, and neighboring phrases are examined in case they can pair with the popped phrase to either connect into a new phrase or create a rule.</Paragraph>
    <Paragraph position="6"> Thus the grammar consists of three tables, the &amp;quot;rule-action-table&amp;quot; which specifies what action a rule in a certain state should take if it encounters a phrase with a given category and features; a &amp;quot;single-phrase-action-table&amp;quot; which specifies whether a phrase with a given category and features should project or start a rule; and a &amp;quot;paired-phrase-action-table&amp;quot; which specifies possible actions to take if two certain phrases abut each other.</Paragraph>
    <Paragraph position="7"> For a rule to fire on a phrase, the rule must be at the starting position of the phrase. Possible actions that can be taken by the rule are: accepting the phrase (shift the dot in the rule); closing, or creating a phrase from all phrases accepted so far; or both, creating a phrase and continuing the rule to recognize a larger phrase should it exist. Interestingly, when an enqueued phrase is accepted, it is &amp;quot;lowered&amp;quot; to the bottom of the queue, and when a rule closes to create a phrase, all other phrases it may have already created are lowered also.</Paragraph>
    <Paragraph position="8"> As phrases are created, a call is made to a set of transducer functions which generate more principled interpretations of the phrases, with appropriate features and tree relations.</Paragraph>
    <Paragraph position="9"> The representations they build are only for output, and do not affect the parse. An exception is made to allow the functions to project and modify features, which eases handling of sub-categorization and agreement. The transducers can be used to generate a constant output syntax as the internal grammar varies, and vice versa.</Paragraph>
    <Paragraph position="10"> New phrases and rules are placed on the queue only after all actions resulting from a given pop of the queue have been taken. The ordering of their placement has a dramatic effect on how the parse proceeds. By varying the queuing placement and the definition of when a parse is finished, the efficiency and accuracy of the parser can be radically altered.</Paragraph>
    <Paragraph position="11"> The parser orders these new rules and phrases by placing rules first, and then pushes all of them onto the stack. This means that new rules will always have precedence over newly created phrases, and hence will fire in a successive &amp;quot;rule chain&amp;quot;. If all items were eventually popped off the stack, the ordering would be irrelevant. However, since the parse is stopped at the end-of-sentence marker, all phrases which have been &amp;quot;lowered&amp;quot; past the marker are never examined. The part of speech disambiguator can pass in several categories for any one word, which are ordered on the stack by likelihood, most probable first. When any lexical phrase is lowered to the back of the queue (presumably because it was accepted by some rule) all other lexical phrases associated with the same word are also lowered. We have found that this both speeds up parsing and increases accuracy. That this speeds up parsing should be obvious. That it increases accuracy is much less so. Remember that disambiguation Method D is</Paragraph>
    <Section position="1" start_page="246" end_page="248" type="sub_section">
      <SectionTitle>
The Parser Algorithm
</SectionTitle>
      <Paragraph position="0"> To parse a sentences S of length n: Perform multivalued disambiguation of S.</Paragraph>
      <Paragraph position="1"> Create empty queue Q. Place End-of-Sentence marker on Q. Create new phrases from disambiguator output categories, and place them on Q.</Paragraph>
      <Paragraph position="2"> Until Q is elnpty, or top(Q) = End-of-Sentence marker. Let I= pop(Q). Let new-items = nil If Its phrase \[cat i 3\] Let rules = all rules at location i.</Paragraph>
      <Paragraph position="3"> Let lefts = all phrases ending at. location i.  For all rules R = (state at i) in rules, And all phrases P = \[cat+features i 3\] in phrases, If there is an action A in the rule-action-table with key (state, cat+features), If A = (accept new-state) or (aeespt-and-close new-state new-cat).</Paragraph>
      <Paragraph position="4"> Create new rule (new-state at j).</Paragraph>
      <Paragraph position="5"> If A = (close new-cat) or (aeeept-artd-close new-state new-cat).</Paragraph>
      <Paragraph position="6"> Let daughters = the set of all phrases which have been accepted in the rule chain which led to R, including the phrase P.</Paragraph>
      <Paragraph position="7"> Let l = the lef|mosl starting location of all)' phrase in daughters. Create new phrase \[new-cat l 3\] wilh daughters daughters.</Paragraph>
      <Paragraph position="8"> For all phrases p in daughters, perform lowsr (p). For all phrases p created (via accept-and-close) by the rule chair, which led to R. perform lower(p). To perform paired-phrase-actions (lefts, rights): For all phrases Pl = \[left-cat+features l if in lefts, And all phrases Pr = \[right-cat+features i r\] in rights, If there is an action A in the paired-phrase-action- null rithm, omitting implementation details. Included in table form are representative sections from a grammar.  substantially more accurate the DeRose~s algorithm only because it can return more than one category per word. One might guess that if the parser were to lower all extra categories on the queue, that nothing would have been gained.</Paragraph>
      <Paragraph position="9"> But the top-down nature of the parser is sufficient in most cases to &amp;quot;pick out&amp;quot; the correct category from the several available (see Milne 1988 for a detailed exposition of this).</Paragraph>
      <Paragraph position="10"> A Parse in Detail: Figure 4 shows a parse of the sentence &amp;quot;The pastry chef placed the pie in the oven.&amp;quot; In the figure, items to the left of the vertical line are the phrases and rules popped off the stack. To the right of each item is a list of all new items created as a result of it being popped. At the start of the parse, phrases were created from each word and their corresponding categories, which were correctly (and uniquely)determined by the disambiguator. null The first item is popped off the queue, this being the \[DET 0 1\] phrase corresponding to the word &amp;quot;the&amp;quot;. The single-phrase action table indicates that a DET0 rule should be started at location 0 and immediately fires on &amp;quot;the&amp;quot;, which is accepted and the rule (DET1 a* 1) is accordingly created and placed on the queue.</Paragraph>
      <Paragraph position="11"> This rule is then popped off the queue, and accepts the \[N 1 2\] corresponding to &amp;quot;pastry&amp;quot;, also closing and creating the phrase \[NP 0 2\].</Paragraph>
      <Paragraph position="12"> When this phrase is created, all queued phrases which contributed to it are lowered in priority, i.e., &amp;quot;pastry&amp;quot;. The rule (DET2 at 2) is created to recognize a possibly longer NP, and is popped off the queue in line 4. Here much the same thing happens as in line 3, except that the \[NP 0 2\] previously created is lowered as the phrase \[NP 0 3\] is created. In line 5, the rule chain keeps firing, but there are no phrases starting at location 3 which can be used by the rule state DET2.</Paragraph>
      <Paragraph position="13"> The next item on the queue is the newly created \[NP 0 3\], but it neither fires a rule (which would have to be in location 0), finds any action in the single-phrase table, or pairs with any neighboring phrase to fire an action in the paired-phrase table, so no new phrases or rules are created. Hence, the verb &amp;quot;placed&amp;quot; is popped and the single-phrase table indicates that it should create a rule which then immedi- ately accepts &amp;quot;placed&amp;quot;, creating a VP and placing the rule (VP4 a* 4) in location 4. The VP is popped off the stack, but not attached to \[NP 0 3\] to form a sentence, because the paired-phrase table specifies that for those two phrases to connect to become an S, the verb phrase must have the feature (expec't; nil), indi- null 0 The 1 pastry 2 chef 3 placed 4 the 5 pie 6 in ? the 8 oven 9 . I0 I. Phrase \[DET 0 I\] 2. Rule (DETO at O) 3. Rule (DETI at I) 4. Rule (DET2 at 2) 5. Rule (DET2 at 3) 6. Phrase \[NP 0 3\] 7. Phrase \[V 3 4\] 8. Rule (VP3 at 3) 9. Rule (UP4 at 4) I0. Phrase \[VP 3 4\] 11. Phrase \[DET 4 5\] 12. Phrase (DETO at 4) 13. Rule (DETI at 5) 14. Rule (DET2 at 6) 15. Phrase \[NP 4 6\] 16. Phrase \[VP 3 6\] 17. Phrase IS 0 6\] 18. Phrase \[P 6 7\] 19. Phrase \[DET 7 8\] 20. Rule (DETO at 7) 21. Rule (DETI at 8) 22. Rule (DET2 at 9) 23. Phrase \[NP 7 9\] 24. Phrase \[PP 6 9\] 25. Phrase \[*PER 9 I0\]  performed prior to the parse.</Paragraph>
      <Paragraph position="14"> cating that all of its argument positions have been filled. However when the VP was cre- ated, the VP transducer call gave it the feature (expect . NP), indicating that it is lacking an NP argument.</Paragraph>
      <Paragraph position="15"> In line 15, such an argument is popped from the stack and pairs with the VP as specified in the paired-phrase table, creating a new phrase, \[VP 3 6\]. This new VP then pairs with the subject, forming \[S 0 6\]. In line 18, the preposition &amp;quot;in&amp;quot; is popped, but it does not create any rules or phrases. Only when the NP &amp;quot;the oven&amp;quot; is popped does it pair to create \[PP 6 9\]. Although it should be attached as an argument  to the verb, the subcategorization frames (contained in the expoc'c feature of the VP) do not allow for a prepositional phrase argument. After the period is popped in line 25, the end-of-sentence marker is popped and the parse stops. At this time, 5 phrases have been lowered and remain on the queue. To choose which phrases to output, the parser picks the longest phrase starting at location 0, and then the longest phrase starting where the first ended, etc.</Paragraph>
      <Paragraph position="16"> The Reasoning behind the Details: The parser has a number of salient features to it, including the combination of top-down and bottom-up methods, the use of transducer functions to create tree structure, and the system of lowering phrases off the queue. Each was necessary to achieve sufficient flexibility and efficiency to parse the LOB corpus.</Paragraph>
      <Paragraph position="17"> As we have mentioned, it would be naive of us to believe that we could completely parse the more difficult sentences in the corpus. The next best thing is to recognize smaller phrases in these sentences. This requires some bottom-up capacity, which the parser achieves through the single-phrase and paired-phrase action tables. In order to avoid overgeneration of phrases, the rules (in conjunction with the &amp;quot;lowering&amp;quot; system and method of selecting output phrases) provide a top-down capability which can prevent some valid smaller phrases from being built. Although this can stifle some correct parses 5 we have not found it to do so often.</Paragraph>
      <Paragraph position="18"> Keaders may notice that the use of special mechanisms to project single phrases and to connect neighboring phrases is unnecessary, since rules could perform the same task. However, since projection and binary attachment are so common, the parser's efficiency is greatly improved by the additional methods.</Paragraph>
      <Paragraph position="19"> The choice of transducer functions to create tree structure has roots in our previous experiences with principle-based structures. Modern linguistic theories have shown themselves to be valuable constraint systems when applied to sentence tree-structure, but do not necessarily provide efficient means of initially generating the structure. By using transducers to map For instance, the parser always generates the longest possible phrase it can from a sequence of words, a heuristic which can in some cases fail. We have found that the only situation in which this heuristic fails regularly is in verb argument attachment; with a more restrictive subcategorization system, it would not be much of a problem.</Paragraph>
      <Paragraph position="20"> between surface structure and more principled trees, we have eliminated much of the computational cost involved in principled representations. null The mechanism of lowering phrases off the stack is also intended to reduce computational cost, by introducing determinism into the parser. The effectiveness of the method can be seen in the tables of Figure 5, which compare the parser's speed with and without lowering.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML