File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/92/c92-4212_abstr.xml
Size: 18,091 bytes
Last Modified: 2025-10-06 13:47:34
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-4212"> <Title>M. Kay. Parsing in Functional Unification</Title> <Section position="2" start_page="0" end_page="0" type="abstr"> <SectionTitle> Artificial Intelligence Laboratory General Electric - Research and Development Center Abstract </SectionTitle> <Paragraph position="0"> Collocation-based tagging and bracketing prograras have attained promising results. Yet, they have not arrived at the stage where they could be used as pre-procezsors for full-fledged parsing. Accuracy is still not high enough.</Paragraph> <Paragraph position="1"> To improve accuracy, it is necessary to investigate the points where statistical data is being misinterpreted, leading to incorrect results.</Paragraph> <Paragraph position="2"> In this paper we investigate inaccuracy which is injected when a pre-processor relies solely on collocations and blurs the distinction between two separate relations: thematic relations and sentential relations.</Paragraph> <Paragraph position="3"> Thematic relations are word pairs, not necessarily adjacent, (e.g., adjourn a meeting) that encode information at the concept level. Sentential relations, on the other hand, concern adjacent word pairs that form a noun group.</Paragraph> <Paragraph position="4"> E.g., preferred stock is a noun group that must be identified as such at ttle syntactic level.</Paragraph> <Paragraph position="5"> Blurring the difference between these two phenomena contributes to errors in tagging of pairs such as ezpressed concerns, a verb-noun construct, as opposed to preferred stocks, an adjective-noun construct. Although both relations are manifested in the corpus as high mutual-information collocations, they possess difl'erent prot)erties and they need to be separaled. null In our method, we distinguish between these two cases by asking additional questions of the corpus. By definition, thematic relations take on filrther variations in the corpus. Expressed concerns (a thematic relation) takes concerns expressed, expressing concerns, express his concerns ere. On the other hand, preferred stock (a sentential relation) does not take any such syntactic variations.</Paragraph> <Paragraph position="6"> We show how this method impacts pre-processing and parsing, and we provide empirical results based on the analysis of an 80-million word corpus. I 2 Pre-Processing: The Greater Picture Sentences in a typical newspaper story include idioms, ellipses, and ungrammatic constructs. Since authentic language defies textbook grammar, we must rethink our basic pars~This research was sponsored (in part) by the Defense Advanced Research Project Agency (DOD) and other government agencies. The views and conclusions contained ill this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Defense Advanced Re- null Hypothetically, parsing could be performed by one huge unification mechanism \[Kay, 1985; Shieber, 1986; Tomita, 1986\] which would process sentences at any level of complexity. Such a mechanism would recieve its tokens in the form of words, characters, or morphemes, negotiate all given constraints, and produce a full chart with all possible interpretations.</Paragraph> <Paragraph position="7"> However, when tested on a real corpus, (i.e., Wall Street Journal (WSJ) news stories), this mechanism fares poorly. For one thing, a typical well-behaved 34-word sentence produces hundreds of candidate interpretations. In effect the parsing burden is passed onto a post processor whose task is to select the appropriate parse tree within the entire forest. For another, ill-behaved sentences - roughly one out of three WSJ sentences is problematic - yield no consistent interpretation whatsoever due to parsing failures.</Paragraph> <Paragraph position="8"> To alleviate problems associated with rough edges in real text, a new strategy has emerged, involving text pre-processing. A pre-processor, capitalizing on statistical data \[Church el aL, 1989; Zernik and Jacobs, 1990; Dagan et al., 1991\], and customized to the corpus itself, could abstract idiosyncracies, highlight regularities, and, in general, feed digested text into the unification parser.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> What is Pre-Processing Up Against? The Linguistic Phenomenon </SectionTitle> <Paragraph position="0"> Consider (Figure 1) a WSJ (August 19, 1987) paragraph processed by NLpc (NL corpus pro~ eessing) \[Zernik el aL, 1991J. Two types of linguistic constructs must be resolved by the preprocessor: null Class A preferred/AJ stock/NN *comma* and expressed/VB eoneern/NN about How can a program determine that preferred stock is an adjective-noun, while expressed concern is a verb-aoun construct? The Input The scope of the pre-processing task is best illustrated by the input to the prc-processor shown in Figure 2.</Paragraph> <Paragraph position="1"> This lexical analysis of the sentence is based on the Collins on-line dictionary (about 49,000 lexical entries extracted by NLpe) plus morphology. Each word is associated with candidales part of speech, and almost all words are ambiguous. The tagger's task is to resolve the ambiguity.</Paragraph> <Paragraph position="2"> For example, ambiguous words such as services, preferred, and expressed, should be tagged as noun (nn), adjective (aj), and verb (vb), respectively. While some pairs (e.g., annual meeting) can be resolved easily, other pairs holders NN of PP Class AJ NN A DT AJ stock NN VB failed AI) VB elect VB two AJ NN to PP the DT hoard NN VB when CC meeting NN VB resumed AJ VB questions NN VB validity NN submittedAJ VB group NN VB are more difficult, and require statistical training. null Part-Of-Speech Resolution The program can bring to bear 3 types of clues: Local context: Consider the following 2 cases where local context donfinates: 1. the preferred stock raised 2. he expressed concern about The words the and he dictate that preferred and expressed are adjective and verb respec-tively. This kind of inference, due to its local nature, is captured and propagated by tile pre-processor.</Paragraph> <Paragraph position="3"> Global context: Global-sentence constraints arc shown by the following two examples: 1. and preferred stock sold yesterday 1Nns ...</Paragraph> <Paragraph position="4"> 2. and expressed concern abouL *.. *period* In case 1, a main verb is found (i.e., was), and preferred is taken as art adjective; in case 2, a main verb is not found, and therefore ezpressed itself is taken as the main verb. This kind of mnbiguity requires fidl-fledged unification, and it is not bandied by the preprocessor. Fortunately, only a small percent of the cases (in newspaper stories) depend on global reading.</Paragraph> <Paragraph position="5"> Corpus-based prefereltce: Corpus analysis (WSJ, 80-million words) provides word-association preference \[Beckwith el at., 1991\] collocation total vb-nn aj-nn preferred stock 2314 100 O expressed concern 318 1 99 The construct expressed concern, which appears 318 times in the corpus, is 99% a verb-noun construct; on tile other hand, preferred stock, which appears in the corpus 2314 times, is 99% an adjective-norm construct. 3 Where Is The Evidence? The last item, however, is not directly available. Since the corpus is not a-priori tagged, there is no direct eviderLcc regarding part-ofspeech. All we get from the corpus are numbers that indicate the mutual information score (MIS) \[Church el al., 1991\] of collocations (9.9 and 8.7, tbr preferred stock and expressed concern, respectively). It becomes necessary to infer the nature of the combination from indirect corpus~based statistics as shown by the rest of this paper.</Paragraph> <Paragraph position="6"> 3For expository psrposes we chose here two extreme, clear cut cases; other pairs (e.g., promised money) are not totally biased towards one side or another.</Paragraph> <Paragraph position="7"> ACIES DE COLING-92, NANTES, 23-28 AOt\]T 1992 1 3 0 7 PRO(:. O1: COLING-92, NANTES, AUG. 23-28, 1992 In this section we describe the method used for eliciting word-association preference from the corpus.</Paragraph> <Paragraph position="8"> The bazic intuition used invariably by all existing statistical taggers is stated as follows: Significant collocations (i.e., high MIS) predict syntactic word association. Since, for example, preferred stock is a significant collocation (mis 9.9), with all other clues assumed neutral, it will be marked as an integral noun group in the sentence.</Paragraph> <Paragraph position="9"> However, is high mis always a good predictor? Figure 3 provides mutual information scores for preferred, expressed, and closed right collocations.</Paragraph> <Paragraph position="10"> The first column (preferred) suggests mis is a perfect predictor. A count in the corpus confirms that a predictor based on collocations is always correct. A small sample of preferred collocations in context is given Figure 4. Notice that in all eases, preferred is an adjective. While column 1 (preferred) yields good syntactic associations, column 2 (ezpressed) and column 3 (closed) yield different conclusions. It turns out (see Figure 4) that expressed collocations, even collocations with high mis, produce a bias towards false-positive groupings. 4 If these collocation do not signify word groupings, what do they signify? An observation of expressed right collocates reveals that the words surprise, confidence, skepticism, optimism, disappointment, support, hope, doubt, 4Word associations based on corpus do not dictate the nature of word groupings; they merely provide a predictor that is accounted for with other locaJ-context clues.</Paragraph> <Paragraph position="11"> worry, salisfaclion, etc., are all thematic relations of express.</Paragraph> <Paragraph position="12"> Namely, a pair such as expressed disappointment denotes an action-object relation which could come in many variants. The last part of Figure 4 shows various combinations of express and its collocates.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Using Additional Evidence </SectionTitle> <Paragraph position="0"> In light of this observation, it is necessary to test in the corpus whether collocations are fixed or variable. For a collocation wordl-word2, if wordl and word2 combine in multiple ways, then wordl-word2 is taken as a thematic relation; otherwise it is taken as a fixed noun group.</Paragraph> <Paragraph position="1"> This test for ezpress~word is shown in Figure 5. Each row provides the number of times each variant is found. Variants for expressed concerns, for example, are concerti expressed, express concern, ezpresses concern, and express. ing concern. Not shown here is the count for split co-occurrence \[Smadja, 1991\], i.e., express its concern, concern was expressed. The last column sums up the result as a ratio (variability ratio) against the original collocation. In conclusion, for 12 out of 15 of the checked collocations we found a reasonable degree of variability.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Making Statistics Operational </SectionTitle> <Paragraph position="0"> While the analysis in Figure 5 provides the motivation for using additional evidence, we have two steps to take to make this evidence useful within an operational tagger.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Dealing with Small Numbers </SectionTitle> <Paragraph position="0"> Although the table in Figure 5 is adequate for expository purposes, in practice the different collected figures are spread over too many rubrics, making the numbers susceptible to noise.</Paragraph> <Paragraph position="1"> To avoid this problem we short-cut the calculation above and collect all the co-occurrence of AcrEs DE COLING-92, NANTES, 23-28 AO~r 1992 I 3 0 8 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 the roots of the words under arralysis. Instead of asking: &quot;what are the individual varL ants?&quot; we ask &quot;what is the total co-occurrence of the root pair?&quot;. For expressed concerns we check the incidence of czpress-in~eresl (and of interest-express).</Paragraph> <Paragraph position="2"> As a result, we get the lump sum without summing up the individual numbers.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Incorporating Statistics in Tagging </SectionTitle> <Paragraph position="0"> Co-oecurence information regarding each pair of words is integrated, as described in Section 2.3, with other local-context clues. Titus, the fact that statistics provide a strong preference can always be overidden by other factors.</Paragraph> <Paragraph position="1"> they preferred stock ...</Paragraph> <Paragraph position="2"> the expressed interest by shareholders was In both these cases the final call is dictated by syntactic markers in spite of strong statistical preference.</Paragraph> <Paragraph position="3"> Conclusions NLpc processes collocations by their category. In this paper, we investigated specifically the PastParticiple-Noun category (e.g., preferredstock, expressed-concerns, etc.). Other categories (in particular ContinuousVerb-Noun as ill driving cars vs. operating systems) are processed in a similar way, using slightly different evidence and thresholds.</Paragraph> <Paragraph position="4"> rithm was called in 400 eemes. 1631 cases were not called since they did not involve collocations (or involved trivial collocations such as ezpressed some fears.) Out of 400 collocations the program avoided ruling in 23 cases due to insufficient data. Within the 377 tagged cases, 358 (94.9%) cases were correct, and 19 were incorrect.</Paragraph> <Paragraph position="5"> 90% Accuracy is Not Enough Existing pre-processors \[Church et al., 1989; Zernik et al., 1991\] which have used corpus-based collocations, have attained levels of ac-ACRES BE COLING-92, NANTES, 23-28 Ao~r 1992 1 3 0 9 I'~toc. OF COLING-92, NANTES, Aut~. 23-28, 1992 GE for the 585,000 shares of its ume payments of dividends on the oha,k but lowered ratings on its n* 3 from BAA *hyphen* 2 *comma* llar* 26,65 a share *period* The axes of common for each share of 0 *pc* of Vaxity *ap* common and ng of up to *dollar* 250 million erms of the transaction call for sal *comma* to swap one share of i *dollar* 2 million annually in p* notes and 7,459 Lori series C a share of nevly issued series A ante an adjustable *hyphen* rate id he told the house Mr. Dingell ggested that the U.S. Mr. Harper ne tax *period* Some legislators soybeans and feed grains *comma* bid *dash* *dash* *dash* GE unit hallenge *period* Mr. Wright has bt about their bank one also had italy *ap* President Cossiga and *comma* sayin 6 earner executives secretary Robert Mosbacher have thor on the nature paper *comla* eber who *comma* he said *comma* ring gold in the street and then said that Mational Pizza Co. has r. nixes *comma* Chinese leaders e Bay Area *ap* pastry community presidents also are expected to its predecessor *period* It also related Services Co. people she c chairman Seidman *comma* vhile * on a tour of asia *comma* also ponsored the senate plan *comma* the nine supreme court justices nd primerica in his eagerness to st few eeeks alone *dash* *dash* iterally flipped his wig *comma* that the neuspaper company said she no longer feel they have to icans mriting to the hostages to en stmmoned to chairman Gonzalez stock outstanding *period* The e stock in January *period* It sue stock and commercial paper *comm stock to ba *hyphen* 2 from BAA is convertible until 5 P.M.. EDT *r-paten* *period* Cash sill be shares outstanding *period* The shares *period* Tells of the tra holders *comma* she previously a stock for 1.2 Abates of common n dividends *period* Aftra owns 68 shares ~ith a carrying value of stock with a value equal to *dol stock ~hose auction failed recen concern *comma, sources said *co confidence that he and Mr. Baum concern that a gas *hyphen* tax outrage over the case *comma* sa interest in financing offer for dismay that a foreign company co interest in Mcorp *ap* mvestment concern about an Italian firm su surprise at Sony *ap* move hut d concern about the EC *ap* use of disappointment that he vas not i support for the idea *period* Ca expressing surprise when thieves walk by t expressed renewed interest in acquiring th expressed no regret for the killings *comm express disbelief that Ms. Shere kept on express support for the Andean nations w expressed its commitment to a free *hyphe express interest in the certificates rec express~xg concerns *comma* also said the expressed a desire to visit China *period expressed some confidence that his plan v expressed varying degrees of dissatisfact express his linguistic doubts to America expressing their relief after crossing in expressing delight at having an eXCUSe to expresses confidence in the outcome of a express their zeal on the streets *comma express their grief and support *period* expresses sympathy for Sen. Riegle ,comma express interest in paintings but do *tot Acr~ DE COLING-92, NAhTES, 23-28 AObq&quot; 1992 1 3 1 0 PROC. OF COLING-92. NANTES, AUO. 23-28, 1992 curacy a-s high as 90%. A simple calculation reveals that a 34-word sentence might contain some 1-.2 errors on the average.</Paragraph> <Paragraph position="6"> This error rate is too high. Since the preprocessor's job is to eliufinate from consideration possible parse trees, if the appropriate parse is eliminated by the pre-processor at the outset, it will never be recovered by the parser. As shown in this paper, it is now necessary to investigate in depth how various linguistic phenomena are reflected by statistical data.</Paragraph> </Section> </Section> class="xml-element"></Paper>