File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-1055_metho.xml
Size: 21,588 bytes
Last Modified: 2025-10-06 14:12:55
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-1055"> <Title>Syntactic Ambiguity Resolution Using A Discrimination and Robustness Oriented Adaptive Learning Algorithm</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. Overview of Uniform Probabilistic Score Function 2.1, General Definition </SectionTitle> <Paragraph position="0"> A Score Function for a given syntactic tree, say Synj, is defined as follows: S~or~ (s~,,~)_-- v(%,,. L~, IG') (\]) where w~ ~ is the input word sequence, w~ ~ = {Wl,W2,'&quot;,w,,}, and Lcxj, the corresponding lexical string, i.e., part of speech sequence {cjl , cj2,... , cj,~ }. By applying the multiplication theorem of probability, l'(Sun), Lexj I w?) can be restated as follows.</Paragraph> <Paragraph position="2"> The two components, s,~,, (.','yn D and s~,~ ( Lexj ), m thc above formula are called syntactic Score Function and Lexieal Score Function, respectively. The original score function, i.e., P(Sy%,Lexjlw~'), is then called Integrated Score Function.</Paragraph> <Paragraph position="3"> Next, we assume the information, from the word .sequence w\] ~, required for syntactic ambiguity resolution, has percolated to the lexical interpretation Lezj. Also, only little additional information can be provided from w\] ~ for the task of disambiguating syntactic interpretation ~yuj after the lexical interpretation Lexj is given. Thus, rite syntactic score can be approximated as shown in</Paragraph> <Paragraph position="5"> The integrated score function P(Sy%, Lez~\]wi') is then approximated as follows.</Paragraph> <Paragraph position="7"> Such a formulation &quot;allows us to use both lexical and syntactic knowledge in assigning preference measure m a syntactic tree. in the real computation, log operation is used to convert the operations of multiplication to the operations of addition. The following equation shows the final form in the real application.</Paragraph> <Paragraph position="9"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2. Lexical Score F'unction </SectionTitle> <Paragraph position="0"> Let ck~, denote the k-th sequence of tile lexical category, or part of speech, co~xesl)onding to the word sequence w~L The Lexical Score Function can be expressed as follows \[Chen 91, Su 92b\]:</Paragraph> <Paragraph position="2"> where ck~ is rite lexic',d category of w i. Several forms lGars 87, Chur 88, Su 92b\] tor 1' (~k, I~q ..... p) were propo~d to simplify the computanon. For example, IChur 88\] approximated &quot;(~k IG ...... ;')by \[P(~k, I%,-,\] x t'(ck, 1~,)\] , A general nonlinear smoothing l'onn \[Chen 91\] described in Eq.(7) is adopted in this paper:</Paragraph> <Paragraph position="4"> ~ Ag(P(,:k. \]w,))+(l .~)9 (Ck, \[ ..... ), where A is the lexical weight (A = 0.6 is used in the current setup), and 9 is a transform function (log (.) is used in this paper). Hence, given both Eq.(6) and (7), the lollowing tormula is derived:</Paragraph> <Paragraph position="6"> It is noted that the above generalized form reduced to the formulation of \[Chur 88\] when file transtonn function is log function and A is 0.5.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3. Syntactic Score Function </SectionTitle> <Paragraph position="0"> To show the computing mechanism for the syntactic score, we take the syntax tree in Fig.1 as an example. The syntax tree is decomposed into a number of phrase levels. Each phrase level (also called a sententialform) consists of a set of symbols (terminal or nontenninal) which can derive all the temfin',d symbols in a sentence. Let label t i in Fig.l be the time index lot each state transition of a LR parser, and Li be the i-fit phrase level. Thus, a transition from phrase level L i to phrase level Li+ 1 is equivalent to a redue action at time ti.</Paragraph> <Paragraph position="1"> A ACIION</Paragraph> <Paragraph position="3"> hito phrase levels.</Paragraph> <Paragraph position="4"> ACfES DE COLING-92, NANTF.S, 23-28 AOtTr 1992 3 5 3 Pgoc. OF COLING-92, NANri~s, AUG. 23-28, 1992 The syntactic score of the syntax tree in Fig.l is then defined as</Paragraph> <Paragraph position="6"> where syn A is the parse tree, and LI through L 8 represent different phrase levels. Note that the product terms in the last formula correspond to the right-most derivation sequence in a general LR parser \[Su 91c\], with left and right contexts taken into account.</Paragraph> <Paragraph position="7"> Therefore, such a formulation is especially useful for a generalized LR parsing algorithm, in which context-sensitive processing power is desirable.</Paragraph> <Paragraph position="8"> Although the context-sensitive model in the above equation provides the ability to deal with intra-level context-sensitivity, it falls to catch inter-level correlation. In addition, the formulation of Eq.(9) gives rise to the normalization problem for ambiguous syntax trees with different number of nodes. An alternative to relieve this problem is to compact multiple highly correlated phrase levels into one in evaluating the syntactic scores. Tile formulation is expressed as follows \[Su 91c\]: Table 1 Categories for words and their log word-to-category scores.</Paragraph> <Paragraph position="9"> In Table 1, the log word-to-category score, log (P (e \] to)), for each word is estimated from the training corpus by calculating their relative frequencies. For exanrple, in the training corpus, the word 'T' is used as pronoun for 60 times, and 40 times as noun. Then, the log word-to-category scores can be calculated as follows.</Paragraph> <Paragraph position="11"> In this example, there are 2&quot;2&quot;2&quot;1=8 possible different ways to assign lexical categories to the input</Paragraph> <Paragraph position="13"> Because thc number of shifts, i.e., the number of terms in Eq.(lO), is always the same for all ambiguous syntax trees, the normalization problem is then resolved. Moreover. it provides a way to consider both intra-level context-sensitivity and inter-level correlation of the underlying context-free grammar. With such a score function, the capability of context-sensitive parsing (in probability sense) can be achieved with a context-free grammar.</Paragraph> <Paragraph position="14"> 3. Discrimination and Robustness Oriented</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Adaptive Learning 3.1. Concepts of Adaptive Learning </SectionTitle> <Paragraph position="0"> The general idea of adaptive learning is to adjust the model parameters (in this paper, they are lexical scores and syntactic scores) to achieve the desired criterion (in our case, it is to minimize the error rate). To explain clearly how the adaptive learning works, we take the sentence &quot;1 saw a man.&quot; as an example. The lexical category (i.e,, part of speech) and its corresponding log score for each word are listed in Table 1.</Paragraph> <Paragraph position="1"> are parsed, only four of them are accepted by our parser. They are listed as follows: 1. pron vt art n 2. n vt art n 3. pron vi prep n 4. n vi prep n.</Paragraph> <Paragraph position="2"> The syntactic scores of different parse trees are then calculated according to Eq.(10). A parse tree corresponding to the lexical sequence &quot;\[pron vt art n\]&quot; is drawn below as an example.</Paragraph> <Paragraph position="3"> The log syntactic scores for those four grammatical inputs are computed and listed in Table 2. AcrEs DE COLING-92, NANTES, 23-28 Aot3&quot;r 1992 3 5 4 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 Input Lexlcal Sequence log syntactic sco~e \[pron vt art n\] (-0.7)+(41.3)+(-0.3)+(-0.2) = -I.5 \[, vt art n\] (41.2)+(-0.3)+(-0.3)+(-0.2) = -1.0 \[prot, vi prep n\] (-0.7)+(-0.7)+(-0.4)+(-0.3) = -2.1 \[n vi prep ,1 (-0.2) +(-0.7)+(-0.4)+(-0.3) = -1.6 1 &quot;fable 2 log syntactic scores of the grammatical Input lexlcal sequences.</Paragraph> <Paragraph position="4"> According to Eq.(5), the total log integrated score (log Slez+logSsvn) for each parsed sentence hypothesis is calculated. For example, the log lexical score for &quot;I/\[pron\] saw/\[vt\] a/\[art\] man/In\]&quot; = (0.22-0.16-0.02-0)= -0.4. Finally, the log integrated scores for the above grammatical inputs are listed as follows: candidate-1, log integrated score = (-0.40-1.5 = -1.90) : l/Ipron\] saw/\[vt\] a/\[art\] man/In\] candidate -2. logintegrated score = ( 0.57-1.0 = -1.57) : l/in\] saw/\[vt\] a/\[art\] maa/\]nl candidate-3, log integrated score = (-2.04-2.1 = 4.71) : I/\[pron\] saw/lvi\] a/\[prep\] man/\[n\] candidate -4. log integrated score = (-2.21-1.6 = -3.81) : l/\[n\] saw/\[vii a/\[prepl man/\[nl Among these four candidates, the candidate 1 is regarded as the desired selection by linguists. Since our decision criterion will select the candidate which has the highest integrated score, i.e., the second one; I/\[n\] ~w/\[vt\] a/\[art\] man/\[n\], it results in a decision error in this case.</Paragraph> <Paragraph position="5"> To remedy this error, adaptive learning procedure is adopted to adjust the score values iteratively, including lexical and syntactic scores, until the integrated score of the correct candidate (i.e., candidate 1) raises to the highest rank. In this paper, parameters which are adjusted by adaptive learning procedure are those log scores, including log P (c/q I wi), logP(ek,\]ck,-i ~ and logP(LilL~-l). The amount of adjutstment in each iteration depends on the misclassihcation distance. Misclassification distance is defined as the difference between the score of the top candidate and that of the correct one.</Paragraph> <Paragraph position="6"> (In the above example, distance = (score of correct candidate)-(score of top candidate) = (-1.90)(-1.57)= -0.33). From iteration to iteration, the parameters (both lexical and syntactic scores) are adjusted so that the integrated score of the correct candidate is increased, and the integrated score of the wrong candidate is decreased at the same time. The learning procedure for a sentence is stopped when the candidate of this sentence is correctly selected. To make the explanation of this adaptive learning procedure clear, we assume lexical scores are unchanged during learning. That is, only the paranteters of the syntactic scores are adjusted. The details of adaptive leanling for adjusting syntactic log integrated score = -4.21; (where * denotes the top candidate, and A denotes the desired candidate) It is clear that after the second iteration, paranaeters have been adjusted so that the desired candidate (i.e., candidate 1) would be selected.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2. Procedure of Discrimination Learning </SectionTitle> <Paragraph position="0"> Since correct decision only depends upon correct rank ordering of the integrated scores for all ambiguities, not their real value, a discrimination-oriented approach should directly pursue correct rank ordering. To derive the discrimination function, the probability scores mentioned above are first jointly considered. Then, a discrimination-oriented function, namely g (.), is defined as a measurement of above mentioned score functions, so that it can well pre~rve the correct rank ordering \[Su 91a\].</Paragraph> <Paragraph position="1"> Here, 9(') is chosen as the weighted sum of log ACRES DE COLING-92, NANTES, 23-28 aofrr 1992 3 5 5 PROC. OV COLING-92, NANTES. AUO. 23-28, 1992 lexical and log syntactic scores, i.e.,</Paragraph> <Paragraph position="3"> where .h,, (i) = logP (ckA%; .... 1'), and A,v,, (i) = logP(LdLi-:). Both stand for the log lexical score and the log syntactic score of the i-th word for the k-th syntactic ambiguity, respectively. In addition, Wle z and Wsyn correspond to the weights of lexical and syntactic scores, respectively.</Paragraph> <Paragraph position="4"> If the parse tree of a sentence is misselected, the parameters (i.e., the lexical and the syntactic scores) are adjusted via the proposed adaptive learning procedure. Otherwise, no parameters would be adjusted. When misselection occurs, the misclassification distance, dso is less than zero. This mis-classification distance is defined as the difference between the log integrated .score of the correct candidate and that of the top one. A specific term of the syntactic score components in the (t+l)-th itera-A(t+l) lion of the correct candidate, say sv,* (j), would be adjusted as follows: { ~('+'~ ' ~ - L~). (i) + ~(.;),, (j) a,. < o, ,v- t3)- ' - (13) At the same time, the term of the syntactic score components of the top candidate would be adjusted according to the following formulas: fa(.~+,,')(i) x(') &quot;&quot; ~a(') = ,~s)- ,~,,(j), a,~<o, (I,1) where AA~ n (j)is the amount of adjustment. This value is represented as</Paragraph> <Paragraph position="6"> where do is a constant which stands for a window size, and e is the learning constant for controlling the speed of convergence. The learning rule for adjusting the lexical scores can be represented in a similar manner. Notice that only the parameters of the top candidate and those of the correct candidate would be adjusted when misselections occur. Those parameters of other wrong candidates would not be adjusted in this adaptive leaming procedure. From Eq.(13), (14) and (15), it is clear that the score of the correct candidate will increase and that of wrong candidate will decrease from iteration to iteration until the correct candidate is selected. For the purpose of clarity, the detailed derivations of the above adaptive learning procedure will not be given hem. Interested readers can contact the authors for details.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3. Robustness Issues </SectionTitle> <Paragraph position="0"> Since it is easy to improve the performance in a training set by adopting a model with a large number of parameters, the error rate measured in the training set frequently turns out to be over-optimistic.</Paragraph> <Paragraph position="1"> Moreover, the parameters estinlated from the training corpus may be quite differ from that obtained from the real applications. These phenomena may occur due to the factors of finite sampling size, style mismatch, or domain mismatch, etc. To achieve a better perlbrmance in the real application, one must deal with the possible mismatch of parameters, or statistical variation between the training corpus and the real application. One way to achieve this goal is to enlarge the inter-class distance to achieve maximum separation \[Su 91a\] between the cot:ect candidate and the other candidates. That is, this approach provides a tolerance zone between different candidates for allowing possible data scattering in the real application.</Paragraph> <Paragraph position="2"> Traditional adaptive learning methods \[Amar 67, Kata 90\] stop adjusting parameters once the input pattern has been correctly classified. However, if we stop adjusting parameters under the condition that the observations are correctly classified in the training corpus, the distance between the correct candidate and other ambiguities may still be too small. Thus, it is vulnerable to deal with possible modeling errors and statistical variations between the training corpus and the real application.</Paragraph> <Paragraph position="3"> Su \[Su 91a\] has proposed a robust learning procedure which continues to enlarge the margin between the correct candidate and the top one, even if the syntax tree of the sentence has been correctly seleeted. That is, the parameters will not be adjusted only if the distance between the correct candidate and the others has exceeded a given threshold. The learning rules in Eq.(13), (14) are then modified as follows.</Paragraph> <Paragraph position="4"> If dsc _< 6, where 6 is a preset margin, the syntactic score in the (t+l) iteration tor the correct candidate is adjusted according to the following formulas: (,~(t+l) ll) (t) ~,~, I:) = .t,~., OI + A;,.~. O), a,, < ~, (16) ...... O) ,x~,'~),, 0), ot/,~T,o,~.</Paragraph> <Paragraph position="5"> And, the syntactic score of the top candidate is adjusted as follows:</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. Simulations </SectionTitle> <Paragraph position="0"> The following experiments are conducted to ilivestigate the advmltage of the proposed discrimination and robustness oriented adaptive leanfing pro_ cedure. In the experiments, 4,000 sentences, which are extracted from IBM tectmical manuals are first associated with their conesponding correct category sequences and correct parsed trees by linguists. The corpus are then partitioned into a training corpus of 3,2(X) sentences and a test set of 800 sentences.</Paragraph> <Paragraph position="1"> Next, the lexical and syntactic probabilities are estimated from the data in the training corpus. Afterwards, the sentences in the test set a~e used to evaluate rite perlomtance of the proposed algorithm using the estimated lexical and syntactic probabilities. This integrated score timction apploach using the estimated probabilities is considelcd as the base-line system. Performances of discrimination ufi~ ented adaptive learnings with and widtout robustness enhancement are then evaluated. The accuracy rate of the syntactic ambiguity resolution for the training corpus and rile test set are summarized in t?om all po~ible parses allowed by tile gmntmars of the system; therefore, the baseline perlonnance is evaluated under a highly ambiguous environment.) Table 3 shows that syntax tree accuracy ~ate is improved from 46% to 56.88% using the basic version of discrimination oriented adaptive learning procedure. This significant iiuprovement shows the superiority of the adaptive learning procedure for dealing with the disambiguation task. l:unhenuore, when the rt)bust versiou of leanfiug proccdure is adopted, the perfommnce is iml)roved further (flora 56.88% to 60.62%). It means that the rohuslness of the learning procedure is indeed enhanced by enlarging the distance between the correct candidate and other candidates. Moreover, not only the accuracy rate of syntax trec is improved using adaptive learning, but al~ that of lexical sequence is improved. In this paper, a lexical sequence is ~egarded as &quot;correct&quot; only if all the lexical categories in a sentence perfectly match those selected by linguists. In other words, we are measuring &quot;sentence accurac3: rate&quot; in contract tu &quot;word accuracy rate&quot; as adopled ill \[Chur 88, Gal~ 87\]. Table 4 shows that file basic version of adaptive leaming procedule ilnproves the sentence accuracy rate of lexicol sequences a|xmt 5% (from 77.12% to 82.38%).</Paragraph> <Paragraph position="2"> Again, with the robust version of learning, the accurate rate of lexical sequences is greatly enbanced. The behavior of cach iteration of the adaptive learning process is shown in Figure 2. Through observing this figure, we can conclude that if the robustness issues are not considered during learning, the performance of tile test set would decrease as the training t)~ocess goes on. This is the phenomena of river-tuning, llowever, by h~rcing the learning procedure to continue unlil the separation between the correct candidate and the top one exceeds the desired margin, the performmlce of the test set can be t'utther improved, and no degradation phenomenon is observed.</Paragraph> <Paragraph position="3"> I ............. =;:: ..................................................... l:lgurc 2 Syntax t~e accuracy rate versus Iteratkms for basic and r~lbusl verslou of adaptive learning</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> AcrEs DE COLING-92, NAbrrl,:s, 23 28 Ao(rr 1992 3 5 7 l'ltoC/:, oi: C()l,IN(;-92, N^yn~s, AUG. 23-28, 1992 5. Summary </SectionTitle> <Paragraph position="0"> Because of insufficient training data, and ap- \[Su 88\] proximation error introduced by the language model, traditional statistical approaches, which resolve ambiguities by indirectly and implicitly using maximum likelihood method, fail to achieve high performance in real applications. To overcome these \[Su 90a\] problems, adaptive learning is proposed to pursue the goal of minimizing discrimination error directly.</Paragraph> <Paragraph position="1"> The performance of syntactic ambiguity resolution is significantly improved using the discrimination \[Su 90b\] oriented analysis. In addition, the sentence accuracy rate of the lexical sequences is also improved. Moreover, the performance is further enhanced by using the robust version of learning procedure, which enlargeds the margin between the correct candidate and its candidates. The final resuits show that using the basic version of learning, \[Su 91b\] the syntax tree selection accuracy rate is improved about 10% (from 46% to 56.88%), and the total improvement is over 14% using robust version learning. Also, the sentence accuracy rate for lexical sequences is improved from 77.12% to 82.38 and 87.88% using the basic and robust version of leaming procedure, respectively.</Paragraph> </Section> class="xml-element"></Paper>