File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0110_metho.xml
Size: 23,895 bytes
Last Modified: 2025-10-06 14:14:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0110"> <Title>Statistical Models for Deep-structure Disambiguation</Title> <Section position="2" start_page="0" end_page="113" type="metho"> <SectionTitle> I Introduction </SectionTitle> <Paragraph position="0"> For many natural language processing tasks, e.g., machine translation, systems usually require to apply several kinds of knowledge to analyze input sentence and represent the analyzed results in terms of a deep structure which identify the thematic roles (cases) of constituents and the senses of words. However, ambiguity and uncertainty exist at the different levels of analysis. To resolve the ambiguity and uncertainty, the related knowledge sources should be properly represented and integrated. Conventional approaches to case identification usually need a lot of human efforts to encode ad hoc rules \[1,2,3\]. Such a rule-based system is, in general, very expensive to construct and difficult to maintain. In contrast, a statistics-oriented corpus-based approach achieves disambiguation by using a parameterized model, in which the parameters are estimated and tuned from a training corpus. In such a way, the system can be easily scaled up and well trained based on the well-established theories.</Paragraph> <Paragraph position="1"> However, statistical approaches reported in the literature \[4,5,6,7\] usually use only surface level information, e.g., collocations and word associations, without taking structure information, such as syntax and thematic role, into consideration. In general, the structure features that characterize long-distance dependency, can provide more relevant correlation information between words. Therefore, word association information can be trained and applied more effectively by considering the structural features. In many tasks, such as natural language understanding and machine translation, deep-structure information other than word sense is often required.</Paragraph> <Paragraph position="2"> Nevertheless, few research was reported to provide both thematic role and word sense information with statistical approach.</Paragraph> <Paragraph position="3"> Motivated by the above concerns, an integrated score function, which encodes lexical, syntactic and semantic information in a uniform formulation is proposed in this paper. Based on the integrated score function, the lexical score function, the syntactic score function, and the semantic score function are derived. Accordingly, several models encoding structure information in the semantic score formulation are proposed for case identification and word-sense discrimination.</Paragraph> <Paragraph position="4"> To minimize the number of parameters needed to specify the deep-structure, a deep-structure representation form, called normal form which adopts &quot;predicate-argument&quot; style, is used in our system. By using this normal form representation, the senses of content words and the relationships among constituents in a sentence can be well specified. The normal form used here is quite generalized and flexible; therefore, it is also applicable in other tasks.</Paragraph> <Paragraph position="5"> When the parameters of the proposed score function are estimated with the maximum likelihood estimation (MLE) method, the baseline system achieves parsing accuracy rate of 56.3%, case identification rate of 77.5%, and 86.2% accuracy rate of word sense discrimination.</Paragraph> <Paragraph position="6"> Furthermore, to reduce the estimation error resulting from the MLE, Good-Tudng's smoothing method is applied; significant improvement is obtained with this parameter smoothing method.</Paragraph> <Paragraph position="7"> Finally, a robust discriminative learning algorithm is derived in this paper to minimize the testing set error, and very promising results are obtained with this algorithm. Compared with the baseline system; 17.4% error reduction rate for sense discrimination, 50.7% for case identification, and 47.4% for parsing accuracy are obtained. These results clearly demonstrate the superiority of the proposed models for deep-structure disambiguation.</Paragraph> </Section> <Section position="3" start_page="113" end_page="118" type="metho"> <SectionTitle> 2 The Integrated Score Function </SectionTitle> <Paragraph position="0"> The block diagram of the deep-structure disambiguation system is illustrated in Figure 1.</Paragraph> <Paragraph position="1"> As shown, the input word sequence is first tagged with the possible part-of-speech sequences. A word sequence would, in general, correspond to more than one part-of-speech sequence. The parser analyzes the part-of-speech sequences and then produces corresponding parse trees. Afterwards, the parse trees are analyzed by the semantic interpreter, and various interpretations represented by the normal form are generated. Finally, the proposed integrated score function is adopted to select the most plausible normal form as the output. The formulation of the scoring mechanism is derived as follows.</Paragraph> <Section position="1" start_page="113" end_page="114" type="sub_section"> <SectionTitle> Scoring Module </SectionTitle> <Paragraph position="0"> Tagger ~-~t~---~ Parser ~ Semantic Interpreter sequence of trees \~l~ll~ ndegrma speech forms FIGURE I. Block diagram of the deep-structure disambiguation system For an input sentence, say W, of n words w I , w 2,..-, w,, the task of deep-structure disambiguation is formulated to find the best normal form l~l, parse tree L, and parts of speech 'l', such that (lq, L,'~') = arg max P(N,, L j, Tk \]W), N i,Lj,T k where N i , L i , T k denote the i-th normal form, the j-th parse tree and the k-th part-of-speech sequence, respectively; P(Ni,L j, TkIW ) is called the integrated score function. For computation, the integrated score function is further decomposed into the following equations.</Paragraph> <Paragraph position="2"> where Ssem(Ni ),Ssyn(Lj ), Slex(T k )stand for the semantic score function, syntactic score function, and lexical score function, respectively; they are defined as follows:</Paragraph> <Paragraph position="4"> The derivations of these score function are addressed as follows.</Paragraph> </Section> <Section position="2" start_page="114" end_page="114" type="sub_section"> <SectionTitle> 2.1 The Lexical Score </SectionTitle> <Paragraph position="0"> The lexical score for the k-th lexical (part-of-speech) sequence T k associated with the input word sequence W is expressed as follows: S,,x(T~): P(T, IW ): P\[tk,&quot;lw&quot; kk,l I l/ k,. p(tk., ,(w:l,..)x ,.,) , where tk. i , denoting the i-th part-of-speech in T k , stands for the part-of-speech assigned to wi. Since P(w;) is tho same for all possible lexical sequences, this term can be ignored without * m tl k,n affecting the final disambignation results. Therefore, Slex(T~)(=P(wi\[tk,1 )xP{t~'&quot;'~ instead of St,x(Tk) is used in our implementation. Like the standard trigram tagging procedures, the lexical score S;x(T k) is expressed as follows:</Paragraph> <Paragraph position="2"/> </Section> <Section position="3" start_page="114" end_page="115" type="sub_section"> <SectionTitle> 2.2 The Syntactic Score </SectionTitle> <Paragraph position="0"> The tree in Figure 2 is used as an example to explain the syntactic score function. The basic derivation of the syntactic score includes the following steps.</Paragraph> <Paragraph position="1"> eFirst, the tree is decomposed into a number of phrase levels, such as /-t,/-2,'&quot;, L~ in Fig. 2. * Secondly, the transition between phrase levels is formulated as a context-sensitive rewriting process. With the formulation, each transition probability between two phrase levels is calculated by consulting a finite-length window that comprises the symbols to be reduced and their left and right contexts.</Paragraph> <Paragraph position="2"> A</Paragraph> <Paragraph position="4"> Let the label t i in Fig. 2 be the time index for the i-th state transition, which corresponds to a reduce action, and /~ be the i-th phrase level. Then the syntactic score of the tree L A in Figure 2 is defined as follows [8,9]:</Paragraph> <Paragraph position="6"> where C/ and $ correspond to the begin-of-sentence and the end-of-sentence symbols, respectively; I i and r~ stand for the left and right contextual symbols to be consulted in the i-th phrase level.</Paragraph> <Paragraph position="7"> If M number of left contextual symbols and N number of right contextual symbols are consulted in computation, the model is said to operate in the LMRN mode.</Paragraph> <Paragraph position="8"> Note that each pair of phrase levels in the above equation corresponds to a change in the LR parser's stack before and after an input word is consumed by a shift operation. Because the total number of shift actions, equal to the number of product terms in the above equation, is always the same for all alternative syntactic trees, the normalization problem is resolved in such a formulation.</Paragraph> <Paragraph position="9"> Moreover, the syntactic score formulation provides a way to consider both intra-level context-sensitivity and inter-level correlation of the underlying context-free grammar. With such a formulation, the capability of context-sensitive parsing (in probabilistic sense) can be achieved with a context-free grammar.</Paragraph> </Section> <Section position="4" start_page="115" end_page="118" type="sub_section"> <SectionTitle> 2.3 The Semantic Score </SectionTitle> <Paragraph position="0"> To simplify the computation of the semantic score, a structure normalization procedure is taken beforehand by the semantic interpreter to convert a parse tree into an intermediate normal form, called normal form one (NF1), which preserves all relevant information for identification of cases and word senses. The implementation of the normalization procedure includes a syntactic normalization procedure and a semantic normalization, procedure.</Paragraph> <Paragraph position="1"> In the syntactic normalization procedure, many parse trees that are syntactically equivalent should be normalized first. Such syntactic variants may result from a writing convention, function words, or non-discriminative syntactic information, such as punctuation markers. Excessive nodes for identifying the various bar levels in the phrase structure grammar are also deleted or compacted. Afterwards, different syntactic structures that are semantically equivalent are normalized to the desired normal form (NF) structure. In the NF representation, the tense, modal, voice and type information of a sentence are extracted as features. By taking the sentence &quot;To meet spectrumanalyzer specification, allow a 30-rain warm-up before making any measurement.&quot; as an example, the parse tree, NF1, and the desired normal form structure are illustrated in Figure 3.</Paragraph> <Paragraph position="2"> To compute the semantic score, the normal form is first decomposed into a series of production rules in a top-down and leftmostfirst manner, where each decomposed production rule corresponds to a &quot;case subtree&quot;. For instance, the normal form in Figure 3(c) is decomposed into a series of case subtrees, where ~: PROP .--> FURPVACINGOAL'IIME l~: F'UP, P ---> VSTAT GOAL ~: GOAL--> HEAD HEAD F4: GOAL.--> HEAD HEAD Fs: TIME ---> VACTN THEME I~: GOAL---> QUAN HEAD.</Paragraph> <Paragraph position="3"> Similarly, the NF1 structure is also decomposed into another set of production rules, each of which corresponds to a Normal Form One (NF1) subtree. For example, the NF1 structure in Figure 3(b) is decomposed into the following NF1 subtrees: /~: S---> SS*vNPSS** /~: SS*.--> v NP /~: NP---> nn E4: NP--> nn ~: SS**-->vNP E6: NP--> quan n.</Paragraph> <Paragraph position="4"> In such a way, the semantic score can be defined in terms of the case subtrees and the NF1 subtrees.</Paragraph> <Paragraph position="6"/> <Paragraph position="8"> Formally, regarding the NF1 alternatives, the semantic score Sse m (Ni) can be expressed as follows:</Paragraph> <Paragraph position="10"> where ~ denotes the possible NF1 structures with respect to N i and Ly. Theoretically, a parse trees may be normalized into more than one NF1 structure; however, this happens seldom in our case. That is, it is almost true that the normalization procedure can be considered as a one-to-one mapping, which indicates P(O}Lj, Tk,W)=I in our task. Under this assumption, the semantic score can be simplified as: Since the normal form comprises the cases of constituents and the senses of content words, f i,n v-'i'Mi}, W&quot; in the representation of the normal form can be thus rewritten as N i = ~si. l , l~.~ nere s~i I is the word senses corresponding to W(= w:); I&quot;:~g' = {l-'l,r'2,'&quot;,I&quot;n, } is the M i case subtrees which define the structure of the normal form N j. In such a way, the semantic score is rewritten as follows: S,,m(Ni)=P(NiI~j,Lj,Tk,W)</Paragraph> <Paragraph position="12"> where L~:~' :{L1,L2,...,LN,}corresponds to theN i sentential forms (phrase levels) with respect to the parse tree Lj. ~:~' = {~i,~2,'-.,tDM,} stands fortheNF1 subtrees transformed from L~:~ j respectively, the word-sense score and Sca~,(F/\]~ n')= P(Fi'.',M'I*j:~',L~;7',t~:~,w: ) is the casescore.</Paragraph> <Paragraph position="13"> Different models for case identification and word-sense disambiguation are further derived below.</Paragraph> </Section> </Section> <Section position="4" start_page="118" end_page="120" type="metho"> <SectionTitle> * Case Identification Models </SectionTitle> <Paragraph position="0"> To derive the case identification model, it is assumed that the required information for case k,n L! 'N~ parts-of-speech tk, l and the word w~ has be identification from the parse tree j.~ , represented by the NF1. Based on this assumption, the case score, Scas,\[F~'g' ~ is thus i,l j, approximated as follows:</Paragraph> <Paragraph position="2"> Again, the number of parameters required to model such a formulation is still too many to afford, unless more assumptions are made.</Paragraph> <Paragraph position="3"> Since the decomposition of the normal form structures has been carried out in the top-down and lefimost-first manner, the case subtree Ft,,, depends on its previously decomposed case subtrees, which are either the siblings or the ancestors of the subtree Fi, m . Therefore, in addition ~i, Mi to the NF1 representation T i,~ , the determination of cases in the case subtree r'i, m is assumed to be highly dependent on its ancestors and siblings. In computation, if N number of ancestors and R number of siblings of Fi. m have been consulted, the case score function is approximated as:</Paragraph> <Paragraph position="5"> where I-'a, and Fsj denote the i-th ancestor and the j-th sibling of F~. m , respectively. A model using this case score function is hereby said to operate in an ANSR mode. For example, when the model is operated in AlSo mode, the case score of the normal form in the previous example is expressed as:</Paragraph> <Paragraph position="7"> To make the word-sense score function feasible for implementation, we further assume that the senses of words depend only on the case assigned to the words, the parts-of-speech, and the words themselves only. Therefore, the word sense score function is approximated as follows.</Paragraph> <Paragraph position="8"> Ssense(S\[:~ ) v,~\[ i,n&quot;ri, Mi ~j,M i L!,Nj -k,n n ~ =r~Si, ,\].t/,, ,=j,, , j,, ,tk, , ,W, ) r~\[ i.nl-r.d,M i k,n = tk, I , r\[s,,, ,,, , w? ) n\[ i,n l i,n - k,n /&quot;\[A'i,l \[Ci,l '/'k,l' W?) n rI( I '''-' '&quot; '' P Si, m Si,I , Ci,I , tk,l , Wi , m=l vi'gi. Currently, a where ci. m denotes the case of wi, m which is specified by the case subtrees -i,i simplified model, called case dependent (CD) model, is implemented in this paper. In the casedependent model, the sense of a word is assumed to depend on its case role, part-of-speech and the (co) word itself. Thus, the word sense score in this model, denoted by Sse,, e , is approximated as follows: (Cdeg)(S,, I )= fl P(Si,mlCi,m ' ,Win)&quot; S sense i,n t k,m ra=l</Paragraph> </Section> <Section position="5" start_page="120" end_page="121" type="metho"> <SectionTitle> 3. The Baseline System </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="120" end_page="121" type="sub_section"> <SectionTitle> 3.1 Experimental Setup </SectionTitle> <Paragraph position="0"> A. Corpora: 3,000 sentences in English, extracted from computer manuals and related documents, are collected and are parsed by the BehaviorTran system \[10\], which is a commercialized English-to-Chinese machine translation system developed by Behavior Design Corporation (BDC). The correct part-of-speech, parse trees and normal forms for the collected sentences are verified by linguistic experts. The corpus is then randomly partitioned into the training set of 2,200 sentences and the testing set of the remaining 8,00 sentences to eliminate possible systematic biases. The average number of words per sentence for the training set and the testing set are 13.9 and 13.8, respectively. On the average, there are 34.2 alternative parse trees per sentence for the training set, and 31.2 for the testing set.</Paragraph> <Paragraph position="1"> B. Lexicon: In the lexicon, there are 4,522 distinct words extracted from the corpus. Different sense definitions of these words are extracted from the Longman English-Chinese Dictionary of Contemporary English. For those words which are not included in the Longman dictionary, their sense are defined according to the system dictionary of the BehaviorTran system. In total, there are 12,627 distinct senses for those 4,522 words.</Paragraph> <Paragraph position="2"> C. Phrase Structure Rules: The grammar is composed of 1,088 phrase structure rules, expressed in terms of 35 terminal symbols (parts of speech) and 95 nonterminal symbols.</Paragraph> <Paragraph position="3"> D. Case Set: In the current system, the case set includes a total number of 50 cases, which are designed for the next generation BehaviorTran MT system. Please refer to \[11\] for the details of the case set.</Paragraph> <Paragraph position="4"> To evaluate the performance of the proposed case identification models, the recall rate and the precision rate of case assignment, defined in the following equations, are used.</Paragraph> <Paragraph position="5"> recall ~- No of matched case trees specified by the model Total no of case trees specified by the linguistic experts No of matched case treesspecified by the model precision ~-Total no of case trees specified by the model where a case tree specified by the model is said to match with the correct one if the corresponding cases of the case tree are fully identical to those of the correct case tree.</Paragraph> </Section> <Section position="2" start_page="121" end_page="121" type="sub_section"> <SectionTitle> 3.2 Results and Discussions </SectionTitle> <Paragraph position="0"> In the baseline system, the parameters are estimated by using the maximum likelihood estimation (MLE) method. The results of the deep-structure disambiguation system with the AiSo+CD model is summarized in Table 1. For comparison, the performance of the parser, without combined with the semantic interpreter, is also listed in this table. As expected, the accuracy of parse tree selection is improved as the semantic interpreter is integrated.</Paragraph> <Paragraph position="1"> When the error of the baseline system was examined, we found that a lot of errors occur because many events were assigned with zero probability. To eliminate this kind of estimation error, the parameter smoothing method, Good-Turing's formula \[12\], is adopted to improve the baseline system. The corresponding results are listed in the third column of Table 1, which show that parameter smoothing improves the performance significantly.</Paragraph> <Paragraph position="2"> In addition, a robust learning algorithm, which has been shown to perform well in our previous work \[9\], is also applied to the system to minimize the error rate of the testing set. The basic idea for the robust learning algorithm to achieve robustness is to adjust parameters until the score differences between the correct candidate and the competitors exceed a preset margin. The parameters trained in such a way, therefore, provide a tolerance zone for the mismatch between the training and the testing sets. Readers who are interested in details of the learning algorithm are referred to \[ 11 \]. When the robust learning algorithm is applied, very encouraging result is obtained. Compared with the baseline system, the error reduction rate is 50.7% for case and 17.4% for sense discrimination, and 47.4% for parsing accuracy. As the parser, before coupling with the semantic interpreter, is considered, the performance is improved from 50.1% to 77.0%, which corresponds to 53.9% error reduction.</Paragraph> </Section> </Section> <Section position="6" start_page="121" end_page="122" type="metho"> <SectionTitle> 4 Error Analysis </SectionTitle> <Paragraph position="0"> To explore the areas for further improving the deep-structure disambiguation system, the errors for 200 sentences extracted randomly from the training corpus have been examined. It is found that a very large portion of error come from the syntactic ambiguity. More precisely, most syntactic errors result from attachment problems, including prepositional phrase attachment and modification scope for adverbial phrases, adjective phrases and relative clauses. Only less than 10% of errors arise due to incorrect parts-of-speech. Since the normal form cannot be correctly constructed without selecting the correct parse tree, errors of this type deteriorate system performance most seriously.</Paragraph> <Paragraph position="1"> In addition, errors for case identification is one of the problems that make the deep- null structure disambiguation system unable to achieve a high accuracy rate of normal form. Excluding the effect of syntactic ambiguity, we checked out the errors of the semantic interpreter and found that 44.9% of normal form errors occur in identifying case. As these errors are examined, it is found that more than 30% of the incorrect normal forms have only one erroneous case. Among them, a lot of errors occur in assigning the case for the first noun of a compound noun. Taking the compound noun &quot;shipping materials&quot; as an example, the corresponding cases for the words &quot;shipping&quot; and &quot;materials&quot; are both annotated as the &quot;HEAl3&quot; case in the corpus, as shown in Figure. However, they are assigned the cases &quot;MODIFIER&quot; and &quot;HEAD&quot;, respectively. Error of this kind is usually tolerable for most applications.</Paragraph> <Paragraph position="2"> Another important type of case error is to determine the class of a verb. A constituent with an action verb tends to prefer the case frame in the form of \[VACTN AGENT (INSTR .... ), THEME\], where AGENT, INSTR, and TH.~IE are the arguments of the action verb, assigned by the VACTN case. On the contrary, a constituent with a stative verb would have the case frame in the form of \[VSTAT THEME GOAL\]. Therefore, once the class of a verb is recognized incorrectly, the cases for the verb's arguments and adjuncts will not be identified correctly. Therefore, the errors of this kind would have more serious effects on the case recall rate and the precision rate than the case structure accuracy rate.</Paragraph> </Section> class="xml-element"></Paper>