File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/e95-1031_metho.xml
Size: 15,644 bytes
Last Modified: 2025-10-06 14:14:02
<?xml version="1.0" standalone="yes"?> <Paper uid="E95-1031"> <Title>A Robust Parser Based on Syntactic Information</Title> <Section position="4" start_page="223" end_page="223" type="metho"> <SectionTitle> 2 Algorithm and Heuristics </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="223" end_page="223" type="sub_section"> <SectionTitle> 2.1 General algorithm for least-errors </SectionTitle> <Paragraph position="0"> recognition The general algorithm for least-errors recognition (Lyon, 1974), which is based on Earley's algorithm, assumes that sentences may have insertion,</Paragraph> <Paragraph position="2"> deletion, and mutation errors of terminal symbols.</Paragraph> <Paragraph position="3"> The objective of this algorithm is to parse input string with the least number of errors.</Paragraph> <Paragraph position="4"> A state used in this algorithm is quadruple (p, j, f, e J, where p is a production number in grammar, j marks a position in RHS(p), fis a start position of the state in input string, and e is an error value. 1 A final state (p, p_-I-1, f, e) denotes recognition of a phrase RHS(p) with e errors where _p is a number of components in rule p. A stateset S(i), where i is the position of the input, is an ordered set of states. States within a stateset are ordered by ascending value of 3&quot;, within a p within a f ; f takes descending value.</Paragraph> <Paragraph position="5"> When adding to statesets, ff state (p, j, f, e) is a candidate for admission to a stateset which already has a similar member (p, j, f,, e') and e' _~ e, then (p, j, f, e) is rejected. However, if e' > e, then (p, j, f, e') is replaced by (p, j, f, e).</Paragraph> <Paragraph position="6"> The algorithm works as follows : A procedure SCAN is carried out for each state in S(i). SCAN checks various correspondences of input token t(i) against terminal symbols in RHS of rules. Once SCAN is done, COMPLETER substitutes all final states of S(i) into all other analyses which can use them as components.</Paragraph> </Section> </Section> <Section position="5" start_page="223" end_page="225" type="metho"> <SectionTitle> SCAN </SectionTitle> <Paragraph position="0"> SCAN handles states of S(i), checking each input terminal against requirements of states in S(i) and various error hypotheses. Figure 1 shows how SCAN processes.</Paragraph> <Paragraph position="1"> Let c(p,j) be j-th component of RHS(p) and t(i) be i-th word of input string.</Paragraph> <Paragraph position="2"> * perfect match : if c(p,j) = t(i) then add (p, j+l, f, e) to</Paragraph> <Paragraph position="4"> sible.</Paragraph> <Paragraph position="5"> a~n,er~ion is the cost of an insertion-error for a terminal symbol.</Paragraph> <Paragraph position="6"> If c(p,j) is terminal, then add (p, j-l-l, .f, e+OtdeZe.~) to S(i) if possible. od~z~.., is the cost of a deletion-error for a terminal symbol.</Paragraph> <Paragraph position="7"> * mutation-error hypothesis : If c(p,j) is terminal but not equal to t(i), then add (p, j+ l, f, e+~,,,to,on) to S(i+ l) if possible. null ~muta,on is the cost of a mutation-error for S&quot;, \[VP-> vb. NP PPI S' &quot; \[,.NP I \[PP| * They ~~the repo~ s.\[VP->vb NP PP\] < Phrase Perfect Ma~:h > s&quot;= IS-> NP. md VPI a terminal symbol.2 ~~! _ COMPLETER ~..- c~ rles &quot; IneludlngW(mt G@rmany ~mlq &quot;havell hltrd lime,.. ~. COMPLETER handles substitution of final states &quot; k ./ , in S(i) like that of original Earley's algorithm. ,-\[s->,P ,, vP| Each final state means the recognition of a nonterminal. * Ph.~ ,..=r~,~-,.r~ .yp~-=,.</Paragraph> <Section position="1" start_page="224" end_page="224" type="sub_section"> <SectionTitle> 2.2 Extension of least-errors recognition </SectionTitle> <Paragraph position="0"> algorithm The algorithm in section 2.1 can analyze any input string with the least number of errors. But this algorithm can handle only the errors of terminal symbols because it doesn't consider the errors of nonterminal nodes. In the real text, however, the insertion, deletion, or inversion of a phrase - namely, nonterminal node - occurs more frequently. So, we extend the original algorithm in&quot; order to handle the errors of nonterminal symbols as well.</Paragraph> <Paragraph position="1"> In our extended algorithm, the same SCAN as that of the original algorithm is used, while COMPLETER is modified and extended. Figure 2 shows the processing of extended-COMPLETER.</Paragraph> <Paragraph position="2"> In figure 2, \[NP\] denotes the final state whose rule has NP as its LHS. In other words, it means the recognition of a noun phrase.</Paragraph> <Paragraph position="3"> extended-COMPLETER If there is a final state s' = (p',p~ + 1, k, e') in S(i), * phrase perfect match If there exists a state s&quot; = (p, j, x, e) in S(k) , t < i and j) = L /S(f) then add s = (p, j + 1, z, e + e') into S(i).</Paragraph> <Paragraph position="4"> * phrase insertion-error hypothesis a If there exists a state s&quot; = (p, j, z, e) in S(k) then add s = (p,j,z,e+/~,,,r,o,) into S(i) if possible.</Paragraph> <Paragraph position="5"> /Yinaertion is the cost of a insertion-error for a nonterminal symbol.</Paragraph> <Paragraph position="6"> 2ain.ertion, Otdeletion , Ofmutation lEES st|\] strictly 1 in Lyon's ori~-~l p~per ~In fact, there axe cases that an inserted phrase cannot be constructed to form a nonterminal node. In phrase insertion-error hypothesis of figure 2, the original sentence is ~Other countries, including West Germany, m~y hgve ...', where the inserted phrase VP is surrounded by commas. So, the substring( comma V~ comma ) should be dealt with as a constituent in extended-COMPLETER. In fact, we implemented the algorithm to allow substring insertions ~, well as insertions of nontermlnal nodes.</Paragraph> <Paragraph position="7"> If there exists a state s&quot; = (p, j, z, e) in S(k) and e(p,j) is a nonterminal then add s : (p, j &quot;~- 1, Z, e &quot;\]'t~dele|ion) into S(k) if possible. ~dele,~ is the cost of a deletion-error for a nonterminal symbol.</Paragraph> <Paragraph position="8"> * phrase mutation-error hypothesis 4 If there ~ts a state 8&quot; = (V, J, x, e) in S(k) and c(p, j) is a nonterminal but not equal to L(p') then add s = (p, j + 1, z, e + ~me*a.o.) into S(i) if possible.</Paragraph> <Paragraph position="9"> ~m.ta.o. is the cost of a mutation-error for a nonterminM symbol.</Paragraph> <Paragraph position="10"> The extended least-errors recognition algorithm can handle not only terminal errors but also non-terminal errors.</Paragraph> </Section> <Section position="2" start_page="224" end_page="225" type="sub_section"> <SectionTitle> 2.3 Heuristics </SectionTitle> <Paragraph position="0"> The robust pa~ser using the extended least-errors recognition algorithm overgenerates many error-hypothesis edges during parsing process. To cope with this problem, we adjust error values according to the following heuristics. Edges with more error values are regarded as less important ones, so that those edges are processed later than those of less error values.</Paragraph> <Paragraph position="1"> tWe know that the phrase mutation-error hypothesis is not meaningful in the red text because we cannot find out any example of phrase mutation-error in the corpus. So we didn't implement the phrase mutation-error hypothesis.</Paragraph> <Paragraph position="2"> * Heuristics 1: error types The analysis on 3,538 sentences of the Penn treebank corpus WS:I shows that there are 498 sentences with phrase deletions and 224 sentences with phrase insertions. So, we assign less error value to the deletion-error hypothesis edge than to the insertion- and mutation-errors.</Paragraph> <Paragraph position="4"> where ~ is the error cost of a terminal symbol,/~ is the error cost of a nonterminal symbol. null * Heuristics 2: fiducial nonterminal People often make mistakes in writing English. These mistakes usually take place rather between small constituents such as a verbal phrase, an adverbial phrase and noun phrase than within small constituents themselves. The possibility of error occurrence within noun phrases are lower than between a noun phrase and a verbal phrase, a preposition phrase, an adverbial phrase.</Paragraph> <Paragraph position="5"> So, we assume some phrases, for example noun phrases, as fiducial nonterminals, which means error-free nonterminals. When handling sentences, the robust parser assings more error values(61) to the error hypothesis edge occurring within a fiducial nonterminal.</Paragraph> <Paragraph position="6"> * Heuristics 3: kinds of terminal symbols Some terminal symbols like punctuation symbols, conjunctions and particles are often misused. So, the robust parser assigns less error values(-52) to the error hypothesis edges with these symbols than to the other terminal symbols.</Paragraph> <Paragraph position="7"> * Heuristics 4: inserted phrases between commas or parentheses Most of inserted phrases are surrounded by commas or parentheses. For example, a. They're active , generally , at night or on damp, cloudy days.</Paragraph> <Paragraph position="8"> b. All refrigerators , whether they are defrosted manually or not, need to be cleaned.</Paragraph> <Paragraph position="9"> c. I was a last-minute ( read intedopin9 ) attendee at a French journalism convention .-. We will assign less error values(-6a) to the insertion-error hypothesis edges of nonterminals which are embraced by comma or parenthesis. null 61 and 62 are weights for the error of terminal nodes, and 68 is a weight for the error of nonterminal nodes.</Paragraph> <Paragraph position="10"> The error value e of an edge is calculated as follows. All error values are additive.</Paragraph> <Paragraph position="11"> The error value e for a rule X ~ alAla2.., a~Aj, where a is a terminal node and A is a nonterminal node, is</Paragraph> <Paragraph position="13"> where a E {ain..r:ion, adele.on, amutation}, fl E {/~in.er.o.,/~&/etion} and ech.d is an error value of a child edge.</Paragraph> <Paragraph position="14"> By these heuristics, our robust parser can process only plausible edges first, inste~i of processing all generated edges at the same time, so that we can enhance the performance of the robust parser and result in the great reduction in the number of resultant trees.</Paragraph> </Section> </Section> <Section position="6" start_page="225" end_page="225" type="metho"> <SectionTitle> 3 Implementation and Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="225" end_page="225" type="sub_section"> <SectionTitle> 3.1 The robust parser </SectionTitle> <Paragraph position="0"> Our robust parsing system is composed of two modules. One module is a normal parser which is the bottom-up chart parser. The other is a robust parser with the error recovery mechanism proposed herein. At first, an input sentence is processed by the normal parser. If the sentence is within the grammatical coverage of the system, the normal parser succeed to analyze it. Otherwise, the normal parser fails, and then the robust parser starts to execute with edges generated by the normal parser. The result of the robust parser is the parse trees which are within the grammatical coverage of the system. The overview of the system is shown in figure 3.</Paragraph> <Paragraph position="1"> ., t ! ,,..o,,</Paragraph> </Section> <Section position="2" start_page="225" end_page="225" type="sub_section"> <SectionTitle> 3.2 Experimental result </SectionTitle> <Paragraph position="0"> To show usefulness of the robust parser proposed in this paper, we made some experiments.</Paragraph> </Section> </Section> <Section position="7" start_page="225" end_page="226" type="metho"> <SectionTitle> * Rule </SectionTitle> <Paragraph position="0"> We can derive 4,958 rules and their frequencies out of 14,137 sentences in the Penn treebank tree-tagged corpus, the Wall Street Journal. The average frequency of each rule is 48 times in the corpus. Of these rules, we remove rules which occurs fewer times than the average frequency in the corpus, and then only 192 rules are left. These removed rules are almost for peculiar sentences and the left rules are very general rules. We can show that our robust parser can compensate for lack of rules using only 192 rules with the recovery mechanism and heuristics.</Paragraph> <Paragraph position="1"> * Test set First, 1,000 sentences are selected randomly from the WSJ corpus, which we have referred to in proposing the robust parser. Of these sentences, 410 are failed in normal parsing, and are processed again by the robust parser. To show the validity of these heuristics, we compare the result of the robust parser using heuristics with one not using heuristics. Second, to show the adaptability of our robust parser, same experiments are carried out on 1,000 sentences from the ATIS corpus in Penn treebank, which we haven't referred to when we propose the robust parser. Among 1,000 sentences from the ATIS, 465 sentences are processed by the robust parser after the failure of the normal parsing.</Paragraph> </Section> <Section position="8" start_page="226" end_page="226" type="metho"> <SectionTitle> * Parameter adjustment </SectionTitle> <Paragraph position="0"> We chose the best parameters of heuristics by executing several experiments.</Paragraph> <Paragraph position="1"> Accuracy is measured as the percentage of constituents in the test sentences which do not cross any Penn treebank constituents (Black, 1991). Table 1 shows the results of the robust parser on WSJ. In table 1, 5th, 6th and 7th raw mean that the percentage of sentences which have no crossing constituents, less than one crossing and less than two crossing respectively. With heuristics, our robust parser can enhance the processing time and reduce the number of edges. Also, the accuracy is improved from 72.8% to 77.1% even if the heuristics differentiate edges and prefer some edges. It shows that the proposed heuristics is valid in parsing the real sentences. The experiment says that our robust parser with heuristics can recover perfectly about 23 sentences out of 100 sentences which axe just failed in normal parsing, as the percentage of no-crossing sentences is about 23.28%.</Paragraph> <Paragraph position="2"> Table 2 is the results of the robust parser on ATIS which we did not refer to before. The accuracy of the result on ATIS is lower than WSJ because the parameters of the heuristics are a~justed not by ATIS itself but by WSJ. However, the percentage of sentences with constituents crossing less than 2 is higher than the WSJ, as sentences of ATIS are more or less simple.</Paragraph> <Paragraph position="3"> The experimental results of our robust parser show high accuracy in recovery even though 96% of total rules are removed. It is impossible to construct complete grammar rules in the real parsing system to succeed in analyzing every real sentence. So, parsing systems are likely to have extragrammatical sentences which cannot be analyzed by the systems. Our robust parser can recover these extragrammatical sentences with 68 ~ 77% accuracy. null It is very interesting that parameters of heuristics reflect the characteristics of the test corpus. For example, if people tend to write sentences with inserted phrases, then the parameter fli,~sert~on must increase. Therefore we can get better results if the parameter are fitted to the characteristics of the corpus.</Paragraph> </Section> class="xml-element"></Paper>