File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1032_metho.xml
Size: 18,435 bytes
Last Modified: 2025-10-06 14:13:36
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1032"> <Title>A Stochastic Japanese Morphological Analyzer Using a Forward-DP Backward-A* N-Best Search Algorithm</Title> <Section position="3" start_page="0" end_page="201" type="metho"> <SectionTitle> 2 Tagging Model 2.1 Tri-POS Model and Relative Fre- </SectionTitle> <Paragraph position="0"> quency Training We used the tri-POS (or triclass, tri-tag, tri-Ggram etc.) model ~Ls tile tagging model for Japanese. Consider a word segmentation of the input sentence W = wl w2... w,~ and a sequence of tags T = tits.., t,, of the same length. The morphological analysis tmsk cau I)e formally defined ,~ finding a set of word segmentat.ion and parts of speech ~ssignment that maximize the joint probability of word sequence arm tag sequence P(W, 7'). In the tri-POS model, the joint probability is approximated by the product of parts of speech trigram probabilities P(tilti_2,ti_l) and word output probabilities for given part of speech P(wl\]ll):</Paragraph> <Paragraph position="2"> In practice, we consider sentence boundaries ~s special symbols as follows.</Paragraph> <Paragraph position="4"> where &quot;#&quot; indicates the sentence boundary marker. If we have some tagged text available, we can estimate the probabilities P(tdti_2,ti_l ) and P(wiltl) by computing the relative frequencies of the corresponding events on this data:.</Paragraph> <Paragraph position="6"> where f indicates the relative frequency, N(w, t) is t!,e number of times a given word w appears with tag l, aid N(li_2,ti-l,tl) is the number of times that sequer~ce l~i_2ti_ll i appears in the text. It is inevitable to s~ irer from sparse-data problem in the part of speech tag tri-gram probability I . To handle open text, trigram Frol:ability is smoothed by interpolated estimation, wi~ich simply interpolates trlgram, bigram, unigram, and zerogram relative frequencies\[8\],</Paragraph> <Paragraph position="8"> where f indicates the relative.frequency and V is a uniform probability that each tag will occur. The non-negative weights qi satisfy q3 + q~ + q1 + q0 = 1, and they are adjusted so as to make the observed data most probable after the adjustment by using EM algorithm ~-.</Paragraph> <Section position="1" start_page="201" end_page="201" type="sub_section"> <SectionTitle> 2.2 Order Reduction and Recursive Tracing </SectionTitle> <Paragraph position="0"> In order to understand the search algorithm described in the next section, we will introduce the second order HMM and extended Viterbi algorithm \[6\]. Considering the combined state sequence U = ltl'tt2.., ttn, where ul = tl and ui = ti-tli, we have</Paragraph> <Paragraph position="2"> Substituting Equation (6) into Equation (l), we have lWe used 120 part of speedl tags. In the ATR Corpus, 26 parts of speech, 13 conjugation types, and 7 conjugation forms are defined. Out of 26, 5 parts of speech have conjugation. Since we used a list of part of speech, conjugation type, and conjugation form as a tag, there are 119 tags in the A'ITC/ Corpus. We added the sentence boundary marker to them.</Paragraph> <Paragraph position="3"> aTo handle open text, word output probahility P(loilti) must also be smoothed. Tiffs problem is discussed in a later section *Ls the unknown word problem.</Paragraph> <Paragraph position="4"> Equation model.</Paragraph> <Paragraph position="5"> 'ill I ... Wi we have</Paragraph> <Paragraph position="7"> (7) have the same form as the first order Conshler the partial word sequence HI/ = and the partial tag sequence Ti = tl...tl,</Paragraph> <Paragraph position="9"> Equation (8) suggests that, to find the maxlmmn P(I,Vi,7\]) for each ul, we need only to: remember the maximum P(WC/_I, 7\]_1), extend each of these probabilities to every ul by computing Eqnation (8), and select the m;uxinmm P(~/Vi,Ti) for each ui. 'thus, by increasing i by 1 to n, selecting the u. ttlat maximize P(W.,7\]~), and backtracing the sequence leading to the nmxinmm probability, we can get the optimal tag seqnence.</Paragraph> </Section> </Section> <Section position="4" start_page="201" end_page="203" type="metho"> <SectionTitle> 3 Search Strategy </SectionTitle> <Paragraph position="0"> The search algorithm consists of a forward dynamic programming search and a backward A* search. First, a linear time dynamic programming is used for recording the scores of all partial paths in a table 3. A backward A* algorithm based tree search is then used to extend the partial paths. Partial paths extended in the backward tree search are ranked by their corresponding fill path scores, which are cmnputed by adding the scores of backward partial path scores to the cot responding best possihle scores of the remaining paths which are prerecorded in the forward search. Since the score of the incomplete portion of a path is exactly known, the backward search is admissible. That is, the top-N candidates are exact.</Paragraph> <Section position="1" start_page="201" end_page="202" type="sub_section"> <SectionTitle> 3.1 The Forward DP Search </SectionTitle> <Paragraph position="0"> Table 1 shows the two data structures used in our algorithm. The st,'t, cture parse stores tile information of a word and the best partial path up to the word.</Paragraph> <Paragraph position="1"> Parse.start and parse.end are the indices of tile start and end positions of the word in the sentence.</Paragraph> <Paragraph position="2"> Parse.pos is tile part of speech tag, which is a list of part of speech, conjugation type, and conjugation form in our system for Japanese. Parse.nth-order-~tate is a list of the last two parts of speech tags including that of the current word. This slot corresponds to the combined state in the second order IIMM.</Paragraph> <Paragraph position="3"> Parse.prob-so-far is the score of the best partial path from the beginning of the sentence to the word.</Paragraph> <Paragraph position="4"> Parse.prev+-ous is the pointer to the (best) previous parse structure as in conventional Viterbi decoding, which is not necessary if we use the backward N best search.</Paragraph> <Paragraph position="5"> ~ln fact, we use two tables, pa~se-\].ist and path-~ap. The reason is described later.</Paragraph> <Paragraph position="6"> The structure word represents the word information in the dictionary including its lexical form, part of speech tag, and word output probability given tt,e part of speech.</Paragraph> <Paragraph position="7"> &quot;tim beginning pasition of the word the end position of the word part of speech tag of the word a list of the la-'~t two parts (,f speech the b,~t partial path score from the start a pointer to previous parse strllettll'e word structure form \] lexical f,.'-n{ of the word l)Oa \[ part of speech tag of the word prob _ word outlmt probability Before explaining tim forward search, we will define some flmctions and tables used in the algorithm. In the forward search, we use a table called parse-list, whose key is the end position of the parse structure, and wlm,se value is a list of parse structures that have the best partial path scores for each combined state at the end position. Function register-to-parse-list registers a parse structure against the parse-list and maintains the best partim parses. Function get-parse-list returns a list of parse structnres at the specified position. We also use the fimetion leltmost-substrings which returns a list of word structures in the dictionary whose lexical form matches the substrings starting at the. specified position in the input sentence.</Paragraph> <Paragraph position="8"> function ~orward-paae (string) begin initial-atepO ; It Pods special symbols at both ends. for iffil to length(string) do foreach parse in get-parse-list(i) do foreach word ill leftmost-subatrings(atring,i) ,I(7</Paragraph> <Paragraph position="10"> Figure h The forward DP search algorithm Figure 1 shows the central part of the forward dynamic programming search algorithm. It starts from the beg,string of tim inlmt sentence, and proceeds charattar by character. At each point in tim sentence, it looks up the combination of the best partial parses ending at the point and word hypotheses starting at that point. If tim connection of a partial parse and a word llypothesis is allowed by the tagging model, a new continuation parse is made and registered in the parse-list. The partial path score for the new contitular,on parse is the product of the best partial path score up to the poi,g, the trigram probability of the last three parts of speech tags and the word output probability for LIfe part of speech 4.</Paragraph> </Section> <Section position="2" start_page="202" end_page="203" type="sub_section"> <SectionTitle> 3.2 The Backward A* Search </SectionTitle> <Paragraph position="0"> The backward search uses a table called path-map, whose key is the end position of tile parse structure, and whose value is a list of parse structures that have the best partial path scores for each distinct combin~ ties of the start position and the combined state. The dilference 1)etween parse-list and path-map is that path-map is classi/ied by tim start position of the last word in addition to tim combined state.</Paragraph> <Paragraph position="1"> This distinction is crucial for the proposed N best algorithm, l&quot;or tim tbrward search to tind a parse that maximizes Equation (1), it is the parts of speech sequence that matters. For the backward N-best search, how(wet, we want N most likely word segmentation and part of speech sequence. Parse-list may shadow less probable candidates that have the same part of speech sc:qnence for the best scoring candidate, but differ in tim segmentaL,on of the last word. As shown in Figure 1, path-map is made during the forward search by the function register-parse-to-path-map, which registers a parse structure to path-map and maintains the best partial parses in the table's criteria.</Paragraph> <Paragraph position="2"> Now we describe the central part of tim backward A* search algorithm. But we assume that the readers know the A* algorithm, and exphtin only the way we applied the algorithm to the problem.</Paragraph> <Paragraph position="3"> We consider a parse structure ,~q a state in A* search. Two slates are e(plat if their parse structures have the same start position, end position, and combined state. The backward search starts at the end of the input, sentence, and backtracks to the beginning of the sentence using tim path-map.</Paragraph> <Paragraph position="4"> Initial states are obtained by looking up the entries of tim sentence end position of the path-map. The successor states are obtained by first, looking u 1) tim entries of the path-map at the start position of the current parse, then cbecldng whether they satisfy the constraint of the combined state transition in the second order IIMM, aim whether the transition is allowed by the tagging model. The combined state transition constraint means that tim part of speech sequence in the parse.nth-order-state of the current parse, ignor4 In Figure 1, function transprob returns the probability of given trlgraln. Functions initial-step and final-step treat \[be tl'aliSlt\[ons I%L sltlill~llce \],Olll|dlll'ieg, ing the last element, equals that of tile previous parse, ignoring the first element.</Paragraph> <Paragraph position="5"> The state transition cost of the backward search is the product of the part of speech trigram probability and the word output probability. Tile score estimate of the remaining portion of a path is obtained from the parse.prob-so-~ar slot in the parse structure.</Paragraph> <Paragraph position="6"> The backward search generates the N best hypotheses sequentially and there is no need to preset N. The complexity of the backward search is significantly less than that of the forward search.</Paragraph> </Section> </Section> <Section position="5" start_page="203" end_page="204" type="metho"> <SectionTitle> 4 Word Model </SectionTitle> <Paragraph position="0"> To handle open text, we have to cope with unknown words. Since Japanese do not put spaces between words, we have to identify unknown words at first. To do this, we can look at the spelling (character sequence) that may constitute a word, or look at the context to identify words that are acceptable in this context.</Paragraph> <Paragraph position="1"> Once word hypotheses for unknown words are generated, the proposed N-best algorithm will find tile most likely word segmentation and part of speech assignment taking into account the entire sentence. Therefore, we can formalize the unknown word problem as (letermining the span of an unknown word, assigning its part of speech, and estimating its probability given its part of speech.</Paragraph> <Paragraph position="2"> Let us call a computational model that determines the probability of any word hypothesis given its lexical form and its part of speech the &quot;word model&quot;. The word model must account for morphology and word formarion to estimate the part of speech and tile probability of a word hypothesis. For tile first approxinmtion, we used the character trigram of each part of sl)eech as the word model.</Paragraph> <Paragraph position="3"> Let C = cic~.., c,~ denote the sequence of n characters that constitute word zv whose part of speech is t. We approximate the probability of the word given part of speech P(wlt ) by tile trigram probabilities,</Paragraph> <Paragraph position="5"> where special symbol &quot;#&quot; indicates ttle word boundary marker. Character trigram probabilities are estimated from the training corpus by computing relative frequency of character bigram and trigram that appeared in words tagged as t.</Paragraph> <Paragraph position="7"> where Nt(ci_2,ci_~,ci) is tile total number of times character trigram ci_2ci_~el appears in words tagged as t in the training corpus. Note that the character trigram probabilities reflect the frequency of word tokens in tile training corpus. Since there are more than 3,000 characters in Japanese, trigram probabilities are smoothed by interpolated estimation to cope with the sparse-data problem.</Paragraph> <Paragraph position="8"> It is ideal to make this character trigram model for all open clmss categories, llowever, the amount of training data is too small for low frequency categories if we divide it by part of speech tags. Therefore, we made trigram models only for tile 4 most frequent parts of speech that are open categories and have no conju~ gation. They are common noun, proper noun, sahen no/ln5~ and nunleral.</Paragraph> <Paragraph position="10"> tion for unknown words. Each trigram model returns a probability if the input string is a word belonging to the category. In both examples, the correct category has the largest probability.</Paragraph> <Paragraph position="12"> ated by using tile character trigram models. A word hypothesis is a list of word boundary, part of speech assignment, and word probability that matches tile left-most substrings starting at a given position in tile input sentence. In the forward search, to handle unknown words, word hypotheses are generated at every position in addition to the ones generated by the function leftmost-subs~;rings, which are the words found ill tile dictionary, llowever, ill our system, we limited the ntunl)er of word hyl)otheses generated at each position to 10, for efficiency reasons.</Paragraph> <Paragraph position="13"> aA noun tlmt can be used a~s a verb when it is followed by a forlna,\] verb &quot;s~tr~t&quot;,</Paragraph> </Section> <Section position="6" start_page="204" end_page="204" type="metho"> <SectionTitle> 5 Evaluation Measures </SectionTitle> <Paragraph position="0"> We applied the performance measures for English parsers \[1\] to Japanese morphological analyzers. The basic idea is that morphological analysis for a sentence can be thought of as a set of labeled brackets, where a bracket corresponds to word segmentation and its la-.</Paragraph> <Paragraph position="1"> bel corresponds to part of speech. We then compare the brackets contained in the system's output to the brackets contained in the standard analysis. For the N-best candidate, we will make the union of t\],e brackets contained in each candidate, and compare thenr to the brackets in the standard.</Paragraph> <Paragraph position="2"> For comparison, we court{, the number of I)rackcts in the standard data (Std), the number of brackets in the system output (Sys), and the nunlber of matching brackets (M). We then calculate the nleasurcs of recall (= M/Std) and precision (= M/Sys). We also connt the number of crossings, which is tile mmtber of c,'mes where a bracketed sequence from the standard data overlaps a bracketed sequence from tile system output, but neither sequence is completely coutained in the other.</Paragraph> <Paragraph position="3"> We defined two equaiity criteria of brackets for counting tim number of matching brackets. Two brackets are unlabeled-bracket-equal if the boundaries of the two brackets are tile same. Two brackets are labeledbracket.equal if the labels of the brackets ark the same in addition to unlabeled-I)racket-equal. In comparing the consistency of the word segmentations of two brackclings, wllich we call structure-consistency, we count the measures (recall, precision, crossings) by unlabeledbracket-equal. In comparing the consistency of part of speech assignment in addition to word segmentation, which we call label-consistency, we couut them by For example, Figure 4 shows a sample of N-hest analysls hypotheses, where the first candidate is the correct analysis a. For the second candhlate, since there are !) })rackets in tim correct data (Std=9), 11 brackets in the second candidate (Sys=ll), and 8 nlatciiing brackets (M=8), tile recall and precision with respect to label consistency are 8/9 and 8/11, respectively. For the top two candidates, since tliere ;ire 12 distinct brackets in tile systems otll.litlt and 9 Inatehing brackets, tile recall and precision with respect to hal)el consistency are 9/9 aud 9/12, respeetiwqy. For the third candidate, since the correct data and the third candidate differ in just one part of Sl)eech tag, the recall and precision wittl respect to structure consistency are 9/9 and 9/9, respectiw>ly.</Paragraph> </Section> class="xml-element"></Paper>