File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0120_metho.xml
Size: 15,108 bytes
Last Modified: 2025-10-06 14:14:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0120"> <Title>A Self-Organlzing Japanese Word Segmenter using He-ristic Word Identification and Re-estimation</Title> <Section position="4" start_page="206" end_page="207" type="metho"> <SectionTitle> 3 Initial Word Frequency Estimntion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="206" end_page="206" type="sub_section"> <SectionTitle> 3.1 Longest Match </SectionTitle> <Paragraph position="0"> We can get a set of initial estimates of the word frequencies by segmenting the training corpus using a heuristic (non-stochastic) dictionary-based word segmenter. In both Japanese and Chinese, one of the most popular non-stochastic dictionary-based approaches is the longest match method 1 There are many variations of the longest match method, possibly augmented with further heuristics. We used a simple greedy algorithm described in \[Sproat et al., 1996\]. It starts at the beg6nning of the sentence, finds the longest word starting at that point, and then repeats the process starting at the next character until the end of the sentence is reached. We chose the greedy algorithm because it is easy to implement and guaranteed to produce only one segmentation.</Paragraph> </Section> <Section position="2" start_page="206" end_page="206" type="sub_section"> <SectionTitle> 3.2 String Frequency </SectionTitle> <Paragraph position="0"> \[Sproat et al., 1996\] also proposed another method to estimate a set of initial word frequencies without segmenting the corpus. It derives the initial estimates from the frequencies in the corpus of the strings of character making up each word in the dictionary whether or not each string is actually an instance of the word in question. The total number of words in the corpus is derived simply by summing the string frequency of each word in the dictionary. Finding (and counting) all instances of a string W in a large text T can be efficiently accomplished by making a data structure known as a sUtrLX array, which is basically a sorted list of all the su~ixes of T \[Manber and Myers, 1993\].</Paragraph> </Section> <Section position="3" start_page="206" end_page="207" type="sub_section"> <SectionTitle> 3.3 Longest Match String Frequency </SectionTitle> <Paragraph position="0"> The estimates of word frequencies by the above string frequency method tend to inflate a lot especially in short words, because of double counts. We devised a slightly improved version which we term the &quot;longest match string frequency&quot; method. It counts the instances of string W1 in text T, unless the instance is also a substring of another string W~ in dictionary D.</Paragraph> <Paragraph position="1"> This method can be implemented by making two suffix arrays, Srr and SD for text T and dictionary D. By using ST, we first make list Lw of all occurrences of string W in the text. By using SD, we then look up all strings IYC/ in the dictionary that include W as a substring, and make list ~ of all their occurrences in the text by using ST. The longest match string frequency of word W in text T with respect to dictionary D is obtained by counting the number of elements in the set difference .LW --/'W-For example, if the input sentence is ~~ ~g~---~|r.~-~o &quot; (talk about the Association of English and the Association of Linguistics) and the dictionary has -r~_~ (linguistics), ~&quot; (language), ~ (language study), ~ (association), and ~ (talk). Figure 3 shows the difference of the three methods.</Paragraph> <Paragraph position="2"> The longest match string frequency (lsf) method considers all possible longest matches in the text, while the greedy longest match (lm) algorithm considers only one possibility. It is obvious that the longest match string frequency method remedies the problem that the string frequency (sf) method consistently and inappropriately favors short words.</Paragraph> <Paragraph position="3"> The problem of the longest match string frequency method is that if a word W1 is a substring of other word W2 and if W1 always appears as a substring of W2 in the training text, just like &quot;~ Although \[Sproat et al., 1996\] calls it '~maximum matching&quot;, we call this method &quot;longest match&quot; according to a review on Chinese word segmentation \[Wu and Tseng, 1993\] and the literal translation of the Japanese name of the method t'~--~.</Paragraph> <Paragraph position="4"> and ~--~- in the above example, the frequency estimate of W1 becomes 0. Although this rarely happens for a large training text, we have to smooth the word frequencies.</Paragraph> </Section> </Section> <Section position="5" start_page="207" end_page="208" type="metho"> <SectionTitle> 4 Initial Word Identification Method </SectionTitle> <Paragraph position="0"> To a first approximation, a point in the text where character type changes is likely to be a word boundary. This is a popular heuristics in Japanese word segmentation. To help readers understand the heuristics, we have to give a brief introduction to the Japanese writing system.</Paragraph> <Paragraph position="1"> In contemporary Japanese, there are at least five different types of characters other than punctuation maxks: kanji, hiragana, katakana, Roman alphabet, and Arabic numeral. Kanfi which means 'Chinese character' is used for both Chinese origin words and Japanese words semantically equivalent to Chinese characters. There are two syllabaries hiragana and katakana. The former is used primarily for grammatical function words, such as particles and inflectional endings, while the latter is used primarily to transliterate Western origin words. Roman alphabet is also used for Western origin words and acronyms. Arabic numeral is used for numbers.</Paragraph> <Paragraph position="2"> By using just this character type heuristics, a non-stochastic and non-dictionary word segmenter can be made. Ia fact, using the estimated word frequencies obtaiued by the heuristics results in poor segmentation accuracy 2. We found, however, that it is very effective to use the character type based word segmenter as a lexical acquisition tool to augment the initial word list.</Paragraph> <Paragraph position="3"> The initial word identification procedwe is as follows. First, we segment the training corpus by the character type based word segmenter, and make a list of words with frequencies. We then filter out hiragana strings because they are likely to be function words. We add the extracted word ~The word segmentation accuracy of the character type based method was less th~- 60%, while other estimation methods achieves around 70-80% as we show ia the next section.</Paragraph> <Paragraph position="5"> list to the original dictionary witli associated frequencies, whether or not each string is actually a word. Although there are a lot of erroneous words in the augmented word list, most of them are filtered out by the re-estimation. This method works suzprisingly well, as shown in the experiment.</Paragraph> </Section> <Section position="6" start_page="208" end_page="3984" type="metho"> <SectionTitle> 5 Experiment </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="208" end_page="3984" type="sub_section"> <SectionTitle> 5.1 Language Data </SectionTitle> <Paragraph position="0"> We used the EDR Japanese Corpus Version 1.0 \[EDR, 1995\] to train and test the word segmenter.</Paragraph> <Paragraph position="1"> It is a corpus of 5.1 million words (208 thousand sentences). It contains a variety of Japanese sentences taken from newspapers, magazines, dictionaries, encyclopedias, textbooks, etc. It has a variety of annotations including word segmentation, pronunciation, and part of speech tag.</Paragraph> <Paragraph position="2"> In this experiment, we randomly selected two sets of training sentences, each consisting of 100 thousand sentences. The fixst tralniug set (training-0) is used to make initial word lists of various sizes. The second training set (training-I) is used to train various word segmenters. From the remaining of 8 thousand sentences, we randomly selected 100 test sentences to evaluate the accuracy of the word segmenters. Table 1 shows the number of sentences, words, and characters in the training and test sets 3 Based on the frequency in the manually segmented corpus training-0, we made 7 different initial word lists (D1-D200) whose frequency threshold were !, 2, 5, 10, 50, 100, 200, respectively. The size of the resulting word lists and their out-of-vocabulary rate (OOV rate) in the test sentences are shown in the second and third colnmn~ of Table 2. For example, D200 consists of words appearing more than 200 times in training-0. Although D200 consists of only 826 words, it covers 76.6% (OOV rate 23.4%) of the test sentences. This is an example of the Zipf law.</Paragraph> </Section> <Section position="2" start_page="3984" end_page="3984" type="sub_section"> <SectionTitle> 5.2 Evaluation Measures </SectionTitle> <Paragraph position="0"> Word Segmentation accuracy is expressed in terms of recall and precision as is done for bracketing of partial parses \[Nagata, 1994, Sproat et al., 1996\]. Let the number of words in the manually segmented corpus be Std, the number of words in the output of the word segmenter be Sys, and the number of matched words be M. Recall is defined as M/Std, and precision is defined as M/Sys.</Paragraph> <Paragraph position="1"> Since it is inconvenient to use both recall and precision all the we also use the F-measure to indicate the overall performance. The F-measure was originally developed by the information STraining-I was used as plain texts that are taken from the same information sou.rce as training-O. Its word segmentation information was never used to ensure that tr~;=ing was unsupervised.</Paragraph> <Paragraph position="2"> retrieval community. It is calculated by F= f12 x P + R (7) where P is precision, R is recall, and fl is the relative importance given to recall over precision. We set fl = 1.0 throughout this experiment. That is, we put equal importance on recall and precision.</Paragraph> </Section> <Section position="3" start_page="3984" end_page="3984" type="sub_section"> <SectionTitle> 5.3 Comparison of Various Word Frequency Estimation Methods </SectionTitle> <Paragraph position="0"> We first compared the three frequency estimation methods described in the previous section: greedy longest match method (lm), string frequency method (sf), and longest match string frequency method (lsf). The sixth, seventh, and eighth columns of Table 2 show the word segmentation accuracy (F-measure) of each estimation method using different sets of initial words (D1-D200).</Paragraph> <Paragraph position="1"> For comparison, the word segmentation accuracy using real word frequency (wf), computed from the manual segmentation of training-1 (not training-0!), is shown in the fifth column of Table 2.</Paragraph> <Paragraph position="2"> The results are also diagramed in Figure 4.</Paragraph> <Paragraph position="3"> outperformed that of any frequency estimation methods. Among word frequency estimates, the longest match string frequency method (lsf) consistently outperformed the string frequency method (sf). The (longest match) string frequency method (sf and lsf) outperformed the greedy longest match method (lm) by about 2-5% when the initial word list size was under 20K (from D5 to D200). In all estimation methods, word segmentation accuracies of D1 are worse than D2, while D1 is slightly better than D2 in using real word frequencies.</Paragraph> </Section> <Section position="4" start_page="3984" end_page="3984" type="sub_section"> <SectionTitle> 5.4 Effect of Augmenting Initial Dictionary </SectionTitle> <Paragraph position="0"> We then compared the three frequency estimation methods (Ira, sf, and lsf) with the initial dictionary augmented by the character type based word identification method (ct) described in the previous section. The word identification method collected a list of 108975 word hypotheses from trainingol. The ninth, tenth, and eleventh columns of Table 2 show the word segmentation accuo facies.</Paragraph> <Paragraph position="1"> Augmenting the dictionary yields a significant improvement in word segmentation accuracy. Although the difference between the underlying word frequency estimation methods is small, the longest match string frequency method generally performs best. Surprisingly, the best word segmentation accuracy is achieved when the very small initial word list of 1719 words (D100) is augmented</Paragraph> <Paragraph position="3"> by the heuristic word identification method, where the recall and precision are 86.3% aud 82.5% (F-measure 0.843).</Paragraph> </Section> <Section position="5" start_page="3984" end_page="3984" type="sub_section"> <SectionTitle> 5.5 Effect of Re-estimation </SectionTitle> <Paragraph position="0"> To investigate the effect of re-estimation, we tested the combination of three initial word lists: D1, D2, D100, and two initial word frequency estimation methods: string frequency method (sf) and longest match string frequency method au~nented with the word identification method (lsf+ct).</Paragraph> <Paragraph position="1"> We applied the Viterbi re-estimation procedure three times. It seems further re-estimation brings no signi~cant change. At each stage of re-estimation, we measured the word segmentation accuracy on the test sentences (not the training texts!). Figure 5 shows the word segmentation accuracy, the number of word tol~ens in the training texts, and the number of word types in the dictionary at each stage of re-estimation.</Paragraph> <Paragraph position="2"> In general, re-estimation has little impact on word segmentation accuracy. It gradually improves the accuracy when the initial word list is relatively large (D1 and D2), while it worsen the accuracy a little when the initial word list is relatively small (D100). This might correspond with the results on unsupervised learning performed by an English part of speech tagger. Although \[Kupiec, 1992\] presented a very sophisticated method of unsupervised learning, \[Elworthy, 1994\] reported that re-estimation is not always helpful. We think, however, our results are because we used a word uni~am model; it is too early to conclude that re-estimation is useless for word segmentation, as discussed in the next section.</Paragraph> <Paragraph position="3"> It seems the virtue of re-estimation lies in its ability to adjust word frequencies and removing unreliable word hypotheses that are added by heuristic word identification. The abrupt drop in the number of word tokens at the ffirst re-estimation step indicates that the inflated inRial estimates of estimation stage word frequencies are adjusted to more reasonable values. The drop in the number of word types indicates the removal of infrequent words and unzeliable word hypotheses from the dictionary.</Paragraph> </Section> </Section> class="xml-element"></Paper>