File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/97/w97-0120_concl.xml
Size: 2,021 bytes
Last Modified: 2025-10-06 13:57:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0120"> <Title>A Self-Organlzing Japanese Word Segmenter using He-ristic Word Identification and Re-estimation</Title> <Section position="8" start_page="3984" end_page="3984" type="concl"> <SectionTitle> 7 Conclusion and Future Work </SectionTitle> <Paragraph position="0"> We have presented a self-organized method that builds a stochastic Japanese word segmenter from a small word list and a large unsegmented text. We found that it is very effective to augment the initial word list with automatically extracted words using character type heuristics. Re-estimation helps in adjusting word frequencies and removing inappropriate word hypotheses, although it has little impact on word segmentation accuracy if the word unigram model is used.</Paragraph> <Paragraph position="1"> The major drawbacks of the current word segmenter is its breakdown of unknown words whose substrings coincide with other words in the dictionary, and the erroneous longest match at the sequence of functional words written in hiragana. The first drawback results from the character unigram based word model that prefers short words, while the second drawback results from the nature of the word tmigram model which prefers fewest words segmentation.</Paragraph> <Paragraph position="2"> One may argue that we could use the word bigzam model. However, we don't know how we can estimate the initial word bigram frequencies from scratch. One may also argue that we could use the character bigram in the word model. However, the character bigram for the word model must be computed from segmented texts. Both of these suggest that we need a word segmenter to build a more sophisticated word segmenter. Therefore, as a next step of our research, we are thinking of using the proposed unigram based word segmenter to obtain the initial estimates of the word bigrams and the word-based character bigr~m~ which will then be refined by a re.estimation procedure.</Paragraph> </Section> class="xml-element"></Paper>