File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-1108_intro.xml
Size: 2,681 bytes
Last Modified: 2025-10-06 14:06:34
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1108"> <Title>Use of Mutual Information Based Character Clusters in Dictionary-less Morphological Analysis of Japanese Hideki Kashioka, Yasuhiro Kawata, Yumiko Kinjo,</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Recent papers have reported cases of successful part-of-speech tagging with statistical language modeling techniques (Church 1988; Cutting et.</Paragraph> <Paragraph position="1"> al. 1992; Charniak et. al. 1993; Brill 1994; Nagata 1994; Yamamoto 1996). Morphological analysis on Japanese, however, is more complex because, unlike European languages, no spaces are inserted between words. In fact, even native Japanese speakers place word boundaries inconsistently. Consequently, individual researchers have been adopting different word boundaries and tag sets based on their own theory-internal justifications.</Paragraph> <Paragraph position="2"> For a practical system to utilize the different word boundaries and tag sets according to the demands of an application, it is necessary to co-ordinate the dictionary used, tag sets, and numerous other parameters. Unfortunately, such a task is costly. Furthermore, it is difficult to maintain the accuracy needed to regulate the word boundaries. Also, depending on the purpose, new technical terminology may have to be collected, the dictionary has to be coordinated, but the problem of unknown words would still remain.</Paragraph> <Paragraph position="3"> The above problems will arise so long as a dictionary continue to play a principal role. In analyzing Japanese, a Decision-Tree approach with no need for a dictionary (Kashioka, et. al. 1997) has led us to employ, among other parameters, mutual information (MI) bits of individual characters derived from large hierarchically clustered sets of characters in the corpus.</Paragraph> <Paragraph position="4"> This paper therefore proposes a type of Decision-Tree morphological analysis using the MI of characters but with no need for a dictionary. Next the paper describes the use of information on character sorts in morphological analysis involving the Japanese language, how knowing the sort of each character is useful when tokenizing a string of characters into a string of words and when assigning parts-of-speech to them, and our method of clustering characters based on MI bits. Then, it proposes a type of Decision-Tree analysis where the notion of MI-based character and word clustering is incorporated. Finally, we move on to an experimental report and discussions.</Paragraph> </Section> class="xml-element"></Paper>