File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/97/j97-4004_abstr.xml
Size: 7,257 bytes
Last Modified: 2025-10-06 13:48:56
<?xml version="1.0" standalone="yes"?> <Paper uid="J97-4004"> <Title>Critical Tokenization and its Properties</Title> <Section position="2" start_page="0" end_page="0" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Words, and tokens in general, are the primary building blocks in almost all linguistic theories (e.g., Gazdar, Klein, Pullum, and Sag 1985; Hudson 1984) and language processing systems (e.g., Allen 1995; Grosz, Jones, and Webber 1986). Sentence, or string, tokenization, the process of mapping sentences from character strings to strings of words, is the initial step in natural language processing (Webster and Kit 1992).</Paragraph> <Paragraph position="1"> Since in written Chinese there is no explicit word delimiter (equivalent to the blank space in written English), the problem of Chinese sentence tokenization has been the focus of considerable research efforts, and significant advancements have been made (e.g., Bai 1995; Zhang et al. 1994; Chen and Liu 1992; Chiang et al. 1992; Fan and Tsai 1988; Gan 1995; Gan, Palmer, and Lua 1996; Guo 1993; He, Xu, and Sun 1991; Huang 1989; Huang and Xia 1996; Jie 1989; Jie, Liu, and Liang 1991a, 1991b; Jin and Chen 1995; Lai et al. 1992; Li et al. 1995; Liang 1986, 1987, 1990; Liu 1986a, 1986b; Liu, Tan, and Shen 1994; Lua 1990, 1994, and 1995; Ma 1996; Nie, Jin, and Hannan 1994; Sproat and Shih 1990; Sproat et al. 1996; Sun and T'sou 1995; Sun and Huang 1996; Tung and Lee 1994; Wang, Su, and Mo 1990; Wang 1989; Wang, Wang, and Bai 1991; Wong et al. 1995; Wong et al. 1994; Wu et al. 1994; Wu and Su 1993; Yao, Zhang, and Wu 1990; Yeh and Lee 1991; Zhang, Chen, and Chen 1991).</Paragraph> <Paragraph position="2"> (~) 1997 Association for Computational Linguistics Computational Linguistics Volume 23, Number 4 The tokenization problem exists in almost all natural languages, including Japanese (Yosiyuki, Takenobu, and Hozumi 1992), Korean (Yun, Lee, and Rim 1995), German (Pachunke et al. 1992), and English (Garside, Leech, and Sampson 1987), in various media, such as continuous speech and cursive handwriting, and in numerous applications, such as translation, recognition, indexing, and proofreading.</Paragraph> <Paragraph position="3"> For Chinese, sentence tokenization is still an unsolved problem, which is in part due to its overall complexity but also due to the lack of a good mathematical description and understanding of the problem. The theme in this paper is therefore to develop such a mathematical description.</Paragraph> <Paragraph position="4"> In particular, this paper focuses on critical tokenization 1, a distinctive type of tokenization following the maximum principle. What is to be established in this paper is the notion of critical tokenization itself, together with its precise descriptions and well-proved properties.</Paragraph> <Paragraph position="5"> We will prove that critical points are all and only unambiguous token boundaries for any character string on a complete dictionary. We will show that any critically tokenized word string is a minimal element in the partially ordered set of all tokenized word strings on the word string cover relation. We will also show that any tokenized string can be reproduced from a critically tokenized word string but not vice versa. In other words, critical tokenization is the most compact representation of tokenization. In addition, we will show that critical tokenization forms a sound mathematical foundation for categorizing critical ambiguity and hidden ambiguity in tokenizations, which provides a precise mathematical understanding of conventional concepts like combinational and overlapping ambiguities. Moreover, we will confirm that some important maximum tokenization variations, such as forward and backward maximum matching and shortest tokenization, are all subclasses of critical tokenization.</Paragraph> <Paragraph position="6"> Based on a mathematical understanding of tokenization, we reported, in Guo (1997), a series of interesting findings. For instance, there exists an optimal algorithm that can identify all and only critical points, and thus all unambiguous token boundaries, in time proportional to the input character string length but independent of the size of the tokenization dictionary. Tested on a representative corpus, about 98% of the critical fragments generated are by themselves desired tokens. In other words, about 98% close-dictionary tokenization accuracy can be achieved efficiently without disambiguation.</Paragraph> <Paragraph position="7"> Another interesting finding is that, for those critical fragments with critical ambiguities, by replacing the conventionally adopted meaning preservation criterion with the critical tokenization criterion, disagreements among (human) judges on the acceptability of a tokenization basically become non-existent. Consequently, an objective (human) analysis and annotation of all (critical) tokenizations in a corpus becomes achievable, which in turn leads to some important observations. For instance, we observed from a Chinese corpus of four million morphemes a very strong tendency to have one tokenization per source. Naturally, this observation suggests tokenization disambiguation strategies notably different from the mainstream best-path-finding strategy. For instance, the simple strategy of tokenization by memorization alone could easily exhibit critical ambiguity resolution accuracy of no less than 90%, which is notably higher than what has been achieved in the literature. Moreover, it has been observed that critical tokenization can also provide helpful guidance in identifying hidden ambiguities and in determining unregistered (unknown) tokens (Guo 1997). While these are just some of the very primitive findings, they are nevertheless promising and motivate</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Guo Critical Tokenization </SectionTitle> <Paragraph position="0"> us to rigorously formalize the tokenization problem and to carefully explore logical consequences.</Paragraph> <Paragraph position="1"> The rest of the paper is organized as follows: In Section 2, we formally define the string generation and tokenization operations that form the basis of our framework. In Section 3, we will study tokenization ambiguities and explore the concepts of critical points and critical fragments. In Section 4, we define the word string cover relation and prove it to be a partial order, define critical tokenization as the set of minimal elements of the tokenization partially ordered set, and illustrate the relationship between critical tokeniz~ition and string tokenization. Section 5 discusses the relationship between critical tokenization and various types of tokenization ambiguities, while Section 6 addresses the relationship between critical tokenization and various types of maximum tokenizations. Finally, in Sections 7 and 8, after discussing some helpful implications of critical tokenization in effective tokenization disambiguation and in efficient tokenization implementation, we suggest areas for future research and draw some conclusions.</Paragraph> </Section> </Section> class="xml-element"></Paper>