XML Viewer - j97-4004

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/j97-4004_metho.xml
Size: 61,263 bytes
Last Modified: 2025-10-06 14:14:30
<?xml version="1.0" standalone="yes"?>
<Paper uid="J97-4004">
  <Title>Critical Tokenization and its Properties</Title>
  <Section position="4" start_page="573" end_page="577" type="metho">
    <SectionTitle>
3. Critical Point and Fragment
</SectionTitle>
    <Paragraph position="0"> After clarifying both sentence generation and tokenization operations, we undertake next to further clarify sentence tokenization ambiguities. Among all the concepts to be introduced, critical points and critical fragments are probably two of the most important. We will prove that, for any character string on a complete tokenization dictionary, its critical points are all and only unambiguous token boundaries, and its critical fragments are the longest substrings with all inner positions ambiguous.</Paragraph>
    <Section position="1" start_page="573" end_page="574" type="sub_section">
      <SectionTitle>
3.1 Ambiguity
</SectionTitle>
      <Paragraph position="0"> Let G be an alphabet, D a dictionary, and S a character string over the alphabet.</Paragraph>
    </Section>
    <Section position="2" start_page="574" end_page="574" type="sub_section">
      <SectionTitle>
Guo Critical Tokenization
</SectionTitle>
      <Paragraph position="0"> Definition 5 The character string S from the alphabet G has tokenization ambiguity on dictionary D, if \]TD(S)\] &gt; 1. S has no tokenization ambiguity, if \]TD(S)\] = 1. S is ill-formed on dictionary D, if ITD(S)\] = 0. A tokenization W C To(S) has tokenization ambiguity, if there exists another tokenization W' E To(S), W' ~ W.</Paragraph>
      <Paragraph position="1"> Example 2 (cont.) Since TD(fundsand) = {&amp;quot;funds and&amp;quot;, &amp;quot;fund sand&amp;quot;}, i.e., \]TD(fundsand)l = 2 &gt; 1, the character string fundsand has tokenization ambiguity. In other words, it is ambiguous in tokenization. Moreover, the tokenization &amp;quot;funds and&amp;quot; has tokenization ambiguity since there exists another possible tokenization &amp;quot;fund sand&amp;quot; for the same character string.</Paragraph>
      <Paragraph position="2"> This definition is quite intuitive. If a character string could be tokenized in multiple ways, it would be ambiguous in tokenization. If a character string could only be tokenized in a unique way, it would have no tokenization ambiguity. If a character string could not be tokenized at all, it would be ill-formed. In this latter case, the dictionary is incomplete.</Paragraph>
      <Paragraph position="3"> Intuitively, a position in a character string is ambiguous in toke~zation or is an ambiguous token boundary if it is a token boundary in one tokenization but not in another. Formally, let S = cl... cn be a character string over an alphabet G and let D be a dictionary over the alphabet.</Paragraph>
      <Paragraph position="4"> Definition 6 Position p has tokenization ambiguity or is an ambiguous token boundary, if there are two tokenizations X = xl... Xs and Y = yl... yt in TD(S), such that G(xl... Xu) = Cl... cp and G(xu+l... Xs) = Cp+l...Cn for some index u, and for any index v, there is neither G(yl... yv) = cl... Cp nor G(yv+l... yt ) = Cp+l... Cn. Otherwise, the position has no tokenization ambiguity, or is an unambiguous token boundary.</Paragraph>
      <Paragraph position="5"> Example 1 (cont.) Given a typical English dictionary and the character string S = thisishisbook, all three positions after character s are unambiguous in tokenization or are unambiguous token boundaries, since all possible tokenizations must take these positions as token boundaries.</Paragraph>
      <Paragraph position="6"> Example 2 (cont.) Given a typical English dictionary and the character string S --fundsand, the position after the middle character s is ambiguous in tokenization or is an ambiguous token boundary since it is a token boundary in tokenization &amp;quot;funds and&amp;quot; but not in another tokenization &amp;quot;fund sand&amp;quot;.</Paragraph>
    </Section>
    <Section position="3" start_page="574" end_page="575" type="sub_section">
      <SectionTitle>
3.2 Complete Dictionary
</SectionTitle>
      <Paragraph position="0"> To avoid ill-formedness in sentence tokenization, we now introduce the concept of a complete tokenization dictionary.</Paragraph>
      <Paragraph position="1">  Computational Linguistics Volume 23, Number 4 That is, for any character string S = c 1 . .. C n from the alphabet, there exists at least one word string W = wl ... Wm with S as its generated character string, G(W) = S. Theorem 1 A dictionary D over an alphabet G is complete if and only if all the characters in the alphabet are single-character words in the dictionary.</Paragraph>
      <Paragraph position="2"> Proof On the one hand, every single character is also a character string (of length 1). To ensure that such a single-character string is being tokenized, the single character must be a word in the dictionary. On the other hand, if all the characters are words in the dictionary, any character string can at least be tokenized as a string of single-character words. \[\] Theorem I spells out a simple way of making any dictionary complete, which calls for adding all the characters of an alphabet into a dictionary as single-character words. This is referred to as the dictionary completion process. If not specified otherwise, in this paper, when referring to a complete dictionary or tokenization dictionary, we mean the dictionary after the completion process.</Paragraph>
    </Section>
    <Section position="4" start_page="575" end_page="576" type="sub_section">
      <SectionTitle>
3.3 Critical Point and Fragment
</SectionTitle>
      <Paragraph position="0"> Let S = cl...cn be a character string over the alphabet ~ and let D be a dictionary over the alphabet. In addition, let To(S) be the tokenization set of S on D.</Paragraph>
      <Paragraph position="1"> Definition 8 Position p in character string S = cl... Cn is a critical point, if for any word string W = wl... wm in To(S), there exists an index k, 0 &lt; k &lt; m, such that G(wl... Wk) =cl... Cp and G(Wk+l... Win) = Cp+l... Cn. In particular, the starting position 0 and the ending position n are the two ordinary critical points. Substring cp+l ... cq is a critical fragment of S on D, if both p and q are critical points and any other position r in between them, p &lt; r &lt; q, is not a critical point.</Paragraph>
      <Paragraph position="2"> Example 1 (cont.) Given a typical English dictionary, there are five critical points in the character string S = thisishisbook. They are 0, 4, 6, 9, and 13. The corresponding four critical fragments are this, is, his, and book.</Paragraph>
      <Paragraph position="3"> Example 2 (cont.) Given a typical English dictionary, there is no extraordinary critical point in the character string S = fundsand. It is by itself the only critical fragment of this character string.</Paragraph>
      <Paragraph position="4"> Given a complete tokenization dictionary, it is obvious that all single-character critical fragments or, more generally, single-character strings, possess unique tokenization. That is, they possess neither ambiguity nor ill-formedness in tokenization. However, the truth of the statement below (Lemma 1) is less obvious.</Paragraph>
      <Paragraph position="5"> Lemma 1 For a complete tokenization dictionary, all multicharacter critical fragments and all of their inner positions are ambiguous in tokenization.</Paragraph>
    </Section>
    <Section position="5" start_page="576" end_page="576" type="sub_section">
      <SectionTitle>
Guo Critical Tokenization
Proof
</SectionTitle>
      <Paragraph position="0"> Let S = cl ... Cn, n &gt; 1, be a multicharacter critical fragment. Because the tokenization dictionary is complete, the critical fragment can at least be tokenized as a string of single-character words. On the other hand, because it is a critical fragment, for any position p, 1 _&lt; p &lt; n - 1, there must exist a tokenization W = Wt...Wm in TD(S) such that for any index k, 0 G k G m, there is neither G(wl...Wk) = cl...cp nor G(wk+l... win) = cp+l...Cn. As this tokenization differs from the above-mentioned tokenization of the string of single-character words, the critical fragment has at least two different tokenizations and thus has tokenization ambiguity. \[\]</Paragraph>
    </Section>
    <Section position="6" start_page="576" end_page="577" type="sub_section">
      <SectionTitle>
3.4 Discussion
</SectionTitle>
      <Paragraph position="0"> In this section, we have described sentence tokenization ambiguity from three different angles: character strings, tokenizations, and individual string positions. The basic idea is conceptually simple: ambiguity exists when there are different means to the same end. For instance, as long as a character string has multiple tokenizations, it is ambiguous.</Paragraph>
      <Paragraph position="1"> This description of ambiguity is complete. Given a character string and a dictionary, it is always possible to answer deterministically whether or not a string is ambiguous in tokenization. Conceptually, for any character string, by checking every one of its possible substrings in a dictionary, and then by enumerating all valid word concatenations, all word strings with the character string as their generated character string can be produced. Just counting the number of such word strings will provide the answer to whether or not the character string is ambiguous.</Paragraph>
      <Paragraph position="2"> Some researchers question the validity of the complete dictionary assumption.</Paragraph>
      <Paragraph position="3"> Here we argue that, even in the strictest linguistic sense, there exists no single character that cannot be used as a single-character word in sentences. In any case, any natural language must allow us to directly refer to single characters. For instance, you could say &amp;quot;character x has many written forms&amp;quot; or &amp;quot;the character x in this word can be omitted&amp;quot; for any character x. 3 3 Even so, some researchers might still insist that the character x here is just for temporary use and cannot be regarded as a regular word with the many linguistic properties generally associated with words. Understanding the importance of such a distinction, we will use the more generic term token, rather than the loaded term word, when we need to highlight the distinction. It must be added, however, that the two are largely used interchangeably in this paper.</Paragraph>
      <Paragraph position="4">  Computational Linguistics Volume 23, Number 4 The validity of the complete dictionary assumption can also be justified from an engineering perspective. To ensure a so-called soft landing, any practical application system must be designed so that every input character string can always be tokenized.</Paragraph>
      <Paragraph position="5"> In other words, a complete dictionary is an operational must. Moreover, without such a complete dictionary, it would not be possible to avoid ill-formedness in sentence tokenization nor to make the generation-tokenization system for character and words closed and complete. Without such definitions of well-formedness, any rigorous formal study would be impossible.</Paragraph>
      <Paragraph position="6"> The concepts of critical point and critical fragment are fundamental to our sentence tokenization theory. By adopting the complete dictionary assumption, it has been proven that critical points are all and only unambiguous token boundaries while critical fragments are the longest substrings with all inner positions ambiguous.</Paragraph>
      <Paragraph position="7"> This is a very strong and significant statement. It provides us with a precise understanding of what and where tokenization ambiguities are. Although the proof itself is easy to follow, the result has nonetheless been a surprise. As demonstrated in Guo (1997), many researchers have tried but failed to answer the question in such a precise and complete way. Consequently, while they proposed many sophisticated algorithms for the discovery of ambiguity (and certainty), they never were able to arrive at such a concise and complete solution.</Paragraph>
      <Paragraph position="8"> As critical points are all and only unambiguous token boundaries, an identification of all of them would allow for a long character string to be broken down into several short but fully ambiguous critical fragments. As shown in Guo (1997), critical points can be completely identified in linear time. Moreover, in practice, most critical fragments are dictionary tokens by themselves, and the remaining nondictionary fragments are generally very short. In short, the understanding of critical points and fragments will significantly assist us in both efficient tokenization implementation and tokenization ambiguity resolution.</Paragraph>
      <Paragraph position="9"> The concepts of critical point and critical fragment are similar to those of segment point and character segment in Wang (1989, 37), which were defined on a sentence word graph for the purpose of analyzing the computational complexity of his new tokenization algorithm. However, Wang (1989) neither noticed their connection with tokenization ambiguities nor realized the importance of the complete dictionary assumption, and hence failed to demonstrate their crucial role in sentence tokenization.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="577" end_page="584" type="metho">
    <SectionTitle>
4. Critical Tokenization
</SectionTitle>
    <Paragraph position="0"> This section seeks to disclose an important structure of the set of different tokenizations. We will see that different tokenizations can be linked by the cover relationship to form a partially ordered set. Based on that, we will establish the notion of critical tokenization and prove that every tokenization is a subtokenization of a critical tokenization, but no critical tokenization has true supertokenization.</Paragraph>
    <Section position="1" start_page="577" end_page="578" type="sub_section">
      <SectionTitle>
4.1 Cover Relationship
</SectionTitle>
      <Paragraph position="0"> Definition 9 Let X and Y be word strings. X covers Y, or X has a cover relation to Y, denoted X &lt; Y, if for any substring Xs of X, there exists substring Ys of Y, such that IXsl ( IYsl and G(Xs) = G(Ys). If X G Y, then X is called a covering word string of Y, and Y a covered word string of X.</Paragraph>
      <Paragraph position="1"> Intuitively, X ~ Y implies \]X\] &lt; \]YI. In other words, shorter word strings cover longer word strings. However, an absence of X &lt; Y does not imply the existence of  Guo Critical Tokenization Y &lt;__ X. Some word strings do not cover each other. In other words, shorter word strings do not always cover longer word strings.</Paragraph>
      <Paragraph position="2"> Example 1 (cont.) The word string &amp;quot;this is his book&amp;quot; covers the word string &amp;quot;th is is his book&amp;quot;, but not vice versa.</Paragraph>
      <Paragraph position="3"> Example 2 (cont.) The word strings &amp;quot;funds and&amp;quot; and &amp;quot;fund sand&amp;quot; do not cover each other. Definition 9 I Let A and B be sets of word strings. A covers B, or A has a cover relation to B, denoted A ~ B, if for any Y c B, there is X E A, such that X ~ Y. If A ~ B, A is called a covering  word string set of B, and B a covered word string set of A. Example 3 Given the alphabet G = {a, b, c, d}, dictionary D = {a, b, c, d, ab, be, cd, abe, bed}, and character string S = abed from the alphabet, there is TD(S) = {a/b/c/d, a/b/cd, a/bc/d, a/bcd, ab/c/d, ab/cd, abe~d}. Among them, there are {abe~d} &lt; {ab/c/d, a/bc/d}, {ab/cd} &lt;_ {ab/e/d, a/b/ed}, {a/bed} &lt; {a/be/d, a/b/ed} and {ab/c/d, a/be/d, a/b/ed} &lt; {a/b/c/d}. Moreover, there is {abe~d, ab/cd, a/bcd} ~_ TD(S).</Paragraph>
    </Section>
    <Section position="2" start_page="578" end_page="579" type="sub_section">
      <SectionTitle>
4.2 Partially Ordered Set
</SectionTitle>
      <Paragraph position="0"> Lemma 2 The cover relation is transitive, reflexive, and antisymmetric. That is, the cover relation is a (reflexive) partial order.</Paragraph>
      <Paragraph position="1"> Lemma 2, proved in Guo (1997), reveals that the cover relation is a partial order-a well-defined mathematical structure with good mathematical properties. Consequently, from any textbook on discrete mathematics (Kolman and Busby \[1987\], for example), it is known that the tokenization set TD(S), together with the word string cover relation &lt;, forms a partially ordered set, or simply a poser. We shall denote this poset by (TD(S), _&lt;). In case there is no confusion, we may refer to the poset simply as TD(S).</Paragraph>
      <Paragraph position="2"> In the literature, usually a poset is graphically presented in a Hasse diagram, which is a digraph with vertices representing poset elements and arcs representing direct partial order relations between poset elements. In a Hasse diagram, all connections implied by the partial order's transitive property are eliminated. That is, if X ~ Y and Y &lt; Z, there should be no arc from X to Z.</Paragraph>
      <Paragraph position="3"> Example 3 (cont.) The poset TD(abcd) = {a/b/c/d, a/b/cd, a/bc/d, a/bcd, ab/c/d, ab/cd, abe~d} can be graphically presented in the Hasse diagram in Figure 1.</Paragraph>
      <Paragraph position="4"> Certain elements in a poset are of special importance for many of the properties and applications of posets. In this paper, we are particularly interested in the minimal elements and least elements. In standard textbooks, they are defined in the following manner: Let (A, &lt;) be a poset. An element a E A is called a minimal element of A if there is no element c E A, c ~ a, such that c &lt; a. An element a E A is called a least  element of A if a &lt; x for all x E A. (Kolman and Busby 1987, 195-196).</Paragraph>
      <Paragraph position="5">  Computational Linguistics Volume 23, Number 4 }iiiiiii)i}iiiiiiS}i}iiiiiii i iiiiiii. !:-!:-!:-? ~ ~ ~ ~!:.::?~!s~ !ii! i ~!~ :! :~!~ :-:-2: i iiiii iii ii;iiiiii55iii \[::.:!5~:: :~ :::: !~ :~):~ !::: !~:~:. :~!:::::i ~:}~!:~: :~,:/!:!:!:!~: :::!~:~!~!~i\]iiil}iiiiiiiii;iiiiiiiiiiiii25ililiiiiiiiiii! Figure 1 The Hasse diagram for the poset TD(abcd) = {a/b/c/d, a/b/cd, a/bc/d, a/bcd, ab/c/d, ab/cd, abc/d}.</Paragraph>
      <Paragraph position="6"> Example 1 (cont.) The word string &amp;quot;this is his book&amp;quot; is both the minimal element and the least element of both TD( thisishisbook ) = {&amp;quot;this is his book&amp;quot;} and TD, ( thisishisbook ) = {&amp;quot;th is is his book&amp;quot;, &amp;quot;this is his book&amp;quot;}.</Paragraph>
      <Paragraph position="7"> Example 2 (cont.)  The poset TD(fundsand) = {&amp;quot;funds and&amp;quot;, &amp;quot;fund sand&amp;quot;} has both &amp;quot;funds and&amp;quot; and &amp;quot;fund sand&amp;quot; as its minimal elements, but has no least element.</Paragraph>
      <Paragraph position="8"> Example 3 (cont.) The poset TD(abcd) = {a/b/c/d, a/b/cd, a/bc/d, a/bcd, ab/c/d, ab/cd, abc/d} has three minimal elements: abc/d, ab/cd, a/bcd. It has no least element.</Paragraph>
      <Paragraph position="9"> Note that any finite nonempty poset has at least one minimal element. Any poset has at most one least element (Kolman and Busby 1987, 195-198).</Paragraph>
    </Section>
    <Section position="3" start_page="579" end_page="580" type="sub_section">
      <SectionTitle>
4.3 Critical Tokenization
</SectionTitle>
      <Paragraph position="0"> This section deals with the most important concept---critical tokenization. Let ~ be an alphabet, D a dictionary over the alphabet, and S a character string over the alphabet.</Paragraph>
      <Paragraph position="1"> In this case, (TD(S), &lt;_) is the poset.</Paragraph>
      <Paragraph position="2"> Definition 10 The character string critical tokenization operation CD is a mapping CD: ~* --~  2 D&amp;quot; defined as: for any S in ~*, CD(S) = {W I W is a minimal elementoftheposet (TD(S), G)}. Any word string W in CD(S) is a critically tokenized word string, or simply a critical tokenization, or CT tokenization for short, of the character string S. And CD(S) is the set of critical tokenizations.</Paragraph>
    </Section>
    <Section position="4" start_page="580" end_page="580" type="sub_section">
      <SectionTitle>
Guo Critical Tokenization
</SectionTitle>
      <Paragraph position="0"> In other words, the critical tokenization operation maps any character string to its set of critical tokenizations. A word string is critical if any other word string does not cover it.</Paragraph>
      <Paragraph position="1"> Example 1 (cont.) Given the English alphabet, the tiny Dictionary D = {th, this, is, his, book}, and the character string S = thisishisbook, there is Co(S) = {&amp;quot;this is his book&amp;quot;}. This critical tokenization set contains the unique critical tokenization &amp;quot;this is his book&amp;quot;. Note that the only difference between &amp;quot;this is his book&amp;quot; and &amp;quot;th is is his book&amp;quot; is that the word this in the former is split into two words th and is in the latter.</Paragraph>
      <Paragraph position="2"> Example 2 (cont.) Given the English alphabet, the tiny Dictionary D = {fund, funds, and, sand}, and the character string S = fundsand, there is CD( S) = {&amp;quot;funds and&amp;quot;, &amp;quot;fund sand&amp;quot;}. Example 3 (cont.) Let E = {a,b,c,d} and D = {a,b,c,d, ab, bc, cd, abc, bcd}. There is CD(abcd) = {abc/d, ab/cd, a/bcd}. If D' --- {a, b, c, d, ab, bc, cd}, then Co, (abcd) = {a/bc/d, ab/cd}. Example 4 Given the English alphabet, the tiny Dictionary D = {the, blue, print, blueprint}, and the character string S = theblueprint, there are To(S) = {&amp;quot;the blueprint&amp;quot;, &amp;quot;the blue print&amp;quot;} and Co(S) = {&amp;quot;the blueprint&amp;quot;}. Note that the tokenization &amp;quot;the blue print&amp;quot; is not critical (not a critical tokenization).</Paragraph>
    </Section>
    <Section position="5" start_page="580" end_page="581" type="sub_section">
      <SectionTitle>
4.4 Super- and SubTokenization
</SectionTitle>
      <Paragraph position="0"> Intuitively, a tokenization is a subtokenization of another tokenization if further tokenizing words in the latter can produce the former. Formally, let S be a character string over an alphabet E and let D be a dictionary over the alphabet. In addition, let  Computational Linguistics Volume 23, Number 4 single word x. As any single word in a word string is also its single-word substring, it can be concluded that for any word x in X, there exists a substring Ys of Y, such that x = G(Ys).</Paragraph>
      <Paragraph position="1"> On the other hand, if Y is a subtokenization of X, by definition, for any word x in X, there exists a substring Ys of Y such that x = G(Ys). Thus, given any substring Xs of X, Xs = Xl... x,, for any k, 1 &lt; k &lt; n, there exists a substring Yk of Y such that Xk = G(Yk). Denote Ys --- Y1. . . Ym, there is IXsl &lt; IYsl and G(Xs) = G(Ys). By definition, there is X &lt; Y. \[\] Lemma 3 reveals that a word string is covered by another word string if and only if every word in the latter is realized in the former as a word string. In other words, a covering word string is in a more compact form than its covered word string. Theorem 3 Every tokenization has a critical tokenization as its supertokenization, but critical tokenization has no true supertokenization.</Paragraph>
      <Paragraph position="2"> That is, for any tokenization Y, Y E To(S), there exists critical tokenization X, X E Co(S), such that X is a supertokenization of Y. Moreover, if Y is a critical tokenization and X is its supertokenization, there is X = Y.</Paragraph>
      <Paragraph position="3">  By definition, for any tokenization Y, Y E To(S), there is a critical tokenization X, X E Co(S), such that X _ Y. By Lemma 3, it would be the same as saying that X is a supertokenization of Y. The second part of the theorem is from the definition of critical tokenization. \[\] Theorem 3 states that no critical tokenization can be produced by further tokenizing words in other tokenizations. However, all other tokenizations can be produced from at least one critical tokenization by further tokenizing words in it. Example 3 (cont.) Given TD(S) = {a/b/c/d, a/b/cd, a/bc/d, a/bcd, ab/c/d, ab/cd, abc/d}, there is Co(S) = {abc/d, ab/cd, a/bcd} ~ To(S). By splitting the word abc in abc/d E Co(S) into a/b/c, ab/c or a/bc, we can make another three tokenizations in To(S): a/b/c/d, ab/c/d and a/bc/d. Similarly, from ab/cd, we can bring back a/b/c/d, ab/c/d and a/b/cd; and from abc/d, we can recover a/b/c/d, ab/c/d and a/bc/d. By merging all word strings produced together with word strings in Co(S) = {abc/d, ab/cd, a/bcd}, the entire tokenization set To(S) is reclaimed.</Paragraph>
    </Section>
    <Section position="6" start_page="581" end_page="582" type="sub_section">
      <SectionTitle>
4.6 Discussion
</SectionTitle>
      <Paragraph position="0"> Since the theory of partially ordered sets is well established, we can use it to enhance our understanding of the mathematical structure of string tokenization. One of the obvious and immediate results is the concept of critical tokenization, which is simply another name for the minimal element set of a poset. The least element is another important concept. Although it may seem trivial to the string tokenization problem, the critical tokenization is, in fact, absolutely crucial. For instance, Theorem 3 states that, from critical tokenization, any tokenization can be produced (enumerated). As the number of critical tokenizations is normally considerably less than the total amount of all possible tokenizations, this theorem leads us to focus on the study of a few critical  Guo Critical Tokenization ones. In the next few sections, we shall further investigate certain important aspects of critical tokenizations.</Paragraph>
      <Paragraph position="1"> 5. Critical and Hidden Ambiguities  character string fundsand has critical ambiguity in tokenization. Moreover, the tokenization &amp;quot;funds and&amp;quot; has critical ambiguity in tokenization since there exists another possible tokenization &amp;quot;fund sand&amp;quot; such that both &amp;quot;funds and&amp;quot; &lt;_ &amp;quot;fund sand&amp;quot; and &amp;quot;fund sand&amp;quot; &lt;_ &amp;quot;funds and&amp;quot; do not hold.</Paragraph>
      <Paragraph position="2"> Example 4 (cont.) Since CD(theblueprint) = {&amp;quot;the blueprint&amp;quot;}, the character string theblueprint does not have critical ambiguity in tokenization.</Paragraph>
      <Paragraph position="3"> It helps to clarify that the only difference between the definition of tokenization ambiguity and that of critical ambiguity in tokenization lies in the tokenization set: While tokenization ambiguity is defined on the entire tokenization set TD(S), critical ambiguity in tokenization is defined only on the critical tokenization set CD(S), which is a subset of To(S).</Paragraph>
      <Paragraph position="4"> As all critical tokenizations are minimal elements on the word string cover relationship, the existence of critical ambiguity in tokenization implies that the &amp;quot;most powerful and commonly used&amp;quot; (Chen and Liu 1992, 104) principle of maximum tokenization would not be effective in resolving critical ambiguity in tokenization and implies that other means such as statistical inferencing or grammatical reasoning have to be introduced. In other words, critical ambiguity in tokenization is unquestionably critical.</Paragraph>
      <Paragraph position="5"> Critical ambiguity in tokenization is the precise mathematical description of conventional concepts such as disjunctive ambiguity (Webster and Kit \[1992, 1108\], for example) and overlapping ambiguity (Sun and T'sou \[1995, 121\], for example). We will return to this topic in Section 5.4.</Paragraph>
    </Section>
    <Section position="7" start_page="582" end_page="582" type="sub_section">
      <SectionTitle>
5.2 Hidden Ambiguity in Tokenization
</SectionTitle>
      <Paragraph position="0"> blueprint&amp;quot;}. Since To(S) ~ Co(S), the character sting theblueprint has hidden ambiguity in tokenization. Since &amp;quot;the blueprint&amp;quot; &lt;_ &amp;quot;the blue print&amp;quot;, the character string &amp;quot;the blueprint&amp;quot; has hidden ambiguity in tokenization.</Paragraph>
      <Paragraph position="1"> Intuitively, a tokenization has hidden ambiquity in tokenization, if some words in it can be further decomposed into word strings, such as &amp;quot;blueprint&amp;quot; to &amp;quot;blue print&amp;quot;. They are called hidden or invisible because others cover them. The resolution of hidden ambiguity in tokenization is the aim of the principle of maximum tokenization (Jie 1989; Jie and Liang 1991). Under this principle, only covering tokenizations win and all covered tokenizations are discarded.</Paragraph>
      <Paragraph position="2"> Hidden ambiguity in tokenization is the precise mathematical description of conventional concepts such as conjunctive ambiguity (Webster and Kit \[1992, 1108\], for example), combinational ambiguity (Liang \[1987\], for example) and categorical ambiguity (Sun and T'sou \[1995, 121\], for example). We will return to this topic in Section 5.4.</Paragraph>
    </Section>
    <Section position="8" start_page="582" end_page="582" type="sub_section">
      <SectionTitle>
5.3 Ambiguity = Critical + Hidden
</SectionTitle>
      <Paragraph position="0"> Let E be an alphabet, D a dictionary, and S a character string over the alphabet.</Paragraph>
      <Paragraph position="1"> Theorem 4 A character string S over an alphabet ~ has tokenization ambiguity on a tokenization dictionary D if and only if S has either critical ambiguity in tokenization or hidden ambiguity in tokenization.</Paragraph>
      <Paragraph position="2">  If S has critical ambiguity in tokenization, by definition, there is ICD(S)I &gt; 1. If S has hidden ambiguity in tokenization, by definition, there is TD(S) ~ CD(S). In both cases, since CD(S) c_C_ TD(S), there must be ITD(S)\[ &gt; 1. By definition, S has tokenization ambiguity.</Paragraph>
      <Paragraph position="3"> If S has tokenization ambiguity, by definition, there is ITo(S)I &gt; 1. Since any finite nonempty poset has at least one minimal element, there is \[Co(S)I &gt; 0. Since Co(S) c To(S), there is To(S) # Co(S) if ICo(S)I = 1. In this case, by definition, S has hidden ambiguity in tokenization. If ICo(S)\[ &gt; 1, by definition, S has critical ambiguity in tokenization. \[\] Theorem 4 explicitly and precisely states that tokenization ambiguity is the union of critical ambiguity in tokenization and hidden ambiguity in tokenization. This result helps us in the understanding of character string tokenization ambiguity.</Paragraph>
    </Section>
    <Section position="9" start_page="582" end_page="584" type="sub_section">
      <SectionTitle>
5.4 Discussion
</SectionTitle>
      <Paragraph position="0"> By freezing the problem of token identity determination, tokenization ambiguity identification and resolution are all that is required in sentence tokenization. Consequently, it must be crucial and beneficial to pursue an explicit and accurate understanding of various types of character string tokenization ambiguities and their relationships.</Paragraph>
      <Paragraph position="1"> In the literature, however, the general practice is not to formally define and classify ambiguities but to apply various terms to them, such as overlapping ambiguity and combinational ambiguity in their intuitive and normally fuzzy senses. Nevertheless, efforts do exist to rigorously assign them precise, formal meanings. As a representa- null Guo Critical Tokenization tive example, in Webster and Kit (1992, 1108), both conjunctive (combinational) and disjunctive (overlapping) ambiguities are defined in the manner given below.</Paragraph>
      <Paragraph position="2"> 1. TYPE h In a sequence of Chinese 4 characters S = al... aibl ... by, if al... ai, bl... bj, and S are each a word, then there is conjunctive ambiguity in S. The segment S, which is itself a word, contains other words. This is also known as multi-combinational ambiguity.</Paragraph>
      <Paragraph position="3"> 2. TYPE II: In a sequence of Chinese characters S = al ... aibl ... bjCl ... Ck, if  al... aibl.., bj and bl... bjCl... Ck are each a word, then S is an overlapping ambiguous segment, or in other words, the segment S displays disjunctive ambiguity. The segment bl... bj is known as an overlap, which is usually one character long.</Paragraph>
      <Paragraph position="4"> The definitions above contain nothing improper. In fact, conjunctive (combinational) ambiguity as defined above is a special case of hidden ambiguity in tokenization, since &amp;quot;al ... aibl.., bj&amp;quot; &lt;_ &amp;quot;al... ai/bl.., bj&amp;quot;. Moreover, disjunctive (overlapping) ambiguity is a special case of critical ambiguity in tokenization, since for the character string S = al . . . aibl . . . bjc, . . . Ck, both &amp;quot;al... aibl . . . bj /Cl . . . Ck&amp;quot; and &amp;quot;a,... ai/bl . . . bye1</Paragraph>
      <Paragraph position="6"> The definitions above, however, are neither complete nor critical. In our opinion, a definition is complete only if any phenomenon in the problem domain can be properly described (defined). With regard to the character string tokenization problem proper, this completeness requirement can be translated as: given an alphabet, a dictionary, and a character string, the definition should be sufficient to answer the following two questions: (1) does this character string have tokenization ambiguity? (2) if yes, what type of ambiguity does it have? The definitions above cannot fulfill this completeness requirement. For instance, if al ... ai, bl * :. by, Cl ... Ck, and al ... aibl ... bjCl ... Ck are all words in a dictionary, the character string S = al ... aibl ... bjCl ... Ck, while intuitively in Type I (conjunctive ambiguity), is, in fact, captured neither by Type I nor by Type II.</Paragraph>
      <Paragraph position="7"> We agree that, although to do so would not be trivial, it is nevertheless possible to make the definitions above complete by carefully listing and including all possible cases. However, criticality, which is what is being explored in this paper, would most probably still not be captured in such a carefully generalized ambiguity definition.</Paragraph>
      <Paragraph position="8"> What we believe to be crucial is the association between tokenization ambiguity and the maximization or minimization property of the partially ordered set on the cover relation. As will be illustrated later in this paper, such an association is exceptionally important in attempting to understand ambiguities and in developing disambiguation strategies.</Paragraph>
      <Paragraph position="9"> In short, both the cover relation and critical tokenization have given us a clear picture of character string tokenization ambiguity as expressed in Theorem 4.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="584" end_page="589" type="metho">
    <SectionTitle>
6. Maximum Tokenization
</SectionTitle>
    <Paragraph position="0"> This section clarifies the relationship between critical tokenization (CT) and three other representative implementations of the principle of maximum tokenization, i.e., forward maximum tokenization (FT), backward maximum tokenization (BT) and shortest to-</Paragraph>
    <Paragraph position="2"> For any j', j &lt; j' &lt; n, there is ci... cj, ~ D.</Paragraph>
    <Paragraph position="3"> The forward maximum tokenization operation, or FT operation for short, is a mapping FD: ~* ~ 2 D* defined as: for any S E ~*, FD(S) = {W I W is a FT tokenization of S over G and D}.</Paragraph>
    <Paragraph position="4"> This definition is in fact a descriptive interpretation of the widely recommended conventional constructive forward maximum tokenization procedure (Liu 1986a, 1986b; Liang 1986, 1987; Chen and Liu 1992; Webster and Kit 1992).</Paragraph>
    <Paragraph position="5">  have many possible tokenizations. For example, given the alphabet G = {a, b, c, d} and the dictionary D = {a, abc, bcd}, there is TD(abcd) = {a/bcd}. But the single tokenization does not fulfill condition (3) in the definition above for k = 1, because the longer word abc exists in the dictionary.</Paragraph>
    <Paragraph position="6"> 5 Note, as a widely adopted convention, in case k ~ 1, Wl * Wk_ 1 represents the empty word string v and Cl... Ck-1 represents the empty character string e.</Paragraph>
    <Section position="1" start_page="586" end_page="586" type="sub_section">
      <SectionTitle>
Guo Critical Tokenization
</SectionTitle>
      <Paragraph position="0"> Assume both X = Xl... Xm and Y = yl... ym' are FT tokenizations, X ~ Y. Then, there must exist k, 1 &lt; k &lt; rain(m, m'), such that Xk, = Yk', for all k', 1 &lt; k' &lt; k, but Xk # yk. Since G(X) = G(Y), there must be IXkl # lYkl. Consequently, either X or Y is unable to fulfill condition (3) of definition 14. By contradiction, there must be X = Y.</Paragraph>
      <Paragraph position="1"> In other words, any character string at most has single FT tokenization.</Paragraph>
      <Paragraph position="2"> Assume the FT tokenization X = xl ... Xm is not a CT tokenization. By Theorem 3, there must exist a CT tokenization Y = yl... ym' such that X # Y and Y &lt; X. Thus, by the cover relation definition, for any substring Ys of Y, there exists substring Xs of X, such that IYsl &lt; IXsl and G(Xs) = G(Ys). Since X # Y, there must exist k, 1 &lt; k &lt; min(m,m'), such that Xk, = yk', for all k', 1 &lt;_ k' &lt; k, but IXkl &lt;_ lYkl. This leads to a conflict with condition (3) in the definition. In other words, X cannot be an FT tokenization if it is not a CT tokenization. \[\]</Paragraph>
    </Section>
    <Section position="2" start_page="586" end_page="587" type="sub_section">
      <SectionTitle>
6.2 Backward Maximum Tokenization
</SectionTitle>
      <Paragraph position="0"> Let G be an alphabet, D a dictionary on the alphabet, and S a character strings over the alphabet.</Paragraph>
      <Paragraph position="1"> Definition 15 A tokenization W = Wl-..Wm C To(S) is a backward maximum tokenization of S over G and D, or BT tokenization for short, if for any k, 1 &lt; k &lt; m, there exist i and j, 1 &lt; i G j &lt; n, such that  1. G(Wk+l .. . Win) = Cj+l .. . Cn, 2. Wk = Ci . . . Cj, and 3. For any i', 1 &lt; i' &lt; i, there is ci .... cj ~ D.  The backward maximum tokenization operation is a mapping BD: ~* -~ 2 D&amp;quot; defined as: for any S E ~*, Bo(S) = {W I W is a BT tokenization of S over ~ and D}. This definition is in fact a descriptive interpretation of the widely recommended conventional constructive backward maximum tokenization procedure (Liu 1986a, 1986b; Liang 1986, 1987; Chen and Liu 1992; Webster and Kit 1992). Example 3 (cont.) For the character string S = abcd, the word string a/bcd is the only BT tokenization in To(S) = {a/b/c/d, a/b/cd, a/bc/d, a/bcd, ab/c/d, ab/cd, abc/d}. That is, Bo(S) = {a/bcd}.</Paragraph>
      <Paragraph position="2"> Example 2 (cont.) For the character string S = fundsand, there is Bo(fundsand) = {&amp;quot;fund sand&amp;quot;}. That is, the word string &amp;quot;fund sand&amp;quot; is the only BT tokenization. Example 4 (cont.) For the character string S = theblueprint, there is BD(S) = {&amp;quot;the blueprint&amp;quot;}. That is, the word string &amp;quot;the blueprint&amp;quot; is the only BT tokenization. Lemma 5 For all S E ~*, there are IBD(S)I ~ 1 and BD(S) C_ CD(S).  Computational Linguistics Volume 23, Number 4 That is, any character string has at most one BT tokenization. Moreover, if the BT tokenization exists, it is a CT tokenization.</Paragraph>
      <Paragraph position="3"> Proof Parallel to the proof for Lemma 4. \[\]</Paragraph>
    </Section>
    <Section position="3" start_page="587" end_page="587" type="sub_section">
      <SectionTitle>
6.3 Shortest Tokenization
Definition 16
</SectionTitle>
      <Paragraph position="0"> The shortest tokenization operation SD is a mapping SD: ~* --* 2 D defined as: for any S in ~*, SD(S) = {W I IW\[ = minW'ETD(s)IW'I}&amp;quot; Every tokenization W in SD(S) is a shortest tokenization, or ST tokenization for short, of the character string S.</Paragraph>
      <Paragraph position="1"> In other words, a tokenization W of a character string S is a shortest tokenization if and only if the word string has the minimum word string length among all possible tokenizations.</Paragraph>
      <Paragraph position="2"> This definition is in fact a descriptive interpretation of the constructive shortest path finding tokenization procedure proposed by Wang (1989) and Wang, Wang, and Bai (1991).</Paragraph>
      <Paragraph position="3"> Example 3 (cont.) Given the character string S = abcd. For the dictionary D = {a, b, c, d, ab, bc, cd, abc, bcd}, both abc/d and a/bcd are ST tokenizations in TD(S) = {a/b/c/d, a/b/cd, a/bc/d, a/bcd, ab/c/d, ab/cd, abc/d}. That is, SD(S) = {abc/d, a/bcd}. For D' = {a, b, c, d, ab, bc, cd}, however, there is SD, (S) = {ab/cd}. Note, in this case, the CT tokenization a/bc/d is not in So,(S).</Paragraph>
      <Paragraph position="4"> Example 2 (cont.) For the character string S = fundsand, there is SD(fundsand) = {&amp;quot;funds and&amp;quot;, &amp;quot;fund sand&amp;quot;}. That is, both &amp;quot;funds and&amp;quot; and &amp;quot;fund sand&amp;quot; are ST tokenizations. Example 4 (cont.) For the character string S = theblueprint, there is SD(S) = {&amp;quot;the blueprint&amp;quot;}. That is, the word string &amp;quot;the blueprint&amp;quot; is the only ST tokenization.</Paragraph>
      <Paragraph position="6"> Let X be an ST tokenization, X E SD(S). Assume X is not a CT tokenization, X ~ CD(S).</Paragraph>
      <Paragraph position="7"> Then, by Theorem 3, there exists a CT tokenization Y ~ CD(S), Y ~ X, such that Y &lt; X.</Paragraph>
      <Paragraph position="8"> By the definition of the cover relation, there is IYI &lt; IXI. In fact, as X ~ Y, there must be IY\[ &lt; IXI. This is in conflict with the fact that X is an ST tokenization. Hence, the lemma is proven by contradiction. \[\]</Paragraph>
    </Section>
    <Section position="4" start_page="587" end_page="588" type="sub_section">
      <SectionTitle>
6.4 Theorem
</SectionTitle>
      <Paragraph position="0"> Theorem 5 FD(S)UBD(S ) C CD(S) and SD(S) C_ Co(S) for all S E G*. Moreover, there exists S E E*, such that FD(S) t_; BD(S) Co(S) or SD(S) # CD(S). That is, the forward maximum tokenization, the backward maximum tokenization, and the shortest tokenization are all true subclasses of critical tokenization.</Paragraph>
    </Section>
    <Section position="5" start_page="588" end_page="589" type="sub_section">
      <SectionTitle>
6.5 Principle of Maximum Tokenization
</SectionTitle>
      <Paragraph position="0"> The three tokenization definitions in this section are essentially descriptive restatements of the corresponding constructive tokenization procedures, which in turn are realizations of the widely followed principle of maximum tokenization (e.g., Liu 1986; Liang 1986a, 1986b; Wang 1989; Jie 1989; Wang, Su, and Mo 1990; Jie, Liu, and Liang 1991a, b; Yeh and Lee 1991; Webster and Kit 1992; Chen and Liu 1992; Guo 1993; Wu and Su 1993; Nie, Jin, and Hannah 1994; Sproat et al. 1996; Wu et al. 1994; Li et al. 1995; Sun and T'sou 1995; Wong et al. 1995; Bai 1995; Sun and Huang 1996).</Paragraph>
      <Paragraph position="1"> The first work closest to this principle, according to Liu (1986, 1988), was the 5-4-3-2-1 tokenization algorithm proposed by a Russian MT practitioner in 1956. This algorithm is a special version of the greedy-type implementation of the forward maximum tokenization and is still in active use. For instance, Yun, Lee, and Rim (1995) recently applied it to Korean compound tokenization.</Paragraph>
      <Paragraph position="2"> It is understood that forward maximum tokenization, backward maximum tokenization and shortest tokenization are the three most representative and widely quoted works following the general principle of maximum tokenization. However, the principle itself is not crystal-clear in the literature. Rather, it only serves as a general guideline, as different researchers make different interpretations. As Chen and Liu (1992, 104) noted, &amp;quot;there are a few variations of the sense of maximal matching.&amp;quot; Hence, many variations have been derived after decades of fine-tuning and modification. As Webster and Kit (1992, 1108) acknowledged, different realizations of the principle &amp;quot;were invented one after another and seemed inexhaustible.&amp;quot; While researchers generally agree that a dictionary word should be tokenized as itself, they usually have different opinions on how a non-dictionary word (critical) fragment should be tokenized. While they all agree that a certain form of extremes must be attained, they nevertheless have their own understanding of what the form should be.</Paragraph>
      <Paragraph position="3"> Consequently, it should come as no surprise to see various kinds of theoretical generalization or summarization work in the literature. A good representative work is by Kit and his colleagues (Jie 1989; Jie, Liu, and Liang 1991a, b; Webster and Kit 1992), who proposed a three-dimensional structural tokenization model. This model, called ASM for Automatic Segmentation Model, is capable of characterizing up to eight classes of different maximum or minimum tokenization procedures. Among the eight procedures, based on both analytical inferences and experimental studies, both forward maximum tokenization and backward maximum tokenization are recommended as good solutions. Unfortunately, in Webster and Kit (1992, 1108), they unnecessarily made the following overly strong claim: It is believed that all elemental methods are included in this model.</Paragraph>
      <Paragraph position="4"> Furthermore, it can be viewed as the ultimate model for methods of string matching of any elements, including methods for finding English idioms.</Paragraph>
      <Paragraph position="5"> The shortest tokenization proposed by Wang (1989) provides an obvious counterexample. As Wang (1989) exemplified 6, for the alphabet G = {a, b, c, d, e} and the  Computational Linguistics Volume 23, Number 4 dictionary D = {a, b, c, d, e, ab, bc, cd, de}, the character string S = abcde has FT set FD(S) = {ab/cd/e}, BT set BD(S) = {a/bc/de} and ST set SD(S) = {ab/cd/e, a/bc/de, ab/c/de}. Clearly, the ST tokenization ab/c/de, which fulfills the principle of maximum tokenization and is the desired tokenization in some cases, is neither FT nor BT tokenization. Moreover, careful checking showed that the missed ST tokenization is not in any of the eight tokenization solutions covered by the ASM model. In short, the ASM model is not a complete interpretation of the principle of maximum tokenization.</Paragraph>
      <Paragraph position="6"> Furthermore, the shortest tokenization still does not capture all the essences of the principle. &amp;quot;For instance, given the alphabet G = {a, b, c, d} and the dictionary D = {a, b, c, d, ab, bc, cd}, the character string S = abcd has the same tokenization set FD(S) = BD(S) = SD(S) = {ab/cd} for FT, BT and ST, but a different CT tokenization set CD(S) = {ab/cd, a/bc/d}. In other words, the CT tokenization a/bc/d is left out in all the other three sets. As the tokenization a/bc/d is not a subtokenization of any other possible tokenizations, it fulfills the principle of maximum tokenization.</Paragraph>
      <Paragraph position="7"> It is now clear that, while the principle of maximum tokenization is very useful in sentence tokenization, it lacks precise understanding in the literature. Consequently, no solution proposed in the literature is complete with regards to realizing the principle. Recall that, in the previous sections, the character string tokenization operation was modeled as the inverse of the generation operation. Under the tokenization operation, every character string can be tokenized into a set of different tokenizations. The cover relationship between tokenizations was recognized and the set of tokenizations was proven to be a poset (partially ordered set) on the cover relationship. The set of critical tokenizations was defined as the set of minimum elements in the poset. In addition, it was proven that every tokenization has at least one critical tokenization as its supertokenization and only critical tokenization has no true supertokenization.</Paragraph>
      <Paragraph position="8"> Consequently, a noncritical tokenization would conflict with the principle of maximum tokenization, since it is a true subtokenization of others. As compared with its true supertokenization, it requires the extra effort of subtokenization. On the other hand, a critical tokenization would fully realize the principle of maximum tokenization, since it has already attained an extreme form and cannot be simplified or compressed further. As compared with all other tokenizations, no effort can be saved.</Paragraph>
      <Paragraph position="9"> Based on this understanding, it is now apparent why forward maximum tokenization, backward maximum tokenization, and shortest tokenization are all special cases of critical tokenization, but not vice versa. In addition, it has been proven, in Guo (1997), that critical tokenization also covers other types of maximum tokenization implementations such as profile tokenization and shortest tokenization.</Paragraph>
      <Paragraph position="10"> We believe that critical tokenization is the only type of tokenization completely fulfilling the principle of maximum tokenization. In other words, critical tokenization is the precise mathematical description of the commonly adopted principle of maximum tokenization.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="589" end_page="592" type="metho">
    <SectionTitle>
7. Further Discussion
</SectionTitle>
    <Paragraph position="0"> This section explores some helpful implications of critical tokenization in effective tokenization disambiguation and in efficient tokenization implementation.</Paragraph>
    <Paragraph position="1"> desired tokenization, in many contexts, is &amp;quot; ~ ~ / J~ / ~r~ -~ -</Paragraph>
    <Section position="1" start_page="590" end_page="590" type="sub_section">
      <SectionTitle>
Guo Critical Tokenization
7.1 String Generation and Tokenization versus Language Derivation and Parsing
</SectionTitle>
      <Paragraph position="0"> The relationship between the operations of sentence derivation and sentence parsing in the theory of parsing, translation, and compiling (Aho and Ullman 1972) is an obvious analogue with the relationship between the operations of character string generation and character string tokenization that are defined in this paper. As the former pair of operations is well established, and has great influence in the literature of sentence tokenization, many researchers have, either consciously or unconsciously, been trying to transplant it to the latter. We believe this worthy of reexamination.</Paragraph>
      <Paragraph position="1"> Normally, sentence derivation and parsing are governed by complex grammars.</Paragraph>
      <Paragraph position="2"> Consequently, the bulk of the work has been in developing, representing, and processing grammar. Although it is a well known fact that some sentences may have several derivations or parses, the focus has always been either on (1) grammar enhancement, such as introducing semantic categories and consistency checking rules (selectional restrictions), not to mention those great works on grammar formalisms, or on (2) ambiguity resolution, such as introducing various heuristics and tricks including leftmost parsing and operator preferences (Aho and Ullman 1972; Aho, Sethi, and Ullman 1986; Alien 1995; Grosz, Jones, and Webber 1986).</Paragraph>
      <Paragraph position="3"> Following this line, we observed two tendencies in tokenization research. One is the tendency to bring every possible knowledge source into the character string generation operation. For example, Gan (1995) titled his Ph.D. dissertation Integrating Word Boundary Disambiguation with Sentence Understanding. Here, in addition to traditional devices such as syntax and semantics, he even employed principles of psychology and chemistry, such as crystallization. Another is the tendency of enumerating almost blindly every heuristic and trick possible in ambiguity resolution. As Webster and Kit (1992, 1108) noted, &amp;quot;segmentation methods were invented one after another and seemed inexhaustible.&amp;quot; For example, Chen and Liu (1992) acknowledged that the heuristic of maximum matching alone has &amp;quot;many variations&amp;quot; and tested six different implementations.</Paragraph>
      <Paragraph position="4"> We are not convinced of the effectiveness and necessity of both of the schools of tokenization research. The principle argument is, while research is by nature trial-and-error and different knowledge sources contribute to different facets of the solution, it is nonetheless more crucial and productive to understand where the core of the problem really lies.</Paragraph>
      <Paragraph position="5"> As depicted in this paper, unlike general sentence derivation for complex natural languages, the character string generation process can be very simple and straightforward. Many seemingly important factors such as natural language syntax and semantics do not assume fundamental roles in the process. They are definitely helpful, but only at a later stage. Moreover, as emphasized in this paper, the tokenization set has some very good mathematical properties. By taking advantage of these properties, the tokenization problem can be greatly simplified. For example, among the huge number of possible tokenizations, we can first concentrate on the much smaller.</Paragraph>
      <Paragraph position="6"> critical tokenization set, since the former can be completely reproduced from the latter. Furthermore, by contrasting critical tokenizations, we can easily identify a few critically ambiguous positions, which allows us to avoid wasting energy at useless positions.</Paragraph>
    </Section>
    <Section position="2" start_page="590" end_page="591" type="sub_section">
      <SectionTitle>
7.2 Critical Tokenization and the Syntactic Graph
</SectionTitle>
      <Paragraph position="0"> It is worth noting that similar ideas do exist in natural language derivation and parsing.</Paragraph>
      <Paragraph position="1"> For example, Seo and Simmons (1989) introduced the concept of the syntactic graph, which is, in essence, a union of all possible parse trees. With this graph representation, &amp;quot;it is fairly easy to focus on the syntactically ambiguous points&amp;quot; (p. 19, italics added).  Computational Linguistics Volume 23, Number 4 These syntactically ambiguous points are critical in at least two senses. First, they are the only problems requiring knowledge and heuristics beyond the existing syntax. In other words, any syntactic or semantics development should be guided by ambiguity resolution at these points. If a semantic enhancement does not interact with any of these points, the enhancement is considered ineffective. If a grammar revision in turn leads to additional syntactically ambiguous points, such a revision would be in the wrong direction.</Paragraph>
      <Paragraph position="2"> Second, these syntactically ambiguous points are critical in efficiently resolving ambiguity. After all, these points are the only places where disambiguation decisions must be made. Ideally, we should invest no energy in investigating anything that is irrelevant to these points. However, unless all parse trees are merged together to form the syntactic graph, the only thing feasible is to check every possible position in every parse tree by applying all available knowledge and every possible heuristic, since we are unaware of the effectiveness of any checking that occurs beforehand.</Paragraph>
      <Paragraph position="3"> The critical tokenization introduced in this paper has a similar role in string tokenization to that of the syntactic graph in sentence parsing. By Theorem 3, critical tokenization is, in essence, the union of the whole tokenization set and thus the compact representation of it. As long as the principle of maximum tokenization is accepted, the resolution of critical ambiguity in tokenization is the only problem requiring knowledge and heuristics beyond the existing dictionary. In other words, any introduction of &amp;quot;high-level&amp;quot; knowledge must at least be effective in resolving some critical ambiguities in tokenization. This should be a fundamental guideline in tokenization research.</Paragraph>
      <Paragraph position="4"> Even if the principle of maximum tokenization is not accepted, critical ambiguity in tokenization must nevertheless be resolved. Therefore, any investment, as mentioned above, will not be a waste in any sense. What needs to be undertaken now is to substitute something more precise for the principle of maximum tokenization. It is only at this stage that we touch on the problem of identifying and resolving hidden ambiguity in tokenization. That is one of the reasons why this type of ambiguity is called hidden.</Paragraph>
    </Section>
    <Section position="3" start_page="591" end_page="592" type="sub_section">
      <SectionTitle>
7.3 Critical Tokenization and Best-Path Finding
</SectionTitle>
      <Paragraph position="0"> The theme in this paper is to study the problem of sentence tokenization in the framework of formal languages, a direction that has recently attracted some attention. For instance, in Ma (1996), words in a tokenization dictionary are represented as production rules and character strings are modeled as derivatives of these rules under a string concatenation operation. Although not stated explicitly in his thesis, this is obviously a finite-state model, as evidenced from his employment of (finite-) state diagrams for representing both the tokenization dictionary and character strings. The weighted finite-state transducer model developed by Sproat et al. (1996) is another excellent representative example.</Paragraph>
      <Paragraph position="1"> They both stop at merely representing possible tokenizations as a single large finite-state diagram (word graph). The focus is then shifted to the problem of defining scores for evaluating each possible tokenization and to the associated problem of searching for the best-path in the word graph. To emphasize this point, Ma (1996) explicitly called his approach &amp;quot;evaluation-based.&amp;quot; In comparison, we have continued within the framework and established the critical tokenization together with its interesting properties. We believe the additional step is worthwhile. While tokenization evaluation is important, it would be more effective if employed at a later stage.</Paragraph>
      <Paragraph position="2"> On the one hand, critical tokenization can help greatly in developing tokenization knowledge and heuristics, especially those tokenization specific understandings, such</Paragraph>
    </Section>
    <Section position="4" start_page="592" end_page="592" type="sub_section">
      <SectionTitle>
Guo Critical Tokenization
</SectionTitle>
      <Paragraph position="0"> as the observation of &amp;quot;one tokenization per source&amp;quot; and the trick of highlighting hidden ambiguities by contrasting competing critical tokenizations (Guo 1997).</Paragraph>
      <Paragraph position="1"> While it may not be totally impossible to fully incorporate such knowledge and heuristics into the general framework of path evaluation and searching, they are apparently employed neither in Sproat et al. (1996) nor in Ma (1996). Further, what has been implemented in the two systems is basically a token unigram function, which has been shown to be practically irrelevant to hidden ambiguity resolution and not to be much better than some simple maximum tokenization approaches such as shortest tokenization (Guo 1997).</Paragraph>
      <Paragraph position="2"> On the other hand, critical tokenization can help significantly in boosting tokenization efficiency. As has been observed, the tokenization of about 98% of the text can be completed in the first parse of critical point identification, which can be done in linear time. Moreover, as practically all acceptable tokenizations are critical tokenizations and ambiguous critical fragments are generally very short, the remaining 2% of the text with tokenization ambiguities can also be settled efficiently through critical tokenization generation and disambiguation (Guo 1997).</Paragraph>
      <Paragraph position="3"> In comparison, if the best path is to be searched on the token graph of a complete sentence, while a simple evaluation function such as token unigram cannot be very effective in ambiguity resolution, a sophisticated evaluation function incorporating multiple knowledge sources, such as language experiences, statistics, syntax, semantics, and discourse as suggested in Ma (1996), can only be computationally prohibitive, as Ma himself acknowledged.</Paragraph>
      <Paragraph position="4"> In summary, the critical tokenization is crucial both in knowledge development for effective tokenization disambiguation and in system implementation for complete and efficient tokenization. Further discussions and examples can be found in Guo (1997).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML