File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1076_metho.xml
Size: 17,213 bytes
Last Modified: 2025-10-06 14:14:55
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1076"> <Title>One Tokenization per Source</Title> <Section position="4" start_page="458" end_page="459" type="metho"> <SectionTitle> 3 One Tokenization per Source </SectionTitle> <Paragraph position="0"> Noticing that all the fragments studied in the preceding section are critical fragments (Guo 1997) from the same source, it becomes reasonable to accept the following hypothesis.</Paragraph> <Paragraph position="1"> One tokenization per source: For any critical fragment from a given source, if one of its tokenization is correct in one occurrence, the same tokenization is also correct in all its other occurrences.</Paragraph> <Paragraph position="2"> The linguistic object here is a critical fragment, i.e., the one in between two adjacent critical points or unambiguous token boundaries (Guo 1997), but not an arbitrary sentence segment. The hypothesis says nothing about the tokenization of a noncritical fragment. Moreover, the hypothesis does not apply even if a fragment is critical in some other sentences from the same source, but not critical in the sentence in question.</Paragraph> <Paragraph position="3"> The hypothesis does not imply context independence in tokenization. While the correct tokenization correlates decisively with its source, it does not indicate that the correct tokenization has no association with its local sentential context. Rather, the tokenization of any fragment has to be realized in local and sentential context.</Paragraph> <Paragraph position="4"> It might be arguable that the PH corpus of 4 million morphemes is not big enough to enable many of the critical fragments to realize their different readings in diverse sentential contexts. To answer the question, I0 colleagues were asked to tokenize, without seeing the context, the most frequent 123 non-dictionary-entry critical fragments extracted from the PH corpus. Several of these fragments 2 have thus been marked &quot;context dependent&quot;, since they have &quot;obvious&quot; different readings in different contexts. Shown in Figure 1 are three examples.</Paragraph> <Paragraph position="6"> readings. Preceding numbers are their occurrence counts in the PH corpus.</Paragraph> <Paragraph position="7"> i For instance, the Chinese fragment dp dx (secondary primary school) is taken as &quot;\[secondary (and) primary\] school&quot; by one school of thought, but &quot;\[secondary (school)\] (and) \[primary school\]&quot; by another. But both will never agree that the fragment must be analyzed differently in different context.</Paragraph> <Paragraph position="8"> 2 While all fragments are lexically ambiguous in tokenization, many of them have received consistent unique tokenizations, as these fragments are, to the human judges, self-sufficient for comfortable ambiguity resolution.</Paragraph> <Paragraph position="9"> We looked all these questionable fragments up in a larger corpus of about 60 million morphemes of news articles collected from the same source as that of the PH corpus in a longer time span from 1989 to 1993. It turns out that all the fragments each always takes one and the same tokenization with no exception.</Paragraph> <Paragraph position="10"> While we have not been able to specify the notion of source used in the hypothesis to the same clarity as that of critical fragment and critical tokenization in (Guo 1997), the above empirical test has made us feel comfortable to believe that the scope of the source can be sufficiently large to cover any single domain of practical interest.</Paragraph> </Section> <Section position="5" start_page="459" end_page="460" type="metho"> <SectionTitle> 4 Application in Tokenization </SectionTitle> <Paragraph position="0"> The hypothesis of one tokenization per source can be applied in many ways in sentence tokenization.</Paragraph> <Paragraph position="1"> For tokenization ambiguity resolution, let us examine the following strategy: Tokenization by memorization: If the correct tokenization of a critical fragment is known in one context, remember the tokenization. If the same critical fragment is seen again, retrieve its stored tokenization. Otherwise, if a critical fragment encountered has no stored tokenization, randomly select one of its critical tokenizations.</Paragraph> <Paragraph position="2"> This is a pure and straightforward implementation of the hypothesis of one tokenization per source, as it does not explore any constraints other than the tokenization dictionary.</Paragraph> <Paragraph position="3"> While sounds trivial, this strategy performs surprisingly well. While the strategy is universally applicable to any tokenization ambiguity resolution, here we will only examine its performance in the resolution of critical ambiguities (Guo 1997), for ease of direct comparison with works in the literature.</Paragraph> <Paragraph position="4"> As above, we have manually tokenized 3 all non-dictionary-entry critical fragments in the PH corpus; i.e., we have known the correct tokenizations for all of these fragments. Therefore, if any of these fragments presents somewhere else, its tokenization can be readily retrieved from what we have manually done. If the hypothesis holds perfect, we could not make any error.</Paragraph> <Paragraph position="5"> The only weakness of this strategy is its apparent inadequacy in dealing with the sparse data problem. That is, for unseen critical fragments, only the simplest tokenization by random selection is taken. Fortunately, we have seen on the PH corpus that, on average, each non-dictionary-entry critical fragment has just two (100,398 over 49,308 or 2.04 to be exact) critical tokenizations to be chosen from. Hence, a tokenization accuracy of about 50% can be expected for unknown non-dictionary-entry critical fragments.</Paragraph> <Paragraph position="6"> The question then becomes that: what is the chance of encountering a non-dictionary-entry critical fragment that has not been seen before in the PH corpus and thus has no known correct tokenization? A satisfactory answer to this question can be readily derived from the Good- null different non-dictionary-entry critical fragments and their 49,308 occurrences in the PH corpus, 9,587 different fragments each occurs exactly once. By the Good-Turing Theorem, the chance of encountering an arbitrary non-dictionary-entry critical fragment that is not in the PH corpus is about 9,587 over 49,308 or slightly less than 20%. In summary, if applied to non-dictionary-entry critical fragment tokenization, the simple strategy of tokenization by memorization delivers virtually 100% tokenization accuracy for slightly over 80% of the fragments, and about 50% accuracy for the rest 20% fragments, and hence has an overall tokenization accuracy of better than 90% (= 80% x 100% + 20% x 50%).</Paragraph> </Section> <Section position="6" start_page="460" end_page="461" type="metho"> <SectionTitle> 4 The theorem states that, when two independent </SectionTitle> <Paragraph position="0"> marginally binomial samples B e and B 2 are drawn, the expected frequency r&quot; in the sample B~ of types occurring r times in B t is r'=(r+I)E(N,.~)/E(N,), where E(N,) is the expectation of the number of types whose frequency in a sample is r.</Paragraph> <Paragraph position="1"> What we are looking for here is the quantity of r'E(N,) for r=O, or E(N~), which can be closely approximated by the number of non-dictionary-entry fragments that occurred exactly once in the PH corpus. This strategy rivals all proposals with directly comparable performance reports in the literature, including 5 the representative one by Sun and T'sou (1995), which has the tokenization accuracy of 85.9%. Notice that what Sun and T'sou proposed is not a trivial solution. They developed an advanced four-step decision procedure that combines both mutual information and t-score indicators in a sophisticated way for sensible decision making.</Paragraph> <Paragraph position="2"> Since the memorization strategy complements with most other existing tokenization strategies, certain types of hybrid solutions are viable. For instance, if the strategy of tokenization by memorization is applied to known critical fragments and the Sun and T'sou algorithm is applied to unknown critical fragments, the overall accuracy of critical ambiguity resolution can be better than 97% (= 80% + 20% x 85.9%).</Paragraph> <Paragraph position="3"> The above analyses, together with some other more or less comparable results in the literature, are summarized in Table 6 below. It is interesting to note that, the best accuracy registered in China's national 863-Project evaluation in 1995 was only 78%. In conclusion, the hypothesis of one tokenization per source is unquestionably helpful in sentence tokenization.</Paragraph> <Paragraph position="4"> s The task there is the resolution of overlapping ambiguities, which, while not exactly the same, is comparable with the resolution of critical ambiguities. The tokenization dictionary they used has about 50,000 entries, comparable to the Beihang dictionary we used in this study. The corpus they used has about 20 million words, larger than the PH corpus. More importantly, in terms of content, it is believed that both the dictionary and corpus are comparable to what we used in this study. Therefore, the two should more or less be comparable.</Paragraph> </Section> <Section position="7" start_page="461" end_page="461" type="metho"> <SectionTitle> 5 The Notion of Tokens </SectionTitle> <Paragraph position="0"> Upon accepting the validness of the hypothesis of one tokenization per source, and after experiencing its striking utility value in sentence tokenization, now it becomes compelling for a new paradigm. Parallel to what Dalton did for separating physical mixtures from chemical compounds (Kuhn 1970, page 130-135), we are now suggesting to regard the hypothesis as a lawof-language and to take it as the proposition of what a word/token must be.</Paragraph> <Paragraph position="1"> The Notion of Tokens: A stretch of characters is a legitimate token to be put in tokenization dictionary if and only if it does not introduce any violation to the law of one tokenization per source. Opponents should reject this notion instantly as it obviously makes the law of one tokenization per source a tautology, which was once one of our own objections. We recommend these readers to reexamine some of Kuhn's (1970) arguments.</Paragraph> <Paragraph position="2"> Apparently, the issue at hand is not merely over a matter of definition of words/tokens. The merit of the notion, we believe, lies in its far-reaching implications in natural language processing in general and in sentence tokenization in particular.</Paragraph> <Paragraph position="3"> For instance, it makes the separation between words and non-words operational in Chinese, yet maintains the cohesiveness of words/tokens as a relatively independent layer of linguistic entities for rigorous scrutiny. In contrast, while the paradigm of &quot;mutual affinity&quot; represented by measurements such as mutual information and t-score has repetitively exhibited inappropriateness in the very large number of intermediate cases, the paradigm of &quot;linguistic words&quot; represented by terms like syntactic-words, phonolo~cal-words and semantic-words is in essence rejecting the notion of Chinese words/tokens at all, as compounding, phrase-forming and even sentence formation in Chinese are governed by more or less the same set of regularities, and as the whole is always larger than the simple sum of its parts. We shall leave further discussions to another place.</Paragraph> </Section> <Section position="8" start_page="461" end_page="462" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> Like most discoveries in the literature, when we first captured the regularity several years ago, we simply could not believe it. Then, after careful experimental validation on large representative corpora, we accepted it but still could not imagine any of its utility value. Finally, after working out ways that unquestionably demonstrated its usefulness, we realized that, in the literature, so many supportive evidences have already been presented. Further, while never consciously in an explicit form, the hypothesis has actually already been widely employed.</Paragraph> <Paragraph position="1"> For example, Zheng and Liu (1997) recently studied a newswire corpus of about 1.8 million Chinese characters and reported that, among all the 4,646 different chain-length-l two-characteroverlapping-typd s ambiguous fragments which cumulatively occur 14,581 times in the corpus, only 8 fragments each has different tokenizations in different context, and there is no such fragment in all the 3,409 different chain-length-2 twocharacter-overlapping-type 7 ambiguous fragments.</Paragraph> <Paragraph position="2"> Unfortunately, due to the lack of a proper representation framework comparable to the critical tokenization theory employed here, their observation is neither complete nor explanatory. It is not complete, since the two ambiguous types apparently do not cover all possible ambiguities. It is not explanatory, since both types of ambiguous fragments are not guaranteed to be critical fragments, and thus may involve other types of ambiguities.</Paragraph> <Paragraph position="3"> Consequently, Zheng and Liu (1997) themselves merely took the apparent regularity as a special case, and focused on the development of localcontext-oriented disambiguation rules. Moreover, while they constructed for tokenization disambiguation an annotated &quot;phrase base&quot; of all ambiguous fragments in the large corpus, they still concluded that good results can not come solely from corpus but have to rely on the utilization of syntactic, semantic, pragmatic and other information.</Paragraph> <Paragraph position="4"> The actual implementation of the weighted finite-state transducer by Sproat et al. (1996) can be taken as an evidence that the hypothesis of one tokenization per source has already in practical use. While the primary strength of such a transducer is its effectiveness in representing and 6 Roughly a three-character fragment abc where a, b, c, ab, and bc are all tokens in the tokenization dictionary. 7 Roughly a four-character fragment abcd, where a, b, c, d, ab, bc, and cd are all tokens in the tokenization dictionary.</Paragraph> <Paragraph position="5"> utilizing local and sentential constraints, what Sproat et al. (1996) implemented was simply a token unigram scoring function. Under this setting, no critical fragment can realize different tokenizations in different local sentential context, since no local constraints other than the identity of a token together with its associated token score can be utilized. That is, the requirement of one tokenization per source has actually been implicitly obeyed.</Paragraph> <Paragraph position="6"> We admit here that, while we have been aware of the fact for long time, only after the dissemination of the closely related hypotheses of one sense per discourse (Gale, Church and Yarowsky 1992)&quot; and one sense per collocation (Yarowsky 1993), we are able to articulate the hypothesis of one tokenization per source.</Paragraph> <Paragraph position="7"> The point here is that, one tokenization per source is unlikely an isolated phenomenon. Rather, there must exist a general law that covers all the related linguistic phenomena. Let us speculate that, for a proper linguistic expression in a proper scope, there always exists the regularity of one realization per expression. That is, only one of the multiple values on one aspect of a linguistic expression can be realized in the specified scope. In this way, one tokenization per source becomes a particular articulation of one realization per expression.</Paragraph> <Paragraph position="8"> The two essential terms here are the proper linguistic expression and the proper scope of the claim. A quick example is helpful here: part-of-speech tagging for the English sentence &quot;Can you can the can?&quot; If the linguistic expressions are taken as ordinary English words, they are nevertheless highly ambiguous, e.g., the English word can realizes three different part-of-speeches in the sentence. However, if &quot;the can&quot;, &quot;can the&quot; and the like are taken as the underling linguistic expressions, they are apparently unambiguous: &quot;the can/NN&quot;, &quot;can/VB the&quot; and the rest &quot;can/MD&quot;. This fact can largely be predicted by the hypothesis of one sense per collocation, and can partially explain the great success of Brill's transformation-based part-of-speech tagging (Brill 1993).</Paragraph> <Paragraph position="9"> As to the hypothesis of one tokenization per source, it is now clear that, the theory of critical tokenization has provided the suitable means for capturing the proper linguistic expression.</Paragraph> </Section> class="xml-element"></Paper>