File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/97/j97-4004_concl.xml

Size: 3,063 bytes

Last Modified: 2025-10-06 13:57:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="J97-4004">
  <Title>Critical Tokenization and its Properties</Title>
  <Section position="8" start_page="592" end_page="593" type="concl">
    <SectionTitle>
8. Summary
</SectionTitle>
    <Paragraph position="0"> The objective in this paper has been to lay down a mathematical foundation for sentence tokenization. As the basis of the overall mathematical model, we have introduced both sentence generation and sentence tokenization operations. What is unique here is our attempt to model sentence tokenization as the inverse problem of sentence generation.</Paragraph>
    <Paragraph position="1"> Upon that basis, both critical point and critical fragment constitute our first group of findings. We have proven that, under a complete dictionary assumption, critical points in sentences are all and only unambiguous token boundaries.</Paragraph>
    <Paragraph position="2"> Critical tokenization is the most important concept among the second group of findings. We have proven that every tokenization has a critical tokenization as its supertokenization. That is, any tokenization can be reproduced from a critical tokenization. null Critical ambiguity and hidden ambiguity in tokenization constitute our third group of findings. We have proven that tokenization ambiguity can be categorized as either critical type or hidden type. Moreover, it has been shown that critical tokenization provides a sound basis for precisely describing various types of tokenization ambiguities. null In short, we have presented a complete and precise understanding of ambiguity in sentence tokenizations. While the existence of tokenization ambiguities is jointly described by critical points and critical fragments, the characteristics of tokenization ambiguities will be jointly specified by critical ambiguities and hidden ambiguities. Moreover, we have proven that the three widely employed tokenization algorithms, namely forward maximum matching, backward maximum matching, and shortest  Computational Linguistics Volume 23, Number 4 length matching, are all subclasses of critical tokenization and that critical tokenization is the precise mathematical description of the principle of maximum tokenization. In this paper, we have also discussed some important implications of the notion of critical tokenization in the area of character string tokenization research and development. In this area, our primary claim is that critical tokenization is an excellent intermediate representation that offers much assistance both in the development of effective tokenization knowledge and heuristics and in the improvement and implementation of efficient tokenization algorithms.</Paragraph>
    <Paragraph position="3"> Besides providing a framework to better understand previous wor k, as has been attempted here, a good formalization should also lead to new questions and insights. While some of the findings and observations achieved so far (Guo 1997) have been mentioned here, much more work remains to be done.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML