XML Viewer - c96-2110

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2110_intro.xml
Size: 10,289 bytes
Last Modified: 2025-10-06 14:06:02
<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2110">
  <Title>Identifying the Coding System and Language of On-line Documents on the Internet</Title>
  <Section position="3" start_page="0" end_page="653" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Recent advances in information infrastructure have made an enormous number of on-line documents accessible. A notable example is the explosive growth of World-Wide Web(WWW) involving more than 10 million documents.</Paragraph>
    <Paragraph position="1"> Accessing and using such a huge number of on-line documents require intelligent document processing such ~ content-based search, categorization, information extraction, and machine translation. Most of these processes assume that ~ documents are correctly decoded and the language is known. For documents on the WWW, however, these assumptions do not hold. This is because each language community uses its own coding system which is optimal for internal communication but is not appropriate for exchanging information with those outside at community.</Paragraph>
    <Paragraph position="2"> A fundamental solution is to develop international standards for an internationalized coding system and language representation. In fact, there is active discussion on the international coding standards(Yergeau et al., 1995)(Nicol, 1995)(Unicode, 1994), and the language representation on the WWW(Unicode, 1994). However, it will require several years before most of the documents are encoded into a unique well-defined coding system. null A realistic approach, which also goes together with the above 'fundamental' approach, is to develop a more intelligent module that can estimate which coding system and language used for each on-line document on the current WWW.</Paragraph>
    <Paragraph position="3"> Automatic identification of the coding system is achieved in communities where a limited number of coding systems are used. For example, (Lunde, 1993) presents an algorithm for Selecting one of three coding systems for Japanese texts (UJIS, SJIS, and JIS) The algorithm, however, is not applicable to international domain where a lot of other coding systems are potential candidates.</Paragraph>
    <Paragraph position="4"> Automatic language identification has been discussed in the field of document processing. Several statistic models have been tried including using the n-g~am of characters (Cavnar, 1994), diacritics and special characters (Beesley, 1988), and using the word unigram with heuristics (lienrich, 1989). Among these methods, the result by (Cavnar, 1994) shows the best accuracy over 95%.</Paragraph>
    <Paragraph position="5"> (Giguet, 1995) achieved over 99% accuracy by using a rule-based (i.e., non-statistic) method.</Paragraph>
    <Paragraph position="6"> These methods, however, cannot handle East-Asian languages, because they presuppose that input texts are easily segmented into words, which does not hold true in these languages. Another problem is that it presupposes that the input document is correctly decoded.</Paragraph>
    <Paragraph position="7"> Sibun and Penelope (Penelope and Sibun, 1994) proposed a method of determining the language of a text image. The problem tackled by them is similar to ours, in the sense that the input is not a unique character string but a string that potentially corresponds to several different character strings. Their method, however, can not directly applied to our problem.</Paragraph>
    <Paragraph position="8"> This paper presents an algorithm that identifies the coding system and the language of a given text. The algorithm is an application of an automatic language identification using statistic language models. It covers 11 coding systems and 13 languages used in East-Asian countries as well as Western-European countries.</Paragraph>
    <Paragraph position="9"> The next section describes the problem. Section 3 introduces our algorithm. Se,ction 4 and Section  5 describes an example and the experimental resuits respectively.</Paragraph>
    <Section position="1" start_page="652" end_page="652" type="sub_section">
      <SectionTitle>
Problems in Decoding and
Identifying Languages on the
WWW
2.1 Brief Explanation of Character
Coding Systelns
</SectionTitle>
      <Paragraph position="0"> In a communications network, characters are represented as numbers, or a sequence of numbers.</Paragraph>
      <Paragraph position="1"> A character coding system specifies the mapping between characters and numbers. A coding system consists of a character set and an encoding S~\]te~ne.</Paragraph>
      <Paragraph position="2"> A character set is a set of characters collected to represent texts in a certain language community. For example, JIS-X0208-1983 (referred to ms &amp;quot;JIS character set&amp;quot; in this paper) is a character set for encoding texts (mainly) in Japanese. Each character in a character set has a unique identification number. \[t should noticed that the same number may appear in different character sets.</Paragraph>
      <Paragraph position="3"> An encoding sdtema maps each element of a character set into a (sequence of) number(s) that is used in communication networks. The simplest encoding scheme directly uses the identification number of a character set for communication.</Paragraph>
      <Paragraph position="4"> Some encoding schemes are designed to encode texts that contain characters from two or more character sets. For example,the encoding scheme for JIS coding system (ISO-2022-jp) uses escape sequences to indicate changes of character set in the code string in the following way:  1. &amp;quot;ESC $ B&amp;quot; shows the beginning of J\[S character set.</Paragraph>
      <Paragraph position="5"> 2. &amp;quot;ESC ( B&amp;quot; shows at the beginning of ASCII character set.</Paragraph>
    </Section>
    <Section position="2" start_page="652" end_page="652" type="sub_section">
      <SectionTitle>
2.2 Ambiguity in Determhllng Character
Coding System
</SectionTitle>
      <Paragraph position="0"> For historical reasons, documents on the current WWW are encoded in various coding systems. For example, servers in Western-Europe normally use \[SO-8859-1 (ISO-LATIN~ 1), whereas most UNIX servers in Japan encode text using Japanese EUC (Extended Unix Code). The problem is that different coding systems are applicable to the same code-string.</Paragraph>
      <Paragraph position="1"> The fundamental solution is to have everyone use a unique coding system that can handle all the characters in the world. ISO-2022 is one such coding systems. This system assigns a unique identifier to every registered character set in the world and specifies escape sequences for switching one character set to another. Although most of the local coding systems in the world are 'compatible' with ISO-2022, many of them lack escape sequences which are not necessary for choosing the correct character set in the local domain but are necessary in the international domain.</Paragraph>
      <Paragraph position="2"> Therefore, the sender should give or the receiver should infer with what coding system the received coded sequence is encoded.</Paragraph>
      <Paragraph position="3"> One approach is to transfer the name of the coding system with the upper level protocols. For example, the lnternet mail protocol can transfer the coding system with which the message is encoded. However, on the WWW, active discussion still continues on the WWW as to how to deliver the coding system.</Paragraph>
      <Paragraph position="4"> Another approach is to uncover the coding system from the received byte code string. If the potential candidates for the code system are limited, the correct coding system may be inferred by using simple pattern matching. For example, the byte code contains the pattern &amp;quot;ESC $ B&amp;quot; then it must be encoded with IS()-2022 and include Japanese characters. IIowever, in international domain, it is difficult or impossible to specify the coding system with simple pattern matching. For example, Japanese EUC and Korean EUC cannot be discriminated by this simple method, because most of their code values overlap 1.</Paragraph>
      <Paragraph position="5"> In summary, a more sophisticated method is required to identify the coding system from the content of the code string.</Paragraph>
    </Section>
    <Section position="3" start_page="652" end_page="653" type="sub_section">
      <SectionTitle>
2.3 Ambiguity in Determining Language
</SectionTitle>
      <Paragraph position="0"> Most text processing systems have language-dependent components such a.s rules or dictionaries. Thus, it is crucial to know in what language the target document is written in order to choose the appropriate system or language specific rules and dictionaries.</Paragraph>
      <Paragraph position="1"> If we restrict ourselves to HTML documents, then explicit language lagging, which represents the language of the text body, will be introduced in a future version of HTML specifications. This, however, does not handle multiple languages in one document. Moreover, there are a lot of non-HTML documents on the WWW.</Paragraph>
      <Paragraph position="2"> If the character set used in the text is known, it might be good clue for identifying the language because some character sets strongly suggest which language(s) was used. For example, if a document consists of characters in the J IS character set, the document must be written in Japanese. However, this is not always the case.</Paragraph>
      <Paragraph position="3"> One re,on for this is that the character set of a text is sometimes ambiguous due to the. decoding problem described above. Another reason is that some character sets are designed to cover multiple languages (e.g., ISO-8859-1 for several Western-European languages). Sometimes a character set is used in a language that is not the primary candidate suggested by the character set. For example, a document containing only US-ASCII characters, tin detail see (\[,uncle, 1993).</Paragraph>
      <Paragraph position="4">  which suggests the document is in English, may be a German document in ASCII-format (e.g., 6 is written as oe ).</Paragraph>
      <Paragraph position="5"> For this reason, it is necessary to identify the language from the content of the given character string.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML