File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2110_metho.xml
Size: 10,550 bytes
Last Modified: 2025-10-06 14:14:14
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2110"> <Title>Identifying the Coding System and Language of On-line Documents on the Internet</Title> <Section position="4" start_page="653" end_page="654" type="metho"> <SectionTitle> 3 Our Algorithm </SectionTitle> <Paragraph position="0"> Our basic idea is to use statistic language models to select the correctly decoded string as well as to determine the language. The idea comes from the observation that a human can distinguish whether or not a text is written in the language s/he can read. If the text is judged to be written not in the language the person can read, then it is written in another language or decoded with an incorrect coding system.</Paragraph> <Paragraph position="1"> In our algorithm, the human intuition on his familiar language is realized by a statistic-based module which calculates how likely a text to be from a specific language.</Paragraph> <Section position="1" start_page="653" end_page="653" type="sub_section"> <SectionTitle> 3.1 The Scope of our Algorithm </SectionTitle> <Paragraph position="0"> The current version of our algorithm can handle the following 11 coding systems and 9 languages.</Paragraph> </Section> <Section position="2" start_page="653" end_page="653" type="sub_section"> <SectionTitle> 3.2 Outline of the Algorithm </SectionTitle> <Paragraph position="0"> Our algorithm consists of the following two major steps.</Paragraph> <Paragraph position="1"> Step 1 This step divides the given code string into East-Asian part (i.e., sub-strings consisting of East-Asian characters) and the rest (i.e., European) part.</Paragraph> <Paragraph position="2"> Step 2 This step decodes each part and identifies its language(s).</Paragraph> <Paragraph position="3"> The following two subsections describe the above two steps in detail.</Paragraph> </Section> <Section position="3" start_page="653" end_page="654" type="sub_section"> <SectionTitle> 3.3 Dividing Code-String </SectionTitle> <Paragraph position="0"> If the give code string contains escape code sequences defined in ISO-2022 or its variants, East-Asian character strings are easily extracted because East-Asian characters are explicitly marked by escape sequences in the string.</Paragraph> <Paragraph position="1"> If the given code string does not contain such escape sequences, the Eastern-Asian part is identified by the procedure shown in Figure 1. The procedure consists of the following two loops.</Paragraph> <Paragraph position="2"> * Loop 1 If a coding system is determined, it is easy to extract Eastern-Asian characters. For every coding system that can handle Eastern-Asian characters, the first loop tries to extract Eastern-Asian characters by using the coding system.</Paragraph> <Paragraph position="3"> The function ea_string takes a code-string and (the name of) a coding system. It extracts Eastern-Asian character-strings, presupposing that the given code string has been encoded with the given coding system (csys).</Paragraph> <Paragraph position="4"> This function can be realized by simple pattern matching. For example, if we presuppose that the given code-string is encoded with EUC-JIS, then the adjacent two bytes that match \[A1H-FEH\]{2}, a two byte sequence whose values ranges from A1 to FE (in hexadecimal representation), correspond to a Japanese (or JIS) character.</Paragraph> <Paragraph position="5"> Table 1 shows examples of regular expression patterns in our system 2</Paragraph> <Paragraph position="7"> If a non-empty string is returned by ea_slring, it is decoded with the presupposed coding system and registered in ca_string_list.</Paragraph> </Section> </Section> <Section position="5" start_page="654" end_page="654" type="metho"> <SectionTitle> * Loop 2 </SectionTitle> <Paragraph position="0"> The second loop tries to identify the language of each East--Asian character-string in ca_string_list.</Paragraph> <Paragraph position="1"> Each Eastern-Asian string is passed to the language identification routine in the descendent order of its Length. The language identification routine, described in Section 3.4, takes a character string and returns the most likely language and the score of likelihood.</Paragraph> <Paragraph position="2"> If the score is larger than a predetermined threshold, the loop terminates and returns the language and the score.</Paragraph> <Paragraph position="3"> If the score of every Eastern-Asian string does not exceed the threshold, then the loop returns nil, which indicates that no Eastern-Asian characters are involved in the code string.</Paragraph> <Paragraph position="4"> After the Eastern-Asian part is identified, the remainder is classified into the European part.</Paragraph> <Section position="1" start_page="654" end_page="654" type="sub_section"> <SectionTitle> 3.4 Identifying the Language </SectionTitle> <Paragraph position="0"> The language of a text is identitied by the follow- null ing three steps.</Paragraph> <Paragraph position="1"> 1. Selecting possible languages for the given coding system The coding system (or the character set(s)) of a text is loosely related to the language of the text. For example, a document encoded with US-ASCII is not written in Korean. We made heuristic rules to map a coding system to possible languages.</Paragraph> <Paragraph position="2"> 2. Calculating the 'likelihood' of the decoded string for each language.</Paragraph> <Paragraph position="3"> For each language, this step calculates how likely the decoded string is to be from that language by comparing the string with the statistic model of the language.</Paragraph> <Paragraph position="4"> 3. Selecting the language with the highest likelihood null This step compares likelihood scores, then returns the language with the highest likelihood score.</Paragraph> <Paragraph position="5"> The second step is the most important. Our system uses a unigram model }'or both Western-European languages and East-Asian languages, but the models for Western-European languages and the models for the East-Asian languages have different unigram units.</Paragraph> <Paragraph position="6"> Western-European languages In order to distinguish Western-European languages, we applied a method proposed by Cavnar (Cavnar, 1994). We assign a class name for each word. The class name of a word longer than n characters is the concatenation of &quot;X-&quot; and the last n characters of the word. If the word is not longer than n characters, the class name is the word itself. For example, if n = 4, then class names of &quot;beautiful&quot; and &quot;the&quot; are X-iful and the respectively.</Paragraph> <Paragraph position="7"> Let TEXT be the set of words in a text, then the likelihood of TEXT with regard to language I is given as tile following P(7'EXT&quot;, l)</Paragraph> <Paragraph position="9"> where P(Cw, l) is the unigram probability of C~ in language l , and (hw is the class name of the word w.</Paragraph> <Paragraph position="10"> P(C~, l) is estimated from text corpora in language I.</Paragraph> </Section> <Section position="2" start_page="654" end_page="654" type="sub_section"> <SectionTitle> 3.4.2 Likelihood Score for East-Asian languages </SectionTitle> <Paragraph position="0"> As compared with Western-European languages, East-Asian languages have the following properties: 1. A large number of characters East-Asian languages use over 3,000 ideographic or combined characters. A character is normally encoded with two (or more) bytes.</Paragraph> <Paragraph position="1"> 2. No Explicit Word Boundaries in East-Asian languages, there are no explicit word delimiters (corresponding to spaces in English) in a sentence. We cannot use word-based language models.</Paragraph> <Paragraph position="2"> For East-Asian languages, we use a character unigram, instead of a word unigram, to model a language. Formally, r/r * r T P(. F xz, 0 1-\[ charE'I'EXT where P(ehar, l) is the unigram probability of char in language l .</Paragraph> </Section> </Section> <Section position="6" start_page="654" end_page="655" type="metho"> <SectionTitle> 4 Example </SectionTitle> <Paragraph position="0"> Suppose the following code sequence is given to the algorithm.</Paragraph> <Paragraph position="1"> The string is first divided int Asian and European parts. Since there is no escape sequence, which begins with &quot;lb&quot;, the procedure in Section</Paragraph> <Section position="1" start_page="655" end_page="655" type="sub_section"> <SectionTitle> 3.3 is applied. </SectionTitle> <Paragraph position="0"> The division procedure first tries to extract Eastern-Asian characters from the given string.</Paragraph> <Paragraph position="1"> The first 14 bytes, from b8 to al, match with every pattern in Table 1. This means that all the four coding systems are potential candidates and that they extract the same Asian character part.</Paragraph> <Paragraph position="2"> This part is decoded with each coding system.</Paragraph> <Paragraph position="3"> Result strings are shown in Figure 2.</Paragraph> <Paragraph position="4"> Next, the statistic-based language identification is applied to each decoded string, Table 2 shows the score (=probability) of each character as regards to the language that produced the highest likelihood (i.e., average score). For example, the second column shows the score of each character as regards to Chinese (zho) when the input code string is decoded with EUC-GB.</Paragraph> <Paragraph position="5"> This implies that Chinese(zho) is the most likely language if we presuppose the original string is encoded with EUC-GB.</Paragraph> <Paragraph position="6"> zho=Chinese, kor=Korean, jpn=J apanese The bottom row shows that the highest average score is obtained when the input is decoded with EUC-J\[S and the language is Japanese(jpn). Since this score exceeds the threshold (-10), the Eastern-Asian part is confirmed to be Japanese string encoded with EUC-JIS.</Paragraph> <Paragraph position="7"> The remaining part is decoded into &quot;Identifying the Language&quot; and sent to European language identifier. Table3 gives scores of tokens as regards to three languages.</Paragraph> <Paragraph position="8"> eng deu ita eng=English, deu=German, ita=Italian In this table, English(eng) is the most plausible language for European part with sufficient score. The final result is easily obtained by combining results of the Asian and the European parts.</Paragraph> </Section> </Section> class="xml-element"></Paper>