File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1102_metho.xml
Size: 16,168 bytes
Last Modified: 2025-10-06 14:08:41
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1102"> <Title>Detecting Transliterated Orthographic Variants via Two Similarity Metrics</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Detecting method </SectionTitle> <Paragraph position="0"> We propose a detecting method for katakana variants. The method consists of two components: one is string similarity and the other is contextual similarity. null The string similarity part measures similarity based on edit distance for katakana words. More precisely, there are two metrics of similarity: one measures the similarity between romanized strings of two words, while the other measures the similarity between raw strings of two words. We cannot use the romanization system as a perfect substitution of katakana, because romanization causes side effects. For example, both Japanese words a36</Paragraph> <Paragraph position="2"> erated into panti by our romanization system.</Paragraph> <Paragraph position="3"> Thus, we use two string similarities.</Paragraph> <Paragraph position="4"> Contextual similarity is de ned as the distance between context vectors. A context vector we employed is marked by using a dependency structure of a sentence. A context vector is constructed by gathering surrounding information for the target katakana word, such as cooccurring nouns, the predicate expression depended upon by the katakana word, the particle marking the katakana word, and so on.</Paragraph> <Paragraph position="5"> method. The detection procedure is as follows: 1. Extract katakana words and contextual vectors from the dependency-analyzed result of the target corpus.</Paragraph> <Paragraph position="6"> 2. Choose a katakana word as the input word from the extracted katakana words.</Paragraph> <Paragraph position="7"> 3. Retrieve candidates of katakana variants from the extracted katakana words. Each candidate should share at least one character with the input word.</Paragraph> <Paragraph position="8"> 4. Calculate the similaritysimed, which is based on the ordinary edit distance, between the input Str1 and each candidate Str2. The similarity simed is de ned as follows:</Paragraph> <Paragraph position="10"> where ED(Str1;Str2) denotes the ordinary edit distance. If the input and a candidate word share suf x or pre x morphemes, the shared morphemes would be excluded from the comparing strings.</Paragraph> <Paragraph position="11"> 5. Calculate string similarity sims between the input and each candidate. If the input and a candidate word share a suf x or pre x morphemes, the shared morphemes would be excluded in the same way as above.</Paragraph> <Paragraph position="12"> 6. Calculate the contextual similarity simc between the input and each candidate.</Paragraph> <Paragraph position="13"> 7. Decide whether the candidate is a variant by means of a deciding module. The deciding Daijiten as the dictionary. It has almost 8,000 katakana words, and we slightly modi ed it. We explain the details of the string similarity part and the contextual similarity part in the following subsections.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 String similarity for romanized words </SectionTitle> <Paragraph position="0"> There are recognizable patterns in Japanese transliterated orthographic variants, thus, so far several rule-based methods to detect such variants have been developed. We use a kind of weighted edit distance to recognize transliterated orthographic variants. The weighting rules are very similar to the rules that are used in conventional rule-based methods. The ordinary edit distance between two strings is de ned as the number of edit operations (insertion, deletion, and substitution, although sometimes substitution is not permitted) required to edit from one string into the other. Thus, the ordinary edit distance is a positive integer value. We de ned small weighted operations in speci c situations to identify recognizable patterns. Figure 2 shows an example of the difference between ordinary edit distance and weighted edit distance, in which we used a rule for changing a vowel that follows the same consonant ('r').</Paragraph> <Paragraph position="1"> r i p o o t o r e p o o t o More precisely, string similarity based on the weighted edit distance of romanizations for length frequency simed sims simc decision</Paragraph> <Paragraph position="3"> (2) whererom(x) denotes romanized strings ofx, and EDk(x;y) denotes a weighted edit distance between x and y that is specialized for katakana. EDk(x;y) is a kind of weighted edit distance, and is marked by a distance function that determines the relaxed distance based on local strings. Here, EDk(x;y) is de ned as Formula (3),</Paragraph> <Paragraph position="5"> where, for two stringsS1 andS2,D(i;j) is de ned to be the specialized edit distance of S1[1::i] and S2[1::j].</Paragraph> <Paragraph position="6"> D(i;j) is given by the following recurrence re- null where id(i;j) de nes the insertion and deletion operation distance, and that is de ned to have the penalty value Pid if S1(i) or S2(j) denotes a consonant, and id(i;j) has the value 1 in all other cases. In addition, t(i;j) de nes the substitution operation distance, and that is de ned to have the value 0 if S1(i) = S2(j), in all other cases, t(i;j) has a pre-de ned table and returns a value that depends on S1[i 3;::;i;::;i + 3] and</Paragraph> <Paragraph position="8"> Table 2 shows an example of part of the t(i;j) table. There are 29 entries in the t(i;j) table. In Table 2, several t(i;j) values are negative because in such a situation the strings compared have already had or will have a positive distance, thus the t(i;j) has a negative value to adjust the distance.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Contextual similarity </SectionTitle> <Paragraph position="0"> In order to use the contextual information surrounding a katakana word, we employed a vector space model. We use a dependency analyzer to achieve more precise similarity, and contextual information is extracted from the dependency analyzed result of the text. Figure 3 shows an example of extracting vectors from an analyzed result, in which the vector has elements; N for cooccurring noun, P for predicate expression that is depended upon by the word, and PP for the particle and predicate expression pairs.</Paragraph> <Paragraph position="1"> syanpen o gurasu de kudasai.(basic form is analyzed result The vectors are calculated by the following procedure. null 1. Analyze the dependency structure for all sentences of the target corpus. We employed CaboCha1 as the dependency analyzer.</Paragraph> <Paragraph position="2"> 2. Extract vectors for all katakana words included in the corpus. Each vector corresponds to a katakana word and consists of the following elements: Nouns that cooccur with the katakana word.</Paragraph> <Paragraph position="4"> Predicate that is depended upon by the katakana word.</Paragraph> <Paragraph position="5"> Particle and predicate pair: particle that follows the katakana word and predicate that is depended upon by the katakana word.</Paragraph> <Paragraph position="6"> Each element is extracted from the dependency-analyzed result of a sentence, and the frequency of the element is counted. 3. Load a tf-idf-like weight onto each element of the vector. The weight is calculated by the following formula.</Paragraph> <Paragraph position="8"> Here, kwi is a katakana word, ei is an element of the vector corresponding to kwi, f(kwi;ei) denotes the frequency of the element ei for kwi, sf(kwi) denotes the frequency of the sentence including kwi, and N denotes the number of katakana words in the corpus.</Paragraph> <Paragraph position="9"> The contextual similarity is de ned as the following formula.</Paragraph> <Paragraph position="11"> where vec(kw) denotes a vector corresponding to the katakana word kw.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We used the ATR Basic Travel Expression Corpus (BTEC)(Takezawa et al., 2002) as a resource for text. BTEC is a multilingual corpus and was mainly developed with English and Japanese. The Japanese part of BTEC contains not only ordinary katakana variants, but also mis-transliterated katakana strings by non-Japanese natives that serve as our target for detection. The BTEC we used consists of almost 200,000 sentences.</Paragraph> <Paragraph position="1"> We used almost 160,000 sentences for the development of the t(i;j) table, other rules used in our method, and parameter estimations for the method. We manually estimated the parameters to achieve the highest F-measure for the development sentences, and estimated the parameters as follows: Pid = 2:5, THlen = 5, THst1 = 9:4, THfreq = 3, THcos1 = 0:12, THcos2 = 0:02, THed = 0:65, and THst2 = 0:89.</Paragraph> <Paragraph position="2"> The developmental corpus includes almost 6,000 types of katakana words. We carried out a closed test using the development corpus with these parameter settings. There are two choices for the detection method: One is the use of a dictionary to judge whether the input and candidate words are known as different words. The other is the use of contextual similarity. Actually, in the detection method, contextual similarity plays a supportive role because there is a data sparseness problem. Therefore we carried out an experiment with four conditions. Table 3 shows the results of recall, precision, and F-measure on these four conditions. null The remaining 40,000 sentences were used as a test set, with which we carried out an open test.</Paragraph> <Paragraph position="3"> Table 4 shows the result of the open test.</Paragraph> <Paragraph position="4"> There is an obvious tendency for the detection of short words to be very dif cult. We compared an F-measure for each class of word length, with Figure 4 showing the results in open tests with the dictionary and without it. Both open tests were conducted without contextual similarity. Compar- null without dictionary in each word length</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> The experimental results showed that it is very difcult to detect short variants. Thus, it is reasonable to use a dictionary for words that we already know.</Paragraph> <Paragraph position="1"> Figure 4 shows the impact of the dictionary.</Paragraph> <Paragraph position="2"> Most of the detection errors are related to proper nouns, because there are many proper nouns that are dif cult to recognize as different words, such as a38 a0a39a3 (marii, Mary) and a38 a0a39a40 (maria, Maria), and so on. Furthermore, it is also hard to differentiate these words by their contextual vectors, because using proper nouns is independent of context. If we can precisely detect these proper nouns written in katakana, we will be able to avoid such mis-detection. In practical situations, an enormous dictionary of proper nouns such as ENAMDICT2 would be useful for this problem.</Paragraph> <Paragraph position="3"> The detection method successfully detected several mistyped words, such as a41a33a23 a37 (buran) for a12 (furemonto hoteru, the Fremont Hotel), and so on. Most of the detected mistypes were vowels, because the string similarity is designed to be tolerant of mistyped vowels, and such mistypes were detected successfully. However, it seems to be a dif cult task to detect mistypes with consonants, because, in ordinary situations, most mistypes related to consonants seem to be completely different words.</Paragraph> <Paragraph position="4"> In addition, there are several variants that are dif cult to detect by this method. A typical example was shown in the Introduction: a9a49a10a49a12a50a14 (uirusu) or a16 a3 a12a17a14 (biirusu) for the English word virus. This type of variant includes drastic changes; for instance, the ordinary edit distance between a9a24a10a51a12a50a14 (uirusu) and a16 a3 a12a50a14 (biirusu) is four, and the similarity derived from the distance is too small (0.5) to identify it as the same word. Moreover, there is another type of orthographic variant that has changed with time. BTEC includes such an example: for the English word milkshake, both a52a53a12a55a54a13a56 a3a58a57 (mirukuseeki) and a52a59a12a60a54a49a35a33a61 a3 a54 (mirukusyeeku) exist. We have to be careful of this problem when we process a corpus that has been developed for quite some time, and that includes both very old texts and new ones.</Paragraph> <Paragraph position="5"> A well known problem arises here: data sparseness. Orthographic variants appear less frequently than their standard expressions, and we cannot expect to have much contextual information for orthographic variants. Therefore, we always have to cope with this problem even when we process very large corpus, because the appearance of variants does not relate to the size of the corpus. The basic idea of the contextual vector seems very reasonable for words that appear frequently in a target corpus. However, experimental results showed that the contextual similarity did not work as expected because of this data sparseness. Consequently, to achieve reliable contextual similarity, we have to use sentences in which a candidate of the variant is used. On-line WWW searching seems to be good as such a resource for variant detection because WWW texts include many variants.</Paragraph> <Paragraph position="6"> On the other hand, there was a pair of words that have very high string similarity and contextual similarity, but they are not variants. That pair is a35a63a62 a37a2a64a65a37 (syanpen champagne): a66a50a10 a37a2a64a65a37 (sainpen sign pen / felt-tip pen), and examples of sentences that include each word are as follows: syanpen o gurasu de kudasai. (Give me a glass of champagne, please.) sainpen o ippon kudasai. (Give me a felt-tip pen, please.) Both words are arguments of the same verb kudasai, and the vectors derived from the analyzed result of these sentences would be very similar.</Paragraph> <Paragraph position="7"> Practically, these words are identi ed as different words by using a dictionary. However, when using only contextual similarity, these words would be judged as variants.</Paragraph> <Paragraph position="8"> It is not easy to detect all of the variants by applying the proposed method. Indeed, the method employs contextual information to achieve good performance, but the contextual information used also includes variants, and the variants cause mismatches. In addition, not only katakana variants, but also other orthographic variants, such as kanji and cross-script orthographic variants (e.g., kanji vs. hiragana, hiragana vs. katakana, and so on), should be detected to achieve high precision and recall.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Related works </SectionTitle> <Paragraph position="0"> To date, there have been several studies conducted on the detection of transliterated orthographic variants(e.g., (Kubota et al., 1994; Shishibori et al., 1994)). Most of these, however, targeted a relatively clean and well organized corpus or they assumed arti cial situations. As a practical matter, not only predictable orthographic variants but also misspelled words should be detected. The detection of transliterated orthographic variants and spelling corrections have been studied separately, and there is no study that is directly related to our work.</Paragraph> <Paragraph position="1"> There are several studies on transliteration (e.g., (Knight and Graehl, 1997)), and they tell us that machine transliteration of language pairs that employ very different alphabets and sound systems is extremely dif cult, and that the technology is still to immature for use in practical processing.</Paragraph> </Section> class="xml-element"></Paper>