File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2209_metho.xml
Size: 16,843 bytes
Last Modified: 2025-10-06 14:09:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2209"> <Title>JMdict: a Japanese-Multilingual Dictionary</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Project Status </SectionTitle> <Paragraph position="0"> The JMdict file was first released in 1999, and updated versions are released 3-4 times each year along with versions of the EDICT file, which is generated at the same time from the same data files. The file now has over 99,300 entries, i.e. the size of a medium-large printed dictionary, and the growth in numbers of entries is now relatively slow, with most updates dealing with corrections and expansion of existing entries.</Paragraph> <Paragraph position="1"> The file is available under a liberal licence that allows its use for almost any purpose without fee. The only requirement is that its use be fully acknowledged and that any files developed from it continue under the same licence conditions.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Structure </SectionTitle> <Paragraph position="0"> The JMdict XML structure contains one element type: <entry>, which in turn contains sequence number, kanji word, kana word, information and translation elements.</Paragraph> <Paragraph position="1"> The sequence number is used for maintenance and identification.</Paragraph> <Paragraph position="2"> The kanji word and kana word elements contain the two forms of the Japanese headwords; the former is used for representations containing at least one kanji character, while the latter is for representations in kana alone. The kana word is effectively the pronunciation, but is also an important key for indexing the dictionary file, as Japanese dictionaries are usually ordered by kana words. The minimum content of these fields is a single word in the kana word element. In addition, each entry may contain information about the words (unusual orthographical variant, archaic kanji, etc.) and frequency of use information. The latter needs to be associated with the actual words rather than the entry as a whole because some combinations of kanji and kana words are used more frequently than others. (For example, He Qi Dao and He Qi Dao are orthographical variants of the one word (aikido), but the former is more common.) The kana used in the elements follows modern Japanese orthography, i.e. hiragana is used for native Japanese words, and katakana for loan words, onomatopoeic words, etc.</Paragraph> <Paragraph position="3"> In most cases an entry has just one kanji and one kana word (approx. 75%), or one kana word alone (15%). In about 10% of entries there are multiple words in one of the elements. In some cases a kana reading can only be associated with a subset of the kanji words in the entry. For example, soyokaze (soyokaze: breeze) can be written either Wei Feng or soyoFeng (the latter is more common as soyo is a non-standard reading of the Wei kanji). However Wei Feng can also be pronounced bifuu (bihuu) with the same meaning, but clearly this pronunciation cannot be associated with the soyoFeng form, as the kana portion is read &quot;soyo&quot;. XML does not provide an elegant method for indicating a restricted mapping between portions of two elements, so when such a restriction is required, additional tags are used with each kana word supplying the kanji word with which it may be validly associated.</Paragraph> <Paragraph position="4"> The information element contains general information about the Japanese word or the entry as a whole. The contents allow for ISO-639 source language codes (for loan words), dialect codes, etymology, bibliographic information and update details.</Paragraph> <Paragraph position="5"> The translation area consists of one or more sense elements that contain at a minimum a single gloss. Associated with each sense is a set of elements containing part of speech, cross-reference, synonym/antonym, usage, etc. information. Also associated with the sense may be restriction codes tying the sense to a subset of the Japanese words. For example, Shui Qi can be pronounced suiki (suiki) and mizuge (mizuke); both meaning &quot;moisture&quot;, but the former alone can also mean &quot;dropsy&quot;.</Paragraph> <Paragraph position="6"> The gloss element has an attribute stating the target language of the translation. In its absence it is assumed the gloss is in English. There is also an attribute stating the gender, if for example, the part-of-speech is a noun and the gloss is in a language with gendered nouns. Figure 1 shows a slightly simplified example of an entry. The <ke_pri> and <re_pri> elements indicate the word is a member of a particular set of common words.</Paragraph> <Paragraph position="7"> The potential to have multiple kanji and kana words within an entry brings attention to the issues of homonymy, homography and polysemy, and the policies for handling these, in particular the criteria for combining kanji and kana words into a single entry. As Japanese has a comparatively limited set of phonemes there are a large number of homophonous words. For example, over twenty different words have the kana representation kouziyou (kojo). If we regard homography as only applying to words written wholly or partly with kanji, there are relatively few cases of it, however they do exist, e.g. Chuan Liu when read senriyuu (senryu) means a comic poem, but when read kawayanagi (kawayanagi) means a variety of willow tree.</Paragraph> <Paragraph position="8"> The combining rule that has been applied in the compilation of the JMdict file is as follows: a. treat each basic entry as a triplet consisting of: kanji representation, matching kana representation, senses; b. if for any basic entries two or more members of the triplet are the same, combine them into the one entry; i. if the entries differ in kanji or kana representation, include these as alternative forms; ii. if the entries differ in sense, treat as a case of polysemy; c. in other cases leave the entries separate.</Paragraph> <Paragraph position="9"> This rule has been applied successfully in a majority of cases. The main problems arise where the meanings are similar or related, as in the case of the entries: (Fang su, hanasu, to separate; to set free; to turn loose) and (Li su, hanasu, to part; to divide; to separate), where the kana words are the same and the meanings overlap. Japanese dictionaries are divided on Fang su and Li su; some keeping them as separate entries, and others having them as the one entry with two main senses. (The two words derive from a common source.)</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Parts of Speech and Related Issues </SectionTitle> <Paragraph position="0"> As languages differ in their parts of speech (POS), the recording of those details in bilingual dictionaries can be a problem (Al-Kasimi, 1977). Traditionally bilingual dictionaries involving Japanese avoid recording any POS information, leaving it to the user to deduce that information from the translation and examples (if any). In the early stages of the EDICT project, POS information was deliberately kept to a minimum, e.g. indicating where a verb was transitive or intransitive when this was not apparent from the translation, mainly to conserve storage space. As there are a number of advantages in having POS information marked in an electronic dictionary file, a POS element was included in the JMdict structure, and publicly available POS classifications were used to populate much of the file. About 30% of entries remain to be classified; mostly nouns or short noun phrases.</Paragraph> <Paragraph position="1"> In the interests of saving space an early decision had been made to avoid listing derived forms of words. For example, the Japanese adjective Gao i (takai) meaning &quot;high, tall, expensive&quot; has derived forms of Gao sa (takasa) &quot;height&quot; and Gao ku (takaku) &quot;highly&quot;. As this process is very regular, many Japanese dictionaries do not carry entries for the derived forms, and some bilingual dictionaries follow suit. Another such example is the common verb form, sometimes called a &quot;verbal noun&quot;, which is created by adding the verb suru (suru) &quot;to do&quot; to appropriate nouns. The verb &quot;to study&quot; is Mian Qiang suru (benkyosuru) where Mian Qiang is a noun meaning &quot;study&quot; in this context. Again, Japanese dictionaries often do not include these forms as headwords, preferring to indicate in the body of an entry that the formation is possible.</Paragraph> <Paragraph position="2"> The omission of such derived forms means that care needs to be taken when constructing the translations so that the user is readily able to identify the appropriate translation of one of the derived forms.</Paragraph> <Paragraph position="3"> In a multilingual context, the omission of derived forms can have other problems. The recording of suru verbs only in their noun base form has been reported to lead to some discomfort among German users, as German language orthographical convention capitalizes the first letters of nouns but not verbs (the WaDokuJT file has suru verbs as separate entries for this reason).</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Inclusion and Maintenance of Multiple Languages </SectionTitle> <Paragraph position="0"> As mentioned above, part of the interest in having entries with translations in a range of languages came from the compilation of a number of dictionary files based on or similar to the EDICT file. There are a number of issues associated with the inclusion of material from other dictionary files, in particular those relating to the compilation policies: coverage, handling of inflected forms, etc. (Breen, 2002) There is also the major issue of the editing and maintenance of the material, which has the potential to become more complex as each language is incorporated.</Paragraph> <Paragraph position="1"> The approach taken with JMdict has been to: a. maintain a core Japanese-English file with a well-documented structure and set of inclusion and editing policies; b. encourage the development and maintenance of equivalent files in other languages paired with Japanese, which can draw on the JMdict/EDICT material as required; c. periodically build the complete multi-lingual JMdict from the different components.</Paragraph> <Paragraph position="2"> This approach has proved successful in that it has separated the compilation of the file from the ongoing editing of the components, and has left the latter in the hands of those with the skills and motivation to perform the task.</Paragraph> <Paragraph position="3"> At the time of writing, the JMdict file has over 99,300 entries (Japanese and English), of which 83,500 have German translations, 58,000 have French translations, 4,800 have Russian translations and 530 have Dutch translations. A set of approximately 4,500 Spanish translations is being prepared, with the prospects that some 20,000 will be available shortly.</Paragraph> <Paragraph position="4"> The major sources of these additional translations are: a. French translations from two projects: i. approximately 17,500 entries have come from the Dictionnaire francais-japonais Project (Desperrier, 2002), a project to translate the most common Japanese words from the EDICT File into French; ii. a further 40,500 entries drawn from the Fo Yu Bu Wan Ji Hua</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> (French-Japanese Complementation </SectionTitle> <Paragraph position="0"> Project) at http://francais.sourceforge.jp/ (This project is also based on the EDICT file.) b. German translations from the WaDokuJT Project (Apel, 2002). This is a large file of over 300,000 entries; however, unlike JMdict it includes many phrases, proper nouns and inflected forms of verbs, etc. The overlap of coverage with JMdict is quite high, leading to the large number of entries that have been included in the JMdict file.</Paragraph> <Paragraph position="1"> One of the issues that can lead to problems when incorporating translations from other project files is that of aligning the translations when an entry has several senses. In the case of the French translations, the project coordinator has marked the translations of polysemous entries with a sense code, thus enabling the translations to be inserted correctly when compiling the final file. For other languages, the translations are being appended to the set English translations. The appropriate handling of multiple senses is an item of future work.</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Examples of Word Usage </SectionTitle> <Paragraph position="0"> When the project was begun and the DTD designed, it was intended that sets of bilingual examples of usage of the entry words would be included. For this reason an <example> element was associated with each sense to allow for such example phrases, sentences, etc, to be included.</Paragraph> <Paragraph position="1"> In practice, a quite different approach has been taken. With the availability since 2001 of a large corpus of parallel Japanese/English sentences (Tanaka, 2001), it was decided to keep the corpus intact, and instead provide for the association of selected sentences from the corpus with dictionary entries via dictionary application software (Breen, 2003b). This strategy, which required the corpus to be parsed to extract a set of index words for each sentence, has proved effective at the application level. It also has the advantage of decoupling the maintenance of the dictionary file from that of the example corpus.</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> 8 Related Projects </SectionTitle> <Paragraph position="0"> Apart from a few small word lists involving several European languages, the only other major current project attempting to compile a comprehensive multilingual database is the Papillon project (e.g. Boitet et al, 2002). See http://www.papillon-dictionary.org/ for a full list of publications. The Papillon design involves linkages based on word-senses as proposed in (Serasset, 1994) with the finer lexical structure based on Meaning-Text Theory (MTT) (Mel'cuk, 1984-1996). At the time of writing the Papillon database is still in the process of being populated with lexical information.</Paragraph> <Paragraph position="1"> Closely related to the JMdict project is the Japanese-Multilingual Named Entity Dictionary (JMnedict) project. This is a database of some 400,000 Japanese place and person names, and non-Japanese names in their Japanese orthographical form, along with a romanized transcription of the Japanese (Breen, 2004b). Some geographical names have English descriptions: cape, island, etc. which are in the process of being extended to other languages. The JMnedict file is in an XML format with a similar structure to JMdict.</Paragraph> <Paragraph position="2"> Another multilingual lexical database is KANJIDIC2 (Breen, 2004c), which contains a wide range of information about the 13,039 kanji in the JIS X 0208, JIS X 0212 and JIS X 0213 character standards. Among the information for each kanji are the set of readings in Japanese, Chinese and Korean, and the broad meanings of each kanji in English, German and Spanish. A set of Portuguese meanings is being prepared. The database is in an XML format.</Paragraph> </Section> <Section position="10" start_page="0" end_page="0" type="metho"> <SectionTitle> 9 Applications </SectionTitle> <Paragraph position="0"> While there are a number of experimental systems using the JMdict file, the only application system using the full multilingual file at present is the Papillon project server. Figure 2 shows the display from that server when looking up the word Chuan Liu . The author's WWWJDIC server (Breen, 2003a) uses the Japanese-English components of the file. Figure 3 is an extract from the WWWJDIC display for the word Xiao Ren , which is an example of an entry with multiple kana words, and senses restricted by reading. (The (P) markers indicate the more common readings.) The EDICT Japanese-English dictionary file, which is generated from the same database as the JMdict file, continues to be a major non-commercial Japanese-English lexical resource, and is used in a large number of applications and servers, as well as in a number of research projects.</Paragraph> </Section> class="xml-element"></Paper>