File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1076_metho.xml
Size: 7,615 bytes
Last Modified: 2025-10-06 14:07:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P00-1076"> <Title>Good Spelling of Vietnamese Texts, one aspect of computational linguistics in Vietnam</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> PHAN Huy Khanh </SectionTitle> <Paragraph position="0"/> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> There are many challenging problems for Vietnamese language processing. It will be a long time before these challenges are met. Even some apparently simple problems such as spelling correction are quite difficult and have not been approached systematically yet. In this paper, we will discuss one aspect of this type of work: designing the so-called Vietools to detect and correct spelling of Vietnamese texts by using a spelling database based on TELEX code. Vietools is also extended to serve many purposes in Vietnamese language processing.</Paragraph> <Paragraph position="1"> Introduction For the past two decades computational linguistics (CL) has progressed substantially in Vietnam, mainly in these basic aspects: data acquisition from the keyboard, encoding, and restitution through an output device for Vietnamese diacritic characters, updates on the fonts in Microsoft DOS/Windows, standardization for Vietnamese (James Do, Ngo Thanh Nhan), automatic translation of English documents into Vietnamese and vice versa (Phan Thi Tuoi, Dinh Dien), recognition of handwriting (Hoang Kiem, Nguyen Van Khuong), speech processing (Nguyen Thanh Phuc, Quach Tuan Ngoc), building bilingual dictionaries such as English-Vietnamese and V-E, French-Vietnamese and V-F dictionaries (Lac Viet), archives of old Sino-Vietnamese documents (Ngo Trung Viet, Cong Tam), etc.</Paragraph> <Paragraph position="2"> Some of these works have been presented in Informatics and IT workshops organized in Vietnam. These efforts are modest and do not yet show our full potential. There are many reasons for this weakness. The major reasons that the different efforts are quite isolated and there is not enough coordination. Some coordinated workshops held from time to time would be very helpful.</Paragraph> <Paragraph position="3"> At the IT Dept. DaNang University we are building a lexical database based on TELEX code for accomplishing the following tasks: - Converting Vietnamese texts from any font to any other font.</Paragraph> <Paragraph position="4"> - Putting texts in alphabetical order independently of the font in use.</Paragraph> <Paragraph position="5"> - Looking up words up in the monolingual and / or multilingual dictionary.</Paragraph> <Paragraph position="6"> - Building specialized monolingual dictionaries.</Paragraph> <Paragraph position="7"> At present, we are taking part in the GETA, CLIPS, IMAG, France, in the FEV project: for a multilingual dictionary: French-Vietnamese via English.</Paragraph> <Paragraph position="8"> In fact, inputting Vietnamese texts still encounters many problems, not yet solved properly. The most common mistakes in detecting and correcting spelling errors are: - wrong intonation or misspelling, - not following spelling specialization, not using syllables systematically in the same texts, etc.</Paragraph> <Paragraph position="9"> Winword, a commercial text processor, is not able to detect and correct spelling mistakes. The program designed by Ngo Thanh Nhan (without an associated spelling dictionary) and other software packages for Vietnamese still do not offer adequate solutions.</Paragraph> <Paragraph position="10"> We propose here a general solution for building the so-called Vietools for detecting and correcting spelling errors. Vietools is designed for office application such as Winword, Excel, Acess, PowerPoint, etc. in Microsoft Windows. Vietools has also been extended for converting and rearranging Vietnamese words in the dictionaries and consulting the Vietnamese dictionaries, including multilingual dictionaries.</Paragraph> <Paragraph position="11"> 1 Building spelling database In the spelling dictionary by Hoang Phe (1995), there are 6760 syllables in the writing system (6616 syllables in the phonology system) to compose single words or complex words. Each syllable has two parts: initial consonant (optional) and rhyme pattern (including rhyme and tone). Altogether, there are 27 initial consonants, and 1160 rhyme patterns (including 6 tones).</Paragraph> <Paragraph position="12"> Based on Vietnamese syllable structure, the spelling database is built in a tabular form. Each element of the table helps to check the correction of a syllable based on the column position of initial consonants and the row position of rhyme patterns, for example, the syllable lamf (work) in the TELEX form, is composed of the initial consonant l and rhyme pattern am with by low falling tone (or grave accent) f. Each element of the table can be understood as: - syllables used in Vietnamese.</Paragraph> <Paragraph position="13"> - elements between tone sign positions (on o: oja or on a: oaj), pronunciation or dialect with spelling (z is equivalent to d or gi, y is equivalent to i...) and borrowings such as karaoke, photocopy, fax...</Paragraph> <Paragraph position="14"> - Sino-Vietnamese word: coongj (addition) - congj, quoocs (country) - nuwowcs...</Paragraph> <Paragraph position="15"> - being unable to form syllables: quts, quoon, coan , cuee...</Paragraph> <Paragraph position="16"> Techniques have been developed to recognize the compound words from two syllables, such as baor damr or damr baor (guarantee), chung chung (vague), etc., from three syllables, such as howpj tacs xax (cooperative), etc., from four syllables, such as coong awn vieecj lamf (work, job), etc.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Designing Vietools </SectionTitle> <Paragraph position="0"> The error detecting program reads one syllable at a time from the text. The syllable is divided into an initial consonant and a rhyme pattern, paying attention to solving initial consonants such as: gi containing vowel i; the consonant qu has vowel u, but it is easy to separate it from the syllable for it does not have the consonant q; the other combined initial consonants have the length of 2, or 3. The error-correcting unit checks the conformity of initial consonants (if present) and the rhyme pattern.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Code converting </SectionTitle> <Paragraph position="0"> At present, there are many Vietnamese fonts built on different codes (different in number of bytes used: 1 byte or 2 bytes, order of tones, letter arrangements, etc.). Because there has not been a unified code for Vietnamese text, we selected a pivot code and TELEX code. There are many codes to convert from such as IBM-CP01129, Microsoft-CP1258, VISCII, VietKey, VietWare, VNI, TCVN3, Unicode, etc.</Paragraph> <Paragraph position="1"> Vietools works on syllables converted to TELEX. Vietools analyses syllables to detect initial consonants and rhyme pattern in TELEX code.</Paragraph> <Paragraph position="2"> Conclusion The main advantage of our method is that the tool operates independently of the Vietnamese font used. The design of Vietools is open: one can add new functions such as text or data conversion Spelling data base structure design helps building multi-functional dictionaries, which are essential for natural language processing.</Paragraph> </Section> class="xml-element"></Paper>