File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1207_metho.xml
Size: 21,257 bytes
Last Modified: 2025-10-06 14:08:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1207"> <Title>A State of the Art of Thai Language Resources and Thai Language Behavior Analysis and Modeling Asanee Kawtrakul, Mukda Suktarachan, Patcharee Varasai,</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. A State of the Art of Thai Language Resource </SectionTitle> <Paragraph position="0"> This section gives a survey of a state of the art of Thai Language Resources consisting of Corpus, Lexicon and Tools. Here, we will present only the resources that open for public access.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Corpus </SectionTitle> <Paragraph position="0"> The existing Thai corpus is divided into 2 types; speech and text corpus developed by many Thai Universities. Thai Language Audio Resource Center of Thammasart University (ThaiARC) (http:// thaiarc.ac.th) developed speech corpus aimed to provide digitized audio information for dissemination via Internet. The project pioneers the production and collection of various types of audio information and various styles of Thai speech, such as royal speeches, academic lectures, oral literature, etc.</Paragraph> <Paragraph position="1"> For Text corpus, originally, the goal of the corpus collecting is used only inside the laboratory. Until 1996, National Electronics and Computer Technology Center (NECTEC) and Communications Research Laboratory (CRL) had a collaboration project with the purpose of preparing Thai language corpus from technical proceedings for language study and application research. It named ORCHID corpus (NECTEC, 1997). NAiST Corpus began in 1996 with the primary aim of collecting document from magazines for training and testing program in Written Production Assistance (Asanee, 1995).</Paragraph> <Paragraph position="2"> The existing corpus can be summarized as shown in Table 1.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Lexicon </SectionTitle> <Paragraph position="0"> There are a number of Thai lexicons, which has been developed as shown in Table 2.</Paragraph> <Paragraph position="1"> From the table 2, Only Lexitron (from NECTEC) and NAiST Lexibase (from Kasetsart University) that were applied to NLP. NAiST Lexibase has been developed based on relational model for managing and maintaining easily in the future. It contains 15,000 words list with their syntax and the semantic concept information in the concept code.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Corpus and Language Analysis Tools </SectionTitle> <Paragraph position="0"> Corpus is not only the resource of Linguistic Knowledge but is used for training, improving and evaluating the NLP systems. The tools for corpus manipulation and knowledge acquisition become necessary.</Paragraph> <Paragraph position="1"> NAiST Lab. has developed the toolkit for sharing via the Internet. It has been designed for corpus collecting, annotating, maintaining and analyzing. Additionally, it has been designed as the engine, which the end user could use with their data. (See a service on http://naist.cpe.ku.ac.th).</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. Thai Language Behavior Analysis </SectionTitle> <Paragraph position="0"> In order to have a good language model for creating cost-effective solutions to the practical problems in application development, language behavior must be observed. Next is Thai language behavior analysis based on NAiST corpus consisting of Lexicon growth, Thai word formation and Phrase construction.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Lexicon Growth </SectionTitle> <Paragraph position="0"> The lexicon growth is studied by using Word list Extraction tool to extract word lists from a large-scale corpus and mapping to the Royal Institute Dictionary (RID). It is noticeable that there are two types of lexicon: common and unknown words. The common word lists are some words in RID, which occur in almost every document, and use in daily life. They are primitive</Paragraph> <Paragraph position="2"> words but not being proper names or colloquial words. The unknown or new words occur much in the real document such as Proper names, Colloquial words, Abbreviations, and Foreign words.</Paragraph> <Paragraph position="3"> The lexicon growth is observed from corpus size, 400,000, 2,154,700 and 60,511,974 words from Newspaper, Magazine and Agriculture text. We found that common word lists increased from 111,954 to 839,522 and 49,136,408 words according to the corpus size, while the unknown word lists increased from 288,046 to 1,315,178 and 11,375,566 words respectively as shown in table3.</Paragraph> <Paragraph position="4"> table 3, it composes of 35,127,012 words from Newspaper, 18,359,724 words from Magazine and 7,025,238 words from Agricultural Text.</Paragraph> <Paragraph position="5"> Unknown words occur in each category as shown in table 4.</Paragraph> <Paragraph position="6"> according to the various corpus genres According to table 3 and 4, we could observe that not only unknown words increase but common words also increase and the main categories of increasing unknown word are proper names and foreign words. Consequently, a computational model of unknown word extraction and name entity identification has been developed and also of new word construction.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 New Word Formation and Core Noun </SectionTitle> <Paragraph position="0"> Regarding to the growth of common word shown in table 3, we studied how the new words come from.</Paragraph> <Paragraph position="1"> Thai words are multi-syllabic words which stringing together could form a new word. Since Thai has no inflection and no word delimiters, Thai morphological processing is mainly to recognize word boundaries instead of recognizing a lexical form from a surface form as in English. Let C be a sequence of characters</Paragraph> <Paragraph position="3"> Since Thai sentences are formed with a sequence of words with a stream of characters, i.e., c1c2c3...cn mostly without explicit delimiters, the word boundary in &quot;c1c2c3c4c5&quot; pattern as shown below could have two ambiguous forms. One is &quot;c1c2&quot; and &quot;c3c4c5&quot;. The other one is &quot;c1c2c3&quot; and &quot;c4c5&quot; From grouped differently eaning of words will be too. For example, &quot;k`d`k k`d`k(fold one's arms across the chest)&quot; and clump of flower)&quot;. From our corpus, we found that the sentence with 45 characters has 30 combinations of words sequence.</Paragraph> <Paragraph position="4"> Almost all-Thai new words ar means of compounding and nominalization, using a set of prefixes.</Paragraph> <Paragraph position="5"> 3.2.2.1 Nominalization Nominalization is a process by which a word can be formed as a noun by using prefixes Noun words formed by using prefixes and &quot;khwaam(k h wa:m)&quot;are nouns which signal action. Words formed by using prefixes &quot;chaaw(t h a:w)&quot; and &quot;nak(nak)&quot;are nouns which human or profession.</Paragraph> <Paragraph position="6"> )&quot; are used in verb or verb phrase and sometimes from noun (Nominalization). kaar(ka:n) that co-occur with noun, represents the meaning about duty or function of noun it relates to. kaar(ka:n) that co-occur with verbs, always occur with action verbs.</Paragraph> <Paragraph position="8"> u:) and nak(nak) co-occur with verb phrase. nak (nak) sometimes can occur with a few fields of nouns, such as sport and music. So at the first time we kept words, which constructed from prefix &quot;nak (nak)&quot; plus noun in the lexicon for solving the problem. Prefix &quot;chaaw(t h a:w)&quot; can co-occur with noun only.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="4" type="metho"> <SectionTitle> 3.2.2.2 Compounding </SectionTitle> <Paragraph position="0"> Thai new words can, also, be combined to form compound nouns and are invented almost daily. They normally have at least two parts. The first part represents a pointed object or person such as khn(man), hm`(pot), haang(tail), phuuech(plant). The second part identifies what kind of object or person it is, or what its purpose is like khabrth(drive a car), hungkhaaw(cook rice), esuue`(tiger), nM aa(water). Table 5 shows the examples of compound noun in Thai.</Paragraph> <Paragraph position="2"> From Table 6, it has shown that some compound nouns maintain some parts of those primitive word meaning but some changed to a new meaning. In this paper, we are only interested in compound noun grouping from primitive words which were changed the meaning to more abstract but still maintain some parts of those primitive word meanings, e.g. &quot;khnrth(driver) khnkhraw(cooker) etc.&quot; The word &quot;khn&quot; maintains its meaning which has a concept of human, but when it was compounded with &quot;rth(car)&quot; and &quot;khraw(kitchen)&quot;, their meanings have changed to the occupation by the word relation in the equivalent level. In case of compound noun that change a whole meaning such as &quot;luukesuue` (a boy scout)&quot;, it will be kept in the lexicon.</Paragraph> <Paragraph position="3"> sentences. Since Thai language is flexible and has no word derivation, including to preposition in compound noun can be omitted, etc. This causes a compound noun having the same pattern as sentence. Thus, Thai NP analysis in IR system is more difficult than English. (See Figure 2) sentence structure In figure 2, compound noun &quot;otakinkhaaw&quot; (a dining table) actually omit the preposition &quot;sM aahrab (for)&quot;, which is a relation that point to the purpose of the first noun &quot;ota(table)&quot;.</Paragraph> <Paragraph position="4"> The Compound Noun Boundary Ambiguity After we have extracted noun phrase aiming for enhancing the IR system, we have to segment Sentence: nkkinphlaim (birds eat fruit) that noun phrase into sub noun phrase or compound noun in order to specify the core noun as index and its modifier as sub-index. For example, compound noun with &quot;noun + noun + verb&quot; structure: edk(child/N)phm(hair/N)yaaw (long/V) etc. In this case, the second noun and verb have to be grouped firstly since it behaves similarly to a modifier by omitting the relative pronoun that represents its purpose, i.e., &quot;who has&quot;. Another case of Compound Noun Boundary Ambiguity is word combination. Consider the sequence of words as the example of NP that composes of four words as follows:</Paragraph> <Paragraph position="6"> There are 8 word combinations of compound noun as shown in figure 3.</Paragraph> <Paragraph position="7"> In figure 3, word string has to be grouped correctly for the correct meaning.</Paragraph> <Paragraph position="8"> The ambiguity of noun phrase boundary has also directly effected the efficiency of text retrieval.</Paragraph> <Paragraph position="9"> Core Noun detection Due to the Information Retrieval, a head or core of noun phrase detection is necessary. In this paper, core noun refers to the most important and specific word that the information retrieval and extraction can directly retrieve or extract without over generating candidate words. However, by the observation, the core of noun phrase needs not to be the initial words. Some of them are at the final position and some have word relation in the equivalent level (As shown in Table 7).</Paragraph> <Paragraph position="10"> fruit papaya green be W1 located at the initial position be W2 located at the final position be W2 located at the second position As mentioned above, the models of New Word Generation and Noun Phrase Recognition become one of the interesting works in Thai processing.</Paragraph> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 3.3 Phrase and Sentence Construction </SectionTitle> <Paragraph position="0"> Next, we will indicate the main problems that influence to MT, IE and IR system. These are constituent movement, zero anaphora and iterative relative clause.</Paragraph> <Paragraph position="1"> Constituent is the relationship between lexicon units, which are parts of a larger unit. Constituency is usually shown by a tree diagram or by square brackets: Ex. [[kaarprachumkhnakrrmkaar] [`yaangraabruuen]] [[meeting committee] [very smoothly]].</Paragraph> <Paragraph position="2"> Constituent acts as a chunk that can be moved together and it often occurs in Thai language (see Fig. 4). The constituents can be moved to the front, the middle or the end of the sentence.</Paragraph> <Paragraph position="4"> Ex.: t`nechaa chaawpramng ``keruue` haaplaa In the morning, the fisherman goes to catch the fish chaawpramng ``keruue` haaplaa t`nechaa The fisherman goes to catch the fish in the morning. chaawpramng ``keruue` t`nechaa haaplaa The fisherman goes to in the morning, catch the fish t`nechaa haaplaa chaawpramng ``keruue` In the morning, catch the fish, the fisherman goes to. Noun, adverb, and prepositional phrase are often move while verb phrases are.</Paragraph> <Paragraph position="5"> To make the cohesion in the discourse, the anaphora is used as a reference to &quot;point back&quot; to some entities called referent or antecedent, given in the preceding discourse. Halliday, M.A.K. and Hasan, Rugaiya (1976) divided cohesion in English into 5 categories as shown in Table 8:</Paragraph> <Paragraph position="7"> Observing from the corpus in: news, magazine and agricultural text, there are 4 types of anaphora. Ellipsis or zero anaphora was found most frequently in Thai documents and other anaphora happened as show in table 9.</Paragraph> <Paragraph position="8"> Zero anaphora is the use of a gap, in a phrase or clause that has an anaphoric function similar to a pro-form. It is often described as &quot;referring back&quot; to an expression that supplies the information necessary for interpreting the gap The following is a sentence that illustrates zero anaphora: miithnns`ngsaaythiit`ngaip trngaetaekhb aelakwaangaetkhdekhiiyw * There are two roads to eternity, straight but narrow, and broad but crooked.</Paragraph> <Paragraph position="9"> In this sentence, the gaps in straight but narrow [gap], and broad but crooked [gap] have a zero anaphoric relationship to two roads to eternity.</Paragraph> <Paragraph position="10"> It is noticeable that zero anaphora in the position of the subject occurs with high frequency (49.88%). It shows that in Thai language, the position of subject is the most commonly replaced.</Paragraph> </Section> <Section position="2" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 3.3.3 Iterative Relative Clause </SectionTitle> <Paragraph position="0"> Thai relative pronouns &quot;thii&quot; (thi) &quot;chueng(sung)&quot; and &quot; `an(un)&quot; relate to group of nouns or other pronouns (The student &quot;thii&quot; (thi) studies hardest usually does the best.). The word &quot;thii&quot; (thi) connects or relates the subject, student, to the verb within the dependent clause (studies). Generally, we use &quot;thii&quot; (thi) and &quot; &quot;chueng(sung)&quot; to introduce clauses that are parenthetical in nature (i.e., that can be removed from the sentence without changing the essential meaning of the sentence. The pronoun &quot;thii&quot; (thi) and &quot;chueng(sung)&quot; refers to things and people and &quot;`an(un)&quot; usually refers to things, but it can also refer to event in general.</Paragraph> <Paragraph position="1"> The relative pronoun is sometimes omitted because it makes the sentence more efficient and elegant.</Paragraph> <Paragraph position="2"> * hnangsuue` thii/chueng khun sangchuue` cchaak raannan maathuengaelwemuue` 2 wank`n The book that you ordered from that shop arrived two days later.</Paragraph> <Paragraph position="3"> Sometimes relative pronoun refers to an event that takes place repeatedly in a phrase.</Paragraph> <Paragraph position="4"> Figure 5 The structure of relative clause Ex. [ph`khraw]N [(thii) chnakaaraekhngkhanthM aa`aahaar]Rel Cl. [The chef] [who won the cooking competition] [(chueng) cchadkhuenthiipraethsfrangess] Rel Cl. [(thii) chancchaangmaa] Rel Cl. [which compete at France] [that I employ] Although a sentence, which has several clauses inside, will be grammatical, but it is not a good style in writing and always causes a problem for parser and noun phrase recognition.</Paragraph> </Section> </Section> <Section position="6" start_page="4" end_page="4" type="metho"> <SectionTitle> 4. The Computational Model </SectionTitle> <Paragraph position="0"> The computational models in word and phrase level are developed according to the phenomena mentioned in section 3.</Paragraph> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.1 Unknown Word Extraction </SectionTitle> <Paragraph position="0"> Unknown word extraction model composes of 2 sub-modules: unknown word recognition and name entity identification.</Paragraph> <Paragraph position="1"> The hybrid model approach has been used for unknown word recognition. The approach is the combination of a statistical model and a set of context based rules. A statistical model is used to identify unknown word's boundary. The set of context based rules, then, will be used to extract the unknown word's semantic concept. If the unknown word has no context, a set of unknown word information, which has defined through corpus analysis, will be generated and the best one will be selected, as its semantic concept, by using the semantic tagging model. Unknown word recognition process is shown in figure 6.</Paragraph> <Paragraph position="2"> After unknown words have been extracted, Named Entity (NE) Identification will define the category of unknown word. The model based on heuristic rules and mutual information. Mutual information or statistical analysis of word collocation is used to solve boundary ambiguity when names were composed with known and unknown words. We use Knowledge based such as list of known name (such as country names), clue word list (such as person's title) to support the heuristic rules. Using clue word or common noun that precedes the name can specify NE categorization. Based on the case grammar, NE categories can also defined. Moreover, the lists of the names from predefined NE Ontology can be used for predicting category too. The overview of our system is shown in figure 7. More detail sees (Chanlekha, H. et al, 2002)</Paragraph> </Section> <Section position="2" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.2 New Word Generation </SectionTitle> <Paragraph position="0"> Word formation is proposed to reduce the lexicon size by constructing new words or compound noun from the existing words. Based on word formation rules and common dictionary, the shallow parser will extract a set of candidate compound nouns. Then probabilistic approach based on syntactic structure and statistical data is used to solve the problem of over- and undergeneration of new word construction and prune the erroneous of compound noun from the candidate set. The process of new word construction is shown in figure 8. See more detail in (Pengphon, N. et al, 2002) Figure 8 : New Word Construction process</Paragraph> </Section> <Section position="3" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.3 Noun Phrase Recognition </SectionTitle> <Paragraph position="0"> Entities or concepts are usually described by noun phrases. This indicates that text chunks like noun phrases play an important role in human language processing. In order to analyze NP, both statistical and linguistic data are used. The model of NP analysis system is shown in figure 9. More detail sees (Pengphon, N. et al, 2002) The first step is morphological analysis for word segmentation and POS tagging. At the second step, the compound word is grouped into one word by using word formation module (see 4.2). The third step, statistical-based technique is used to identify phrase boundary. This step was provided for identifying the phrase boundary by using NP rules. Next step is Noun Phrase Segmentation. The condition of noun phrase segmentation is shown in figure 10.</Paragraph> <Paragraph position="1"> After noun phrase is correctly detected, the relation in noun phrases will be extracted. There are 2 types of relation: head-head noun phrase and head-modifier noun phrase. The process is based on statistical techniques by considering the frequency (f</Paragraph> <Paragraph position="3"> ) in the document (See figure 11).</Paragraph> </Section> </Section> class="xml-element"></Paper>