File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-1091_intro.xml

Size: 11,901 bytes

Last Modified: 2025-10-06 14:05:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-1091">
  <Title>CLASSIFIER ASSIGNMENT BY CORPUS-BASED APPROACH</Title>
  <Section position="3" start_page="0" end_page="559" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> A classifier has a significant use in Thai language tbr construction of noun or verb to express quantity, determination, pronoun, etc. By far the most common use of classifiers, however, is in enumerations, where the classifiers follow numerals and precede demonstratives (Noss,1964). Not all types of classifier have a relationship with noun or verb as a unit classifier does.</Paragraph>
    <Paragraph position="1"> A unit classifier is any classifier which has a special relationship with one or more concrete nouns. For example, to enumerate members of the class of /rya/ 'boats', tile unit classifier/lain/ is selected as in the phrase below: /rya nung lain/ boat one &lt;boat&gt; 'one boat'.</Paragraph>
    <Paragraph position="2"> Other than tile unit classifier, there are collective classifier, metric classifier, frequency classifier and verbal classifier.</Paragraph>
    <Paragraph position="3"> A collective classifier is ,any classifier which shows general group or set of mass nouns, un a~ ~ /nok soong lung/ 'two flocks of bird'. A metric classifier is any classifier which occurs in enumerations that modify predicates as well as nouns, v lh l~1~ ,u~/nam saam kaew/ 'three glasses of water'. A frequency classifier is any classifier which is used to express the frequency of event that occurs, ~u ~ ~mJ /bin sii roob/ 'fly four rounds'. A verbal classifier is any classifier which is derived from a verb and usually used in construction with mass nouns, n~z~q~a #a ~ 11&amp;quot;)11 /kradaad haa muan/ 'five rolls of paper'.</Paragraph>
    <Paragraph position="4"> The unit classifier has a special relationship with concrete noun. The member of this class of classifier is closed for each noun. Most of the unit classifiers m'e used with a great many concrete nouns of very different meaning, but few are restricted to a single noun. Except for the unit classifier, the members of classifier for a noun or predicate are open.</Paragraph>
    <Paragraph position="5"> Especially for the metric classifier, the number of classifiers for numeral expression of distance, size, weight, container and value is large.</Paragraph>
    <Paragraph position="6"> The use of classifier in Thai is not limited to the nunmral expression but is extended to other expressions such as ordinal, determination, relative pronoun, pronoun, etc. The detail of each classifier phrase is described in the next section.</Paragraph>
    <Paragraph position="7"> In many existing natural language processing systems, tile list of available classifiers lk3r each noun is attached to a lexicon base. Rules for classifier selection from the list can somehow provide the  dcfault value but does not guarantee thc appropriateness, tlowever, the problems on classifier phrase construction still remain unsolved.</Paragraph>
    <Paragraph position="8"> To overcome the problems of using classifiers, we propose a method of classifier phrase extracting fl'om a large corpus. As a result, Noun-Classifier Associations (dcscribcd in Section 3) is statistically created to define the relationship between a noun and a classifier in a classifier phrase. With the li'equency of tile occurrence of a classifier in a classifier phrase, we can propose the most apl)rot)riate use of a classilier. Furthermore, we introduce a hierarchy of semantic class for tile induction of a classifier class when they are employed to construct with nouns which belong to the same class of meaning. Section 3 and Section 4 (lescribc the generation and the imlflcmentation of the NCA, respectively.</Paragraph>
    <Paragraph position="9"> 2. The roles of classilier in Thai hmguage in Thai language, we use classifiers ill wuious situations. The classilier plays atu important role ill COllStrtlciiou with tlnUll to express ordinal, pronoun, for instance. The classifier phrase is syntactically geneutted according to a specific pattenL Fig. 2.1 showt; the position of a classifier in each pattern, where N stands lot noun, NCNM stands for cardinal nnnlher, CI, stallds for classifier, DET stands for determiner, VATF stands for attributive verh, Rt'iL M stands for relative marker, ITR. M stands for Interrogative iilarkcr , DONM stands foi ordinal liu/tlt:llil, DDAC statMs fin definite demonshativc Study on tile use of classilira' in each expression inemioned above, we can conclu(le that tile types of classifier are not restricted tt) any kinds of expression, 'to consider tile Selnantic representatioll of each exprcssioit, it happens that tilt: unit classifier is not wgarded its a conceptual refit in all expressions except i~l pattern 6, hut the other types are. (see examples in a. and b.)  We ellcolmtered to gcnerate tile alWopdate classifier tel noun or verb ill a semantic representation. &amp;quot;file classifier assignment for non-conceptual representation alld the classifier selection of o\[le to nunly conceptual representation arc over handleable by the rule-based approach. The propose on classifier assignment using the corpus-based method is another approach. Based on the collocation of noun and classifier of each pattern shown in Fig. 2.1, we decided to construct the Noun Classifier Association table (see Section 3). A stocMstic method combined with the concept hierarchy is proposed as a strategy in making the NCA table. The table composes of the information about nonn-classifier collocation, statistic occurrences and the representative classifier for each semantic class in the concept hierarchy.</Paragraph>
    <Paragraph position="10"> 3. Extraction of Noun-Classifier Collocation 1,1 this section, wc describe tile algorithm used for extraction of Noun Classifier Associations (NCA) from a large corpus. We used a 40 megabyte Thai coq)us collected from wu'ious areas to create tile table. The algorithm is as follows: Step 1: Word segmentation.</Paragraph>
    <Paragraph position="11"> Input: A corpus.</Paragraph>
    <Paragraph position="12"> Output: The wordosegmented corpus.</Paragraph>
    <Paragraph position="13"> hi text processing, we often need word boundary information lot several puqmses. Because Thai has no explicit raarke, to separate words from one another, we have to prcprocess the corpus with word segmentation program. We used the program developed by Sornlcrthmwanich (1993) with post-editing to correct fault segmcntation. The program employs heuristic rules of longest malching and least word count incoq)orated with character combining rules for Thai words. Though tile accuracy of the word segmentation does not reach 100%, but it is high enough (more than 95%) to reduce the post-~iting time.</Paragraph>
    <Paragraph position="14"> Step 2: Tagging.</Paragraph>
    <Paragraph position="15"> Input: Output of step 1.</Paragraph>
    <Paragraph position="16"> Output: The corpus of which each word is tagged with a part of speech and a semantic class.</Paragraph>
    <Paragraph position="17"> The word-segmented corpus is then processed with a stochastic paWof.-st)eed, tagger. Each word w together with its part of speech is then used to reUieve the semantic class of tile word fiom a dictionary. The result yields a data structure of (w,p,s), where p denotes the pm-t of speech of w and s denotes the semantic chtss of w. For example, the data structure of the word fihf~mA hlakrian/'student' is (ffnt~ou, NCMN, person), where NCMN stml(ls for common noun and t)crson rel)rescnts ffntTml in file class of person. Step 3: Producing cnncordances.</Paragraph>
    <Paragraph position="18"> hq)ut: Output of step 2, a given classifier el.</Paragraph>
    <Paragraph position="19"> Output: All the fragnlents containing cl.</Paragraph>
    <Paragraph position="21"> Instead of picking up the data sentence by sentence, we extracted a fragment of data arouud the el, because there is no explicit marker to indicate sentence boundaries. We used the range of -10 to +2 words around the cl in our experiments which appeared to cover most of co-occurrence patterns.</Paragraph>
    <Paragraph position="22"> Step 4: PaRern naatching Input: Output of step 3.</Paragraph>
    <Paragraph position="23"> Output: A list of nouns-classifiers with frequency intormatiou of co-occurrences.</Paragraph>
    <Paragraph position="24"> In this step, the tagged corpus is matched with each pattern of classifier occurrences shown below:  where N denotes noun, CL denotes classifier, NCNM denotes c~u'dinal number, DET denotes determiner, A ,. .4 VATF denotes attributive verb, ~l/tu/, ~ /sung/ and '\[u /nai/ are specific Thai words, A-B denotes a consecutive pair of A and 1t, aud A--B denotes a possibly separated pair. Actually, A--B can be  separated by several arbitrary words but in our experiments we considered only possible separations by a relative pronoun phrase having no more than 5 words. This is to limit the search space of general cases to a manageable size with some loss of generality.</Paragraph>
    <Paragraph position="25"> The pattern matching process was carried out one by one with each pattern. For each pattern of A- -B-C, the matching of B-C pair was simple and was  performed at first. Next, the matching of a pair A- -B was done by: 1. searching for the nearest A from B. If found, mark AI.</Paragraph>
    <Paragraph position="26"> 2. from B within a span of five, searching for the nearest relative pronoun. If found, mark pl then go to 3. Otherwise, match A1.</Paragraph>
    <Paragraph position="27"> 3. further searching for the nearest A from p 1.  If found, mark A2. If A2 is farther from B than A1, match A2. Otherwise, match A I.</Paragraph>
    <Paragraph position="28"> At the end of these steps, we obtained a list of nouns Ni along with the frequency of w in the corpus for each matching pattern (see Fig. 3.1 for sample ouqmts). Each entry is of the form (W_N1, CLN2, Freq) where W denotes a noun, N1 denotes a number representing semantic class of W, CL denotes the associated classifier, N2 is a number indicating whether CL is a unit or collective classifier (1 for unit, 2 for collective) and Freq denotes the frequency of co-occurrence between W and CL. The semantic class is shown in Fig. 3.2.</Paragraph>
    <Paragraph position="29"> Step 5: Determine representative classifier Input: A list of noun-classifier with frequency information of co-occurrence.</Paragraph>
    <Paragraph position="30"> Output: Representative classifier of each noun and each semantic class of nouns.</Paragraph>
    <Paragraph position="31"> As it can be observed in Fig. 3.1, each noun may be used with several possible classifiers. In language generation process. However, we have to select only one of them. For each noun we select the classifier with the greatest value of co-occurrence frequency to be the representative classifier for both representative unit classifier and representative collective classifier. Tile classifier in Fig. 3.1, for example, will have ~__1 as the representative unit classifier and have n~ 2 as the representative collective one for the noun sm~nr~unq'~ 111. Collective classifiers are used instead of unit classifiers when the notion of &amp;quot;group' is required.</Paragraph>
    <Paragraph position="32"> We also find the representative classifier for each semantic class of nouns in the same manner. For each semantic class of nouns (grouped by the semantic class attached with each noun), the classifier with the greatest value of co-occurrence frequency is selected to be the representative. The classifier is used to handle the assignment of classifier to noun which does not exist in the trained corpus. For example, the representative unit classifiers for each semantic class extracted by the pattern (N- -NCNM-CL) are shown in Fig. 3.3.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML