File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2100_metho.xml

Size: 16,341 bytes

Last Modified: 2025-10-06 14:10:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2100">
  <Title>Learning and Natural Language Processing: A</Title>
  <Section position="3" start_page="779" end_page="780" type="metho">
    <SectionTitle>
2 Challenges of POS Tagging in Hindi
</SectionTitle>
    <Paragraph position="0"> The inter-POS ambiguity surfaces when a word or a morpheme displays an ambiguity across POS categories. Such a word has multiple entries in the lexicon (one for each category). After stemming, the word would be assigned all possible POS tags based on the number of entries it has in the lexicon. The complexity of the task can be understood looking at the following English sentence where the word 'back' falls into three different POS categories-I get back to the back seat to give rest to my back.</Paragraph>
    <Paragraph position="1"> The complexity further increases when it comes to tagging a free-word order language like Hindi where almost all the permutations of words in a clause are possible (Shrivastava et al., 2005). This phenomenon in the language, makes the task of a stochastic tagger dif cult.</Paragraph>
    <Paragraph position="2"> Intra-POS ambiguity arises when a word has one POS with different feature values, e.g., the word 'a0a2a1a4a3a5 ' {laDke} (boys/boy) in Hindi is a noun but can be analyzed in two ways in terms of its feature values:  1. POS: Noun, Number: Sg, Case: Oblique a6a7a9a8a11a10a5a12a0a2a1a13a3a5a14a3a16a15a14a17a4a3a19a18a21a20a22a6a24a23a26a25a28a27a29a20 . maine laDke ko ek aam diyaa.</Paragraph>
    <Paragraph position="3"> I-erg boy to one mango gave.</Paragraph>
    <Paragraph position="4"> I gave a mango to the boy.</Paragraph>
    <Paragraph position="5"> 2. POS: Noun, Number: Pl, Case: Direct</Paragraph>
    <Paragraph position="7"> laDke aam khaate hain.</Paragraph>
    <Paragraph position="8"> Boys mangoes eat.</Paragraph>
    <Paragraph position="9"> Boys eat mangoes.</Paragraph>
    <Paragraph position="10">  One of the dif cult tasks here is to choose the appropriate tag based on the morphology of the word and the context used. Also, new words appear all the time in the texts. Thus, a method for determining the tag of a new word is needed when it is not present in the lexicon. This is done using context information and the information coded in the af xes, as af xes in Hindi (especially in nouns and verbs) are strong indicators of a word's POS category. For example, it is possible to determine that the word 'a37 a20a22a17a4a38a39a20 ' {jaaegaa} (will go) is a verb, based on the environment in which it appears and the knowledge that it carries the in ectional suf x -a17a4a38a39a20 {egaa} that attaches to the base verb 'a37 a20 ' {jaa}.</Paragraph>
    <Section position="1" start_page="780" end_page="780" type="sub_section">
      <SectionTitle>
2.1 Ambiguity Schemes
</SectionTitle>
      <Paragraph position="0"> The criterion to decide whether the tag of a word is a Noun or a Verb is entirely different from that of whether a word is an Adjective or an Adverb.</Paragraph>
      <Paragraph position="1"> For example, the word 'a40a42a41 ' can occur as conjunction, post-position or a noun (as shown previously), hence it falls in an Ambiguity Scheme 'Conjunction-Noun-Postposition'. We grouped all the ambiguous words into sets according to the Ambiguity Schemes that are possible in Hindi, e.g., Adjective-Noun, Adjective-Adverb, Noun-Verb, etc. This idea was rst proposed by Orphanos et al. (1999) for Modern Greek POS tagging. null</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="780" end_page="781" type="metho">
    <SectionTitle>
3 Morphological Structure Of Hindi
</SectionTitle>
    <Paragraph position="0"> In Hindi, Nouns in ect for number and case.</Paragraph>
    <Paragraph position="1"> To capture their morphological variations, they can be categorized into various paradigms2 (Narayana, 1994) based on their vowel ending, gender, number and case information. We have a list of around 29,000 Hindi nouns that are categorized into such paradigms3. Looking at the morphological patterns of the words in a paradigm, suf x-replacement rules have been developed.</Paragraph>
    <Paragraph position="2"> These rules help in separating out a valid suf x  uses similar noun sets in the form of paradigms from an in ected word to output the correct stem and consequently, get the correct root.</Paragraph>
    <Paragraph position="3"> Hindi Adjectives may be in ected or uninected, e.g., 'a43 a6a29a3a45a44a46a0a30a20 ' {chamkiilaa} (shiny), 'a18a21a47a49a48a50a20 ' {acchaa} (nice), 'a0a51a8a53a52a42a20 ' {lambaa} (long) in ect based on the number and case values of their head nouns while 'a54a55 a8a56a25 a41 ' {sundar} (beauti-</Paragraph>
    <Paragraph position="5"> Hindi Verbs in ect for the following grammatical properties (GNPTAM):  1. Gender: Masculine, Feminine, Nonspeci c 2. Number: Singular, Plural, Non-speci c 3. Person: 1st, 2nd and 3rd 4. Tense: Past, Present, Future 5. Aspect: Perfective, Completive, Frequentative, Habitual, Durative, Inceptive, Stative 6. Modality: Imperative, Probabilitive, Subjunctive, Conditional, Deontic, Abilitive,  The morphemes attached to a verb along with their corresponding analyses help identify values for GNPTAM features for a given verb form. Division of Information Load in Hindi Verb Groups A Verb Group (VG) primarily comprises main verb and auxiliaries. Constituents like particles, negation markers, conjunction, etc. can also occur within a VG. It is important to know how much of GNPTAM feature information is stored in VG constituents individually and what is the load division in the absence or presence of auxiliaries. In a Hindi VG, when there is no auxiliary present, the complete information load falls on the main verb which carries information for GNPTAM features. In presence of auxiliaries, the load gets shared between the main verb and auxiliaries, and is represented in the form of different morphemes (in ected or unin ected), e.g., in the sentence - null main bol paa rahaa hoon I am able to speak 1. Main verb 'a52a22a15a42a0 ' {bol} is unin ected and does not carry any information for any of the GNPTAM features.</Paragraph>
    <Paragraph position="6"> 2. 'a40 a20 ' {paa} is unin ected and gives modality information, i.e., Abilitive.</Paragraph>
    <Paragraph position="7"> 3. 'a41  In Hindi, the attachment of verbal suf xes like 'a10a4a20 ' {naa} and 'a10a5 ' {ne} to a verb root results either in a gerund like 'a34a7 a41 a10a13a20 ' {tairnaa} (swimming) or in an in nitival verb form like 'a34a7 a41 a10a4a20 ' {tairnaa} (to swim). We observed that it is easy to detect a gerund if it is followed by a case-marker or by any other in nitival verb form.</Paragraph>
  </Section>
  <Section position="5" start_page="781" end_page="781" type="metho">
    <SectionTitle>
4 Design of Hindi POS Tagger
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="781" end_page="781" type="sub_section">
      <SectionTitle>
4.1 Morphology Driven Tagger
</SectionTitle>
      <Paragraph position="0"> Morphology driven tagger makes use of the af x information stored in a word and assigns a POS tag using no contextual information. Though, it does take into account the previous and the next word in a VG to correctly identify the main verb and the auxiliaries, other POS categories are identi ed through lexicon lookup of the root form. The current lexicon4 has around 42,000 entries belonging to the major categories as mentioned in Figure 3. The format of each entry is &lt;word&gt; ,&lt;paradigm&gt; ,&lt;category&gt; .</Paragraph>
      <Paragraph position="1"> The process does not involve learning or disambiguation of any sort and is completely driven by hand-crafted morphology rules. The architecture of the tagger is shown in Figure 1. The work progresses at two levels:  ing the wordlist from Hindi Wordnet (http://www.c lt.iitb.ac.in/wordnet/webhwn/) and partial noun list from Anusaraka. It is being enhanced by adding new words from the corpus and removing the inconsistencies.</Paragraph>
      <Paragraph position="2"> 1. At Word Level: A stemmer is used in con- null junction with lexicon and Suf x Replacement Rules (SRRs) to output all possible root-suf x pairs along with POS category label for a word. There is a possibility that the input word is not found in the lexicon and does not carry any in ectional suf x. In such a case, derivational morphology rules are applied.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="781" end_page="783" type="metho">
    <SectionTitle>
2. At Group Level: At this level a Morpho-
</SectionTitle>
    <Paragraph position="0"> logical Analyzer (MA) uses the information encoded in the extracted suf x to add morphological information to the word. For nouns, the information provided by the sufxes is restricted only to 'Number'. 'Case' can be inferred later by looking at the neighbouring words.</Paragraph>
    <Paragraph position="1"> For verbs, GNP values are found at the word level, while TAM values are identi ed during the VG Identi cation phase, described later. The analysis of the suf x is done in a discrete manner, i.e., each component of the suf x is analyzed separately. A morpheme analysis table comprising individual morphemes with their paradigm information and analyses is used for this purpose. MA's output for the word a32a62a20a22a63a28a8a56a38a64a44 {khaaoongii} (will eat) looks like - null The structure of a Hindi VG is relatively rigid and can be captured well using simple syntactic rules. In Hindi, certain auxiliaries like 'a41 a35 ' {rah}, 'a40 a20 ' {paa}, 'a54 a3 ', {sak} or 'a40 a1 ' {paD} can also occur as main verbs in some contexts.</Paragraph>
    <Paragraph position="2"> VG identi cation deals with identifying the main verb and the auxiliaries of a VG while discounting for particles, conjunctions and negation markers. The VG identi cation goes left to right by marking the rst constituent as the main verb or copula verb and making every other verb con- null struct an auxiliary till a non-VG constituent is encountered. Main verb and copula verb can take the head position of a VG and can occur with or without auxiliary verbs. Auxiliary verbs, on the other hand, always come along with a main verb or a copula verb. This results in a very high accuracy of 99.5% for verb auxiliaries. Ambiguity between a main verb and a copula verb remains unresolved at this level and asks for disambiguation rules.</Paragraph>
    <Section position="1" start_page="782" end_page="783" type="sub_section">
      <SectionTitle>
4.2 Need for Disambiguation
</SectionTitle>
      <Paragraph position="0"> The accuracy obtained by simple lexicon lookup based approach (LLB) comes out to be 61.19%.</Paragraph>
      <Paragraph position="1"> The morphology-driven tagger, on the other hand, performs better than just lexicon lookup but still results in considerable ambiguity. These results are signi cant as they present a strong case in favor of using detailed morphological analysis. Similar observation has been presented by Uchimoto et al. (2001) for Japanese language.</Paragraph>
      <Paragraph position="2"> According to the tagging performed by SRRs and the lexicon, a word receives n tags if it belongs to n POSs. If we consider multiple tags for a word as an error of the tagger (even when the options contain the correct tag for a word), then the accuracy of the tagger comes to be 73.62% (as shown in Table 1). The goal is to keep the contextually appropriate tag and eliminate others which can be achieved by devising a disambiguation technique. The disambiguation task can be naively addressed by choosing the most frequent tag for a word. This approach is also known as baseline (BL) tagging. The baseline accuracy turns out to be 82.63% which is still higher than that of the morphology-driven tagger5. The drawback with baseline tagging is that its accuracy cannot be further improved. On the other hand, there is enough room for improving upon the accuracy of morphology-driven (MD) tagger. It is quite evident that though the MD tagger works well for VG and many close categories, around 30% of the words are either ambiguous or unknown. Hence, a disambiguation stage is needed to shoot up the accuracy.</Paragraph>
      <Paragraph position="3"> The common choice for disambiguation rule learning in POS tagging task is usually machine learning techniques mainly focussing on decision tree based algorithms (Orphanos and Christodoulalds, 1999), neural networks (Schmid, 1994), etc. Among the various decision tree based algorithms like ID3, AQR, ASSIS-TANT and CN2, CN2 is known to perform better than the rest (Clark and Niblett, 1989). Since no such machine learning technique has been used for Hindi language, we thought of choosing CN2 as it performs well on noisy data6.</Paragraph>
      <Paragraph position="4">  We set up a corpus, collecting sentences from BBC news site7 and let the morphology-driven tagger assign morphosyntactic tags to all the words. For an ambiguous word, the contextually appropriate POS tag is manually chosen. Unknown words are assigned a correct tag based on their context and usage.</Paragraph>
      <Paragraph position="5">  Out of the completely manually corrected corpora of 15,562 tokens, we created training instances for each Ambiguity Scheme and for Unknown words. These training instances take into account the POS categories of the neighbouring words and not the feature values8. The experiments were carried out for different context window sizes ranging from 2 to 20 to nd the best con guration.</Paragraph>
      <Paragraph position="6">  The rules are generated from the training corpora by extracting the ambiguity scheme (AS) of each word. If the word is not present in the lexicon then its AS is set as 'unknown'. Once the AS is identi ed, a training instance is formed. This training instance contains the neighbouring correct POS categories as attributes. The number of neighbours included in the training instance is the window size for CN2. After all the ambiguous words are processed and training instances for all seen ASs are created, the CN2 algorithm is applied over the training instances to generate actual rule-sets for each AS. The CN2 algorithm gives one set of If-Then rules (either ordered or unordered) for each AS including 'unknown'9. The AS of every ambiguous word is formed while tagging. A corresponding rule-set for that AS is then identi ed and traversed to get the contextually appropriate rule. The resultant  category outputted by this rule is then assigned to the ambiguous word. The traversal rule differs for ordered and unordered implementation. The POS of an unknown word is guessed by traversing the rule-set for unknown words10 and assigning it the resultant tag.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="783" end_page="783" type="metho">
    <SectionTitle>
5 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> The experimentation involved, rst, identifying the best parameter values for the CN2 algorithm and second, evaluating the performance of the disambiguation rules generated by CN2 for the POS tagging task.</Paragraph>
    <Section position="1" start_page="783" end_page="783" type="sub_section">
      <SectionTitle>
5.1 CN2 Parameters
</SectionTitle>
      <Paragraph position="0"> The various parameters in CN2 algorithm are: rule type (ordered or unordered), star size, signi cance threshold and size of the training instances (window size). The best results are empirically achieved with ordered rules, star size as 1, signi cance threshold as 10 and window size 4, i.e., two neighbours on either side are used to generate the training instances.</Paragraph>
    </Section>
    <Section position="2" start_page="783" end_page="783" type="sub_section">
      <SectionTitle>
5.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> The tests are performed on contiguous partitions of the corpora (15,562 words) that are 75% training set and 25% testing set.</Paragraph>
      <Paragraph position="1"> Accuracy = no. of single correct tagstotal no. of tokens The results are obtained by performing a 4-fold cross validation over the corpora. Figure 2 gives the learning curve of the disambiguation module for varying corpora sizes starting from 1000 to the complete training corpora size. The accuracy for known and unknown words is also measured separately.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML