File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1202_metho.xml
Size: 15,552 bytes
Last Modified: 2025-10-06 14:08:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1202"> <Title>Rajeev Sangal, Durgesh D Rao, LERIL : Collaborative Effort for Creating Lexical Resources, In Proc. of Workshop on Language Resources in</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> DEFAULT CONVENTIONS are left </SectionTitle> <Paragraph position="0"> unmarked. For example, the adjective 'nIlI' ('blue') of 'kitAba' ('book) has been left unmarked in the above example since normally noun modifiers precede the noun they modify (adjectives precede nouns).</Paragraph> <Paragraph position="1"> Such DEFAULT CONVENTIONS save unnecessary typing effort.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. GRAMMATICAL MODEL </SectionTitle> <Paragraph position="0"> It was quite natural to use Paninian grammatical model for sentence analysis ( hence the tagnames) because :1) Paninian grammatical model is based on the analysis of an Indian language (Sanskrit) it can deal better with the type of constructions Indian languages have.</Paragraph> <Paragraph position="1"> 2) The model not only offers a mechanism for SYNTACTIC analysis but also incorporates the SEMANTIC information (nowadays called dependency analysis). Thus making the relationships more transparent.</Paragraph> <Paragraph position="2"> (For details refer Bharati (1995).) Following tags (most of which are based on Paninian grammatical model) have</Paragraph> <Paragraph position="4"> Obviously the task is not an easy one.</Paragraph> <Paragraph position="5"> Standardization of these tags will take some time. Issues, while deciding the tags, are many. Some examples are illustrated below to show the kind of structures which the linear tagging scheme will have to deal with.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1. Multiple Verb Sentences </SectionTitle> <Paragraph position="0"> To mark the nouns-verb relations with the above tags in single verb sentences is a simple task. However, consider the following sentence with two verbs :1: rAma ne khAnA khAkara am' 'postp' 'food' 'having_eaten' pAnI pIyA 'water' 'drank' `Ram drank water after eating the food.` Sentence 1 has more than two verbs - one non-finite (khAkara) and one finite (piyA). The finite verb is the main verb. Noun 'khAnA' is the object of verb 'khAkara', whereas noun 'pAnI' is the object of verb 'piyA'. 'k2' is the tag for object relation in our tagging scheme. Co-indexing becomes the obvious solution for such multiple relations. Since there are two verbs the tagging scheme allows them to be named as 'i' and 'j' (using notation 'i' and 'j'). By default 'i' refers to the main verb and any successive verb by other characters ('j' in the present case):</Paragraph> <Paragraph position="2"> This provides the facility to mark every noun verb relationship.</Paragraph> <Paragraph position="3"> rAma_ne/k1>i khAnA/k2>j khAkara::vkr:j pAnI/k2>i piyA::v:i Fortunately, there is no need to mark it so &quot;heavily&quot;. A number of notations can be left out, and the DEFAULT rules tell us how to interpret such &quot;abbreviated' annotation. Thus, for the above sentence, the following annotation is sufficient and is completely equivalent to the above : rAma_ne/k1 khAnA/k2 khAkara::vkr:j pAnI/k2 piyA::v Even though there are two verbs, there is no need to name the verbs and refer to them. Two default rules help us achieve such brevity (without any ambiguity) : (1) karta or k1 kaaraka always attaches to the last verb in a sentence (Thus 'rAma_ne/k1' attaches to the verb at the end).</Paragraph> <Paragraph position="4"> (2) all other kaarakas except k1, attach to the nearest verb on the right. Thus 'khAnA/k2' attaches to 'khAkara' and 'pAnI/k2' attaches to 'piyA', their respective nearest verbs on the right. 4.2. Compound Units Sometimes two words combine together to form a unit which has its own demands and modifiers, not derivable from its parts. For example, a noun and verb join together to operate as a single unit, namely as a verb. In the sentence 'rAma (Rama) ne (postp) snAna(bath) kiyA (did)', 'snAna' and 'kiyA' together stand for a verb 'snAna+kiyA' (bathed). Such verbal compounds are like any other verb having their own kaarakas.This sentence would be marked as follows : rAma_ne/k1 snAna::v+ kiyA::v'Ram_postp' 'bath+' 'did-' `Ram took a bath` A 'v+' or a 'v-' indicates that the word 'snAna' or 'kiyA' are parts of a whole (a verb in this case). Taken together they function as a single verb unit. Such a device which may appear to be more powerful was needed to mark the 'single unitness' of parts which may appear separately in a sentence. Thus, the above notation allows even distant words to be treated as a single compound. Such occurrences are fairly common in all Indian languages as illustrated in the following example from Hindi :</Paragraph> <Paragraph position="6"> explicitly. (a more detail description of the notation in 5.1)</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3. Embedded Sentence </SectionTitle> <Paragraph position="0"> Tags are also designed to mark the relations within a complex sentence.</Paragraph> <Paragraph position="1"> Consider the example below where a complete sentence (having verb 'piyA' (drank)) is a kaaraka of the main verb 'kaHA' (said).</Paragraph> <Paragraph position="2"> moHana ne kaHA ki {rAma 'Mohan' 'postp' 'said' 'that ' {'Rama' ne pAnI khAnA khAkara 'postp' 'water' 'food' 'having eaten' piyA}.</Paragraph> <Paragraph position="3"> 'drank} (Mohan said that Ram drank water after having eaten the food) The embedded sentence can be first marked as follows --------- {rAma_ne/k1 pAnI/k2>j khAnA/k2 khAkara::vkr piyA::v:j}::s. The whole embedded sentence is the 'karma' (object) or k2 of 'piyA' (drank): The relation of the embedded sentence relation as the object of the main verb is co-indexed in the following way : null Thus the device of naming the elements and co-indexing them with their respective arguments can be used most effectively.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5. TAGGING SCHEME </SectionTitle> <Paragraph position="0"> The tagging scheme contains : notations, defaults, and tagsets.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5.1. NOTATION </SectionTitle> <Paragraph position="0"> Certain special symbols such as double colon,underscore, paranthesis etc. are introduced first. Two sets of tags have been provided (to mark the crucial ARC and node information). However, apart from these symbols and tags, some special notation is required to explicitly mark certain disjointed, scattered and missing elements in a sentence. Following notation is adopted for marking these elements :5.1. 1. X+ ... X- : disjointed elements As shown above (4.2), when a single lexical unit composed of more than one elements is separated by other intervening lexical units, its 'oneness' is expressed by using '+' on the first element in the linear order and '-' on the second element. '+' `Bathe I did , but got dirty again.' 'snAna_karanA' is one verb unit in Hindi. But its two components 'snAna' and 'karanA' can occur separately. Notation 'X+....X-' can capture the 'oneness' of these two elements. So 'snAna.karanA' ('bathe') in the above sentence would be marked as follows : snAna::v+ to mEMne 'bath' 'emph' 'I' kiyA_thA::v- para phira gaMdA 'did' but' 'again' 'dirty'</Paragraph> <Paragraph position="2"> Another example of 'scattered elements' is 'agara .... to' construction of Hindi.</Paragraph> <Paragraph position="3"> agara tuma kaHate to mEM 'if' 'you' 'said' 'then' 'I'</Paragraph> <Paragraph position="5"> `Had you asked I would have come' 'agara' and 'to' together give the 'conditionality' sense. Though they never occur linearly together they have a 'oneness' of meaning. Their dependency on each other can also be expressed through 'X+....X-' notation.</Paragraph> <Paragraph position="6"> agara::yo+ tuma kaHto::yo- mEM A_ jAtA (tag 'yo' is for conjuncts) 5.1.2. >i ....:i : explicitly marked dependency (:i is the head) (a) Example -- The sentence 1a below has the dependency structure given in T-2 1a. phala rAma ne 'fruit' 'Rama' 'Ergpostp' naHA_ kara khAyA 'having_bathed' 'ate' ' Rama ate the fruit after taking a bath'</Paragraph> <Paragraph position="8"> Default (5.2.5) states that all kaarakas attach themselves to the nearest available verb on the right. In (1a) above, the nearest verb available to 'phala' (fruit) is 'naHA_kara'. However, 'phala' (fruit) is not the 'k2' of 'naHA_kara'. It is the 'k2' of the main verb 'khA'. Therefore, an explicit marking is required to show this relationship. The notation '>i...:i' makes this explicit. Therefore, phala/k2>i rAma_ne naHA_kara khAyA::v:i Where 'khAyA' is the 'head', thus marked ':i' and 'phala' is the dependent element, thus marked '>i'. An element marked '>i' always looks for another element marked To show their attachment to 'Ora' (and) the three elements 'rAma','moHana', 'shyAma' have to be marked (as in 2b.) the following way in our linear tagging scheme.</Paragraph> <Paragraph position="9"> rAma>i, moHana>i Ora::yo:i shyAma>i The justification to treat 'Ora' as the head and show the 'wholeness' of all the elements joined by '>i' to ':i' is made explicit by the following examplesrAma, Ora Haz, moHana Ora 'Rama' 'and''yeah', 'Mohana' 'and' shyAma Ae_ the 'Shyama' 'had_come' In this case there is an intervening element 'Ora HAz' ('and_yeah) between 'rAma' and 'moHana' etc. So paranthesis alone will not resolve the issue of grouping the constituents of a whole. (By paranthesising, elements which are not part of the whole will also be included.) To avoid this the 'Ora' (and) has to be treated as a head.</Paragraph> <Paragraph position="10"> 5.1.3. 0 : explicit marking of an ellipted element (missing elements). Example rAma bAjZAra gayA, moHana 'Rama' 'market' 'went' 'Mohana' ghara Ora Hari skUla 'home' 'and' 'Hari' 'school' 'Rama went to the market, Mohana home and Hari to the school.' The sentence above has two ellipted elements. The second and third occurrence of the verb 'gayA'('went'). To draw a complete tree the information of the missing elements is crucial here.</Paragraph> <Paragraph position="11"> Arguments 'moHana', 'ghara', 'Hari', and 'skUla' are left without a head, and their dependency cannot be shown unless we mark the 'ellipted' element.</Paragraph> <Paragraph position="12"> rAma bAjZAra gayA, moHana 'Rama' 'market' 'went', 'Mohana' ghara 0 Ora Hari skUla 0 'home' 'and' 'Hari' 'school' In cases where this information can be retrieved from some other source (DEFAULT ) it need not be marked. In the above case it need not be marked.</Paragraph> <Paragraph position="13"> However, there may be cases where marking of the missing element is crucial to show various relationships. In such cases it has to be marked. Look at the have grown older and do not listen to anybody.' The above sentence does not have any explicit 'yojaka(conjunct)', between two sentences, a) bacce baDZe Ho gaye HEM and `kids have grown older' b) kisI kI bAta naHIM mAnate `do not listen to anybody' Both these sentences together form the 'vAkyakarma(sentential object)' of the verb 'kaHate HEM' ('say').</Paragraph> <Paragraph position="14"> So the analysis would be - null It appears to be a neatly tagged sentence. However, some crucial information is missing from this analysis. In the sentence the relationship between the two sentences within the larger sentential object is not expressed. The problem now is how to do it. Use of '>i...:i' notation can help express this. However, it needs the ':i' information and since there is no explicit 'yojaka' (conjunct) element between the two sentences it will not be possible to mark it. The information of the presence of a 'yojaka' (conjunct) which is the head of a co-ordinate structure is CRUCIAL here. Without its presence its dependency tree cannot be drawn. The notation '0' can be of help in such situations. '0' can be marked in the appropriate place. This will allow the tagging of the dependent elements. Therefore, the revised tagging Here the information of missing conjunct has been marked by a '0'.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 5.2. DEFAULTS </SectionTitle> <Paragraph position="0"> Apart from tagsets and special notations the scheme also relies on certain defaults.</Paragraph> <Paragraph position="1"> Defaults have been specified to save typing by the human annotator. For example, no sentence has to be marked ba a sentence tag till it is crucial for the dependency analysis. For example : rAma ne yaHa socA ki 'Rama' 'postp' 'this' 'thought' 'that' moHana AegA 'Mohana' 'would_come' `Rama thought that Mohana would come' This is a complex sentence where the subordinate sentence is the object complement of the verb 'socA'('thought') . To indicate the relation of the subordinate clause with the main verb, it has to marked.</Paragraph> <Paragraph position="2"> Similarly, within the square paranthesis, right most element is the Head. So there is no need to mark it. Postpositions's attachment to the previous noun is also covered by the default rule. There are other defaults which take care of modifier-modified relationships. In short, the general rules have been accounted for by defaults and only the specific relations have to be marked. Elements preceding the head within paranthesis are to be accepted as modifiers of the head.</Paragraph> <Paragraph position="3"> However, In case the number of elements within paranthesis is more than two (Head plus two) and one or more of them do not modify the head then it should be marked.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Example - [HalkI nIlI kitAba], </SectionTitle> <Paragraph position="0"> 'light' 'blue' 'book' Here, 'halkI'('light') can qualify both 'nIlI'('blue') and 'kitAba'('book'). In case it is modifying 'kitAba'('book'), say, in terms of light weight, then it should be left unmarked. But if it modifies 'nIlI'('blue'), in terms of light shade, then it SHOULD be marked by adding '>' on the right of the modifying element.</Paragraph> <Paragraph position="1"> 'halkI' [HalkI> nIlI kitAba].</Paragraph> <Paragraph position="2"> 'light' ['light'> 'blue' 'book'] Let us look at another case where the dependency has to be explicitly marked.</Paragraph> <Paragraph position="3"> Participle form 'tA_HuA', in Hindi, can modify either a noun or a verb. For example take the Hindi sentence - null The tagsets used here have been divided into two categories 1) TAGSET-1 - Tags which express relationships are marked by a preceding '/' . For example kaarakas are grammatical relationships, thus they are marked '/k1', '/k2', '/k3' etc.</Paragraph> <Paragraph position="4"> 2) TAGSET-2 - Tags expressing nodes are marked by a preceding '::' verbs etc. are nodes, so they will be marked '::v', Certain conventions regarding the naming of the tags are ; k = kaaraka, -- all the kaaraka tags will begin with k-, Therefore, k1, k2, k3 etc.</Paragraph> <Paragraph position="5"> n = noun v = verb -- eg. v, vkr etc.</Paragraph> </Section> </Section> class="xml-element"></Paper>