File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/p93-1015_metho.xml
Size: 16,055 bytes
Last Modified: 2025-10-06 14:13:31
<?xml version="1.0" standalone="yes"?> <Paper uid="P93-1015"> <Title>Parsing Free Word Order Languages in the Paninian Framework</Title> <Section position="4" start_page="105" end_page="106" type="metho"> <SectionTitle> TAM LABEL TRANSFORMED VIBHAKTI FOR KARTA </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> Fig. 3 gives some transformation rules for the default mapping for Hindi. It explains the vibhakti in sentences B.1 to B.4, where Ram is the karta but has different vibhaktis, C/, he, ko, se, respectively.</Paragraph> <Paragraph position="3"> In each of the sentences, if we transform the karaka chart of Fig.2 by the transformation rules of Fig.3, we get the desired vibhakti for the karta Ram.</Paragraph> <Paragraph position="4"> rAma Pala ko KAtA hE.</Paragraph> <Paragraph position="5"> Ram fruit -ko eats is (Ram eats the fruit.) rAma ne Pala KAyA.</Paragraph> <Paragraph position="6"> Ram -ne fruit ate (Ram ate the fruit.) rAma ko Pala KAnA padA.</Paragraph> <Paragraph position="7"> Ram -ko fruit eat had to (Ram had to eat the fruit.) rAma se Pala nahI KAyA gayA Ram -se fruit not eat could (Ram could not eat the fruit.) In general, the transformations affect not only the vibhakti of karta but also that of other karakas. They also 'delete' karaka roles at times, that is, the 'deleted' karaka roles must not occur in the sentence. null The Paninian framework is similar to the broad class of case based grammars. What distinguishes the Paninian framework is the use of karaka relations rather than theta roles, and the neat dependence of the karaka vibhakti mapping on TAMs and the transformation rules, in case of Indian languages. The same principle also solves the problem of karaka assignment for complex sentences (Discussed later in Sec. 3.)</Paragraph> </Section> <Section position="5" start_page="106" end_page="107" type="metho"> <SectionTitle> 2 Constraint Based Parsing </SectionTitle> <Paragraph position="0"> The Paninian theory outlined above can be used for building a parser. First stage of the parser takes care of morphology. For each word in the input sentence, a dictionary or a lexicon is looked up, and associated grammatical information is retrieved. In the next stage local word grouping takes place, in which based on local information certain words are grouped together yielding noun groups and verb groups. These are the word groups at the vibhakti level (i.e., typically each word group is a noun or verb with its vibhakti, TAM label, etc.). These involve grouping post-positional markers with nouns, auxiliaries with main verbs etc. Rules for local word grouping are given by finite state machines. Finally, the karaka relations among the elements are identified in the last stage called the core parser.</Paragraph> <Paragraph position="1"> Morphological analyzer and local word grouper have been described elsewhere (Bharati et al., 1991).</Paragraph> <Paragraph position="2"> Here we discuss the core parser. Given the local word groups in a sentence, the task of the core parser is two-fold: 1. To identify karaka relations among word groups, and 2. To identify senses of words.</Paragraph> <Paragraph position="3"> The first task requires karaka charts and transformation rules. The second task requires lakshan charts for nouns and verbs (explained at the end of the section).</Paragraph> <Paragraph position="4"> A data structure corresponding to karaka chart stores information about karaka-vibhakti mapping including optionality of karakas. Initially, the default karaka chart is loaded into it for a given verb group in the sentence. Transformations are performed based on the TAM label. There is a separate data structure for the karaka chart for each verb group in the sentence being processed. Each row is called a karaka restriclion in a karaka chart. For a given sentence after the word groups have been formed, karaka charts for the verb groups are created and each of the noun groups is tested against the karaka restrictions in each karaka chart. When testing a noun group against a karaka restriction of a verb group, vibhakti information is checked, and if found satisfactory, the noun group becomes a candidate for the karaka of the verb group.</Paragraph> <Paragraph position="5"> The above can be shown in the form of a constraint graph. Nodes of the graph are the word baccA hATa se kelA KAtA hE groups and there is an arc labeled by a karaka from a verb group to a noun group, if the noun group satisfies the karaka restriction in the karaka chart of the verb group. (There is an arc from one verb group to another, if the karaka chart of the former shows that it takes a sentential or verbal karaka.) The verb groups are called demand groups as they make demands about their karakas, and the noun groups are called source groups because they satisfy demands.</Paragraph> <Paragraph position="6"> As an example, consider a sentence containing the verb KA (eat): baccA hATa se kelA KAtA hE.</Paragraph> <Paragraph position="7"> child hand -se banana eats (The child eats the banana with his hand.) Its word groups are marked and KA (eat) has the same karaka chart as in Fig. 2. Its constraint graph is shown in Fig. 4.</Paragraph> <Paragraph position="8"> A parse is a sub-graph of the constraint graph satisfying the following conditions: 1. For each of the mandatory karakas in a karaka chart for each demand group, there should be exactly one out-going edge from the demand group labeled by the karaka.</Paragraph> <Paragraph position="9"> 2. For each of the optional karakas in a karaka chart for each demand group, there should be at most one outgoing edge from the demand group labeled by the karaka.</Paragraph> <Paragraph position="10"> 3. There should be exactly one incoming arc into each source group.</Paragraph> <Paragraph position="11"> If several sub-graphs of a constraint graph satisfy the above conditions, it means that there are multiple parses and the sentence is ambiguous. If no sub-graph satisfies the above constraints, the sentence does not have a parse, and is probably ill-formed. There are similarities with dependency grammars here because such constraint graphs are also produced by dependency grammars (Covington, 1990) (Kashket, 1986).</Paragraph> <Paragraph position="12"> It differs from them in two ways. First, the Paninian framework uses the linguistic insight regarding karaka relations to identify relations between constituents in a sentence. Second, the constraints are sufficiently restricted that they reduce to well known bipartite graph matching problems for which efficient solutions are known. We discuss the latter aspect next.</Paragraph> <Paragraph position="13"> If karaka charts contain only mandatory karakas, the constraint solver can be reduced to finding a matching in a bipartite graph. 4 Here is what needs to be done for a given sentence. (Perraju, 1992). For every source word group create a node belonging to a set U; for every karaka in the karaka chart of every verb group, create a node belonging to set V; and for every edge in the constraint graph, create an edge in E from a node in V to a node in U as follows: if there is an edge labeled in karaka k in the constraint graph from a demand node d to a source node s, create an edge in E in the bipartite graph from the node corresponding to (d, k) in V to the node corresponding to s in U. The original problem of finding a solution parse in the constraint graph now reduces to finding a complete matching in the bipartite graph {U,V,E} that covers all the nodes in U and V. 5 It has several known efficient algorithms. The time complexity of augmenting path algorithm is O (rain (IV\], \[U\]). \]ED which in the worst case is O(n 3) where n is the number of word groups in the sentence being parsed. (See Papadimitrou et al. (1982), ihuja et al. (1993).) The fastest known algorithm has asymptotic cornof O (IV\[ 1/2 . \[E\[) and is based on max flow</Paragraph> <Paragraph position="15"> problem (Hopcroft and Sarp (1973)).</Paragraph> <Paragraph position="16"> If we permit optional karakas, the problem still has an efficient solution. It now reduces to finding a matching which has the maximal weight in the weighted matching problem. To perform the reduction, we need to form a weighted bipartite graph. We first form a bipartite graph exactly as before.</Paragraph> <Paragraph position="17"> Next the edges are weighted by assigning a weight of 1 if the edge is from a node in V representing a mandatory karaka and 0 if optional karaka. The problem now is to find the largest maximal matching (or assignment) that has the maximum weight (called the maximum bipartite matching problem or assignment problem). The resulting matching represents a valid parse if the matching covers all nodes in U and covers those nodes in V that are for mandatory karakas. (The maximal weight condition en- null with a subset of E such that no two edges are adjacent. A complete matching is also a largest maximal matching (Deo, 197&quot;4).</Paragraph> <Paragraph position="18"> sures that all edges from nodes in V representing mandatory karakas are selected first, if possible.) This problem has a known solution by the Hungarian method of time complexity O(n 3) arithmetic operations (Kuhn, 1955).</Paragraph> <Paragraph position="19"> Note that in the above theory we have made the following assumptions: (a) Each word group is uniquely identifiable before the core parser executes, (b) Each demand word has only one karaka chart, and (c) There are no ambiguities between source word and demand word. Empirical data for Indian languages shows that, conditions (a) and (b) hold. Condition (c), however, does not always hold for certain Indian languages, as shown by a corpus. Even though there are many exceptions for this condition, they still produce only a small number of such ambiguities or clashes. Therefore, for each possible demand group and source group clash, a new constraint graph can be produced and solved, leaving the polynomial time complexity unchanged.</Paragraph> <Paragraph position="20"> The core parser also disambiguates word senses.</Paragraph> <Paragraph position="21"> This requires the preparation of lakshan charts (or discrimination nets) for nouns and verbs. A lakshan chart for a verb allows us to identify the sense of the verb in a sentence given its parse. Lakshan charts make use of the karakas of the verb in the sentence, for determining the verb sense. Similarly for the nouns. It should be noted (without discussion) that (a) disambiguation of senses is done only after karaka assignment is over, and (b) only those senses are disambiguated which are necessary for translation The key point here is that since sense disambiguation is done separately after the karaka assignment is over it leads to an efficient system. If this were not done the parsing problem would be NP-complete (as shown by Barton et al. (1987) if agreement and sense ambiguity interact, they make the problem NP-complete).</Paragraph> </Section> <Section position="6" start_page="107" end_page="108" type="metho"> <SectionTitle> 3 Active-Passives and Com- </SectionTitle> <Paragraph position="0"> plex Sentences This theory captures the linguistic intuition that in free word order languages, vibhakti (case endings or post-positions etc.) plays a key role in determining karaka roles. To show that the above, though neat, is not just an adhoc mechanism that explains the isolated phenomena of semantic roles mapping to vibhaktis, we discuss two other phenomena: activepassive and control.</Paragraph> <Paragraph position="1"> No separate theory is needed to explain activepassives. Active and passive turn out to be special cases of certain TAM labels, namely those used to mark active and passive. Again consider for example in Hindi.</Paragraph> <Paragraph position="2"> quently, the vibhakti 'dvArA' for karta (Ram) follows from the transformation already given earlier in Fig. 3.</Paragraph> <Paragraph position="3"> A major support for the theory comes from complex sentences, that is, sentences containing more than one verb group. We first introduce the problem and then describe how the theory provides an answer. Consider the ttindi sentences G.1, G.2 and G.3.</Paragraph> <Paragraph position="4"> In G.1, Ram is the karta of both the verbs: KA (eat) and bulA (call). However, it occurs only once. The problem is to identify which verb will control its vibhakti. In G.2, karta Ram and the karma Pala (fruit) both are shared by the two verbs kAta (cut) and KA (eat). In G.3, the karta 'usa' (he) is shared between the two verbs, and 'cAkU' (knife) the karma karaka of 'le' (take) is the karana (instrumental) karaka of 'kAta' (cut).</Paragraph> <Paragraph position="5"> G.I rAma Pala KAkara mohana ko bulAtA hE.</Paragraph> <Paragraph position="6"> Ram fruit having-eaten Mohan -ko calls (Having eaten fruit, Ram calls Mohan. ) G.2 rAma ne Pala kAtakara KAyA.</Paragraph> <Paragraph position="7"> Ram ne fruit having-cut ate (Ram ate having cut the fruit.) G.3 Pala kAtane ke liye usane cAkU liyA.</Paragraph> <Paragraph position="8"> fruit to-cut for he-ne knife took (To cut fruit, he took a knife.) The observation that the matrix verb, i.e., main verb rather than the intermediate verb controls the vibhakti of the shared nominal is true in the above sentences, as explained below. The theory we will outline to elaborate on this theme will have two parts. The first part gives the karaka to vibhakti mapping as usual, the second part identifies shared karakas.</Paragraph> <Paragraph position="9"> The first part is in terms of the karaka vibhakti mapping described earlier. Because the intermediate verbs have their own TAM labels, they are handled by exactly the same mechanism. For example, kara is the TAM label 6 of the intermediate verb groups in G.1 and G.2 (KA (eat) in G.1 and kAta (cut) in G.2), and nA 7 is the TAM label 6,kara, TAM label roughly means 'having completed the activity'. But note that TAM labels are purely syntactic, hence the meaning is not required by the system. ZThis is the verbal noun.</Paragraph> </Section> <Section position="7" start_page="108" end_page="108" type="metho"> <SectionTitle> TAM LABEL TRANSFORMATION </SectionTitle> <Paragraph position="0"> kara Karta must not be present. Karma is optional.</Paragraph> <Paragraph position="1"> nA Karta and karma are optional. null tA_huA Karta must not be present. Karma is of the intermediate verb (kAta (cut)) in G.3. As usual, these TAM labels have transformation rules that operate and modify the default karaka chart. In particular, the transformation rules for the two TAM labels (kara and nA) are given in Fig. 5. The transformation rule with kara in Fig. 5 says that karta of the verb with TAM label kara must not be present in the sentence and the karma is optionally present.</Paragraph> <Paragraph position="2"> By these rules, the intermediate verb KA (eat) in G.1 and kAta (cut) in G.2 do not have (independent) karta karaka present in the sentence. Ram is the karta of the main verb. Pala (fruit) is the karma of the intermediate verb (KA) in G.1 but not in G.2 (kAta). In the latter, Pala is the karma of the main verb. All these are accommodated by the above transformation rule for 'kara'. The tree structures produced are shown in Fig. 6 (ignore dotted lines for now) where a child node of a parent expresses a karaka relation or a verb-verb relation.</Paragraph> <Paragraph position="3"> In the second part, there are rules for obtaining the shared karakas. Karta of the intermediate verb KA in G.1 can be obtained by a sharing rule of the kind given by S1.</Paragraph> <Paragraph position="4"> Rule SI: Karta of a verb with TAM label 'kara' is the same as the karta of the verb it modifies s. The sharing rule(s) are applied after the tentative karaka assignment (using karaka to vibhakti mapping) is over. The shared karakas are shown by dotted lines in Fig. 6.</Paragraph> </Section> class="xml-element"></Paper>