File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/p94-1027_metho.xml
Size: 21,003 bytes
Last Modified: 2025-10-06 14:13:55
<?xml version="1.0" standalone="yes"?> <Paper uid="P94-1027"> <Title>OPTIMIZING THE COMPUTATIONAL LEXICALIZATION OF LARGE GRAMMARS</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> OPTIMIZING THE COMPUTATIONAL LEXICALIZATION OF LARGE GRAMMARS Christian JACQUEMIN </SectionTitle> <Paragraph position="0"> Institut de Recherche en Informatique de Nantes (IR/N) IUT de Nantes - 3, rue du MarEchal Joffre F-441M1 NANTES Cedex 01 - FRANCE a--mail : jaequemin@ irin.iut-nantas.univ-nantas.fr</Paragraph> </Section> <Section position="2" start_page="0" end_page="202" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> The computational lexicalization of a grammar is the optimization of the links between lexicalized rules and lexical items in order to improve the quality of the bottom-up filtering during parsing. This problem is N P-complete and untractable on large grammars. An approximation algorithm is presented. The quality of the suboptimal solution is evaluated on real-world grammars as well as on randomly generated ones.</Paragraph> <Paragraph position="1"> Introduction Lexicalized grammar formalisms and more specifically Lexicalized Tree Adjoining Grammars (LTAGs) give a lexical account of phenomena which cannot be considered as purely syntactic (Schabes et al, 1990). A formalism is said to be lexicalized if it is composed of structures or rules associated with each lexical item and operations to derive new structures from these elementary ones. The choice of the lexical anchor of a rule is supposed to be determined on purely linguistic grounds. This is the linguistic side of lexicalization which links to each lexical head a set of minimal and complete structures. But lexicalization also has a computational aspect because parsing algorithms for lexicalized grammars can take advantage of lexical links through a two-step strategy (Schabes and Joshi, 1990). The first step is the selection of the set of rules or elementary structures associated with the lexical items in the input sentence ~. In the second step, the parser uses the rules filtered by the first step.</Paragraph> <Paragraph position="2"> The two kinds of anchors corresponding to these two aspects of lexicalization can be considered separately : * The linguistic anchors are used to access the grammar, update the data, gather together items with similar structures, organize the grammar into a hierarchy...</Paragraph> <Paragraph position="3"> * The computational anchors are used to select the relevant rules during the first step of parsing and to improve computational and conceptual tractability of the parsing algorithm.</Paragraph> <Paragraph position="4"> Unlike linguistic lexicalization, computational anchoring concerns any of the lexical items found in a rule and is only motivated by the quality of the induced filtering. For example, the systematic linguistic anchoring of the rules describing &quot;Nmetal alloy&quot; to their head noun &quot;alloy&quot; should be avoided and replaced by a more distributed lexicalization. Then, only a few rules &quot;Nmetal alloy&quot; will be activated when encountering the word &quot;alloy&quot; in the input. In this paper, we investigate the problem of the optimization of computational lexicalization. We study how to choose the computational anchors of a lexicalized grammar so that the distribution of the rules on to the lexical items is the most uniform possible The computational anchor of a rule should not be optional (viz included in a disjunction) to make sure that it will be encountered in any string derived from this rule.</Paragraph> <Paragraph position="5"> with respect to rule weights. Although introduced with reference to LTAGs, this optimization concerns any portion of a grammar where rules include one or more potential lexical anchors such as Head Driven Phrase Structure Grammar (Pollard and Sag, 1987) or Lexicalized Context-Free Grammar (Schabes and Waters, 1993).</Paragraph> <Paragraph position="6"> This algorithm is currently used to good effect in FASTR a unification-based parser for terminology extraction from large corpora (Jacquemin, 1994). In this framework, terms are represented by rules in a lexicalized constraint-based formalism. Due to the large size of the grammar, the quality of the lexicalization is a determining factor for the computational tractability of the application. FASTR is applied to automatic indexing on industrial data and lays a strong emphasis on the handling of term variations (Jacquemin and Royaut6, 1994).</Paragraph> <Paragraph position="7"> The remainder of this paper is organized as follows. In the following part, we prove that the problem of the Lexicalization of a Grammar is NP-complete and hence that there is no better algorithm known to solve it than an exponential exhaustive search. As this solution is untractable on large data, an approximation algorithm is presented which has a computational-time complexity proportional to the cubic size of the grammar. In the last part, an evaluation of this algorithm on real-world grammars of 6,622 and 71,623 rules as well as on randomly generated ones confirms its computational tractability and the quality of the lexicalization.</Paragraph> <Paragraph position="8"> The Problem of the Lexiealization of a Grammar Given a lexicalized grammar, this part describes the problem of the optimization of the computational lexicalization. The solution to this problem is a lexicalization function (henceforth a lexicalization) which associates to each grammar rule one of the lexical items it includes (its lexical anchor). A lexicalization is optimized to our sense if it induces an optimal preprocessing of the grammar. Preprocessing is intended to activate the rules whose lexical anchors are in the input and make all the possible filtering of these rules before the proper parsing algorithm. Mainly, preprocessing discards the rules selected through lexicalization including at least one lexical item which is not found in the input.</Paragraph> <Paragraph position="9"> The first step of the optimization of the lexicalization is to assign a weight to each rule. The weight is assumed to represent the cost of the corresponding rule during the preprocessing. For a given lexicalization, the weight of a lexical item is the sum of the weights of the rules linked to it. The weights are chosen so that a uniform distribution of the rules on to the lexical items ensures an optimal preprocessing. Thus, the problem is to find an anchoring which achieves such a uniform distribution.</Paragraph> <Paragraph position="10"> The weights depend on the physical constraints of the system. For example, the weight is the number of nodes if the memory size is the critical point. In this case, a uniform distribution ensures that the rules linked to an item will not require more than a given memory space. The weight is the number of terminal or non-terminal nodes if the computational cost has to be minimized.</Paragraph> <Paragraph position="11"> Experimental measures can be performed on a test set of rules in order to determine the most accurate weight assignment.</Paragraph> <Paragraph position="12"> Two simplifying assumptions are made : deg The weight of a rule does not depend on the lexical item to which it is anchored.</Paragraph> <Paragraph position="13"> * The weight of a rule does not depend on the other rules simultaneously activated.</Paragraph> <Paragraph position="14"> The second assumption is essential for settling a tractable problem. The first assumption can be avoided at the cost of a more complex representation. In this case, instead of having a unique weight, a rule must have as many weights as potential lexical anchors. Apart from this modification, the algorithm that will be presented in the next part remains much the same than in the case of a single weight. If the first assumption is removed, data about the frequency of the items in corpora can be accounted for. Assigning smaller weights to rules when they are anchored to rare items will make the algorithm favor the anchoring to these items. Thus, due to their rareness, the corresponding rules will be rarely selected.</Paragraph> <Paragraph position="15"> Illustration Terms, compounds and more generally idioms require a lexicalized syntactic representation such as LTAGs to account for the syntax of these lexical entries (Abeill6 and Schabes, I989). The grammars chosen to illustrate the problem of the optimization of the lexicalization and to evaluate the algorithm consist of idiom rules such as 9 : 9 = {from time to time, high time, high grade, high grade steel} Each rule is represented by a pair (w i, Ai) where w i is the weight and A i the set of potential anchors. If we choose the total number of words in an idiom as its weight and its non-empty words as its potential anchors, 9 is represented by the following grammar : G 1 = {a = (4, {time}), b = (2, {high, time}), c = (2, {grade, high}), d = (3, {grade, high,steel}) } We call vocabulary, the union V of all the sets of potential anchors A i. Here, V = {grade, high, steel, time}. A lexicalization is a function ~. associating a lexical anchor to each rule.</Paragraph> <Paragraph position="16"> Given a threshold O, the membership problem called the Lexicalization of a Grammar (LG) is to find a lexicalization so that the weight of any lexical item in V is less than or equal to 0. If 0 >4 in the preceding example, LG has a solution g :</Paragraph> <Paragraph position="18"> If 0 < 3, LG has no solution.</Paragraph> <Paragraph position="19"> Definition of the LG Problem</Paragraph> <Paragraph position="21"> total function anchoring the rules so that</Paragraph> <Paragraph position="23"> The associated optimization problem is to determine the lowest value Oop t of the threshold 0 so that there exists a solution (V, G, Oop t,/q.) to LG. The solution of the optimization problem for the preceding example is 0op t = 4.</Paragraph> <Paragraph position="24"> Lemma LG is in NP.</Paragraph> <Paragraph position="25"> It is evident that checking whether a given lexicalization is indeed a solution to LG can be done in polynomial time. The relation R defined by (2) is polynomially decidable : (2) R(V, G, O, 2.) &quot;-- \[if ~.: V-~G and (We V) w < 0 then true else false\] 2((w, a)) = v The weights of the items can be computed through matrix products : a matrix for the grammar and a matrix for the lexicalization. The size of any lexicalization ~ is linear in the size of the grammar. As (V, G, O, &)e LG if and only if \[R(V, G, 0, ~.)\] is true, LG is in NP. * Theorem LG is NP-complete.</Paragraph> <Paragraph position="26"> Bin Packing (BP) which is NP-complete is polynomial-time Karp reducible to LG. BP (Baase, 1986) is the problem defined by (3) : (3) BP &quot;-- { (R, {R I ..... Rk}) I where</Paragraph> <Paragraph position="28"> rational numbers less than or equal to 1 and {R 1 ..... Rk} is a partition of R (k bins in which the rjs are packed) such that</Paragraph> <Paragraph position="30"> First, any instance of BP can be represented as an instance of LG. Let (R, {R 1 ..... Rk}) be an instance of BP it is transformed into the instance (V, G, 0, &) of LG as follows :</Paragraph> <Paragraph position="32"> and (Vie {1 ..... k}) (Vje {1 ..... n}) ~t((rj, v)) = V i C/~ rje R i For all i~{I ..... k} andjs{1 ..... n}, we consider the assignment of rj to the bin R i of BP as the anchoring of the rule (rj, V) to the item v i of LG. If(R, {R 1 ..... Rk})eBP then : (5) (VIE{1 ..... k}) 2_, r< 1</Paragraph> <Paragraph position="34"> Thus (V, G, 1,/q.)~LG. Conversely, given a solution (V, G, 1, Z) of LG, let R i &quot;- {rye R I Z((ry, V)) = vi} for all ie { 1 ..... k}. Clearly {R 1 ..... Rk} is a partition of R because the lexicalization is a total function and the preceding formula ensures that each bin is correctly loaded. Thus (R, {R I ..... Rk})EBP. It is also simple to verify that the transformation from B P to L G can be performed in polynomial time. \[\] The optimization of an NP-complete problem is NP-complete (Sommerhalder and van Westrhenen, 1988), then the optimization version of LG is NP-complete.</Paragraph> <Section position="1" start_page="198" end_page="200" type="sub_section"> <SectionTitle> An Approximation Algorithm for L G </SectionTitle> <Paragraph position="0"> This part presents and evaluates an n3-time approximation algorithm for the LG problem which yields a suboptimal solution close to the optimal one. The first step is the 'easy' anchoring of rules including at least one rare lexical item to one of these items. The second step handles the 'hard' lexicalization of the remaining rules including only common items found in several other rules and for which the decision is not straightforward. The discrimination between these two kinds of items is made on the basis of their global weight G W (6) which is the sum of the weights of the rules which are not yet anchored and which have this lemma as potential anchor. Vx and Gx are subsets of V and G which denote the items and the rules not yet anchored. The ws and 0 are assumed to be integers by multiplying them by their lowest common denominator if necessary.</Paragraph> <Paragraph position="2"> Step 1 : 'Easy' Lexiealization of Rare Items This first step of the optimization algorithm is also the first step of the exhaustive search. The value of the minimal threshold Omi n given by (7) is computed by dividing the sum of the rule weights by the number of lemmas (\['xl stands for the smallest integer greater than or equal to x and \[ V;tl stands for the size of the set Vx)&quot; (7) 0,m. n = (w, A) E G~t W where I V~.I ~ 0 lEvi All the rules which include a lemma with a global weight less than or equal to Orain are anchored to this lemma. When this linking is achieved in a non-deterministic manner, Omi . is recomputed. The algorithm loops on this lexicalization, starting it from scratch every time, until Omi . remains unchanged or until all the rules are anchored. The output value of 0,,i, is the minimal threshold such that LG has a solution and therefore is less than or equal to 0o_ r After Step 1, either each rule is anchored /J or all the remaining items in Va. have a global weight strictly greater than Omin. The algorithm is shown in Figure 1.</Paragraph> <Paragraph position="3"> Step 2 : 'Hard' Lexicalization of Common Items During this step, the algorithm repeatedly removes an item from the remaining vocabulary and yields the anchoring of this item. The item with the lowest global weight is handled first because it has the smallest combination of anchorings and hence the probability of making a wrong choice for the lexicalization is low. Given an item, the candidate rules with this item as potential anchor are ranked according to : are the ones whose potential anchors have the highest global weights (items found in several other non-anchored rules).</Paragraph> <Paragraph position="4"> The algorithm is shown in Figure 2. The output of Step 2 is the suboptimal computational lexicalization Z of the whole grammar and the associated threshold 0s,,bopr Both steps can be optimized. Useless computation is avoided by watching the capital of weight C defined by (8) with 0 - 0m/~ during Step 1 and 0 - Osubopt during Step 2 :</Paragraph> <Paragraph position="6"> C corresponds to the weight which can be lost by giving a weight W(m) which is strictly less than the current threshold 0. Every time an anchoring to a unit m is completed, C is reduced from 0- W(t~). If C becomes negative in either of both steps, the algorithm will fail to make the lexicalization of the grammar and must be started again from Step 1 with a higher value for 0.</Paragraph> <Paragraph position="8"> ;; anchoring the rules with only m as ;; free potential anchor (t~ e V x with ;; the lowest global weight)</Paragraph> <Paragraph position="10"> Example 3 The algorithm has been applied to a test grammar G 2 obtained from 41 terms with 11 potential anchors. The algorithm fails in making the lexicalization of G 2 with the minimal threshold Omin = 12, but achieves it with Os,,bopt = 13. This value of Os,,bop t Can be compared with the optimal one by running the exhaustive search. There are 232 (= 4 109) possible lexicalizations among which 35,336 are optimal ones with a threshold of 13. This result shows that the approximation algorithm brings forth one of the optimal solutions which only represent a proportion of 8 10 -6 of the possible lexicalizations. In this case the optimal and the suboptimal threshold coincide.</Paragraph> <Paragraph position="11"> Time-Complexity of the Approximation Algorithm A grammar G on a vocabulary V can be represented by a \]Glx \]V I-matrix of Boolean values for the set of potential anchors and a lx I G l-matrix for the weights. In order to evaluate the complexity of the algorithms as a function of the size of the grammar, we assume that I V I and I GI are of the same order of magnitude n. Step 1 of the algorithm corresponds to products and sums on the preceding matrixes and takes O(n 3) time. The worst-case time-complexity for Step 2 of the algorithm is also O(n 3) when using a naive O(n 2) algorithm to sort the items and the rules by decreasing priority. In all, the time required by the approximation algorithm is proportional to the cubic size of the grammar.</Paragraph> <Paragraph position="12"> This order of magnitude ensures that the algorithm can be applied to large real-world grammars such as terminological grammars.</Paragraph> <Paragraph position="13"> On a Spare 2, the lexicalization of a terminological grammar composed of 6,622 rules and 3,256 words requires 3 seconds (real time) and the lexicalization of a very large terminological grammar of 71,623 rules and 38,536 single words takes 196 seconds. The two grammars used for these experiment were generated from two lists of terms provided by the documentation center INIST/CNRS.</Paragraph> <Paragraph position="14"> 3 The exhausitve grammar and more details about this example and the computations of the following section are in (Jacquemin, 1991).</Paragraph> </Section> <Section position="2" start_page="200" end_page="202" type="sub_section"> <SectionTitle> Evaluation of the Approximation Algorithm </SectionTitle> <Paragraph position="0"> Bench Marks on Artificial Grammars In order to check the quality of the lexicalization on different kinds of grammars, the algorithm has been tested on eight randomly generated grammars of 4,000 rules having from 2 to 10 potential anchors (Table 1). The lexicon of the first four grammars is 40 times smaller than the grammar while the lexicon of the last four ones is 4 times smaller than the grammar (this proportion is close to the one of the real-world grammar studied in the next subsection). The eight grammars differ in their distribution of the items on to the rules. The uniform distribution corresponds to a uniform random choice of the items which build the set of potential anchors while the Gaussian one corresponds to a choice taking more frequently some items. The higher the parameter s, the flatter the Gaussian distribution.</Paragraph> <Paragraph position="1"> The last two columns of Table 1 give the minimal threshold Omi n after Step 1 and the suboptimal threshold Osul, op , found by the approximation algorithm. As mentioned when presenting Step 1, the optimal threshold Ooe t is necessarily greater than or equal to Omin after Step 1. Table 1 reports that the suboptimal threshold Os,,t, opt is not over 2 units greater than Omin after Step 1. The suboptimal threshold yielded by the approximation algorithm on these examples has a high quality because it is at worst 2 units greater than the optimal one.</Paragraph> <Paragraph position="2"> A Comparison with Linguistic Lexicalization on a Real-World Grammar This evaluation consists in applying the algorithm to a natural language grammar composed of 6,622 rules (terms from the domain of metallurgy provided by INIST/CNRS) and a lexicon of 3,256 items. Figure 3 depicts the distribution of the weights with the natural linguistic lexicalization. The frequent head words such as alloy are heavily loaded because of the numerous terms in N-alloy with N being a name of metal. Conversely, in Figure 4 the distribution of the weights from the approximation algorithm is much more uniform. The maximal weight of an item is 241 with the linguistic lexicalization while it is only 34 with the optimized lexicalization. The threshold after Step 1 being 34, the suboptimal threshold yielded by the approximation algorithm is equal to the optimal one.</Paragraph> <Paragraph position="3"> Lexicon size Distribution of the On~ n On~n Osubopt items on the rules before Step 1 after Step I suboptimal threshold</Paragraph> </Section> </Section> class="xml-element"></Paper>