File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2169_metho.xml
Size: 19,565 bytes
Last Modified: 2025-10-06 14:13:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2169"> <Title>hesaurus-based Efficient Example Retrieval</Title> <Section position="3" start_page="0" end_page="1044" type="metho"> <SectionTitle> 2 Similarity of Surface Case Structures </SectionTitle> <Paragraph position="0"> As a similarity measure of surface case structm'es~ we basically use the similarity measure in Kurohashi and Nagao (1.993). Since the attthors' similarity measure is intended for calculating similarity between tile input surface case structure and a case fl'ame with exalnple nora, s, we adjust it to the similarity betweei, two surface case structures. Tile following describes the data structure of surface case structures and the thesaurus, and gives the definition of the similarity measure.</Paragraph> <Section position="1" start_page="0" end_page="1044" type="sub_section"> <SectionTitle> 2.1 Data Structure </SectionTitle> <Paragraph position="0"> In general, surface case structure of a Japanese sentence can be represented in feature-structure-like notation as below: \[ pred:V, pl \[pred:Nl , .,p,~: : .sern:Seml \] &quot;&quot; \[pred:Nn In this notation, V is the verb, Pl,...,P~ are the Japanese surface ease markers, N1,..., N,~ are ease eb ement norms, and Semi,... ,Sem,~ are the semantic categories of each case element in a thesaurus.</Paragraph> <Paragraph position="1"> In our task of retrieval of example surface case structures, the input and the examples to be retrieved have to have the same verb. Besides, the similarity value between the input and the example is dependent only on each semantic category. Thus, in this paper, we define the smJaee case structure e of a sentence a.q the set of pairs {p, Sere) where p is a surface case marker and Sere is the leaf semantic category of the case element noun: 1</Paragraph> <Paragraph position="3"> A thesaurus of nouns is regarded us a tree in which each node represents a semantic category. We define a thesaurus of nouns as a rooted directed tree (SC, El) where SC is the set of semantic categories and Et C=</Paragraph> <Paragraph position="5"> A noun has on{! (or possibly more) leaf semantic cutegory in the thesaurus. At present we use an on-line thesaurus called Bunrui Goi Ilyou (BGH) (NLR1, 1964).</Paragraph> <Paragraph position="6"> BGII has a six-layered M)straetion hierarchy and more than 60,000 Japanese words are assigned to the leaves.</Paragraph> </Section> <Section position="2" start_page="1044" end_page="1044" type="sub_section"> <SectionTitle> 2.2 Similarity Measure 2.2.1 Similarity of Se|nantic Categories </SectionTitle> <Paragraph position="0"> Before we define the similarity of surfaee case structures, first we deline the similarity of semantic categories in the thesaurus. We detine the similarity sim.~(,b'eml , Sere2) of two semantic eatego'ries ~5'e11~1 and Sere2 as ~L monotonieally increasing function of the most specific eonllilOll layer mscl(Seml, ,5'ern2) of Senh and ,S'e~*, 2 aS below:</Paragraph> <Paragraph position="2"> First, we assume, that the similarity measure of surface c~Lse sl;ructnres have to satisfy the following requiremeats. The simibmty should become greater if 1) the number of the eol'respoiMing cases becomes greater, o,' 2) the corresponding ease element nouns become more similar, or 3) the ratio of the number of the corresponding cases to the l,nnlbcr of the cases in each surface case structure becomes greater.</Paragraph> <Paragraph position="3"> When cMculating the simihLrity of two surface case structures el and e2, flrsl; e~ and e2 are matche(l and the set of pairs of the corresponding eases, M(ea,e2), is construcl;ed. A ease (pli, Sernli) of el corr(> sponds to ~ c~se (p2j,Sern2j} of e2 only when the surface case markers Pil and P2j are tim same and sim~(Sern,i, Sem2j) is (lefine(l. 2 bet simp.,(m)be the similarity of a pair m of corresponding cases, then the similarity sims(el,e2) of the two aurfae.e case structures el and (!2 is delined as below: .im,,. ( r. ) v/\[MI x :&quot;~<-r~- ........ IMI x ~IFMI( x VIe=/'/\[~ where IMI is the number of the corresponding causes, and I,~,l and I~=i a,'e the number of case,~ in e, ~md e2 respectively. The first factor satisfies requirement 1), 2In Japanese, there exist several topic-marking post-positional particles such as &quot;tl (wa)&quot; and &quot;g (too)&quot;, and eases marked by those topic-marking post-positional particles could correspond to cases marked by caue-marking post-lmsitional partides such a.s &quot;/)C/ (ga-NOM)&quot; and &quot;~: (wo-ACC)'. Although this paper considers ease-marking post-positional particles only, the implemented system can appropriately cMeulate the similarity of surface ease structtnes in which topic-marking po.st-positionM particles apl)e~m and tile second satisfes 2). '.File third and the fourth satisfy 3).</Paragraph> <Paragraph position="4"> For example, the similarity of the surface case structures el and e2 of Example 1 and 2 is calculated as folh)ws.</Paragraph> <Paragraph position="5"> Example 1 kare- ga hon - wo kau he - NOM book- ACC buy (fie buys a book.) Example. 2 kare- ga musuko-, ni nooto-, wo kau he - NOM son- DAT notebook- ACC buy (lie buys his son a notehook.) First, the set of pairs of the COl'rest)onding cases, M(cl, e~), is constructed (nb=notcbook).</Paragraph> <Paragraph position="7"> In the ease of semantic categories in BGIt, the results of the similarity calculation are sim..(Se~nh~, Semhe) = 11 and simo(Se.~boo~, Se.~) = 9. Since IMI, levi, and levi are 2, ~, a.d 3 respectively, sims(el,e2)is cMculated as follows:</Paragraph> <Paragraph position="9"/> </Section> </Section> <Section position="4" start_page="1044" end_page="1047" type="metho"> <SectionTitle> 3 Query Generation Retrieval </SectionTitle> <Paragraph position="0"> Query generation retrieval has the following three t~atares, 1) it generates retrievM queries from similarities, 2) eflleient example retrievM through the tree structure of a thesaurus, 3) binary search along subsumplion ordering of retrieval queries. Fig. 1 describes the framework of query generation retrievM.</Paragraph> <Paragraph position="1"> In query generation retrieval, first, given an inlmt surface case structure, ,t retrieval query is generated for a certmn similarity and then example surface case structures which satisfy the similarity are retrieved from the example database. In order to generate a retrieval query which satisfy the given similarity requiremeat, it is necessary to enumerate all the l)ossible patterns of surf~tce case structures which satisfy the given similarity reqtdrement. We define similarity template which enumerates all the possible patterns of (:alculatlug similarity between two surface case structures and collect them in a similarity table. The similarity table is referred to whm, gener~tting retrievM queries from the inl)ut surface case structure and a certmn similarity.</Paragraph> <Paragraph position="2"> A retriewd query consists of the number of cases of the example to be retrieved, cases which the example to be retrieved should have, and semantic restrictions of ease element nouns. \[n order to quickly retrieve examples which satisfy a retrievM query, for each surface case marker we build a sub-structure of the whole thesara'us of l|OUiis, which we call sub-thesaurus. Examples which satisfy the requirements in a retrieval query are quickly retrieved through the tree structure of those sub-thesauri.</Paragraph> <Paragraph position="3"> In our query generation retrieval, it is necessary to control the retrieval process effectively by providing similarities in a certain order and to retrieve the most similar examples as fast as possible. In this paper, we use binary search along subsumption ordering of retrieval queries. It is possible to define a subsmnption relation between two retrieval queries. Such subsumption relation of retrieval queries results in the subsumption relation of the sets of retrieved examples. This means that a set of retrieved examples subsumes another set if the retrieval query of the former set subsumes, or in other words, is more general than that of the latter set. With those subsumption relations of retrieval queries and the sets of retrieved examples, it becomes possible to efficiently binary-search the set of examples to be retrieved by the most specific retrieval query.</Paragraph> <Paragraph position="4"> Sections from 3.1 to 3.3 describe those three features and section 3.4 evaluates the framework.</Paragraph> <Section position="1" start_page="1045" end_page="1046" type="sub_section"> <SectionTitle> 3.1 Retrieval Query Generation from Similarities </SectionTitle> <Paragraph position="0"> A retrieval query q is defined as a pair (ldb, csp), where ldb is the number of cases of the example to be retrieved, and csp is the requirement on cases and semantic restriction of case element nouns, which we call a case structure pattern. A case structure pattern is represented as a set of pairs (p, Sere) of a surface case marker p and a semantic category Sere, where Sere is not necessarily a leaf semantic category. It requires that for each element (p, Sem} in csp, the example to be retrieved has to have a case marked by p and the case element noun has to satisfy the semantic restriction of the semantic category Sere.</Paragraph> <Paragraph position="1"> For example, for the verb &quot;~ ~) (buy)&quot;, the following ql requires that the example to be retrieved should have three cases, the case element noun of &quot;~ (ga-NOM)&quot; should be &quot;~/(he)&quot;, and that of the &quot;~ (wo-ACC)&quot; case should belong to the semantic category of &quot;st&quot;(= &quot;stationery&quot;). ql retrieves examples like &quot;~/, ~, ,~, f', ~:-, /&quot;- b, ~, ~'~ 5&quot;(He buys his son a notebook.) and &quot;~, ~, ~, V-, ,~~?, ~ , ~'~ &quot;)&quot;(He buys his daughter pencils.).</Paragraph> <Paragraph position="3"> We introduce the notion of similarity template in order to enumerate all the possible patterns of calculating the similarity between two surface ease structures. In the case of the similarity measure defined in section 2.2.2, a similarity template is represented as a 3-tupple: where lin and lab correspond to the number of cases of the input and of the example respectively, and they are supposed to be less than or equal to the predetermined maximum number l,~a~. CS is the multiset of the similarities between corresponding ease element nouns. For example, in the case of Example 1 and 2 in section 2.2.2, the result of similarity calculation is represented as a similarity template (2, 3, {11, 9}} (suppose that the former example is the input and the latter is from the example database).</Paragraph> <Paragraph position="4"> All the possible combinations of li,, ldb, and CS can be enumerated beforehand without any inputs and examples if only the maximum case number lm~ is given.</Paragraph> <Paragraph position="5"> Suppose that lm~, is 3, the number of possible combinations of lin, ldb, and CS is 203.</Paragraph> <Paragraph position="6"> Similarity templates are collected in the similarity table and referred to when generating rctrievM queries from the input and a certain similarity. The following shows how to generate a retrieval query frmn an input ei,~ and a similarity template t = (\]ei,d, ldb , CS).</Paragraph> <Paragraph position="7"> The retrieval query to be generated is denoted as q = (Idb,CSp), where ldb corresponds to the munber of cases in tile example to be retrieved and is the same as Idb. in t. CS in t is the multiset of the similarities between corresponding case element nouns. When constructing the case structure pattern csp fl'om ein and CS, we use an injection to map each similarity sire in cs to a ease (p, Se.~) in ~. ~or each (p, Se~,~) to which a similarity sire is mapped, a case (p, Sere) is collected into csp, where the semantic category Sere satisfies sim~( Semin, Sere) = sire.</Paragraph> <Paragraph position="8"> For example, let the input ei,~ be the surface case structure of Example 1 and the similarity template t be (2,3, {11,9}), then there exist two possible injections frmn CS into ein and two retrieval queries are generated as below (sim,(SemN, SemN,,,) = x):</Paragraph> <Paragraph position="10"/> </Section> <Section position="2" start_page="1046" end_page="1046" type="sub_section"> <SectionTitle> 3.2 Efficient Example Retrieval with Sub-Thesaurus </SectionTitle> <Paragraph position="0"> Each element (p, Sern} in a case structure pattern csp requires that the example surface case structure has to have a case marked by p and the case element noun has to satisfy the semantic restriction of the semantic category Sere. Given the example databa.se, it is possible to collect examples which satisfy the require ment {p, Sere} beforehand. For each case marker p, we collect all those sets of examples. Since all tile semantic categories forms the whole thesaurus of nouns, non-empty sets of those collected examples also form a sub-structure of the whole thesaurus of nmms. We call it a sub-thesaurus for the case marker p.</Paragraph> <Paragraph position="1"> Fig. 2 shows an example of the sub-thesaurus for &quot; (wo-A CC)&quot; cause, supposing that the example database contains Example 1 and 2 in section 2.2.2. The most specific common layer of Sembook and Semnb is the layer 5, and the example set is {Egl} or {Eg2} at layer 6 and 7(leaf), and {Egl,Eg2} above layer 6.</Paragraph> <Paragraph position="2"> Given a requirement (p, Sere) and a sub-thesaurus for p, examples which satisfy the requirement are quickly retrieved through the tree structure of the sub-thesaurus in constant time. Examples wtfich satisfy all the requirements in csp are obtained as the intersection of all those sets of retrieved examples.</Paragraph> <Paragraph position="3"> The Size of A Sub-Thesaurus We estimate the size of all the sub-thesauri by the total number of elements in the nodes of those sub-thesauri.</Paragraph> <Paragraph position="4"> Let N be the total number of examples in the example database, d be the depth of the whole thesaurus of nouns, and lrna~ be tile maximum case number. A case element noun in an example appears it, a leaf node and all of the parent nodes of the leaf in a sub-thesaurus, thus appears d times. Since the number of case element nouns in an example is at most lmaa~, the number of e~se element liouns ill the examph', database is at most N x lm=~ and the order of the size of all the sutl-thesauri is at most Nxl .... xd, i.e., O(N) ( l ..... xd is constant). a.a Binary Search along Subsmnption</Paragraph> </Section> <Section position="3" start_page="1046" end_page="1047" type="sub_section"> <SectionTitle> Ordering of Retrieval Queries 3.3.1 Subsumption Relation </SectionTitle> <Paragraph position="0"> A subsumption relation can he defined between two retrievM queries, and results ill tim subsmnption relation of the sets of retrieved examples. For cxmnple, in the case of the following two retricval queries ql and q2, q2 has a requirement on the &quot;~: (ni-DAT)&quot; case while ql does not, and the requirement on tile &quot;\]fi (.qa-NOMO&quot; case is more specific in q2 than in ql. Thus, q2 is more specific than ql, or in other words, ql subsumes qa.</Paragraph> <Paragraph position="2"> Furthermore, ql and q2 are generated from the similar-ity tenlplates t,- <2, 3, {5,9}} a.nd t2 : <2, 3, {11, 9, 5}) respectively, and a subsmnption relation of similarity templates holds between tl and t2 as well.</Paragraph> <Paragraph position="3"> With the subsun,ption relations of retrieval queries and the sets of retrieved examples, it is possible to eificiently binary-search the set of examples to be retrieved by the most specific retrieval query. In the following, we describe how to organize the set of all the similarity templates as a similarity table and to realize the process of binary search of the most similar examples.</Paragraph> <Paragraph position="4"> First, the set of all the similarity templates is divided into a sequence TI,... , 2'n which is totally ordered by the following subsumptiou relation. Let ein be the input and Ti,Tj(i < j) be the sets of similarity templates ill the sequence T1,... ,:/'~,, and EGi, EGj be the sets of examples retrieved by all the similarity templates in 7~ and \[D respectively. Then, ~/~ subsumes 7) if and only if, t) EGi subsmnes EGj, and 2) the sets of retrieved examples are totally ordered by similarity, i.e., Ve.i (~ EGI Vcj E E(;j, sim~ (ein, ei) < sim~ (ei,,, e i). In the case of the similarity measure defined in section 2.2.2, suppose that lm~ is 3, tile length of the sequence 7'1,... ,7~ in the similarity table is 7 when li~, is l, 9 when li,~ is 2, and 11 when li~ is 3.</Paragraph> <Paragraph position="5"> With this subsmnl)tion ordering, the most similar examples are obtained by finding the most specific Ti with non-empty EGi and then finding the most similar exaruples in EGI. Since EGi = C/ means EGj = C/ for any j > i, this search process can be efficiently realized by binary-searching the sequence Tl,..., T,. This sequence Tl,..., Tn can be regarded as a table of similarity templates and is called similarity table.</Paragraph> <Paragraph position="6"> Fig. 3 shows the binary search of the similarity tahle. Example space is partitioned by the snbsumption relation. The most similar examples are fmmd in the innermost nm>mnpty set. This binary search method makes efficient retrieval possible whether the example database contains similar examples or not.</Paragraph> </Section> <Section position="4" start_page="1047" end_page="1047" type="sub_section"> <SectionTitle> 3.4 Evaluation </SectionTitle> <Paragraph position="0"> The framework of query generation retrieval consists of three major components, i.e., the example database, the similarity table, and the set of sub-thesauri. Let N be the size of the example database, the order of the size of the similarity table is independent of N and that of the set of sub-thesauri is O(N). Thus, tile total order of the size of the system is O(N).</Paragraph> <Paragraph position="1"> In order to evaluate the computational cost, we plot the computation time (in CPU time), increasing the nunlber of examples N, and compare the result with a full retrieval program. The example database contains example surface case structures of the Japanese verb &quot;F~ +) (buy)&quot; and both programs retrieve the most similar examples from the example database, given an input surface case structure of the same verb &quot;~;~ ~) (buy)&quot;. For the query generation retrieval program, the maximum number l .... of cases was 3. The full retrieval program calculates the similarity between the input and the example for all the examples in the example database, and retrieves the examples with the greatest similarity. Both programs are implemented in SICStus Prolog 2.1 on a SPARC station-10. Fig. 4 illustrates the results. The computation time of the full retrieval program is proportional to N, while that of the query generation retrieval program is nearly constant. Thus, our query generation retrieval program achieved drastic improvement in decreasing computational cost compared with the full retrieval program.</Paragraph> </Section> </Section> class="xml-element"></Paper>