File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1014_metho.xml
Size: 17,561 bytes
Last Modified: 2025-10-06 14:07:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1014"> <Title>Automatic Augmentation of Translation Dictionary with Database Terminologies in Multilingual Query Interpretation</Title> <Section position="4" start_page="3" end_page="3" type="metho"> <SectionTitle> 3 Translation with CCG </SectionTitle> <Paragraph position="0"> In this section, we discuss the translation with and without an intermediate language. The translation based on CCG can derive the target database language expressions/queries such as SQL, TSQL, and QUBE, as well as expressions in intermediate representation languages. We show the translation into both languages with examples (Nelken and Francez, 2000).</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.1 Indirect vs. Direct Translation </SectionTitle> <Paragraph position="0"> Most NLDBs use an intermediate representation which does not make use of expressions that correspond directly to real database objects.</Paragraph> <Paragraph position="1"> The intermediate representations are usually notated as logic expressions such as a quasi-logical form (Klein et al., 1998) and a customized language (Androutsopoulos et al., 1998; Nelken and Francez, 2000). These representations provide a way to translate indirectly to the target database languages.</Paragraph> <Paragraph position="2"> For example, query 6a is translated into (Nelken and Francez, 1999; Toman, 1996), and into 6c with the SQL/Temporal expressions (Nelken and Francez, 2000).</Paragraph> <Paragraph position="3"> (6) (a) During which years did Mary work in marketing? (b) DDCTCPD6B4C1B5 CMBLC2B4DBD3D6CZB4D1CPD6DDBND1CPD6CZCTD8CXD2CVBNC2B5 CM C2 AI D4CPD7D8 CMC2 AI C1 (c) NONSEQUENCED VALIDTIME SELECT DISTINCT a0.c1 AS c1 FROM work' AS a1,year' AS a0 WHERE VALIDTIME(a0) contains VALIDTIME(a1) AND a1.c1 = 'mary' AND a1.c2 = 'marketing' AND PERIOD(TIMESTAMP 'beginning', TIMESTAMP 'now') contains VALIDTIME(a1) The translation using an intermediate representation has several advantages, including (a) the availability of an independent linguistic frontend, (b) the separation of domain dependent knowledge from the system engine, and (c) the relative easiness of augmenting the system with an extra inference module for disambiguation (cf. Androutsopoulos et al., 1995). The points (a) and (b) indicate the separation of domain-dependent resources such as lexicon, database mapping information, and other knowledge bases. (c) arises from the modularity of the translation process.</Paragraph> <Paragraph position="5"> When we use an intermediate language, we do not need to concern ourselves with the syntactic details of the target query language during the mapping process, so that we can pay more attention to the differences in syntax between the two source languages (i.e. English and Korean), making the resulting interpretation more reliable.</Paragraph> <Paragraph position="6"> In addition, the use of an intermediate language gives rise to a more flexible query interpretation system as the queries can be translated into multiple target query languages without further processing at the stage of the source query interpretation. However, the use of the same intermediate language for source query languages such as English and Korean that are known to have very different linguistic characteristics makes it difficult to capture subtle differences between the queries of the different source languages unless the intermediate language is quite expressive.</Paragraph> <Paragraph position="7"> And much of the expressiveness of the intermediate language for the translation of the queries in one language may not be what is needed in the translation of the queries in the other.</Paragraph> <Paragraph position="8"> The translation without an intermediate representation has a simpler and more straightforward process. And there is no extra effort on development of a formal intermediate representation which is difficult to ensure the full coverage on linguistic expressiveness and the soundness of the proposed formalism. Nevertheless, the three points mentioned above are thought to be difficult to overcome in this approach. However, the points (a) and (b) can be equally achieved by separating domain-dependent elements from the query processing module using lexicalized grammars such as CCG. In this case, the construction of a domain-dependent lexicon can be a problem, but it can be resolved to some extent with an automatic construction method. The point (c) is difficult to address, since the translation without an intermediate representation usually is done in a single module. The inference module, however, can be complemented by disambiguation using co-occurrence information (Park and Cho, 2000) and disambiguation of domain-dependent word senses with consideration for the context-dependent information such as information structure (Steedman, 2000). (Nelken and Francez, 2000) use an intermediate representation because the compositional construction of formulae during parsing becomes easier. However, we show that database queries can be interpreted compositionally during parsing without such an intermediate representation through direct translation.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.2 Translation to an Intermediate Representation </SectionTitle> <Paragraph position="0"> While our approach does not make use of an intermediate representation, the CCG framework itself allows queries to be interpreted into an intermediate representation. Figure 1 shows the translation process from the query 6a to the form . Since we are only showing the possibility of translation, we use an example from (Nelken and Francez, 2000). In Figure 1, we slightly modified the semantics in (Nelken and Francez, 2000; Nelken and Francez, 1999) for the convenience of translation. And for the same reason, we devised the operator ARB4DCBNC1B5 where DC is an argument and C1 represents a time interval variable.</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.3 Translation to a Target Language </SectionTitle> <Paragraph position="0"> Figure 2 shows the translation process from the query 6a to SQL/Temporal expression 6c, also indicating the need for post-processing. For in-</Paragraph> </Section> <Section position="4" start_page="3" end_page="3" type="sub_section"> <SectionTitle> Language </SectionTitle> <Paragraph position="0"> stance, in Figure 2, multiply occurring constants and the uninstantiated variable ' ' must be discarded. Additionally, '&' in the result of Figure 2 must be mapped to 'AND' and additional information such as 'NONSEQUENCED VALID-TIME' and 'DISTINCT' must be added for the generation of complete target results as in 6c.</Paragraph> </Section> </Section> <Section position="5" start_page="3" end_page="3" type="metho"> <SectionTitle> 4 Multilingual Translation </SectionTitle> <Paragraph position="0"> Source word disambiguation is an important problem in both of the approaches mentioned in the previous section because the problem of lexical selection arises equally. We propose a method to translate and disambiguate the source queries to the appropriate target database information in a direct translation approach.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.1 Word Sense Disambiguation and Target Mapping </SectionTitle> <Paragraph position="0"> Our method to disambiguate the source queries is based on the semantic features of the lexical items. In lexical selection methods using the semantic features and their syntactic relations (Palmer et al., 1999; Copestake and Sanfilippo, 1993), the lexicon is designed with semantic type-features constructed from the semantic classifications of a language for the collocated verb-object and modifier-modifiee relations. We also consider these two syntactic relations, but we do not adopt the general semantic classifications that are hard to construct automatically. For this, we would need the additional mapping information to the domain databases. So we designed a method with the database information which can play the role of semantic classifications in the restricted database domain.</Paragraph> <Paragraph position="1"> In query 1b, the meaning of 'wears' is 'to put on the body', but in 1c, its meaning is 'to put on the foot'. The meaning of 'old' in 1c is 'not new', but that in the phrase 'the oldest man' is 'not young'. Table 3 shows word senses and their candidate target words of 'wears' and 'old' (Lee et al., 1999). We can disambiguate the senses of 'wears' with information in the database, like the sample database shown in Table 1, annotated in the lexical entries. But 'old' in 1c cannot be disambiguated with the database information alone because the values of the 'old' can occur in the same table attributes as shown in the sample database (Table 1). For this problem, we can think of two disambiguation methods.</Paragraph> <Paragraph position="2"> AF Use of additional semantic type-features based on the semantic classifications AF Use of co-occurrence information between the collocated words In the first method, the source queries are disambiguated during parsing, but this method requires the semantic classification information. And the semantic features from the classifications generate many lexical entries, since all the senses for a given lexical item have to be accounted for. As a result, we can expect that the increase in the number of lexical entries may also cause the increase in the loss of both the space and processing time of the system.</Paragraph> <Paragraph position="3"> The second method needs co-occurrence information, but no additional lexical entries. And this method also requires an additional disambiguation process after the parsing to extract information on the collocated words. However, since co-occurrence information between the words can be automatically extracted from a general-purpose corpus, the construction of this information is thought to be relatively straightforward, compared to the construction of the semantic classifications. (Park and Cho, 2000; Lee et al., 1999) proposed to use the co-occurrence information during parsing and lexical selection.</Paragraph> <Paragraph position="4"> For example, in 1c, 'wears' is disambiguated into 'sin-ta' for the semantics of 'shoes' and the collocated words 'old' and 'shoes' is extracted during the parsing. Then the disambiguation module selects the preferred sense of 'old' through the computation of the similarity for the co-occurrence information. As a result, 'old' is correctly disambiguated into the target 'nalk-ta'.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.2 Representation of Lexical Entry </SectionTitle> <Paragraph position="0"> In a CCG framework, all the levels of information, such as syntax, semantics, and discourse, are integrated into the categorial lexicon as lexi- null cal entries. The following shows example lexical entries of a CCG for English.</Paragraph> <Paragraph position="1"> (7) (a) lex(coat,np:[ ,'APB0FHAHAY',body='E6B0DQDS']; ). (b) lex(coat,np:[ ,'ANAYAHAY',clothes='E6B0DQDS']; ). (c) lex(wears,(s:[A,B,C];wear@B;D;ED2np:[A, , ];D)/np:[ ,B,C];E). (d) lex(old,np:[A,sin-ta,status='AGAYFCAHAY'&C];oldDIC;E/np:[A,B,C];E). The lexical entry consists of a lexical item and its CCG category. The CCG category is a pair of the syntactic and semantic information that are interwoven in the following way. Elementary CCG (syntactic) categories include D2D4 and D7, and CCG categories are recursively defined as either CGBPCH or CGD2CH , where CG and CH are also CCG categories, including elementary categories. Each elementary CCG (syntactic) category CG is augmented with an appropriate semantic information CH and word disambiguation information CI so that the resulting form CG BM CH BNCI is a CCG category (Steedman, 1996). In our proposal, the semantic information is replaced with a suitable fragment of SQL, with slots corresponding to Who wears a brown coat ? Portion of the Translation Dictionary SELECT, FROM, and WHERE clauses in SQL, bracketed by '[' and ']'. For example, in entry 7a, 'coat' is assigned the syntactic category 'np' and the semantic information which encodes the fact that the database attribute 'body' has the value 'E6B0DQDS' (oy-twu, meaning 'coat') in the table for 'APB0FHAHAY' (ip-ta, meaning 'to put on body'). 'APB0FHAHAY' is described in FROM clause of SQL and 'body=E6B0DQDS' in WHERE clause. In entry 7b, it shows other 'coat' instances in the database table 'ANAYAHAY' (sa-ta, meaning 'buy'). In entries 7c and 7d, the verb 'wears' and the adjective 'old' are taken to add information in form of CGBSCH and CGDICH for the disambiguation of their senses. CGDICH provides a template for co-occurrence information. null</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.3 Translation Process </SectionTitle> <Paragraph position="0"> Figure 3 shows a derivation of the query 1c and a relevant portion of the translation dictionary.</Paragraph> <Paragraph position="1"> This derivation does not show the binding with SQL syntax. In the final step of the derivation, the syntactic information is combined by a backward application with the category sql:[SELECT A,FROM B,WHERE C]D2s:[A,B,C]; . And the exhibited portion of the translation dictionary shows the list of pairs of a word and its target word. Using this information, after the derivation in Figure 3, semantic checking is performed with the tagged information, that is, 'wear@ipta'. This tagging is compared with the translation dictionary for the correct sense disambiguation. Through this process, the results that have a matching pair in the translation dictionary are confirmed as the desired result, and the others are discarded. Because the result in Figure 3 has the correct pair in Table 3, it is selected as the right result. The resulting SQL statement is shown below: null</Paragraph> </Section> </Section> <Section position="6" start_page="3" end_page="3" type="metho"> <SectionTitle> (8) SELECT person FROM ip-ta </SectionTitle> <Paragraph position="0"> WHERE color=kal-saik and body=oy-twu In response to the SQL statement 8, the answer 'Mary' is produced from Table 1.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.4 Construction of the Lexicon from Available Resources </SectionTitle> <Paragraph position="0"> We construct an English lexicon for the multi-lingual query from several linguistic resources such as an English lexicon with only POS information, a Korean lexicon for the mapping information and an English-Korean translation dictionary. In our system, the English-Korean translation dictionary is needed in two processes. The first is the process of adding word sense information to the lexical items in English and the second is the process of checking for the senses of the given source word. The Korean lexicon is used for the mapping into the database and the English lexicon with POS tag is used for extracting syntactic categories and syntactic relations between the words. Figure 4 shows the needed information resources for the English and Korean lexicons. The Korean lexicon is constructed by a tool in a semi-automatic manner (Lee and Park, 2001). The lexicon construction tool constructs the Korean lexicon using information from a general-purpose corpus and domain specific database information.</Paragraph> </Section> </Section> <Section position="7" start_page="3" end_page="3" type="metho"> <SectionTitle> 5 Implementation </SectionTitle> <Paragraph position="0"> Figure 5 shows the structure of the engine that processes multilingual queries. The database is on the home appliance domain in e-commerce. It contains objects for appliance information such as category, name, maker, price, size, other features and so forth. We have populated the database with information from Korean shopping mall websites. Two queries are shown below: (9) (a) Who makes a flat-screen TV set? (b) SELECT maker FROM product WHERE name='B7B9FLAKBBF7ANAYAEAYF4' and category='TV' (10) (a) BMC4APB0FBAQAYF4FRG2GFFRFXGQAJAZFLELBQC3FLAQAYFLCDCFFLG2GIANAYCDCFANB0FQFRG2GFC0C4, AEAYAEB9F4FRG2GFAPB8FBAKAYAPB3F7AEAYCRCV? I want to buy a refrigerator of the smallest capacity, but what is its price? (b) SELECT price FROM product WHERE size IN</Paragraph> <Paragraph position="2"> The query processing engine is implemented on the UNIX using SICStus Prolog. The word translation checking module performs disambiguation using the English-Korean dictionary (cf. Figure 3) and co-occurrence information.</Paragraph> <Paragraph position="3"> The Korean lexicon contains about a million number of lexical entries, but the English lexicon is comparatively much smaller, and still under construction.</Paragraph> <Paragraph position="4"> The system can process diverse linguistic expressions in English such as coordination, unbounded dependencies, and gapping etc. The system can also process diverse expressions in Korean including subject ellipsis, noun phrases, numerical expressions, coordination, and subordination where the performance of the system for the queries in Korean is reported in (Lee and Park, 2001).</Paragraph> </Section> class="xml-element"></Paper>