File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1032_metho.xml
Size: 16,267 bytes
Last Modified: 2025-10-06 14:08:15
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1032"> <Title>Extracting Key Semantic Terms from Chinese Speech Query for Web Searches</Title> <Section position="3" start_page="0" end_page="2" type="metho"> <SectionTitle> 3 Query Model (QM) </SectionTitle> <Paragraph position="0"> Query model (QM) is used to analyze the query and extract the core semantic string (CSS) that contains the main semantic of the query. There are two main components for a query model. The first is query component dictionary, which is a set of phrases that has certain semantic functions, such as the polite remarks, prepositions, time etc. The other component is the query structure, which defines a sequence of acceptable semantically tagged tokens, such as &quot;Begin, Core Semantic String, Question Phrase, and End&quot;. Each query structure also includes its occurrence probability within the query corpus. Table 2 gives some examples of query structures.</Paragraph> <Section position="1" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 3.1 Query Model Generation </SectionTitle> <Paragraph position="0"> In order to come up with a set of generalized query structures, we use a query log of typical queries posed by users. The query log consists of 557 queries, collected from twenty-eight human subjects at the Shanghai Jiao Tong University (Ying 2002).</Paragraph> <Paragraph position="1"> Each subject is asked to pose 20 separate queries to retrieve general information from the Web.</Paragraph> <Paragraph position="2"> After analyzing the queries, we derive a query model comprising 51 query structures and a set of query components. For each query structure, we compute its probability of occurrence, which is used to determine the more likely structure containing CSS in case there are multiple CSSs found.</Paragraph> <Paragraph position="3"> As part of the analysis of the query log, we classify the query components into ten classes, as listed in Table 1. These ten classes are called semantic tags.</Paragraph> <Paragraph position="4"> They can be further divided into two main categories: the closed class and open class. Closed classes are those that have relatively fixed word lists.</Paragraph> <Paragraph position="5"> These include question phrases, quantifiers, polite remarks, prepositions, time and commonly used verb and subject-verb phrases. We collect all the phrases belonging to closed classes from the query log and store them in the query component dictionary. The open class is the CSS, which we do not know in advance. CSS typically includes person's names, events and country's names etc.</Paragraph> <Paragraph position="7"> Give me some information about Ben laden.</Paragraph> <Paragraph position="8"> Given the set of sample queries, a heuristic rule-based approach is used to analyze the queries, and break them into basic components with assigned semantic tags by matching the words listed in Table 1. Any sequences of words or phrases not found in the closed class are tagged as CSS (with Semantic Tag 9). We can thus derive the query structures of the form given in Table 2.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 Modeling of Query Structure as FSA </SectionTitle> <Paragraph position="0"> Due to speech recognition errors, we do not expect the query components and hence the query structure to be recognized correctly. Instead, we parse the query structure in order to isolate and extract CSS. To facilitate this, we employ the Finite State Automata (FSA) to model the query structure. FSA models the expected sequences of tokens in typical queries and annotate the semantic tags, including CSS. A FSA is defined for each of the 51 query structures.An example of FSA is given in Figure 2.</Paragraph> <Paragraph position="1"> Because CSS is an open set, we do not know its content in advance. Instead, we use the following two rules to determine the candidates for CSS: (a) it is an unknown string not present in the Query Component Dictionary; and (b) its length is not less than two, as the average length of concepts in Chinese is greater than one (Wang 1992).</Paragraph> <Paragraph position="2"> At each stage of parsing the query using FSA (Hobbs et al 1997), we need to make decision on which state to proceed and how to handle unexpected tokens in the query. Thus at each stage, FSA needs to perform three functions: a) Goto function: It maps a pair consisting of a state and an input symbol into a new state or the fail state. We use G(N,X) =N' to define the goto function from State N to State N', given the occurrence of token X.</Paragraph> <Paragraph position="3"> b) Fail function: It is consulted whenever the goto function reports a failure when encountering an unexpected token. We use f(N) =N' to represent the fail function.</Paragraph> <Paragraph position="4"> c) Output function: In the FSA, certain states are designated as output states, which indicate that a sequence of tokens has been found and are tagged with the appropriate semantic tag.</Paragraph> <Paragraph position="5"> To construct a goto function, we begin with a graph consisting of one vertex which represents State 0.We then enter each token X into the graph by adding a directed path to the graph that begins at the start state. New vertices and edges are added to the graph so that there will be, starting at the start state, a path in the graph that spells out the token X. The token X is added to the output function of the state at which the path terminates. For example, suppose that our Query Component Dictionary consists of seven phrases as follows: &quot;a0a1a0a3a2 (please help me); a4a6a5 (some); a3a16a8 (about); a7a6a8 (news); a9a11a10 (collect); a12a11a13 a2 (tell me); a3a23a5a23a7 (what do you have)&quot;. Adding these tokens into the graph will result in a FSA as shown in Figure 2. The path from State 0 to State 3 spells out the phrase &quot;a0a14a0a15a2 (Please help me)&quot;, and on completion of this path, we associate its output with semantic tag 6. Similarly, the output of &quot;a4a16a5 (some)&quot; is associated with State 5, and semantic tag 4, and so on.</Paragraph> <Paragraph position="6"> We now use an example to illustrate the process of parsing the query. Suppose the user issues a speech query: &quot;a0a17a0a11a2 a9a11a10a11a4a16a5 a3a6a8a16a18a11a19a17a20 a17 a7 a8 &quot; (please help me to collect some information about Bin Laden). However, the result of speech recognition with errors is: &quot;a0 (please) a0 (help)</Paragraph> <Paragraph position="8"> The FSA begins with State 0. When the system encounters the sequence of characters a0 (please) a0 (help)a2 (me), the state changes from 0 to 1, 2 and eventually to 3. At State 3, the system recognizes a polite remark phrase and output a token with semantic tag 6.</Paragraph> <Paragraph position="9"> Next, the system meets the character a9 (receive), it will transit to State 10, because of g(0, a9 )=10. When the system sees the next character a21 (send), which does not have a corresponding transition rule, the goto function reports a failure. Because the length of the string is 2 and the string is not in the Query Component Dictionary, the semantic tag 9 is assigned to token&quot;a9a11a21 &quot; according to the definition of CSS.</Paragraph> <Paragraph position="10"> By repeating the above process, we obtain the following result:</Paragraph> <Paragraph position="12"> Here the semantic tags are as defined in Table 1.</Paragraph> <Paragraph position="13"> It is noted that because of speech recognition errors, the system detected two CSSs, and both of them contain speech recognition errors.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.3 CSS Extraction by Query Model </SectionTitle> <Paragraph position="0"> Given that we may find multiple CSSs, the next stage is to analyze the CSSs found along with their surrounding context in order to determine the most probable CSS. The approach is based on the premise that choosing the best sense for an input vector amounts to choosing the most probable sense given that vector. The input vector i has three components: left context (Li), the CSS itself (CSSi), and right context (Ri). The probability of such a structure occurring in the Query Model is as follows:</Paragraph> <Paragraph position="2"> where Cij is set to 1 if the input vector i (Li, Ri) matches the two corresponding left and right CSS context of the query structure j, and 0 otherwise. pj is the possibility of occurrence of the jth query structure, and n is the total number of the structures in the Query Model. Note that Equation (1) gives a detected CSS higher weight if it matches to more query structures with higher occurrence probabilities. We simply select the best CSSi such that</Paragraph> <Paragraph position="4"> s according to Eqn(1).</Paragraph> <Paragraph position="5"> For illustration, let's consider the above example with 2 detected CSSs. The two CSS vectors are: [6, 9, 4] and [7, 9, 3]. From the Query Model, we know that the probability of occurrence, pj, of structure [6, 9, 4] is 0, and that of structure [7, 9, 3] is 0.03, with the latter matches to only one structure. Hence the si values for them are 0 and 0.03 respectively. Thus the most probable core semantic structure is [7, 9, 3] and the CSS&quot;a22 (half)a19 (pull) a23 (light)&quot; is extracted.</Paragraph> </Section> </Section> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Query Terms Generation </SectionTitle> <Paragraph position="0"> Because of speech recognition error, the CSS obtained is likely to contain error, or in the worse case, missing the main semantics of the query altogether. We now discuss how we alleviate the errors in CSS for the former case. We will first break the CSS into one or more basic semantic parts, and then apply the multi-tier method to map the query components to known phrases.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.1 Breaking CSS into Basic Components </SectionTitle> <Paragraph position="0"> In many cases, the CSS obtained may be made up of several semantic components equivalent to base noun phrases. Here we employ a technique based on Chinese cut marks (Wang 1992) to perform the segmentation. The Chinese cut marks are tokens that can separate a Chinese sentence into several semantic parts. Zhou (1997) used such technique to detect new Chinese words, and reported good results with precision and recall of 92% and 70% respectively. By separating the CSS into basic key components, we can limit the propagation of errors.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.2 Multi-tier query term mapping </SectionTitle> <Paragraph position="0"> In order to further eliminate the speech recognition errors, we propose a multi-tier approach to map the basic components in CSS into known phrases by using a combination of matching techniques. To do this, we need to build up a phrase dictionary containing typical concepts used in general and specific domains. Most basic CSS components should be mapped to one of these phrases. Thus even if a basic component contains errors, as long as we can find a sufficiently similar phrase in the phrase dictionary, we can use this in place of the erroneous CSS component, thus eliminating the errors.</Paragraph> <Paragraph position="1"> We collected a phrase dictionary containing about 32,842 phrases, covering mostly base noun phrase and named entity. The phrases are derived from two sources. We first derived a set of common phrases from the digital dictionary and the logs in the search engine used at the Shanghai Jiao Tong University. We also derived a set of domain specific phrases by extracting the base noun phrases and named entities from the on-line news articles obtained during the period. This approach is reasonable as in practice we can use recent web or news articles to extract concepts to update the phrase dictionary.</Paragraph> <Paragraph position="2"> Given the phrase dictionary, the next problem then is to map the basic CSS components to the nearest phrases in the dictionary. As the basic components may contain errors, we cannot match them exactly just at the character level. We thus propose to match each basic component with the known phrases in the dictionary at three levels: (a) character level; (b) syllable string level; and (c) confusion syllable string level. The purpose of matching at levels b and c is to overcome the homophone problem in CSS. For example, &quot;a19a17a20 (Laden)&quot; is wrongly recognized as &quot;a19 a23 (pull lamp)&quot; by the speech recognition engine. Such errors cannot be re-solved at the character matching level, but it can probably be matched at the syllable string level. The confusion matrix is used to further reduce the effect of speech recognition errors due to similar sounding characters.</Paragraph> <Paragraph position="3"> To account for possible errors in CSS components, we perform similarity, instead of exact, matching at the three levels. Given the basic CSS component qi, and a phrase cjin the dictionary, we where LCS(qi,cj) gives the number of characters/ syllable matched between qi and ci in the order of their appearance using the longest common subsequence matching (LCS) algorithm (Cormen et al 1990). Mk is introduced to accounts for the similarity between the two matching units, and is dependent on the level of matching. If the matching is performed at the character or syllable string levels, the basic matching unit is one character or one syllable and the similarity between the two matching units is 1. If the matching is done at the confusion syllable string level, Mk is the corresponding coefficients in the confusion matrix. Hence LCS (qi,cj) gives the degree of match between qi and cj, normalized by the maximum length of qi or cj; and SM gives the degree of similarity between the units being matched.</Paragraph> <Paragraph position="4"> The three level of matching also ranges from being more exact at the character level, to less exact at the confusion syllable level. Thus if we can find a relevant phrase with sim(qi,cj)>a1 at the higher character level, we will not perform further matching at the lower levels. Otherwise, we will relax the constraint to perform the matching at successively lower levels, probably at the expense of precision. null The detail of algorithm is listed as follows: Input: Basic CSS Component, qi a. Match qi with phrases in dictionary at character level using Eqn.(2).</Paragraph> <Paragraph position="5"> b. If we cannot find a match, then match qi with phrases at the syllable level using Eqn.(2).</Paragraph> <Paragraph position="6"> c. If we still cannot find a match, match qi with phrases at the confusion syllable level using Eqn.(2).</Paragraph> <Paragraph position="7"> d. If we found a match, set q'i=cj; otherwise set q'i=qi.</Paragraph> <Paragraph position="8"> For example, given a query: &quot; a2a4a3 a5a6a8a7 a9a11a10 a12a11a19a14a13a14a15a17a16a19a18 &quot; (please tell me some news about Iraq). If the query is wrongly recognized as &quot;a20a22a21</Paragraph> <Paragraph position="10"> could correctly extract the CSS &quot;a12a11a19a22a13a19a28 (Iraq) from this mis-recognized query, then we could ignore the speech recognition errors in other parts of the above query. Even if there are errors in the CSS extracted, such as &quot;a29 (chen) a30a22a31 (waterside)&quot; instead of &quot;a29a4a30a22a32 (chen shui bian)&quot;, we could apply the syllable string level matching to correct the homophone errors. For CSS errors such as &quot;a33 (corrupt) a4a22a34 (usually)&quot; instead of the correct CSS &quot;a35a14a36a14a37 (Taliban)&quot;, which could not be corrected at the syllable string matching level, we could apply the confusion syllable string matching to overcome this error.</Paragraph> </Section> </Section> class="xml-element"></Paper>