File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1120_metho.xml

Size: 17,013 bytes

Last Modified: 2025-10-06 14:09:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1120">
  <Title>A New Chinese Natural Language Understanding Architecture Based on Multilayer Search Mechanism</Title>
  <Section position="3" start_page="0" end_page="7" type="metho">
    <SectionTitle>
2 Multilayer Search Mechanism
</SectionTitle>
    <Paragraph position="0"> The novel Multilayer Search Mechanism (MSM) integrates and quantifles NLU components into a uniform multilayer treelike platform, such as Word-Seg, POS Tagging, Parsing and so on.</Paragraph>
    <Paragraph position="1"> These components afiect each other by computing the flnal score and then get better results.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Background
</SectionTitle>
      <Paragraph position="0"> Considering a Chinese sentence, the sentence analysis task can be formally deflned as flnding a set of word segmentation sequence (W), a POS tagging sequence (POS), a syntax dependency parsing tree (DP) and so on which maximize their joint probability P(W;POS;DP;C/C/C/). In this paper, we assume that there are only three layers W, POS and DP in MSM. It is relatively straightforward, however, to extend the method to the case for which there are more than three layers. Therefore, the sentence analysis task can be described as flnding a triple &lt; W;POS;DP &gt; that maximize the joint probability P(W;POS;DP).</Paragraph>
      <Paragraph position="2"> The joint probability distribution P(W;POS;DP) can be written in the following form using the chain rule of probability:</Paragraph>
      <Paragraph position="4"> Where P(W) is considered as the probability of the word segmentation layer, P(POSjW) is the conditional probability of POS Tagging with a given word segmentation result, P(DPjW;POS) is the conditional probability of a dependency parsing tree with a given word segmentation and POS Tagging result similarly.</Paragraph>
      <Paragraph position="5"> So the form of &lt; W;POS;DP &gt; can be transformed into:</Paragraph>
      <Paragraph position="7"> We consider that each inversion of probability's logarithm at the last step of the above equation is a score given by a component (Such as Word-Seg, POS Tagging and so on). So at last, we flnd an n-tuple &lt; W;POS;DP;C/C/C/ &gt; that minimizes the last score Sn of a sentence analysis result with n layers. Sn is deflned as:</Paragraph>
      <Paragraph position="9"> si denotes the score of the ith layer component. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="7" type="sub_section">
      <SectionTitle>
2.2 The Architecture of Multilayer
Search Mechanism
</SectionTitle>
      <Paragraph position="0"> Because there are lots of analysis results at each layer, it's a combinatorial explosion problem to flnd the optimal result. Assuming that each component produces m results for an input on average and there are n layers in a NLU system, the flnal search space is mn. With the increasing of n, it's impossible for a system to flnd the optimal result in the huge search space.</Paragraph>
      <Paragraph position="1"> The classical cascade mechanism uses a greedy algorithm to solve the problem. It only keeps the optimal result at each layer. But if it's a fault analysis result for the optimal result at a layer, it's impossible for this mechanism to flnd the flnal correct analysis result.</Paragraph>
      <Paragraph position="2"> To overcome the di-culty, we build a new Multilayer Search Mechanism (MSM). Difierent from the cascade mechanism, MSM maintains a number of results at each component, so that the correct analysis should be included in these results with high probability. Then MSM tries to use the information of all layer components to flnd out the correct analysis result. Difierent from the feedback mechanism, the acceptance of an analysis is not based on a higher layer components alone. The lower layer components provide some information to help to flnd the correct analysis result as well.</Paragraph>
      <Paragraph position="3"> According to the above idea, we design the architecture of MSM with multilayer treelike structure. The original input is root and the several analysis results of the input become branches. Iterating this progress, we get a bigger analysis tree. Figure 1 gives an analysis example of a Chinese sentence \  b&amp;quot; (He likes beautiful owers). For the input sentence, there are several Word-Seg results withscores(thelowerthebetter). Thenforeach of Word-Seg results, there are several POS Tagging results, too. And for each of POS Tagging result, the same thing happens. So we get a big tree structure and the correct analysis result is a path in the tree from the root to the leaf except for there is no correct analysis result in some analysis components.</Paragraph>
      <Paragraph position="4"> A search algorithm can be used to flnd out the correct analysis result among the lowest score in the tree. But because each layer cannot give the exact score in Equation 1 as the standard score and the ability of analysis are difierent with difierent layers, we should weight every score.</Paragraph>
      <Paragraph position="5"> Then the last score is the linear weighted sum (Equation 2).</Paragraph>
      <Paragraph position="7"> si denotes the score of the ith layer component which we will introduce in Section 3; wi denotes the weight of the ith layer components which we will introduce in the next section.</Paragraph>
      <Paragraph position="8"> In order to get the optimal result, all kinds of tree search algorithms can be used. Here the BEST-FIRST SEARCH Algorithm (Russell and Norvig, 1995) is used. Figure 2 shows the main algorithm steps.</Paragraph>
      <Paragraph position="9">  1. Add the initial node (starting point) to the queue.</Paragraph>
      <Paragraph position="10"> 2. Compare the front node to the goal state. If they match then the solution is found.</Paragraph>
      <Paragraph position="11"> 3. If they do not match then expand the front node by adding all the nodes from its links. 4. If all nodes in the queue are expanded then the goal state is not found (e.g.there is no solution). Stop.</Paragraph>
      <Paragraph position="12"> 5. According to Equation 2 evaluate the score of expanded nodes and reorder the nodes in the queue.</Paragraph>
      <Paragraph position="13"> 6. Go to step 2.</Paragraph>
    </Section>
    <Section position="3" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
2.3 Layer Weight
</SectionTitle>
      <Paragraph position="0"> We should flnd out a group of appropriate w1;w2;C/C/C/;wn in Equation 2 to maximize the number of the optimal paths in MSM which can get the correct results. They are expressed by W/.</Paragraph>
      <Paragraph position="2"> Here W/ is named as Whole Layer Weight.</Paragraph>
      <Paragraph position="3"> ObjFun(/) denotes a function to value the resultthatagroupofW canget. Herewecanconsider that the performance of each layer is proportional to the last performance of the whole system in MSM. So it maybe the F-Score of Word-Seg, precision of POS Tagging and so on.</Paragraph>
      <Paragraph position="4"> minSn returns the optimal analysis results with the lowest score.</Paragraph>
      <Paragraph position="5"> Here, the F-Score of Word-Seg can be deflned as the harmonic mean of recall and precision of Word-Seg. That is to say:</Paragraph>
      <Paragraph position="7"> Finding out the most suitable group of W is an optimization problem. Genetic Algorithms (GAs) (Mitchell, 1996) is just an adaptive heuristic search algorithm based on the evolutionary ideas of natural selection and genetics to solve optimization problems. It exploits historical information to direct the search into the region of better performance within the search space.</Paragraph>
      <Paragraph position="8"> To use GAs to solve optimization problems (Wall, 1996) the following three questions should be answered:  1. How to describ genome? 2. What is the objective function? 3. Which controlling parameters to be selected? null  A solution to a problem is represented as a genome. The genetic algorithm then creates a population of solutions and applies genetic operators such as mutation and crossover to evolve the solutions in order to flnd the best one(s) after several generations. The numbers of population and generation are given by controlling parameters. The objective function decides which solution is better than others.</Paragraph>
      <Paragraph position="9"> In MSM, the genome is just the group of W which can be denoted by real numbers between 0 and 1. Because the result is a linear weighted sum, weshouldnormalizetheweightstoletw1+ w2+C/C/C/+wn = 1. The objective function is just ObjFun(/) in Equation 3. Here the F-Score of Word-Seg is used to describe it. We set the genetic generations as 10 and the populations in one generation as 30. The Whole Layer Weight shows in the row of WLW in Table 4. The F-Score of Word-Seg shows as Table 3.</Paragraph>
      <Paragraph position="10"> We can see that the Word-Seg layer gets an obviously large weight. So the flnal result is inclined to the result of Word-Seg.</Paragraph>
    </Section>
    <Section position="4" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
2.4 Self Confldence
</SectionTitle>
      <Paragraph position="0"> Our analysis indicates that the method of weighting a whole layer uniformly cannot reect the individual information of each sentence to some component. So the F-Score of Word-Seg drops somewhat comparing with using Only Word-Seg. For example, the most sentences which have ambiguities in Word-Seg component are still weighted high with Word-Seg layer weight. Then the flnal result may still be the same as the result of Word-Seg component. It is ambiguous, too. So we must use a parameter to decrease the weight of a component with ambiguity. It is used to describe the analysis ability of a component for an input. We name it as Self Confldent (SC) of a component.</Paragraph>
      <Paragraph position="1"> It is described by the difierence between the flrst and the second score of a component. Then the bigger SC of a component, the larger weight of it.</Paragraph>
      <Paragraph position="2"> There are lots of methods to value the difierence between two numbers. So there are many kinds of deflnitions of SC. We use A and B to  denotetheflrstandthesecondscoreofacomponent respectively. Then the SC can be deflned as B!A, BA and so on. We must select the better one to represent SC. The better means that a method which gets a lower Error Rate with a threshold t/ which gets the Minimal Error Rate.</Paragraph>
      <Paragraph position="4"> ErrRate(t) denotes the Error Rate with the threshold t. An error has two deflnitions: + SC is higher than t but the flrst result is fault + SC is lower than t but the flrst result is right Then the Error Rate is the ratio between the error number and the total number of sentences. Table 2 is the comparison list between difierent deflnitions of SC and their Minimal Error Rate of Word-Seg. By this table we select B!A as the last SC because it gets the minimal Minimal Error Rate within the difierent deflnitions of SC.</Paragraph>
      <Paragraph position="5"> SC is added into Equation 2 to describe the individual information of each sentence intensively. Equation 4 shows the new score method of a path.</Paragraph>
      <Paragraph position="7"> sci denotes the SC of a component in the ith layer.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="7" end_page="7" type="metho">
    <SectionTitle>
3 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
3.1 Score of Components
</SectionTitle>
      <Paragraph position="0"> We build a practical system CUP (Chinese Understanding Platform) based on MSM with three layers { Word-Seg, POS Tagging and Parsing. Each component not only provides the n-best analysis result, but also the score of each result.</Paragraph>
      <Paragraph position="1"> In the Word-Seg component, we use the uni-gram model (Liu et al., 1998) to value difierent results of Word-Seg. So the score of a result is:</Paragraph>
      <Paragraph position="3"> wi denotes the ith word in the Word-Seg result of a sentence.</Paragraph>
      <Paragraph position="4"> In the POS Tagging component the classical Markov Model (Manning and Sch~utze, 1999) is used to select the n-best POS results of each Word-Seg result. So the score of a result is:</Paragraph>
      <Paragraph position="6"> ti denotes the POS of the ith word in a Word-Seg result of a sentence.</Paragraph>
      <Paragraph position="7"> In the Parsing component, we use a Chinese Dependency Parser System developed by HIT-IRLab1. The score of a result is:</Paragraph>
      <Paragraph position="9"> lij denotes a link between the ith and jth word in a Word-Seg and POS Tagging result of a sentence.</Paragraph>
      <Paragraph position="10"> Table 1 gives the one and flve-best results of each component with a correct input. The test data comes from Beijing Univ. and Fujitsu Chinese corpus (Huiming et al., 2000). The F-Score is used to value the performance of the Word-Seg, Precision to POS Tagging and the correct rate of links to Parsing.</Paragraph>
    </Section>
    <Section position="2" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
3.2 Self Confldence Selection
InordertoselectabetterSC,wetestallkindsof
</SectionTitle>
      <Paragraph position="0"> deflnition form to calculate their Minimal Error Rate. For example B!A, BA and so on. A and B denote the flrst and the second score of a component respectively. Table 2 shows the relationship between deflnition forms of SC and their Minimal Error Rate. Here, we experimented with the flrst and the second Word-Seg results of more than 7100 Chinese sentences.</Paragraph>
    </Section>
    <Section position="3" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
3.3 F-Score of Word-Seg
</SectionTitle>
      <Paragraph position="0"> The result of Word-Seg is used to test our system's performance, which means that the ObjFun(/) returns the F-Score of Word-Seg.</Paragraph>
      <Paragraph position="1"> There are 1,500 sentences as training data and 500 sentences as test data. Among these data about 10% sentences have ambiguities and the others come from Beijing Univ. and Fujitsu  Chinese corpus (Huiming et al., 2000). In CUP the flve-best results of each component are selected. Table 3 lists the F-Score of Word-Seg. They use Only Word-Seg (OWS), Whole Layer Weight (WLW), SC (SC) and FeedBack mechanism (FB) separately. Using the feedback mechanism means that the last analysis result of a sentence is decided by the Parsing. We select the result which has the lowest score of Parsing. Table 4 shows the weight distributions in WLW and SC weighting methods.</Paragraph>
    </Section>
    <Section position="4" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
3.4 The E-ciency of CUP
</SectionTitle>
      <Paragraph position="0"> The e-ciency test of CUP was done with 7112 sentences with 20 Chinese characters averagely.</Paragraph>
      <Paragraph position="1"> It costs 58.97 seconds on a PC with PIV 2.0 CPU and 512M memory. The average cost of a sentence is 0.0083 second.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="7" end_page="7" type="metho">
    <SectionTitle>
4 Discussions
</SectionTitle>
    <Paragraph position="0"> According to Table 1, we can see that the performance of each component improved with the increasing of the number of results. But at the same time, the processing time must increase.</Paragraph>
    <Paragraph position="1"> So we should balance the e-ciency and efiectiveness with an appropriate number of results. Thus, it's more possible for CUP to flnd out the correct analysis than the original cascade mechanism if we can invent an appropriate method.</Paragraph>
    <Paragraph position="2"> We deflne SC as B !A which gets the mini-</Paragraph>
    <Paragraph position="4"> It's just the difierence between logarithms of difierent word results' probability of the flrst and the second result of Word Segmentation.</Paragraph>
    <Paragraph position="5"> Table 3 shows that MSM using SC gets a better performance than other methods. For a Chinese sentence \0/b&amp;quot;+b&amp;quot;. (There are some drinks under the table). The CUP gets the correct analysis { \0/n//ndb/v &amp;quot;/u+/m/q/nb/w&amp;quot;. But the cascade and feedback mechanism's result is \0/n/ b/v&amp;quot;/u+/m/q/nb/w&amp;quot;.</Paragraph>
    <Paragraph position="6"> The cascade mechanism uses the Only Word-Seg result. In this method P(/b) is more than P(/) / P(b). At the same time, the wrong analysis is a grammatical sentence and is accepted by Parsing. These create that these two mechanisms cannot get the correct result.</Paragraph>
    <Paragraph position="7"> But the MSM synthesizes all the information of Word-Seg, POS Tagging and Parsing. Finally it gets the correct analysis result.</Paragraph>
    <Paragraph position="8"> Now, CUP integrates three layers and its e-ciency is high enough for practical applications.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML