XML Viewer - n03-3010

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-3010_metho.xml
Size: 16,402 bytes
Last Modified: 2025-10-06 14:08:14
<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-3010">
  <Title>Cooperative Model Based Language Understanding in Dialogue</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Semantic Representation
</SectionTitle>
    <Paragraph position="0"> The goal of automated natural language understanding is to parse natural language string, extract meaningful information and store them for future processing. For our application of training environment, it's impossible to parse sentences syntactically and we here directly produce the nested information frames as output. The topmost level of the information frame is defined as follows:  In the definition, &lt;semantic-object&gt; consists of</Paragraph>
    <Paragraph position="2"> three types: question, action and proposition. Here, question refers to requests for information, action refers to orders and suggestions except requests, and all the rest falls into the category of proposition.</Paragraph>
    <Paragraph position="3"> Each of these types can also be further decomposed as Figure 2 and 3.</Paragraph>
    <Paragraph position="4">  These information frames can be further extended and nested as necessary. In our application, most of the information frames obtained contain at most three levels. In Figure 4, we give an example of information frame for the English sentence &amp;quot;who is not critically hurt?&amp;quot;. All the target information frames in our domain are  Since the information frames are nested, for the statistical learning model to be addressed, ideally both the semantic information and structural information should be represented correctly. Therefore we use prefix strings to represent the cascading level of each slot-value pair. The case frame in Figure 4 can be re-represented as shown in Figure 5. Here we assume that the slots in the information frame are independent of each other. Reversely the set of meaning items can be restored to a normal nested information frame.</Paragraph>
    <Paragraph position="5">  We introduce the cooperative model in the following section to extract meaningful information frames for all the English sentences in our domain.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Cooperative Model
</SectionTitle>
    <Paragraph position="0"> The Cooperative Model (CM) combines two commonly-used methods in natural language processing, Finite State Model (FSM) and Statistical Learning Model (SLM). We discuss them in section 3.1 and 3.2 respectively.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Finite State Model
</SectionTitle>
      <Paragraph position="0"> The main idea of finite state model is to put all the possible input word sequences and their related output information on the arcs.</Paragraph>
      <Paragraph position="1"> For our application, the input is a string composed of a sequence of words, and the output should be a correctly structured information frame. We apply two strategies of FSM. The Series Mode refers to build a series of finite state machine with each corresponding to a single slot. The Single Model builds only one complex Finite State Machine that incorporates all the sentence patterns and slot-value pairs.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Machine
</SectionTitle>
      <Paragraph position="0"> For this strategy, we analyze our domain to obtain a list of all possible slots. From the perspective of linguistics, a slot can be viewed as characterized by some specific words, say, a set of feature words. We therefore can make a separate semantic filter for each slot. Each sentence passes through a series of filters and as soon as</Paragraph>
      <Paragraph position="2"> we find the &amp;quot;feature&amp;quot; words, we extract their corresponding slot-value pairs. All the slot-value pairs extracted produce the final nested case frame.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Sentence Information Frame
</SectionTitle>
      <Paragraph position="0"> model of finite state machine works. For example, three slot-value pairs are extracted from the word &amp;quot;who&amp;quot;.</Paragraph>
      <Paragraph position="1"> Practically, we identified 27 contexts and built 27 finite state machines as semantic filters, with each one associated with a set of feature words. The number of arcs for each finite state machine ranges from 4 to 70 and the size of the feature word set varies from 10 to 50.</Paragraph>
      <Paragraph position="2"> This strategy extracts semantic information based on the mapping between words and slots. It is relatively easy to design the finite state machine networks and implement the parsing algorithm. For every input sentence it will provide all possible information using the predefined mappings. Even if the sentence contains no feature words, the system will end gracefully with an empty frame. However, this method doesn't take into account the patterns of word sequences. Single word may have different meanings under different situations.</Paragraph>
      <Paragraph position="3"> In most cases it is also difficult to put one word into one single class; sometimes a word can even belong to different slots' feature word sets that can contradict each other. On the other hand, the result produced may have some important slot-value pairs missed and the number of slots is fixed.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Machine
</SectionTitle>
      <Paragraph position="0"> In this strategy we only build a big finite state network.</Paragraph>
      <Paragraph position="1"> When a new sentence goes into the big FSM parser, it starts from &amp;quot;START&amp;quot; state and a successful matching of prespecified patterns or words will move forward to another state. Any matching procedure coming to the &amp;quot;END&amp;quot; state means a successful parsing of the whole sentence. And all the outputs on the arcs along the path compose the final parsing result. If no patterns or words are successfully matched at some point, the parser will die and return failure.</Paragraph>
      <Paragraph position="2"> This strategy requires all the patterns to be processed with this finite state model available before designing the finite state network. The target sentence set includes 65 sentence patterns and 23 classes of words and we combine them into a complex finite state network manually. Figure 7 gives some examples of the collected sentence patterns and word classes.</Paragraph>
      <Paragraph position="3">  Aimed at processing these sentences, we design our finite state network consisting of 128 states. This network covers more than 20k commonly-used sentences in our domain. It will return the exact parsing result without missing any important information. If all of the input sentences in the application belong to the target sentence set of this domain, this approach perfectly produces all of the correct results. However, the design of the network is done totally by hand, which is very tedious and time-consuming. The system is not very flexible or robust and it's difficult to add new sentences into the network before a thorough investigation of the whole finite state network. It is not convenient and efficient for extension and maintenance.</Paragraph>
      <Paragraph position="4"> Finite state models can't process any sentence with new sentence patterns. However in reality most systems require more flexibility, robustness, and more powerful processing abilities on unexpected sentences. The statistical machine learning model gives us some light on that. We discuss learning models in Section 3.2.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Statistical Learning Model
</SectionTitle>
      <Paragraph position="0"> Naive Bayes learning has been widely used in natural language processing with good results such as statistical syntactic parsing (Collins, 1997; Charniak, 1997), hidden language understanding (Miller et al., 1994).</Paragraph>
      <Paragraph position="1"> We represent the mappings between words and their potential associated meanings (meaning items including level information and slot-value pairs) with P(M|W). W refers to words and M refers to meaning items. With Bayes' theorem, we have the formula 3.1.</Paragraph>
      <Paragraph position="3"> ...</Paragraph>
      <Paragraph position="4"> In our domain, we can view P (W) as a constant and transform Formula 3.1 to Formula 3.2 as follows:</Paragraph>
      <Paragraph position="6"> We created the training sentences and case frames by running full range of variation on Finite State Machine described in Section 3.1.2. This gives a set of 20, 677 sentences. We remove ungrammatical sentences and have 16,469 left. Randomly we take 7/8 of that as the training set and 1/8 as the testing set.</Paragraph>
      <Paragraph position="7">  The meaning model P(M) refers to the probability of meanings. In our application, meanings are represented by meaning items. We assume each meaning item is independent of each other at this point. In the meaning model, the meaning item not only includes slot-value pairs but level information. Let C(m</Paragraph>
      <Paragraph position="9"> ) be the number of times the meaning item m</Paragraph>
      <Paragraph position="11"> appears the training set, we obtain P(M) as follows:</Paragraph>
      <Paragraph position="13"> This can be easily obtained by counting all the meaning items of all the information frames in the training set.</Paragraph>
      <Paragraph position="14">  In the naive Bayes learning approach, P(W|M) stands for the probability of words appearing under given meanings. And from the linguistic perspective, the patterns of word sequences can imply strong information of meanings. We introduce a language model based on a Hidden Markov Model (HMM). The word model can be described as P (w</Paragraph>
      <Paragraph position="16"> ) for trigram model, bigram model, and unigram model respectively. They can be calculated with the following formulas:  We parse each sentence based on the naive Bayes learning Formula 3.2. Each word in the sentence can be associated with a set of candidate meaning items. Then we normalize each candidate set of meaning items and use the voting schema to get the final result set with a probability for each meaning item.</Paragraph>
      <Paragraph position="17"> However, this inevitably produces noisy results. Sometimes the meanings obtained even contradict other useful meaning items. We employ two cutoff strategies to eliminate such noise. The first is to cut off unsatisfactory meaning items based on a gap in probability. The degree of jump can be defined with an arbitrary threshold value. The second is to group all the slot-value pairs with the same name and take the top one as the result.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Cooperative Mechanism
</SectionTitle>
      <Paragraph position="0"> In the previous two sections, we discussed two approaches in our natural language understanding system. However, neither is completely satisfactory.</Paragraph>
      <Paragraph position="1"> Cooperative Model can combine all three approaches from these two models. The main idea is to run the three parsing models together whenever a new sentence comes into the system. With the statistical learning model, we obtain a set of information frames.</Paragraph>
      <Paragraph position="2"> For the result we get from single model of finite state machine, if an information frame exists, it means the sentence is stored in the finite state network. We therefore assign a score 1.0. The result should be no worse than any information frame we get from statistical learning model. Otherwise, it means this sentence is not stored in our finite state work, we can ignore this result. In the end, we combine this information frame with the frame set from statistical learning model and rank them according to the confidence scores. Generally we can consider the one with the highest confidence score as our parsing result. The cooperative model takes all advantages of the three methods and combines them together. The cooperative mechanism also suppresses the disadvantages of those methods. The series model of the finite state machine has the advantage of mapping between word classes and contexts, though it sometimes may lose some information, and it contains real semantic knowledge. The statistical learning model can produce a set of information frames based on the word patterns and its noise can be removed by the result of the series model of the finite state machine. For the single finite state machine model, if it can parse sentence successfully, the result will always be the best one. Therefore through the cooperation of the three methods, it can either produce the exact result for sentences stored in the finite state network or return the most probable result through statistical machine learning method if no sentence matching occurs. Also the noise is reduced by the other finite state machine model. The cooperative model is robust and has the ability to learn in our target domain.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> The cooperative model will demonstrate its ability on sentence processing no matter whether the sentence is in the original sentence set. However, currently we only have simple preference rule for the cooperation and haven't obtained the overall performance. In this section, we'll compare the different models' performance to demonstrate the cooperative model's potential ability.</Paragraph>
    <Paragraph position="1"> Based on our target sentence patterns and word classes, we built a blind set with 159 completely new sentences. Although all the words used belong to this domain these sentences don't appear in the training set and the testing set. In the evaluation of its performance, we compare the results of the three approaches and get Table 1. As we can see from this table, finite state method is better in the relative processing speed and for processing existing patterns while statistical model is better for processing new sentence patterns, which makes the system very robust.</Paragraph>
    <Paragraph position="2">  On the other hand, we investigate the performance of statistical model in more detail on the blind test.</Paragraph>
    <Paragraph position="3"> Given the whole blind testing set, the statistical learning model produced 159 partially correct information frames. We manually corrected them one by one. This took us 97 minutes in total. To measure this efficiency, we also built all the real information frames for the blind test set manually, one by one. It took 366 minutes to finish all the 159 information frames. This means it is much more efficient to process a completely new sentence set with the statistical learning model.</Paragraph>
    <Paragraph position="4"> We next investigate the precision and recall of this statistical learning model. Taking the result frames we manually built as the real answers, we define precision, recall, and F-score to measure the system's performance.</Paragraph>
    <Paragraph position="5"> model learning from pairs value-slot of #  Our testing strategy is to randomly select some portion of the new blind set and add it into the training set. Then we test the system with sentences in the rest of the blind set. As more and more new sentences are added into the training set (1/4, 1/3, 1/2, etc) we can see the performance changing accordingly. We investigate the three models: P(M|W), P(W|M) and P(M)*P(M|W).</Paragraph>
    <Paragraph position="6"> All of them are tested with same testing strategy.</Paragraph>
    <Paragraph position="7">  From the three tables, we can see that as new sentences are added into the training set, the performance improves. Comparing Tables 2, 3 and 4, the poor performance of P (W|M)* P (M) is partially due to unbalance in the training set. The higher occurrences of some specific meaning items increase P(M) and affect the result during voting.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML