File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3002_evalu.xml

Size: 10,574 bytes

Last Modified: 2025-10-06 13:59:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3002">
  <Title>Hybrid Statistical and Structural Semantic Modeling for Thai Multi- Stage Spoken Language Understanding</Title>
  <Section position="5" start_page="0" end_page="75" type="evalu">
    <SectionTitle>
5 Evaluation and Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Corpora
</SectionTitle>
      <Paragraph position="0"> Collecting and annotating a corpus is an especially serious problem for language like Thai, where only few databases are available. To shorten the collection time, we created a specific web page simulating our expected conversational dialogues, and asked Thai native users to answer the dialogue questions by typing. As we asked the users to try answering the questions using spoken language, we could obtain a fairly good corpus for training the SLU.</Paragraph>
      <Paragraph position="1"> Currently, 5,869 typed-in utterances from 150 users have been completely annotated. To reduce the effort of manual annotation, we conducted a semi-automatic annotation method. The prototype rule-based SLU was used to roughly tag each utterance with a goal and concepts, which were then manually corrected. Words or phases that were relevant to the concept were marked automatically based on their frequencies and information mutual to the concept.</Paragraph>
      <Paragraph position="2"> Finally the tags were manually checked and the key-words within each concept were additionally marked by the defined label symbols.</Paragraph>
      <Paragraph position="3"> All 5,869 utterances described above were used as a training set (TR) for the SLU system. We also collected a set of speech utterances during an evaluation of our prototype dialogue system. It contained 1,101 speech utterances from 96 dialogues. By balancing the</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Reg models
</SectionTitle>
      <Paragraph position="0"> occurrence of goals, we reserved 500 utterances for a development set (DS), which was used for tuning parameters. The remaining 601 utterances were used for an evaluation set (ES). Table 3 shows the characteristics of each data set. From the TR set, 75 types of concepts and 42 types of goals were defined. The out-ofgoal and out-of-concept denote goals and concepts that are not defined in the TR set, and thus cannot be recognized by the trained SLU. Since concepts that contain no value are not counted for concept-value evaluation, Table 3 also shows the number of concepts that contain values in the line &amp;quot;# Concept-values&amp;quot;.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Evaluation measures
</SectionTitle>
      <Paragraph position="0"> Four measures were used for evaluation:  1. Word accuracy (WAcc) - the standard measure for evaluating the ASR, 2. Concept F-measure (ConF) - the F-measure of detected concepts, 3. Goal accuracy (GAcc) - the number of utterances with correctly identified goals, divided by the total number of test utterances, 4. Concept-value accuracy (CAcc) - the number  of concepts, whose values are correctly matched to their references, divided by the total number of concepts that contain values.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="58" type="sub_section">
      <SectionTitle>
5.3 The use of logical n-gram modeling
</SectionTitle>
      <Paragraph position="0"> The first experiment was to inspect improvement gained after conducting the statistical approaches for concept extraction and concept-value recognition.</Paragraph>
      <Paragraph position="1"> Only the 1-best word hypothesis from the ASR was experimented in this section. The AT&amp;T generalized FSM library (Mohri et al., 1997) was used to construct and operate all WFSTs, and the SNNS toolkit (Zell et al., 1994) was used to create the ANN classifiers for the goal identification task.</Paragraph>
      <Paragraph position="2"> The baseline system utilized the Reg model for concept extraction and concept-value recognition, and the multi-layer perceptron ANN for goal identification. 75 WFSTs corresponding to the number of defined concepts were created from the TR set. The ANN consisted of a 75-node input layer, a 100-node hidden layer (Wutiwiwatchai and Furui, 2003b), and a 42node output layer equal to the number of goals to be identified.</Paragraph>
      <Paragraph position="3">  an oracle test for the DS set.</Paragraph>
      <Paragraph position="4">  Reg, Ngram, and LNgram models.</Paragraph>
      <Paragraph position="5"> Another WFST was constructed for the n-gram semantic parser (n = 2 in our experiment), which was used for the Ngram model and the first pass of the LNgram model. Two parameters, M and l, in the LNgram approach need to be adjusted. To determine an appropriate value of M, we plotted in an oracle mode the CAcc of the DS set with respect to M, as shown in Figure 4. According to the graph, an M of 80 was considered optimum and set for the rest of the experiments. Figure 5 then shows the CAcc obtained for rescored M-best hypotheses when the weight l as defined in Eq. 2 is varied. Here, the larger value of l means to assign a higher score to the hypothesis that contains longer valid word-and-label syntax. Hence, we concluded by Fig. 5 that reordering the hypotheses, which contain longer valid syntaxes, could improve the CAcc significantly. Since the CAcc results become steady when the value of l is greater than 0.7, a l of 1.0 is used henceforth to ensure the best performance. The overall evaluation results on the ES set are shown in Table 4, where M and l in the LNgram model are set to 80 and 1.0 respectively. 'Recognition' denotes the experiments on automatic speech-recognized utterances (at 79% WAcc), whereas 'Orthography' means their exact manual transcriptions. It is noted that the LNgram approach utilizes the same process of Ngram in its first pass, where the concepts are determined. Therefore, the ConF and GAcc results of both approaches are the same.</Paragraph>
      <Paragraph position="6"> According to the results, the Ngram tagger worked well for the concept extraction task as it increased the ConF by over 10%. The improvement mainly came from reduction of redundant concepts often accepted by the Reg model. The better extraction of concepts could give better goal identification accuracy reasonably. However, as we expected, the conventional Ngram model itself had no syntactic information and thus often produced a confusing label sequence, especially for ill-formed utterances. A typical error occurred for words that could be tagged with one of several semantic labels, such as the word 'MNT' (referring to the name of the month), which could be identified as 'check-in month' or 'check-out month'.</Paragraph>
      <Paragraph position="7"> These two alternatives could only be clarified by a context word, which sometimes located far from the word 'MNT'. This problem could be solved by using the Reg model. The Reg model, however, could not provide a label sequence to any out-of-syntax sentence. The LNgram as an integration of both models thus obviously outperformed the others.</Paragraph>
      <Paragraph position="8"> In conclusion, the LNgram model could improve the ConF, GAcc, and CAcc by 15.8%, 6.4%, and 3.2% relative to the baseline Reg model. Moreover, if we considered the orthography result an upperbound of the underlying model, the GAcc and CAcc results produced by the LNgram model are relatively closer to their upperbounds compared to the Reg model. This verifies robustness improvement of the proposed model against speech-recognition errors.</Paragraph>
    </Section>
    <Section position="5" start_page="58" end_page="75" type="sub_section">
      <SectionTitle>
5.4 The use of ASR N-best hypotheses
</SectionTitle>
      <Paragraph position="0"> To incorporate N-best hypotheses from the ASR to the LNgram model, we need to firstly determine an appropriate value of N. An oracle test that measures WAcc and ConF for the DS set with variation of N is shown in Fig. 6. Although we can select a proper value of N by considering only the WAcc, we also examine the ConF to ensure that the selected N provides possibility to improve the understanding performance as well. According to Fig. 6, the ConF highly correlates to the WAcc, and an N of 50 is considered optimum for our task. At this operating point, we plot another curve of ConF for the DS set with a variation of th, the interpolation weight in Eq. 3, as shown in Fig. 7. The appropriate value of th is 0.6, as the highest ConF is obtained at this point. The last parameter we need to adjust is the value of M. Although we have tuned the value of M for the case of 1-best word hypothesis, the appropriate value of M may change when the N-best hypotheses are used instead. However, in our trial, we found that the optimum value of M is again in the same range as that operated for the 1-best case. A probable reason is that rescoring the N-best word hypotheses by the Ngram model can reorder the good hypotheses to a certain upper portion of the N-best list, and thus rescoring in the second pass of the LNgram is independent to the value of N. Consequently, an M of 80 as that selected for the 1-best hypothesis is also used for the N-best case.</Paragraph>
      <Paragraph position="1">  set when N is set to 50.</Paragraph>
      <Paragraph position="2"> Given all tuned parameters, an evaluation on the ES set is carried out as shown in Fig. 8. With the Reg model as a baseline system, the use of N-best hypotheses further improves the the ConF, GAcc, and CAcc by 0.9%, 0.6%, and 3.9% from the only 1-best, and hence reduces the gap between the speech-recognized test set and the orthography test set by 25%, 5.3%, and 26% respectively.</Paragraph>
      <Paragraph position="3"> Finally, we would like to note that the proposed LNgram approach provided the significant advantage of a much smaller computational time compared to the original Reg approach. While the Reg model requires C times (C denotes the number of defined concepts) of WFST operations to determine concepts, the LNgram needs only D+1 times (D &lt;&lt; C), where D is the number of concepts appearing in the top hypothesis produced by the n-gram semantic model. Moreover, under the framework of WFST, incorporating ASR N-best hypotheses required only a small increment of additional processing time compared to the use of 1-best.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML