File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1078_metho.xml
Size: 22,613 bytes
Last Modified: 2025-10-06 14:10:16
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1078"> <Title>discriminative named entity recognition of speech data</Title> <Section position="4" start_page="617" end_page="617" type="metho"> <SectionTitle> 2 SVM-based NER </SectionTitle> <Paragraph position="0"> NER is a kind of chunking problem that can be solved by classifying words into NE classes that consist of name categories and such chunking states as PERSON-BEGIN (the beginning of a person's name) and LOCATION-MIDDLE (the middle of a location's name). Many discriminative methods have been applied to NER, such as decision trees (Sekine et al., 1998), ME models (Borthwick, 1999; Chieu and Ng, 2003), and CRFs (McCallum and Li, 2003). In this paper, we employ an SVM-based NER method in the following way that showed good NER performance in Japanese (Isozaki and Kazawa, 2002).</Paragraph> <Paragraph position="1"> We define three features for each word: the word itself, its part-of-speech tag, and its character type. We also use those features for the two preceding and succeeding words for context dependence and use 15 features when classifying a word. Each feature is represented by a binary value (1 or 0), for example, &quot;whether the previous word is Japan,&quot; and each word is classified based on a long binary vector where only 15 elements are 1.</Paragraph> <Paragraph position="2"> We have two problems when solving NER using SVMs. One, SVMs can solve only a two-class problem. We reduce multi-class problems of NER to a group of two-class problems using the one-against-all approach, where each SVM is trained to distinguish members of a class (e.g., PERSON-BEGIN) from non-members (PERSON-MIDDLE,MONEY-BEGIN, ... ). In this approach, two or more classes may be assigned to a word or no class may be assigned to a word. To avoid these situations, we choose class c that has the largest SVM output score gc(x) among all others. null The other is that the NE label sequence must be consistent; for example, ARTIFACT-END ARTIFACT-MIDDLE. We use a Viterbi search to obtain the best and consistent NE label sequence after classifying all words in a sentence, based on probability-like values obtained by applying sigmoid function sn(x) = 1/(1+exp([?]bnx)) to SVM output score gc(x).</Paragraph> </Section> <Section position="5" start_page="617" end_page="619" type="metho"> <SectionTitle> 3 Proposed method </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="617" end_page="617" type="sub_section"> <SectionTitle> 3.1 Incorporating ASR confidence into NER </SectionTitle> <Paragraph position="0"> In the NER of ASR results, ASR errors cause NEs to be missed and erroneous NEs to be recognized.</Paragraph> <Paragraph position="1"> If one or more words constituting an NE are misrecognized, we cannot recognize the correct NE.</Paragraph> <Paragraph position="2"> Even if all words constituting an NE are correctly recognized, we may not recognize the correct NE due to ASR errors on context words. To avoid this problem, we model ASR errors using additional features that indicate whether each word is correctly recognized. Our NER model is trained using ASR results with a feature, where feature values are obtained through alignment to the corresponding transcriptions. In testing, we estimate feature values using ASR confidence scores. In this paper, this feature is called the ASR confidence feature.</Paragraph> <Paragraph position="3"> Note that we only aim to identify NEs that are correctly recognized by ASR, and NEs containing ASR errors are not regarded as NEs. Utilizing erroneous NEs is a more difficult problem that is beyond the scope of this paper.</Paragraph> </Section> <Section position="2" start_page="617" end_page="618" type="sub_section"> <SectionTitle> 3.2 Training NER model </SectionTitle> <Paragraph position="0"> Figure 1 illustrates the procedure for preparing training data from speech data. First, the speech data are manually transcribed and automatically recognized by the ASR. Second, we label NEs in the transcriptions and then set the ASR confidence feature values to 1 because the words in the transcriptions are regarded as correctly recognized words. Finally, we align the ASR results to the transcriptions to identify ASR errors for the ASR confidence feature values and to label correctly recognized NEs in the ASR results. Note that we label the NEs in the ASR results that exist in the same positions as the transcriptions. If a part of an NE is misrecognized, the NE is ignored, and all words for the NE are labeled as non-NE words (OTHER). Examples of text-based and ASR-based training data are shown in Tables 1 and 2. Since the name Murayama Tomiichi in Table 1 is mis-recognized in ASR, the correctly recognized word Murayama is also labeled OTHER in Table 2. Another approach can be considered, where misrecognized words are replaced by word error symbols such as those shown in Table 3. In this case, those words are rejected, and those part-of-speech and character type features are not used in NER.</Paragraph> </Section> <Section position="3" start_page="618" end_page="619" type="sub_section"> <SectionTitle> 3.3 ASR confidence scoring for using the </SectionTitle> <Paragraph position="0"> proposed NER model ASR confidence scoring is an important technique in many ASR applications, and many methods have been proposed including using word posterior probabilities on word graphs (Wessel et al., 2001), integrating several confidence measures using neural networks (Schaaf and Kemp, 1997), using linear discriminant analysis (Kamppari and Hazen, 2000), and using SVMs (Zhang and Rudnicky, 2001).</Paragraph> <Paragraph position="1"> Word posterior probability is a commonly used and effective ASR confidence measure. Word posterior probability p([w;t,t]|X) of word w at time interval [t,t] for speech signal X is calculated as follows (Wessel et al., 2001):</Paragraph> <Paragraph position="3"> where W is a sentence hypothesis, W[w;t,t] is the set of sentence hypotheses that include w in [t,t], p(X|W) is a acoustic model score, p(W) is a language model score, a is a scaling parameter (a<1), and b is a language model weight.</Paragraph> <Paragraph position="4"> a is used for scaling the large dynamic range of Word Confidence NE label Murayama 1 PERSON-BEGIN Tomiichi 1 PERSON-END shusho 1 OTHER wa 1 OTHER nento 1 DATE-SINGLE Table 1: An example of text-based training data. Word Confidence NE label Murayama 1 OTHER shi 0 OTHER ni 0 OTHER ichi 0 OTHER shiyo 0 OTHER wa 1 OTHER nento 1 DATE-SINGLE Table 2: An example of ASR-based training data. Word Confidence NE label Murayama 1 OTHER (error) 0 OTHER (error) 0 OTHER (error) 0 OTHER (error) 0 OTHER wa 1 OTHER nento 1 DATE-SINGLE with word error symbols.</Paragraph> <Paragraph position="5"> p(X|W)(p(W))b to avoid a few of the top hypotheses dominating posterior probabilities. p(X) is approximated by the sum over all sentence hypotheses and is denoted as</Paragraph> <Paragraph position="7"> p([w;t,t]|X) can be efficiently calculated using a forward-backward algorithm.</Paragraph> <Paragraph position="8"> In this paper, we use SVMs for ASR confidence scoring to achieve a better performance than when using word posterior probabilities as ASR confidence scores. SVMs are trained using ASR results, whose errors are known through their alignment to their reference transcriptions. The following features are used for confidence scoring: the word itself, its part-of-speech tag, and its word posterior probability; those of the two preceding and succeeding words are also used. The word itself and its part-of-speech are also represented by a set of binary values, the same as with an SVM-based NER. Since all other features are binary, we reduce real-valued word posterior probability p to ten binary features for simplicity: (if 0 < p [?] 0.1, if 0.1 < p [?] 0.2, ... , and if 0.9 < p [?] 1.0). To normalize SVMs' output scores for ASR confidence, we use a sigmoid function sw(x) = 1/(1 + exp([?]bwx)). We use these normalized scores as ASR confidence scores. Although a large variety of features have been proposed in previous studies, we use only these simple features and reserve the other features for further studies.</Paragraph> <Paragraph position="9"> Using the ASR confidence scores, we estimate whether each word is correctly recognized. If the ASR confidence score of a word is greater than threshold tw, the word is estimated as correct, and we set the ASR confidence feature value to 1; otherwise we set it to 0.</Paragraph> </Section> <Section position="4" start_page="619" end_page="619" type="sub_section"> <SectionTitle> 3.4 Rejection at the NER level </SectionTitle> <Paragraph position="0"> We use the ASR confidence feature to suppress ASR error problems; however, even text-based NERs sometimes make errors. NER performance is a trade-off between missing correct NEs and accepting erroneous NEs, and requirements differ by task. Although we can tune the parameters in training SVMs to control the trade-off, it seems very hard to find appropriate values for all the SVMs. We use a simple NER-level rejection by modifying the SVM output scores for the non-NE class (OTHER). We add constant offset value to to each SVM output score forOTHER. With a large to, OTHER becomes more desirable than the other NE classes, and many words are classified as non-NE words and vice versa. Therefore, to works as a parameter for NER-level rejection. This approach can also be applied to text-based NER.</Paragraph> </Section> </Section> <Section position="6" start_page="619" end_page="622" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We conducted the following experiments related to the NER of speech data to investigate the performance of the proposed method.</Paragraph> <Section position="1" start_page="619" end_page="620" type="sub_section"> <SectionTitle> 4.1 Setup </SectionTitle> <Paragraph position="0"> In the experiment, we simulated the procedure shown in Figure 1 using speech data from the NE-labeled text corpus. We used the training data of the Information Retrieval and Extraction Exercise (IREX) workshop (Sekine and Eriguchi, 2000) as the text corpus, which consisted of 1,174 Japanese newspaper articles (10,718 sentences) and 18,200 NEs in eight categories (artifact, organization, location, person, date, time, money, and percent). The sentences were read by 106 speakers (about 100 sentences per speaker), and the recorded speech data were used for the experiments. The experiments were conducted with 5fold cross validation, using 80% of the 1,174 articles and the ASR results of the corresponding speech data for training SVMs (both for ASR confidence scoring and for NER) and the rest for the test.</Paragraph> <Paragraph position="1"> We tokenized the sentences into words and tagged the part-of-speech information using the Japanese morphological analyzer ChaSen 1 2.3.3 and then labeled the NEs. Unreadable tokens such as parentheses were removed in tokenization. After tokenization, the text corpus had 264,388 words of 60 part-of-speech types. Since three different kinds of characters are used in Japanese, the character types used as features included: single-kanji (words written in a single Chinese character), all-kanji (longer words written in Chinese characters), hiragana (words written in hiragana Japanese phonograms), katakana (words written in katakana Japanese phonograms), number, single-capital (words with a single capitalized letter), all-capital, capitalized (only the first letter is capitalized), roman (other roman character words), and others (all other words). We used all the features that appeared in each training set (no feature selection was performed). The chunking states included in the NE classes were: BEGIN (beginning of a NE), MIDDLE (middle of a NE), END (ending of a NE), and SINGLE (a single-word NE). There were 33 NE classes (eight categories * four chunking states + OTHER), and therefore we trained 33 SVMs to distinguish words of a class from words of other classes. For NER, we used an SVM-based chunk annotator YamCha 2 0.33 with a quadratic kernel (1 + vectorx * vectory)2 and a soft margin parameter of SVMs C=0.1 for training and applied sigmoid function sn(x) with bn=1.0 and Viterbi search to the SVMs' outputs. These parameters were experimentally chosen using the test set.</Paragraph> <Paragraph position="2"> We used an ASR engine (Hori et al., 2004) with a speaker-independent acoustic model. The lan- null guage model was a word 3-gram model, trained using other Japanese newspaper articles (about 340 M words) that were also tokenized using ChaSen. The vocabulary size of the word 3-gram model was 426,023. The test-set perplexity over the text corpus was 76.928. The number of out-of-vocabulary words was 1,551 (0.587%). 223 (1.23%) NEs in the text corpus contained such out-of-vocabulary words, so those NEs could not be correctly recognized by ASR. The scaling parameter a was set to 0.01, which showed the best ASR error estimation results using word posterior probabilities in the test set in terms of receiver operator characteristic (ROC) curves. The language model weight b was set to 15, which is a commonly used value in our ASR system. The word accuracy obtained using our ASR engine for the overall dataset was 79.45%. In the ASR results, 82.00% of the NEs in the text corpus remained. Figure 2 shows the ROC curves of ASR error estimation for the overall five cross-validation test sets, using SVM-based ASR confidence scoring and word posterior probabilities as ASR confidence scores, where True positive rate = # correctly recognized words estimated as correct# correctly recognized words False positive rate = # misrecognized words estimated as correct# misrecognized words . In SVM-based ASR confidence scoring, we used the quadratic kernel and C=0.01. Parameter bw of sigmoid function sw(x) was set to 1.0. These parameters were also experimentally chosen. SVM-based ASR confidence scoring showed better performance in ASR error estimation than simple word posterior probabilities by integrating multiple features. Five values of ASR confidence threshold tw were tested in the following experiments: 0.2, 0.3, 0.4, 0.5, and 0.6 (shown by black dots in Figure 2).</Paragraph> </Section> <Section position="2" start_page="620" end_page="620" type="sub_section"> <SectionTitle> 4.2 Evaluation metrics </SectionTitle> <Paragraph position="0"> Evaluation was based on an averaged NER Fmeasure, which is the harmonic mean of NER pre-</Paragraph> <Paragraph position="2"> forms word posterior probability for ASR error estimation. null A recognized NE was accepted as correct if and only if it appeared in the same position as its reference NE through alignment, in addition to having the correct NE surface and category, because the same NEs might appear more than once. Comparisons of NE surfaces did not include differences in word segmentation because of the segmentation ambiguity in Japanese. Note that NER recall with ASR results could not exceed the rate of the remaining NEs after ASR (about 82%) because NEs containing ASR errors were always lost.</Paragraph> <Paragraph position="3"> In addition, we also evaluated the NER performance in NER precision and recall with NER-level rejection using the procedure in Section 3.4, by modifying the non-NE class scores using offset value to.</Paragraph> </Section> <Section position="3" start_page="620" end_page="621" type="sub_section"> <SectionTitle> 4.3 Compared methods </SectionTitle> <Paragraph position="0"> We compared several combinations of features and training conditions for evaluating the effect of incorporating the ASR confidence feature and investigating differences among training data: textbased, ASR-based, and both.</Paragraph> <Paragraph position="1"> Baseline does not use the ASR confidence feature and is trained using text-based training data only.</Paragraph> <Paragraph position="2"> NoConf-A does not use the ASR confidence feature and is trained using ASR-based training data only.</Paragraph> <Paragraph position="3"> rejection (to = 0). ASR word accuracy was 79.45%, and 82.00% of NEs remained in ASR results. (+Unconfident words were rejected and replaced by word error symbols, [?]tw = 0.4, [?][?]ASR errors were known.) NoConf-TA does not use the ASR confidence feature and is trained using both text-based and ASR-based training data.</Paragraph> <Paragraph position="4"> Conf-A uses the ASR confidence feature and is trained using ASR-based training data only.</Paragraph> <Paragraph position="5"> Proposed uses the ASR confidence feature and is trained using both text-based and ASR-based training data.</Paragraph> <Paragraph position="6"> Conf-Reject is almost the same as Proposed, but misrecognized words are rejected and replaced with word error symbols, as described at the end of Section 3.2.</Paragraph> <Paragraph position="7"> The following two methods are for reference.</Paragraph> <Paragraph position="8"> Conf-UB assumes perfect ASR confidence scoring, so the ASR errors in the test set are known. The NER model, which is identical to Proposed, is regarded as the upper-boundary of Proposed. Transcription applies the same model as Base-line to reference transcriptions, assuming word accuracy is 100%.</Paragraph> </Section> <Section position="4" start_page="621" end_page="622" type="sub_section"> <SectionTitle> 4.4 NER Results </SectionTitle> <Paragraph position="0"> In the NER experiments, Proposed achieved the best results among the above methods. Table 4 shows the NER results obtained by the methods without considering NER-level rejection (i.e., to = 0), using threshold tw = 0.4 for Conf-A, Proposed, and Conf-Reject, which resulted in the best NER F-measures (see Table 5). Proposed showed the best F-measure, 69.02%. It outperformed Baseline by 2.0%, with a 7.5% improvement in precision, instead of a recall decrease of 1.9%. Conf-Reject showed slightly worse results dence score threshold tw for Conf-A, Proposed, and Conf-Reject. (F: F-measure, P: precision, R: recall) than Proposed. Conf-A resulted in 1.3% worse F-measure than Proposed. NoConf-A and NoConf-TA achieved 7-8% higher precision than Baseline; however, their F-measure results were worse than Baseline because of the large drop of recall. The upper-bound results of the proposed method (Conf-UB) in F-measure was 73.14%, which was 4% higher than Proposed.</Paragraph> <Paragraph position="1"> Figure 3 shows NER precision and recall with NER-level rejection by to for Baseline, NoConf-TA, Proposed, Conf-UB, and Transcription. In the figure, black dots represent results with to = 0, as shown in Table 4. By all five methods, we level rejection by to obtained higher precision with to > 0. Proposed achieved more than 5% higher precision than Baseline on most recall ranges and showed higher precision than NoConf-TA on recall ranges higher than about 35%.</Paragraph> </Section> </Section> <Section position="7" start_page="622" end_page="622" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> The proposed method effectively improves NER performance, as shown by the difference between Proposed and Baseline in Tables 4 and 5. Improvement comes from two factors: using both text-based and ASR-based training data and incorporating ASR confidence feature. As shown by the difference between Baseline and the methods using ASR-based training data (NoConf-A, NoConf-TA, Conf-A, Proposed, Conf-Reject), ASR-based training data increases precision and decreases recall. In ASR-based training data, all words constituting NEs that contain ASR errors are regarded as non-NE words, and those NE examples are lost in training, which emphasizes NER precision. When text-based training data are also available, they compensate for the loss of NE examples and recover NER recall, as shown by the difference between the methods without text-based training data (NoConf-A, Conf-A) and those with (NoConf-TA, Proposed). The ASR confidence feature also increases NER recall, as shown by the difference between the methods without it (NoConf-A, NoConf-TA) and with it (Conf-A, Proposed). This suggests that the ASR confidence feature helps distinguish whether ASR error influences NER and suppresses excessive rejection of NEs around ASR errors.</Paragraph> <Paragraph position="1"> With respect to the ASR confidence feature, the small difference between Conf-Reject and Proposed suggests that ASR confidence is a more dominant feature in misrecognized words than the other features: the word itself, its part-of-speech tag, and its character type. In addition, the difference between Conf-UB and Proposed indicated that there is room to improve NER performance with better ASR confidence scoring.</Paragraph> <Paragraph position="2"> NER-level rejection also increased precision, as shown in Figure 3. We can control the trade-off between precision and recall with to according to the task requirements, even in text-based NER. In the NER of speech data, we can obtain much higher precision using both ASR-based training data and NER-level rejection than using either one.</Paragraph> </Section> <Section position="8" start_page="622" end_page="622" type="metho"> <SectionTitle> 6 Related work </SectionTitle> <Paragraph position="0"> Recent studies on the NER of speech data consider more than 1-best ASR results in the form of N-best lists and word lattices. Using many ASR hypotheses helps recover the ASR errors of NE words in 1-best ASR results and improves NER accuracy.</Paragraph> <Paragraph position="1"> Our method can be extended to multiple ASR hypotheses. null Generative NER models were used for multi-pass ASR and NER searches using word lattices (Horlock and King, 2003b; B'echet et al., 2004; Favre et al., 2005). Horlock and King (2003a) also proposed discriminative training of their NER models. These studies showed the advantage of using multiple ASR hypotheses, but they do not use overlapping features.</Paragraph> <Paragraph position="2"> Discriminative NER models were also applied to multiple ASR hypotheses. Zhai et al. (2004) applied text-based NER to N-best ASR results, and merged the N-best NER results by weighted voting based on several sentence-level results such as ASR and NER scores. Using the ASR confidence feature does not depend on SVMs and can be used with their method and other discriminative models. null</Paragraph> </Section> class="xml-element"></Paper>