File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/p01-1039_evalu.xml
Size: 5,179 bytes
Last Modified: 2025-10-06 13:58:45
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1039"> <Title>Information Extraction From Voicemail</Title> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> 6 Experimental Results </SectionTitle> <Paragraph position="0"> To evaluate the performance of different systems, we use the conventional precision, recall and their F-measure. Significantly, we insist on exact matches for an answer to be counted as correct.</Paragraph> <Paragraph position="1"> The reason for this is that any error is liable to render the information useless, or detrimental. For example, an incorrect phone number can result in unwanted phone charges, and unpleasant conversations. This is different from typical named entity evaluation, where partial matches are given partial credit. Therefore, it should be understood that the precision and recall rates computed with this strict criterion cannot be compared to those from named entity detection tasks.</Paragraph> <Paragraph position="2"> A summary of our results is presented in Tables tion transcripts are used. On the heading line, P refers to precision, R to recall, F to F-measure, C to caller-identity, and N to phone number. Thus P/C denotes &quot;precision on caller identity&quot;. In these tables, the maximum entropy model is referred to as ME. ME1-U uses unigram lexical features only; ME1-B uses bigram lexical features only. ME1-B performs somewhat better than ME1-U, but uses more than double number of features.</Paragraph> <Paragraph position="3"> ME2-U-f1 uses unigram lexical features and number dictionary features. It improves the recall of phone number by a97a88a98a100a99a102a101 upon ME1-U. ME2-U-f12 adds the trigger phrase dictionary features to ME2-U-f1, and it improves the recall of caller and phone numbers but degrades on the precision of both. Overall it improves a little on the F-meansures. ME2-B-f12 uses bigram lexical features, number dictionary features and trigger phrase dictionary features. It has the best recall of caller, again with over two times number of features of ME2-U-f12.</Paragraph> <Paragraph position="4"> The above variants of ME features are chosen using simple count cutoff method. When the incremental feature selection is used, ME2-U-f12-I reduces the number of features from a103a102a104a70a97a105a104 to a103a107a106a37a108 with minor performance loss; ME2-B-f12-I re- null for different systems on decoded voicemail messages. null duces the number of features from a109a5a99a5a110a10a108a5a108 to a109a107a106a26a109a5a99 with minor performance loss. This shows that the main power of the maxent model comes from a a very small subset of the possible features. Thus, if memory and speed are concerned, the incremental feature selection is highly recommended. There are several observations that can be made from these results. First, the maximum entropy approach systematically beats the baseline in terms of precision, and secondly it is better on recall of the caller's identity. We believe this is because the baseline has an imperfect set of rules for determining the end of a &quot;caller identity&quot; description. On the other hand, the baseline system has higher recall for phone numbers. The results of structure induction are worse than the other two methods, however as this is a novel approach in a developmental stage, we expect the performance will improve in the future.</Paragraph> <Paragraph position="5"> Another important point is that there is a significant difference in performance between manual and decoded transcriptions. As expected, the precision and recall numbers are worse in the presence of transcription errors (the recognizer had a word error rate of about 35%). The degradation due to transcription errors could be caused by either: (i) corruption of words in the context surrounding the names and numbers; or (ii) corruption of the information itself. To investigate this, we did the following experiment: we replaced the regions of decoded text that correspond to the correct caller identity and phone number with the correct manual transcription, and redid the test.</Paragraph> <Paragraph position="6"> The results are shown in Table 5. Compared to the results on the manual transcription, the recall numbers for the maximum-entropy tagger are just slightly (a111a38a112a114a113a102a115 ) worse, and precision is still high. This indicates that the corruption of the information content due to transcription errors is much more important than the corruption of the context.</Paragraph> <Paragraph position="7"> If measured by the string error rate, none of our systems can be used to extract exact caller and phone number information directly from decoded voicemail. However, they can be used to locate the information in the message and highlight those positions. To evaluate the effectiveness of this approach, we computed precision and recall numbers in terms of the temporal overlap of the identified and true information bearing segments. Table 6 shows that the temporal loca-tion of phone numbers can be reliably determined, with an F-measure of 80%.</Paragraph> </Section> class="xml-element"></Paper>