File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/a00-1044_concl.xml
Size: 2,491 bytes
Last Modified: 2025-10-06 13:52:38
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-1044"> <Title>Named Entity Extraction from Noisy Input: Speech and OCR</Title> <Section position="11" start_page="322" end_page="322" type="concl"> <SectionTitle> 9 Conclusions </SectionTitle> <Paragraph position="0"> First and foremost, the hidden Markov model is quite robust in the face of errorful input.</Paragraph> <Paragraph position="1"> Performance on both speech and OCR input degrades linearly as a function of word error.</Paragraph> <Paragraph position="2"> Even, without case information or punctuation in the input, the performance on the broadcast news task is above 90%, with only a 3.4 point degradation in performance due to missing textual clues. Performance even with 15% word error degrades by only about 8 points of F for both OCR and ASR systems.</Paragraph> <Paragraph position="3"> Second, because annotation can be performed quickly and inexpensively by non-experts, training-based systems like IdentiFinder offer a powerful advantage in moving to new languages and new domains. In our experience, annotation of English typically proceeds at 5k words per hour or more. This means interesting performance can be achieved with as little as 20 hours of student annotation (i.e., at least 100k words). Increasing training continually improves performance, generally as the logarithm of the training set size. On transcribed speech, performance is already good (89.3 on 0% WER) with only 100 hours or 643K words of training data.</Paragraph> <Paragraph position="4"> Third, though errors due to words out of the vocabulary of the speech recognizer are a problem, they represent only about 15% of the errors made by the combined speech recognition and named entity system.</Paragraph> <Paragraph position="5"> Fourth, we used exactly the same training data, modeling, and search algorithm for errorful input as we do for error-free input. For OCR, we trained on correct newswire once only for both correct text input 0% (WER) and for a variety of errorful text input conditions. For speech, we simply transformed text training data into SNOR format and retrained. Using this approach, the only cost of handling errorful input from OCR or ASR was a small amount of computing time. There were no rules to rewrite, no lists to change, and no vocabulary adjustments. Even so, the degradation in performance on errorful input is no worse than the word error rate of the OCR/ASR system.</Paragraph> </Section> class="xml-element"></Paper>