File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2905_intro.xml
Size: 6,479 bytes
Last Modified: 2025-10-06 14:02:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2905"> <Title>Using Soundex Codes for Indexing Names in ASR documents</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Past Work </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Stemming </SectionTitle> <Paragraph position="0"> Stemming (Porter, 1980; Krovetz, 1993) is a method in which the corpus is processed so that semantically and morphologically related words are reduced to a common stem. Thus, race, racing, and racer are all reduced to a single root - race. Stemming has been found to be effective for Information Retrieval, TDT and other related tasks. Current stemming algorithms work only for regular English words and not names. In this paper we look at addressing the problem of grouping together and normalizing proper names in the same way that stemming groups together regular English words.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Approximate String Matching </SectionTitle> <Paragraph position="0"> There has been some past work (French et al., 1997; Zobel and Dart, 1996) that has addressed the problem that proper names can have different spellings. Each of those works, however, only addresses the question of how effectively one can match a name to its spelling variants.</Paragraph> <Paragraph position="1"> They measure their performance in terms of the precision and recall with which they are able to retrieve other names which are variants of a given query name. Essentially, the primary motivation of those works was in finding good approximate string matching techniques. Those techniques are directly applicable only in applications that retrieve tuples from a database record.</Paragraph> <Paragraph position="2"> However, there is no work that evaluates the effectiveness of approximate string matching techniques for names in an information retrieval or related task. We know of no work that attempts to detect names automatically, and then index names that should go together, in the same way that words of the same stem class are indexed by one common term.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 The TREC SDR and the TDT Link Detection </SectionTitle> <Paragraph position="0"> tasks A single news-source may spell all mentions of a given name identically. However, this consistency is lost when there are multiple sources of news, where sources span languages and modes (broadcast and print). The TDT3 corpus (ldc, 2003) is representative of such real-life data. The corpus consists of English, Arabic and Mandarin print and broadcast news. ASR output is used in the case of the broadcast sources and in the case of non-English stories machine translated output is used for comparing stories. For both ASR systems and Machine Translation systems, proper names are often out-of-vocabulary (OOV). A typical speech recognizer has a lexicon of about 60K, and for a lexicon of this size about 10% of the person names are OOV. The OOV problem is usually solved by the use of transliteration and other such techniques. A breakdown of the OOV rates for names for different lexicon sizes is given in (Miller et al., 2000). We believe the problem of spelling errors is of importance when one wants to index and retrieve ASR documents. For example, Monica Lewinsky is commonly referred to in the TDT3 corpus. The corpus has closed- caption transcripts for TV broadcasts. Closed caption suffers from typing errors. The name Lewinsky is also often misspelt as Lewinskey in the closed caption text. In the ASR text some of the variants that appear are Lewenskey, Linski, Lansky and Lewinsky. This example is typical, with the errors in the closed caption text highlighting how humans themselves can vary in their spelling of a name and the errors in ASR demonstrating how a single ASR system can output different spellings for the same name.</Paragraph> <Paragraph position="1"> The ASR errors are largely because ASR systems rely on phonemes for OOV words, and each of the different variations in the spellings of the same name is probably a result of different pronounciations and other such factors. The result of an ASR system then, is several different spelling variations of each name. It is easy to see why it would help considerably to group names that refer to the same entity together, and index them as one entity. We can exploit the fact that these different spelling variations of a given name exhibit strong similarity using approximate string matching techniques. We propose that in certain domains, where the issue that proper names exist with many different variations is dominant, the use of approximate string matching techniques to determine which names refer to the same entity will help improve the accuracy with which we can detect links between stories. Figure 1 shows a snippet of closed caption text and its ASR counterpart. The names Lewinskey and Tripp are misspelt in the ASR text. The two documents however have high similarity, because of the other words that the ASR system gets right. Allan (Allan, 2002) showed how ASR errors can cause misses in TDT tasks, and can sometimes be beneficial, resulting in a minimal average impact on performance in TDT. In the case of Spoken Document Retrieval (Garofolo et al., 2000) also it was found that a few ASR errors per document did not result in a big difference to performance as long as we get a reasonable percentage of the words right. Of course, factors such as the length of the two pieces of text being compared make a difference. Barnett et al (Barnett et al., 1997), showed how short queries were affected considerably by Word Error rate. ASR errors may not cause a significant drop in performance for any of the Topic Detection and Tracking tasks. But, consider a system where retrieving all documents mentioning Lewinskey and Tripp is critical, and it is not unrealistic to assume there exist systems with such needs, the ASR document in the above mentioned example would be left out. We therefore, believe that the problem we are addressing in this paper is an important one. The preliminary experiments in this paper, which are on the TDT corpus, only highlight how our approach can help.</Paragraph> </Section> </Section> class="xml-element"></Paper>