File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2905_metho.xml
Size: 11,504 bytes
Last Modified: 2025-10-06 14:09:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2905"> <Title>Using Soundex Codes for Indexing Names in ASR documents</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Story Link Detection </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Task Definition </SectionTitle> <Paragraph position="0"> The Story Link Detection Task is key to all the other tasks in TDT. The system is handed a set of story pairs, and for each pair it is asked to judge whether both the stories discuss the same topic or different topics. In addition to a YES/NO decision the system is also expected to output a confidence score, where a low confidence score implies that the system is more in favor of the NO decision.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Our Approach </SectionTitle> <Paragraph position="0"> Simply stated our approach to the SLD task, is to use approximate string matching techniques to compare entities between two pieces of text. The two pieces of text may be a query and a document, or two documents, depending on the task. We first need to identify entities in the two documents. There exist several techniques to automatically identify names. For properly punctuated text, heuristics like capitalization work sufficiently well. However, for ASR text we often do not have sentence boundaries or even punctuation. Hence we rely on a Hidden Markov Model based named entity recognizer (Bikel et al., 1999) for our task.</Paragraph> <Paragraph position="1"> A simple strategy that incorporates an approximate string matching technique is to first preprocess the corpus, and then normalize all mentions of a named entity to a given canonical form, where the canonical form is independent of mentions of other entities in the two documents being compared. Soundex, Phonix, and other such codes offer us a means of normalizing a word to its phonetic form. The Soundex code is a combination of the first letter of the word and a three digit code which is representative of its phonetic sound. Hence, similar sounding names like &quot;Lewinskey&quot; and &quot;Lewinsky&quot; are both reduced to the same soundex code &quot;l520&quot;. We can pre-process the corpus so that all the named entities are replaced by their Soundex codes. We then compute the similarity between documents in the new corpus as opposed to the old one, using conventional similarity metrics like Cosine or TF-IDF.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experimental Set up </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Data </SectionTitle> <Paragraph position="0"> The corpus (ldc, 2003) has 67111 documents from multiple sources of news in multiple languages (English Chinese and Arabic) and media (broadcast news and newswire). The English sources are Associated Press and New York Times, PRI, Voice of America etc. For the broadcast news sources we have ASR output and for TV we have both ASR output as well as closed caption data.</Paragraph> <Paragraph position="1"> Additionally we have the following Mandarin news-wire, web and broadcast sources - Xinhua news, Zaobao, and Voice of America (Mandarin). For all the Mandarin documents we have the original documents in the native language as well the English output of Systran- a machine translation system. The data has been collected by LDC by sampling from the above mentioned sources in the period from October to December 1998.</Paragraph> <Paragraph position="2"> The LDC has annotated 60 topics in the TDT3 corpus.</Paragraph> <Paragraph position="3"> A topic is determined by an event. For example topic 30001 is the Cambodian Government Coalition. Each topic has key entities associated with it and a description of the topic. A subset of the documents are annotated as being on-topic or not according to a well formed strategy as defined by the LDC.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Story Link Detection </SectionTitle> <Paragraph position="0"> To compute the similarity of two documents, that is, the YES/NO decision threshold, we used the the traditional cosine similarity metric. To give some leverage to documents that were very similar even before named entity normalization, we average the similarity scores between documents before and after the named entities have been normalized by their Soundex codes as follows: are the documents after the names have been normalized. null</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Evaluation </SectionTitle> <Paragraph position="0"> An ROC curve is plotted by making a parameter sweep of the YES/NO decision thresholds, and plotting the Misses and False Alarms at each point. At each point the cost is computed using the following empirically determined formula (Fiscus et al., 1998).</Paragraph> <Paragraph position="2"> This cost function is standard across all tasks. The point of minimum cost serves as the comparison between various systems.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> We tested our idea on the TDT3 corpus for the Story Link Detection Task, using the Cosine similarity metric, and found that performance actually degraded. On investigation we found that the named entity recognizer performs poorly on Machine Translated and ASR source data. Our named entity recognizer relies considerably on sentence structure, to make its predictions. Machine translated output often lacks grammatical structure, and ASR output does not have punctuation, which results in a lot of named entity tagging errors.</Paragraph> <Paragraph position="1"> We therefore decided to test our idea for newswire text.</Paragraph> <Paragraph position="2"> We created our own test set of 4752 pairs of stories from newswire sources. This test set was created by randomly picking on and off-topic stories for each topic using the same policy as employed by the LDC (Fiscus, 2003). On these pairs, we obtained about 10% improvement (Figure 2), suggesting that there is merit in Soundex normalization of names. However, the problem of poor named entity recognition is a bottle-neck for ASR. We discuss alternative strategies of how to deal with this, and other ways of using approximate string matching in the next section.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Alternative strategies </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 To not use an entity recognizer </SectionTitle> <Paragraph position="0"> We were not able to benefit from our approach on the ASR documents because of the poor performance of the named entity recognizer on those types of document.</Paragraph> <Paragraph position="1"> An example of a randomly picked named entity tagged ASR document is given below. The tagging errors are underlined.</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> BO DOC BQ BO DOCNO BQ BVC6C6BDBLBLBKBDBCBCBDBMBCBDBFBCBMBCBCBCBC BO /DOCNO BQ BO TEXT BQ BO ENAMEX TYPE=&quot;ORGANIZATION&quot; BQ BUDGET SURPLUS BO /ENAMEXBQ AND FIGHTING OVER WHETHER IT'S GOING DOOR POCKETS WILL TELL YOU THE BO ENAMEX TYPE=&quot;ORGANIZATION&quot; BQ VEHICLES CLIMBED DATES THEREAFTER BO /ENAMEX BQ AND IF YOU'RE REQUIRED TO PAY CHILD SUPPORT INFORMATION THAT YOUR JOB AND COME AND ADDRESS NOW PART HAVE BO ENAMEX TYPE=&quot;ORGANIZATION&quot; BQ A NATIONAL REGISTRY THE HEADLINE BO /ENAMEX BQ NEWS I'M BO ENAMEX TYPE=&quot;PERSON&quot;BQKIMBERLY KENNEDY BO/ENAMEXBQ THOSE STORIES IN A MO- MENT BUT FIRST BO/TEXTBQBO/DOC BQ </SectionTitle> <Paragraph position="0"> We need a better performing recognizer, but that may be hard. Instead we might be able to use other information from the speech recognizer to overcome this problem. We did not have confidence scores for the words in the ASR output. If we had had that information, or if we were able to obtain information about which words were OOV, we could possibly index all words with low confidence scores or all OOV words by their Soundex codes.</Paragraph> <Paragraph position="1"> Or else, one could normalize all words in the ASR output, that are not part of the regular English vocabulary by their Soundex codes.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Other ways of grouping entities </SectionTitle> <Paragraph position="0"> Another direction of research to pursue is the way in which approximate string matching is used to compare documents. The way we used approximate string matching in this paper was fairly simple. However, it loses out on some names that ought to go together particularly when two names differ in their first alphabet - for example Katherine and Catherine. The Soundex codes are k365 and c365 respectively. This is by virtue of the nature of the Soundex code of word.</Paragraph> <Paragraph position="1"> There are other ways to compute the similarity between two documents like the Levenshtein distance or edit distance which is a measure of the number of string edit operations required to convert one string to the other. The words Katherine and Catherine have an edit distance of 1. Given two documents BW</Paragraph> </Section> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> BD and BW BE </SectionTitle> <Paragraph position="0"> , we can compute the distance between them by computing the distance between all pairs of names that occur in the two documents, and using the distances to group entities and finally to find the similarity of the two documents. However this means that each entity in BW</Paragraph> </Section> <Section position="10" start_page="0" end_page="0" type="metho"> <SectionTitle> BD </SectionTitle> <Paragraph position="0"> has to be compared to all entities in</Paragraph> </Section> <Section position="11" start_page="0" end_page="0" type="metho"> <SectionTitle> BW BD and BW BE </SectionTitle> <Paragraph position="0"> . Besides, this method brings with it the question of how to use the distances between the names so as to group together similar names. This method is probably a good direction for future research, because the Levenshtein distance could possibly be a better string matching technique. Another plausible strategy would be to use the edit-distance of the Soundex codes of the names, when comparing documents. Katherine and Catherine would have a distance of 1 in this case too.</Paragraph> <Paragraph position="1"> Using cross document coreference resolution techniques to find equivalence classes of entities would be yet another alternative approach. In Cross document coreference, two mentions of the same name, may or may not be included in the same group depending on whether or not the context of the two mentions is the same or is different.</Paragraph> </Section> class="xml-element"></Paper>