File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1137_intro.xml
Size: 3,917 bytes
Last Modified: 2025-10-06 14:02:11
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1137"> <Title>Identification of Confusable Drug Names: A New Approach and Evaluation Methodology</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"> Drug-name matching refers to the process of string matching to rank similarity between drug names.</Paragraph> <Paragraph position="1"> There are two classes of string matching: orthographic and phonetic. For each of these, there are two methods of matching: distance and similarity.</Paragraph> <Paragraph position="2"> If two drug names are confusable, their distance should be small and their similarity should be large.</Paragraph> <Paragraph position="3"> Some examples of orthographic and phonetic algorithms for both distance- and similarity-based approaches are shown in Table 1.</Paragraph> <Paragraph position="4"> In the remainder of this section, we describe a number of measures that have been applied to the problem of identifying confusable drug names. Specific examples of values obtained by the measures are provided in Table 2.</Paragraph> <Paragraph position="5"> String-edit distance (Wagner and Fischer, 1974) (EDIT) (also known as Levenshtein distance) counts up the number of steps it takes to transform one string into another, where the cost of substitution is the same as the cost of insertion or deletion. A normalized edit distance (NED) is calculated by dividing the total edit cost by the length of the longer string.</Paragraph> <Paragraph position="6"> The longest common subsequence ratio (Melamed, 1999) (LCSR) is computed by dividing measures.</Paragraph> <Paragraph position="7"> the length of the longest common subsequence by the length of the longer string. LCSR is closely related to normalized edit distance. If the cost of substitution is at least twice the cost of insertion/deletion and the strings are of equal length, LCSR is equivalent to the normalized edit distance. In a0 -gram measures, the number of a0 -grams that are shared by two strings is doubled and then divided by the total number of a0 -grams in each string: where a0 -grams(x) is a multi-set of letter a0 -grams in a7 . This formula is often referred to as the Dice coefficient. A slight variation of this measure is obtained by adding extra symbols, such as spaces, before and/or after each string (Lambert et al., 1999). The modification is designed to increase sensitivity to the beginnings and endings of words. For example, TRIGRAM-2B is calculated by applying the Dice formula with a0a21a20a23a22 after adding two spaces before each string. In this paper, we consider two specific variants: BIGRAM, which is the most basic formulation, and TRIGRAM-2B.2 SOUNDEX (Hall and Dowling, 1980) is an approximation to phonetic name matching.</Paragraph> <Paragraph position="8"> SOUNDEX transforms all but the first letter to numeric codes (see Table 3) and after removing zeroes truncates the resulting string to 4 characters. For the purposes of comparison, we implemented a SOUNDEX-based similarity measure that returns the edit distance between the corresponding codes. EDITEX (Zobel and Dart, 1996) is another quasiphonetic measure that combines edit distance with a letter-grouping scheme similar to SOUNDEX (Table 3). As in SOUNDEX, the codes are designed 2TRIGRAM-2B was identified by Lambert et al. (1999) as particularly effective for identifying confusable drug name pairs.</Paragraph> <Paragraph position="9"> and EDITEX.</Paragraph> <Paragraph position="10"> to identify letters that have similar pronunciations, but the corresponding sets of letters are not disjoint. The edit distance between letters that belong to the same group is smaller than the edit distance between other letters. Additional rules are aimed at eliminating silent and reduplicated letters.</Paragraph> </Section> class="xml-element"></Paper>