File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1137_metho.xml
Size: 21,157 bytes
Last Modified: 2025-10-06 14:08:46
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1137"> <Title>Identification of Confusable Drug Names: A New Approach and Evaluation Methodology</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Phonetic Similarity: ALINE </SectionTitle> <Paragraph position="0"> The ALINE cognate matching algorithm (Kondrak, 2000) assigns a similarity score to pairs of phonetically-transcribed words on the basis of the decomposition of phonemes into elementary phonetic features. The algorithm was initially designed to identify and align cognates in vocabularies of related languages (e.g. colour and couleur). Nevertheless, thanks to its grounding in universal phonetic principles, the algorithm can be used for estimating the similarity of any pair of words, including drug names. Furthermore, unlike SOUNDEX and EDI-TEX, ALINE is completely language-independent.</Paragraph> <Paragraph position="1"> The principal component of ALINE is a function that calculates the similarity of two phonemes that are expressed in terms of about a dozen binary or multi-valued phonetic features (Place, Manner, Voice, etc.). Feature values are encoded as floating-point numbers in the range a0 a1a3a2a5a4a7a6 . For example, the feature Manner can take any of the following seven values: stop = 1.0, affricate = 0.9, fricative = 0.8, approximant = 0.6, high vowel = 0.4, mid vowel = 0.2, and low vowel = 0.0. The numerical values reflect the distances between vocal organs during speech production. The phonetic features are assigned salience weights that express their relative importance.</Paragraph> <Paragraph position="2"> The overall similarity score and optimal alignment of two words--computed by a dynamic programming algorithm (Wagner and Fischer, 1974)-is the sum of individual similarity scores between pairs of phonemes. A constant insertion/deletion penalty is applied for each unaligned phoneme. Another constant penalty is set to reduce relative importance of the vowel--as opposed to consonant-phoneme matches. The similarity value is normalized by the length of the longer word.</Paragraph> <Paragraph position="3"> ALINE's behavior is controlled by a number of parameters: the maximum phonemic score, the insertion/deletion penalty, the vowel penalty, and the feature salience weights. The parameters have default settings for the cognate matching task, but these settings may not be appropriate for drug-name matching. The settings can be optimized (tuned) on a training set that includes positive and negative examples of confusable name pairs.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Orthographic Similarity: BI-SIM </SectionTitle> <Paragraph position="0"> An analysis of the reasons behind the unsatisfactory performance of commonly used measures led us to propose a new measure of orthographic similarity: BI-SIM.3 Below, we describe the inherent strengths and weaknesses of a0 -gram and subsequence-based approaches. Next, we present a new, generalized framework that characterizes a number of commonly used similarity measures. Following this, we describe the parametric settings for BI-SIM--a specific instantiation of this generalized framework.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Problems with Commonly Used Measures </SectionTitle> <Paragraph position="0"> The Dice coefficient computed for bigrams (BI-GRAM) is an example of a measure that is demonstrably inappropriate for estimating word similarity. Because it is based exclusively on complete bigrams, it often fails to discover any similarity between words that look very much alike.</Paragraph> <Paragraph position="1"> For example, it returns zero on the pair Verelan/Virilon. In addition, it violates a desirable requirement of any similarity measure that the maximum similarity of 1 should only result when comparing identical words. In particular, non-identical pairs4 like Xanex/Nexan--where all bigrams are shared--are assigned a similarity value of 1. Moreover, it sometimes associates bigrams that occur in radically different word positions, as in the pair Voltaren/Tramadol. Finally, the initial segment, which is arguably the most important in determining drug-name confusability,5 is actually given a lower weight than other segments because it participates in only one bigram. It is therefore surprising that BIGRAM has been such a popular choice of measure for computing word similarity.</Paragraph> <Paragraph position="2"> LCSR is more appropriate for identifying potential drug-name confusability because it does not rely 3BI-SIM was developed before we conducted the experiments described in Section 6.</Paragraph> <Paragraph position="3"> 4This observation is due to Ukkonen (1992).</Paragraph> <Paragraph position="4"> 574.2% of the confusable pairs in the pharmacopeial gold standard (Section 6) have identical initial segments.</Paragraph> <Paragraph position="5"> on (frequently imprecise) bigram matching. However, LCSR is weak in its tendency to posit nonintuitive links, such as the ones between segments in Benadryl/Cardura. The fact that it returns the same value for both Amaryl/Amikin and Amaryl/Altoce can be attributed to lack of context sensitivity.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 A Generalized a0 -gram Measure </SectionTitle> <Paragraph position="0"> Although it may not be immediately apparent, LCSR can be viewed as a variant of the a0 -gram approach. If a0 is set to 1, the Dice coefficient formula returns the number of shared letters divided by the average length of two strings. Let us call this measure UNIGRAM. The main difference between LCSR and UNIGRAM is that the former obeys the no-crossing-links constraint, which stipulates that the matched unigrams must form a subsequence of both of the compared strings, whereas the latter disregards the order of unigrams. E.g., for pat/tap, LCSR returns 0.33 because the length of the longest common subsequence is 1, while UNIGRAM returns 1.0 because all letters are shared. The other, minor difference is that the denominator of LCSR is the length of the longer string, as opposed to the average length of two strings in UNIGRAM. (In fact, LCSR is sometimes defined with the average length in the denominator.) We define a generalized measure based on a0 -grams with the following parameters: 1. The value of a0 .</Paragraph> <Paragraph position="1"> 2. The presence or absence of the no-crossing-links constraint.</Paragraph> <Paragraph position="2"> 3. The number of segments appended to the beginning and the end of the strings.</Paragraph> <Paragraph position="3"> 4. The length normalization factor: either the maximum or the average length of the strings.</Paragraph> <Paragraph position="4"> A number of commonly used similarity measures can be expressed in the above framework. The combination of a0 a20 a4 with the no-crossing-links constraint produces LCSR. By selecting a0 a20 a1 and the average normalization factor, we obtain the BI-GRAM measure. Thirteen out of twenty two measures tested by Lambert et al. (1999) are variants that combine either a0 a20 a1 or a0 a20 a22 with various lengths of appended segments.</Paragraph> <Paragraph position="5"> So far, we have assumed that there are only two possible values of a0 -gram similarity: identical or non-identical. This need not be the case. Obviously, some non-identical a0 -grams are more similar than others. We can define a similarity scale for two a0 -grams as the number of identical segments in the corresponding positions divided by a0 :</Paragraph> <Paragraph position="7"> where a13a15a14 a6a17a16 a2a19a18 a9 returns 1 if a16 and a18 are identical, and 0 otherwise. The scale distinguishes a0 levels of similarity, including 1 for identical bigrams, and 0 for completely distinct bigrams.6 The notion of similarity scale between a0 -grams requires clarification in the case of a0 -grams partially composed of segments appended to the beginning or end of strings. Normally, extra affixes are composed of one or more copies of a unique special symbol, such as space, that does not belong to the string alphabet. We define an alphabet of special symbols that contains a unique symbol for each of the symbols in the original string alphabet. The extra affixes are assumed to contain copies of special symbols that correspond to the initial symbol of the string.</Paragraph> <Paragraph position="8"> In this way, the similarity between pairs of a0 -grams in which one or both of the a0 -grams overlap with an extra affix is guaranteed to be either 0 or 1.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 BI-SIM </SectionTitle> <Paragraph position="0"> We propose a new measure of orthographic similarity, called BI-SIM, that aims at combining the advantages of the context inherent in bigrams, the precision of unigrams, and the strength of the no-crossing-links constraint. BI-SIM belongs to the class of a0 -gram measures defined above. Its parameters are: a0 a20 a1 , the no-crossing-links constraint enforced, a single segment appended to the beginning of the string, normalization by the length of the longer string, and multi-valued a0 -gram similarity.</Paragraph> <Paragraph position="1"> The rationale behind the specific settings is as follows. a0 a20 a1 is a minimum value that provides context for matching segments within a string. The no-crossing-links constraint guarantees the sequentiality of segment matches. The segment added to the beginning increases the importance of the match of initial segment. The normalization method favors associations between words of similar length. Finally, the refined a0 -gram similarity scale increases the resolution of the measure.</Paragraph> <Paragraph position="2"> BI-SIM is defined by the following recurrence:</Paragraph> <Paragraph position="4"> 6The scale could be further refined to include more levels of similarity. For example, bigrams that are frequently confused because of their typographic or cursive shape, such as en/im, could be assigned a similarity value that corresponds to the frequency of their confusions.</Paragraph> <Paragraph position="5"> where a1 refers to the a0 -gram similarity scale defined in Section 4.2, and a7a1a0 and a13a2a0 are the appended segments. Furthermore, a20 a6 a13 a2a22a21 a9 is defined to be a1 if</Paragraph> <Paragraph position="7"> strong similarity to the relation for computing the longest common subsequence except that the sub-sequence is composed of bigrams rather than unigrams, and the bigrams are weighted according to their similarity. Assuming that the segments appended to the beginning of each string are chosen according to the rule specified in Section 4.2, the returned value of BI-SIM always falls in the interval a0 a1a3a2a5a4a7a6 . In particular, it returns 1 if and only if the strings are identical, and 0 if and only if the strings have no segments in common.</Paragraph> <Paragraph position="8"> BI-SIM can be seen as a generalization of LCSR: the setting of a0 a20 a4 reduces BI-SIM to LCSR (which could also be called UNI-SIM). On the other hand, the setting of a0 a20 a22 yields TRI-SIM. TRI-SIM requires two extra symbols at the beginning of the string.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Evaluation Methodology </SectionTitle> <Paragraph position="0"> We designed a new method for evaluating the accuracy of a measure. For each drug name, we sort all the other drug names in the test set in order of decreasing value of similarity. We calculate the recall by dividing the number of true positives among the top a1 names by the total number of true positives for this particular drug name, i.e., the fraction of the confusable names that are discovered by taking the top a1 similar names. At the end we apply an information-retrieval technique called macro-averaging (Salton, 1971) which averages the recall values across all drug names in the test set.7 Because there is a trade-off between recall and the a1 threshold, it is important to measure the recall at different values of a1 . Table 4 shows the top 8 names that are most similar to Toradol according to the BI-SIM similarity measure. A '+'/'-' mark indicates whether the pair is a true confusion pair. The pairs are listed in rank order, according to the score assigned by the indicated algorithm. Names that return the same similarity value are listed in the reverse lexicographic order. Since the test set contains four drug names that have been identified as confusable with Toradol (Tramadol, Torecan, Tegretol, and Inderal), the recall values are a1a8a4 a3 a1 for a1 a20 a3 , and for a1a8a4a5a4 a3 for a1 a20a7a6 .</Paragraph> <Paragraph position="1"> 7We could have also chosen to micro-average the recall values by dividing the total number of true positives discovered among the top a8 candidates by the total number of true positives in the test set. The choice of macro-averaging over micro-averaging does not affect the relative ordering of similarity measures implied by our results.</Paragraph> <Paragraph position="2"> Name Score +/- Recall 1. Tramadol 0.6875 + 0.25 2. Tobradex 0.6250 - 0.25 3. Torecan 0.5714 + 0.50 4. Stadol 0.5714 - 0.50 5. Torsemide 0.5000 - 0.50 6. Theraflu 0.5000 - 0.50 7. Tegretol 0.5000 + 0.75 8. Taxol 0.5000 - 0.75 Toradol according to the BI-SIM similarity measure, and the corresponding recall values.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Experiments and Results </SectionTitle> <Paragraph position="0"> We conducted two experiments with the goal of evaluating the relative accuracy of several measures of similarity in identifying confusable drug names. The first experiment was performed against an online gold standard: the United States Pharma-</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> copeial Convention Quality Review, 2001 (hence- </SectionTitle> <Paragraph position="0"> forth the USP set). The USP set contains both look-alike and sound-alike confusion pairs. We used 582 unique drug names from this source to combinatorically induce 169,071 possible pairs. Out of these, 399 were true confusion pairs in the gold standard.</Paragraph> <Paragraph position="1"> The maximum number of true positives was 6, but for the majority of names (436 out of 582), only one confusable name is identified in the gold standard.</Paragraph> <Paragraph position="2"> On average, the task was to identify 1.37 true positives among 581 candidate names.</Paragraph> <Paragraph position="3"> We computed the similarity of each name pair using the following similarity measures: BIGRAM, TRIGRAM-2B, LCSR, EDIT, NED, SOUNDEX, EDITEX, BI-SIM, TRI-SIM, ALINE and PREFIX.</Paragraph> <Paragraph position="4"> PREFIX is a baseline-type similarity measure that returns the length of the common prefix divided by the length of the longer string. In addition, we calculated the COMBINED measure by taking the simple average of the values returned by PREFIX, EDIT, BI-SIM, and ALINE.</Paragraph> <Paragraph position="5"> In order to apply ALINE to the USP set, all drug names were transcribed into phonetic symbols.</Paragraph> <Paragraph position="6"> This transcription was approximated by applying a simple set of about thirty regular expression rules.</Paragraph> <Paragraph position="7"> (It is likely that a more sophisticated transcription method would result in improvement of ALINE's performance.) In the first experiment, the parameters of ALINE were not optimized; rather, they were set according to the values used for a distinct task of cross-language cognate identification.</Paragraph> <Paragraph position="8"> In Figure 1, the macro-averaged recall values achieved by several measures on the USP set are plotted against the cut-off a1 . Some measures have been left out in order to preserve the clarity of the plot. Table 5 contains detailed results for a1 a20 a4 a1 USP and the sound-alike test sets.</Paragraph> <Paragraph position="9"> and a1 a20 a1 a1 for all measures.</Paragraph> <Paragraph position="10"> Since the USP set contains both look-alike and sound-alike name pairs, we conducted a second experiment to compare the performance of various measures on sound-alike pairs only. We used a proprietary list of 276 drug names identified by experts as &quot;names of concern&quot; for 83 &quot;consult&quot; names. None of the &quot;consult&quot; names and only about 25% of the &quot;names of concern&quot; are in the USP set, i.e., there are no true positive pairs shared between the two sets. The maximum number of true positives was 11, while the average for all names was 3.33.</Paragraph> <Paragraph position="11"> The measures were applied to calculate the similarity between each of the 83 &quot;consult&quot; names and a list of 2596 drug names. The results are shown in Figure 2. Since the task, which involved identifying, on average, 3.33 true positives among 2596 candidates, was more challenging, the recall values are lower than in Figure 1. All drug names were first converted into a phonetic notation by means of a set alike test set.</Paragraph> <Paragraph position="12"> of regular expression rules. (We found that phonetic transcription led to a slight improvement in the recall values achieved by the orthographic measures.) The parameters of ALINE used in this experiment were optimized beforehand on the USP set.</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Discussion </SectionTitle> <Paragraph position="0"> The results described in Section 6 clearly indicate that BI-SIM and TRI-SIM, the newly proposed measures of similarity, outperform several currently used measures on the USP test set regardless of the choice of the cutoff parameter a1 . However, a simple combination of several measures achieves even higher accuracy. On the sound-alike confusion set, EDITEX and ALINE are the most effective. The accuracy achieved by the best measures is impressive. For the combined measure, the average recall on the USP set exceeds 90% with only the 15 top candidates considered.</Paragraph> <Paragraph position="1"> The USP test set has its limitations. The set includes pairs that are considered confusable for other reasons than just phonetic or orthographic similarity, including illegible handwriting, incomplete knowledge of drug names, newly available products, similar packaging or labeling, and incorrect selection of a similar name from a computerized product list. In many cases, the names do not sound or look alike, but when handwritten or communicated verbally, these names have caused or could cause a mix-up. On the other hand, many clearly confusable name pairs are not identified as such (e.g. Erythromycin/Erythrocin, Neosar/Neoral, Lorazepam/Flurazepam, Erex/Eurax/Urex, etc.).</Paragraph> <Paragraph position="2"> All similarity measures have their own strengths and weaknesses. a0 -GRAM is effective at recognizing pairs such as Chlorpromazine/Prochlorperazine, where a shorter name closely matches parts of the longer name. However, this advantage is offset by its poor performance on similar-sounding names with few shared bigrams (Nasarel/Nizoral). LCSR is able to identify pairs where common subsequences are interleaved with dissimilar segments, such as Asparaginase/Pegaspargase, but fails on similar sounding names where the overlap of identical segments is minimal (Luride/Lortab). ALINE detects phonetic similarity even when it is obscured by the orthography (eg. Xanax/Zantac), but phonetic transcription is required beforehand.</Paragraph> <Paragraph position="3"> The idiosyncrasies of individual measures are attenuated when they are combined together, which may explain the excellent performance of the combined measure. Each measure is focused on a particular facet of string similarity: initial segments in PREFIX, phonetic sound-alike quality in ALINE, common clusters in bigram-based measures, overall transformability in EDIT, etc. For this reason, a synergistic blend of several measures achieves higher accuracy than any of its components.</Paragraph> <Paragraph position="4"> Our experiments confirm that orthographic approaches are superior to their phonetic counterparts in tasks involving string matching (Zobel and Dart, 1995). Nevertheless, phonetic approaches identify many sound-alike names that are beyond the reach of orthographic approaches. In applications where the gap between spelling and pronunciation plays an important role, it is advisable to employ phonetic approaches as well. The two most effective ones are EDITEX and ALINE, but whereas ALINE is language-independent, EDITEX incorporates English-specific letter groups and rules.</Paragraph> </Section> class="xml-element"></Paper>