File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1021_intro.xml
Size: 6,583 bytes
Last Modified: 2025-10-06 14:01:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1021"> <Title>Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts</Title> <Section position="5" start_page="7" end_page="7" type="intro"> <SectionTitle> ABBR EXPANSIONS FOUND IN DATA </SectionTitle> <Paragraph position="0"> NR normal range; no radiation; no recurrence; no refill; nurse; nerve root; no response; no report;</Paragraph> <Paragraph position="2"> parenteral nutrition; positional nystagmus; periarteritis nodosa BD band; twice a day; bundle INF Infection; infected; infusion; interferon; inferior; infant; infective RA Rheumatoid arthritis; renal artery; radioactive; right arm; right atrium; refractory anemia; rheumatic arthritis; right atrial data and their abbreviations found in UMLS.</Paragraph> <Paragraph position="3"> The raw text of clinical notes is input and filtered through a dynamic slidingwindow buffer whose maximum window size is set to the maximum length of any abbreviation expansion in the UMLS. When a match to an expansion is found, the expansion and it's context are recorded in a training file as if the expansion were an actual abbreviation. The file is fed to the ME modeling software. In this particular implementation, the context of 7 words to the left and 7 words to the right of the found expansion as well as the section label in which the expansion occurs are recorded; however, not all of this context ended up being used in this study.</Paragraph> <Paragraph position="4"> This methodology makes a reasonable assumption that given an abbreviation and one of it's expansions, the two are likely to have similar distribution. For example, if we encounter a phrase like &quot;rheumatoid arthritis&quot;, it is likely that the context surrounding the use of an expanded phrase &quot;rheumatoid arthritis&quot; is similar to the context surrounding the use of the abbreviation &quot;RA&quot; when it is used to refer to rheumatoid arthritis. The following subsection provides additional motivation for using expansions to simulate abbreviations.</Paragraph> <Paragraph position="5"> to the distribution of their expansions Just to get an idea of how similar are the contexts in which abbreviations and their expansions occur, I conducted the following limited experiment. I processed a corpus of all available rheumatology notes (171,000) and recorded immediate contexts composed of words in positions {w</Paragraph> <Paragraph position="7"> for one unambiguous abbreviation - DJD (degenerative joint disease). Here w</Paragraph> <Paragraph position="9"> either the abbreviation DJD or its multiword expansion &quot;degenerative joint disease.&quot; Since this abbreviation has only one possible expansion, we can rely entirely on finding the strings &quot;DJD&quot; and &quot;degenerative joint disease&quot; in the corpus without having to disambiguate the abbreviation by hand in each instance. For each instance of the strings &quot;DJD&quot; and &quot;degenerative joint disease&quot;, I recorded the frequency with which words (tokens) in positions w</Paragraph> <Paragraph position="11"> occur with that string as well as the number of unique strings (types) in these positions.</Paragraph> <Paragraph position="12"> It turns out that &quot;DJD&quot; occurs 2906 times , &quot;degenerative joint disease&quot; occurs 2517 times. Of the 2906 occurrences of DJD, there were 204 types that occurred immediately prior to mention of DJD (w</Paragraph> <Paragraph position="14"> position). Of the 2517 occurrences of &quot;degenerative joint disease&quot;, there were 207 types that occurred immediately prior to mention of the disease&quot; distribution comparison. On average, the overlap between the contexts in which DJD and &quot;degenerative joint disease&quot; occur is around 50%, which is a considerable number because this overlap covers on average 91% of all occurrences in</Paragraph> <Paragraph position="16"> as well as w and w positions. One of the questions that arose during implementation is whether it would be better to build a large set of small ME models trained on sub-corpora containing context for each abbreviation of interest separately or if it would be more beneficial to train one model on a single corpus with contexts for multiple abbreviations.</Paragraph> <Paragraph position="17"> This was motivated by the idea that ME models trained on corpora focused on a single abbreviation may perform more accurately; even though such approach may be computationally expensive.</Paragraph> <Paragraph position="18"> expansions for 6 abbreviations and the expansions actually found in the training data.</Paragraph> <Paragraph position="19"> For this study, I generated two sets of data. The first set (Set A) is composed of training and testing data for 6 abbreviations (NR, PA, PN, BD, INF, RA), where each training/testing subset contains only one abbreviation per corpus. resulting in six subsets. Table 1 shows the potential expansions for these abbreviations that were actually found in the training corpora. Not all of the possible expansions found in the UMLS for a given abbreviations will be found in the text of the clinical notes. Table 3 shows the number of expansions actually found in the rheumatology training data for each of the 6 abbreviations listed in Table 1 as well as the expansions found for a given abbreviation in the UMLS database.</Paragraph> <Paragraph position="20"> The UMLS database has on average 3 times more variability in possible expansions that were actually found in the given set of training data. This is not surprising because the training data was derived from a relatively small subset of 10,000 notes.</Paragraph> <Paragraph position="21"> The other set (Set B) is similar to the first corpus of training events; however, it is not limited to just one abbreviation sample per corpus. Instead, it is compiled of training samples containing expansions from 69 abbreviations. The abbreviations to include in the training/testing were selected based on the following criteria: a. has at least two expansions b. has 100-1000 training data samples The data compiled for each set and subset was split at random in the 80/20 fashion into training and testing data. The two types of ME models (LCM and CM) were trained for each subset on 100 iterations through the data with no cutoff (all training samples used in training).</Paragraph> </Section> class="xml-element"></Paper>