File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-1009_evalu.xml
Size: 9,509 bytes
Last Modified: 2025-10-06 13:59:37
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1009"> <Title>Role of Local Context in Automatic Deidentification of Ungrammatical, Fragmented Text</Title> <Section position="8" start_page="68" end_page="71" type="evalu"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"> Local context contributes differently to each of the four deidentification systems. Our SVM-based approach uses only local context. The heuristic, rule-based system relies heavily on dictionaries. IdentiFinder uses a simplified representation of local context and adds to this information about the global context as represented by transition probabilities between entities in the sentence. SNoW uses local context as well, but it also makes an effort to benefit from relations between entities. Given the difference in the strengths of these systems, we compared their performance on both the reidentified and authentic corpora (see Section 3). We hypothesized that given the nature of medical discharge summaries, IdentiFinder would not be able to find enough global context and SNoW would not be able to make use of relations (because many sentences in this corpus contain only one entity). We further hypothesized that when the data contain words ambiguous between PHI and non-PHI, or when the PHI cannot be found in dictionaries, the heuristic, rule-based approach would perform poorly. In all of these cases, SVMs trained with local context information would be sufficient for proper deidentification.</Paragraph> <Paragraph position="1"> To compare the SVM approach with Identi-Finder, we evaluated both on PHI consisting of names of people (i.e., patient and doctor names), locations (i.e., geographic locations), and organizations (i.e., hospitals), as well as PHI consisting of dates, and contact information (i.e., phone numbers, pagers). We omitted PHI representing ID numbers from this experiment in order to be fair to IdentiFinder which was not trained on this category. To compare the SVM approach with SNoW, we trained both systems with only PHI consisting of names of people, locations, and organizations, i.e., the entities that SNoW was designed to recognize.</Paragraph> <Section position="1" start_page="68" end_page="69" type="sub_section"> <SectionTitle> 6.1 Deidentifying Reidentified and Authentic Discharge Summaries </SectionTitle> <Paragraph position="0"> We first deidentified: * Previously deidentified discharge summaries into which we inserted invented but realistic surrogates for PHI without deliberately introducing ambiguous words or words not found in dictionaries, and * Authentic discharge summaries with real PHI. Our experiments showed that SVMs with local context outperformed all other approaches. On the reidentified corpus, SVMs gave an F-measure of 97.2% for PHI. In comparison, IdentiFinder, having been trained on the news corpus, gave an F-measure of 67.4% and was outperformed by the heuristic+dictionary approach (see Table 2).</Paragraph> <Paragraph position="1"> Note that in deidentification, recall is much more important than precision. Low recall indicates that many PHI remain in the documents and that there is high risk to patient privacy. Low precision means that words that do not correspond to PHI have also been removed. This hurts the integrity of the data but does not present a risk to privacy.</Paragraph> <Paragraph position="2"> We evaluated SNoW only on the three kinds of entities it is designed to recognize. We cross-validated it on our corpora and found that its performance in recognizing people, locations, and organizations was 96.2% in terms of F-measure (see Table 3 ). In comparison, our SVM-based system, when retrained to only consider people, locations, and organizations so as to be directly comparable to SNoW, had an F-measure of 98%.</Paragraph> <Paragraph position="3"> ple, locations, and organizations found in reidentified discharge summaries.</Paragraph> <Paragraph position="4"> Similarly, on the authentic discharge summaries, the SVM approach outperformed all other approaches in recognizing PHI (see Tables 4 and 5).</Paragraph> </Section> <Section position="2" start_page="69" end_page="70" type="sub_section"> <SectionTitle> 6.2 Deidentifying Data with Ambiguous PHI </SectionTitle> <Paragraph position="0"> In discharge summaries, the same words can appear both as PHI and as non-PHI. For example, in the same corpus, the word &quot;Swan&quot; can appear both as the name of a medical device (i.e., &quot;Swan Catheter&quot;) and as the name of a person, etc. Ideally, we would like to deidentify data even when many words in the The best performances are marked in bold in all of the tables in this paper.</Paragraph> <Paragraph position="1"> For all of the corpora presented in this paper, a performance difference of 1% or more is statistically significant at a = 0.05. charge summaries.</Paragraph> <Paragraph position="2"> corpus are ambiguous between PHI and non-PHI. We hypothesize that given ambiguities in the data, context will play an important role in determining whether the particular instance of the word is PHI and that given the many fragmented sentences in our corpus, local context will be particularly useful. To test these hypotheses, we generated a corpus by reidentifying the previously deidentified corpus with words that were ambiguous between PHI and non-PHI, making sure to use each ambiguous word both as PHI and non-PHI, and also making sure to cover all acceptable formats of all PHI (see Section 3). The resulting distribution of PHI is shown in Table 6. Our results showed that, on this corpus, the SVM-based system accurately recognized 91.9% of all PHI; its performance, measured in terms of F-measure was also significantly better than all other approaches both on the complete corpus containing ambiguous entries (see Table 7 and Table 8) and only on the ambiguous words in this corpus (see Table 9). and organizations found in ambiguous data.</Paragraph> </Section> <Section position="3" start_page="70" end_page="71" type="sub_section"> <SectionTitle> 6.3 Deidentifying PHI Not Found in Dictionaries </SectionTitle> <Paragraph position="0"> Some medical documents contain foreign or misspelled names that need to be effectively removed.</Paragraph> <Paragraph position="1"> To evaluate the different deidentification approaches under such circumstances, we generated a corpus in which the names of people, locations, and hospitals were all random permutations of letters. The resulting words were not found in any dictionaries but followed the general format of the entity name category to which they belonged. The distribution of PHI in this third corpus is in Table 10.</Paragraph> <Paragraph position="2"> associated with names are randomly generated so as not to be found in dictionaries.</Paragraph> <Paragraph position="3"> On this data set, dictionaries cannot contribute to deidentification because none of the PHI appear in dictionaries. Under these conditions, proper deidentification relies completely on context. Our results showed that SVM approach outperformed all other approaches on this corpus also (Tables 11 and 12).</Paragraph> <Paragraph position="4"> dictionaries.</Paragraph> <Paragraph position="5"> Of only the PHI not found in dictionaries, 95.5% was accurately identified by the SVM approach. In comparison, the heuristic+dictionary approach accurately identified those PHI that could not be found in dictionaries 11.1% of the time, IdentiFinder recognized these entities 76.7% of the time and SNoW gave an accuracy of 79% (see Table 13).</Paragraph> </Section> <Section position="4" start_page="71" end_page="71" type="sub_section"> <SectionTitle> 6.4 Feature Importance </SectionTitle> <Paragraph position="0"> As hypothesized, in all experiments, the SVM-based approach outperformed all other approaches.</Paragraph> <Paragraph position="1"> SVM's feature set included a total of 26 features, 12 of which were dictionary-related features (excluding MeSH). Information gain showed that the most informative features for deidentification were the TW, the bigram before TW, the bigram after TW, the word before TW, and the word after TW.</Paragraph> <Paragraph position="2"> Note that the TW itself is important for classification; many of the non-PHI correspond to common words that appear in the corpus frequently and the SVM learns the fact that some words, e.g., the, admit, etc., are never PHI. In addition, the context of TW (captured in the form of unigrams and bigrams of words and part-of-speech tags surrounding TW) contributes significantly to deidentification.</Paragraph> <Paragraph position="3"> There are many ways of automatically capturing context. In our data, unigrams and bigrams of words and their part-of-speech tags seem to be sufficient for a statistical representation of local context. The global context, as represented within IdentiFinder and SNoW, could not contribute much to deidentification on this corpus because of the fragmented nature of the language of these documents, because most sentences in this corpus contain only one entity, and because many sentences do not include explicit relations between entities. However, there is enough structure in this data that can be captured by local context; lack of relations between entities and the inability to capture global context do not hold us back from almost perfect deidentification.</Paragraph> </Section> </Section> class="xml-element"></Paper>