File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1211_metho.xml
Size: 9,703 bytes
Last Modified: 2025-10-06 14:09:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1211"> <Title>Creating a Test Corpus of Clinical Notes Manually Tagged for Part-of-Speech Information</Title> <Section position="4" start_page="62" end_page="64" type="metho"> <SectionTitle> 3 Annotator agreement </SectionTitle> <Paragraph position="0"> In order to establish reliability of the data, we need to ensure internal as well as external consistency of the annotation. First of all, we need to make sure that the annotators agree amongst themselves (internal consistency) on how they mark up text for part-of-speech information.</Paragraph> <Paragraph position="1"> Second, we need to find out how closely the annotators generating data for this study agree with the annotators of an established project such as Penn Treebank (external consistency). If both tests show relatively high levels of agreement, then we can safely assume that the annotators in this study are able to generate part-of-speech tags for biomedical data that will be consistent with a widely recognized standard and can work independently of each other thus tripling the amount of manually annotated data.</Paragraph> <Section position="1" start_page="62" end_page="62" type="sub_section"> <SectionTitle> 3.1 Methods </SectionTitle> <Paragraph position="0"> Two types of measures of consistency were computed - absolute agreement and Kappa coefficient. The absolute agreement (Abs Agr) was calculated by dividing the total number of times all annotators agreed on a tag over the total number of tags.</Paragraph> <Paragraph position="1"> where P(A) is the proportion of times the annotators actually agree and P(E) is the proportion of times the annotators are expected to agree due to chance .</Paragraph> <Paragraph position="2"> The Absolute Agreement is most informative when computed over several sets of labels and where one of the sets represents the &quot;authoritative&quot; set. In this case, the ratio of matches among all the sets including the &quot;authoritative&quot; set to the total number of labels shows how close the other sets are to the &quot;authoritative&quot; one. The Kappa statistic is useful in measuring how consistent the annotators are compared to each other as opposed to an authority standard.</Paragraph> </Section> <Section position="2" start_page="62" end_page="63" type="sub_section"> <SectionTitle> 3.2 Annotator consistency </SectionTitle> <Paragraph position="0"> In order to test for internal consistency, we analyzed inter-annotator agreement where the three annotators tagged the same small corpus of clinical dictations.</Paragraph> <Paragraph position="1"> A very detailed explanation of the terms used in the formula for Kappa computation as well as concrete examples of how it is computed are provided in Poessio and Vieira (1988). The results were compared and the Kappastatistic was used to calculate the inter-annotator agreement. The results of this experiment are summarized in Table 1. For the absolute agreement, we computed the ratio of how many times all three annotators agreed on a tag for a given token to the total number of tags.</Paragraph> <Paragraph position="2"> Based on the small pilot sample of 5 clinical notes (2686 words), the Kappa test showed a very high agreement coefficient - 0.93. An acceptable agreement for most NLP classification tasks lies between 0.7 and 0.8 (Carletta 1996, Poessio and Vieira 1988). Absolute agreement numbers are consistent with high Kappa as they show an average of 90% of all tags in the test documents assigned exactly the same way by all three annotators.</Paragraph> <Paragraph position="3"> The external consistency with the Penn Treebank annotation was computed using a small random sample of 939 words from the Penn Treebank Corpus annotated for POS information.</Paragraph> <Paragraph position="4"> clinical notes with an &quot;authority&quot; label set. The results in Table 2 show that the three annotators are on average 88% consistent with the annotators of the Penn Treebank corpus.</Paragraph> </Section> <Section position="3" start_page="63" end_page="64" type="sub_section"> <SectionTitle> 3.3 Descriptive statistics for the corpus of </SectionTitle> <Paragraph position="0"> clinical notes The annotation process resulted in a corpus of 273 clinical notes annotated with POS tags. The corpus contains 100650 tokens from 8702 types distributed across 7299 sentences. Table 3 displays frequency counts for the top most frequent syntactic categories.</Paragraph> <Paragraph position="1"> corpus of clinical notes.</Paragraph> <Paragraph position="2"> The distribution of syntactic categories suggests the predominance of nominal categories, which is consistent with the nature of clinical notes reporting on various patient characteristics such as disorders, signs and symptoms.</Paragraph> <Paragraph position="3"> Another important descriptive characteristic of this corpus is that the average sentence length is 13.79 tokens per sentence, which is relatively short as compared to the Treebank corpus where the average sentence length is 24.16 tokens per sentence. This supports our informal observation of the clinical notes data containing multiple sentence fragments and short diagnostic statements. Shorter sentence length implies greater number of inter-sentential transitions and therefore is likely to present a challenge for a stochastic process.</Paragraph> <Paragraph position="4"> 4 Training a POS tagger on medical data In order to test some of our assumptions regarding how the differences between general English language and the language of clinical notes may affect POS tagging, we have trained the HMM-based TnT tagger (Brandts, 2000) with default parameters at the tri-gram level both on Penn Treebank and the clinical notes data. We should also note that the tagger relies on a sophisticated &quot;unknown&quot; word guessing algorithm which computes the likelihood of a tag based on the N last letters of the word, which is meant to leverage the word's morphology in a purely statistical manner.</Paragraph> <Paragraph position="5"> The clinical notes data was split at random 10 times in 80/20 fashion where 80% of the sentences were used for training and 20% were used for testing. This technique is a variation on the classic 10-fold validation and appears to be more suitable for smaller amounts of data.</Paragraph> <Paragraph position="6"> We conducted two experiments. First, we computed the correctness of the Treebank model on each fold of the clinical notes data. We tested the Treebank model on the 10 folds rather than the whole corpus of clinical notes in order to produce correctness results on exactly the same test data as would be used for validation tests of models build from the clinical notes data. Then, we computed the correctness of each of the 10 models trained on each training fold of the clinical notes data using the corresponding testing fold of the same data for testing.</Paragraph> <Paragraph position="7"> Table 4 Correctness results for the Treebank model.</Paragraph> <Paragraph position="8"> Correctness was computed simply as the percentage of correct tag assignments of the POS tagger (hits) to the total number of tokens in the test set. Table 4 summarizes the results of testing the Treebank model, while Table 5 summarizes the testing results for the models trained on the clinical notes.</Paragraph> <Paragraph position="9"> The average correctness of the Treebank model tested on clinical notes is ~88%, which is considerably lower than the state-of-the-art performance of the TnT tagger - ~96%. Training the tagger on a relatively small amount of clinical notes data brings the performance much closer to the state-of-the-art - ~95%.</Paragraph> <Paragraph position="10"> Table 5 Correctness results for the clinical notes model.</Paragraph> </Section> </Section> <Section position="5" start_page="64" end_page="64" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> The results of this pilot project are encouraging.</Paragraph> <Paragraph position="1"> It is clear that with appropriate supervision, people who are well familiar with medical content can be reliably trained to carry out some of the tasks traditionally done by trained linguists.</Paragraph> <Paragraph position="2"> This study also indicates that an automatic POS tagger trained on data that does not include clinical documents may not perform as well as a tagger trained on data from the same domain. A comparison between the Treebank and the clinical notes data shows that the clinical notes corpus contains 3,239 lexical items that are not found in Treebank. The Treebank corpus contains over 40,000 lexical items that are not found in the corpus of clinical notes. 5,463 lexical items are found in both corpora. In addition to this 37% out-of-vocabulary rate (words in clinical notes but not the Treebank corpus), the picture is further complicated by the differences between the n-gram tag transitions within the two corpora. For example, the likelihood of a DT NN bigram is 1 in Treebank and 0.75 in the clinical notes corpus.</Paragraph> <Paragraph position="3"> On the other hand, JJ NN transition in the clinical notes is 1 but in the Treebank corpus it has a likelihood of 0.73. This is just to illustrate the fact that not only the &quot;unknown&quot; out-of-vocabulary items may be responsible for the decreased accuracy of POS taggers trained on general English domain and tested on the clinical notes domain, but the actual n-gram statistics may be a major contributing factor.</Paragraph> </Section> class="xml-element"></Paper>