File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-2007_metho.xml

Size: 9,297 bytes

Last Modified: 2025-10-06 14:08:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-2007">
  <Title>A Preliminary Look into the Use of Named Entity Information for Bioscience Text Tokenization</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Methodology
</SectionTitle>
    <Paragraph position="0"> From a machine learning perspective, one way to look at a tokenization task, including normalization, is as a classification problem. As stated before, the problem of tokenization is that of ambiguous punctuation - one must be able to tell whether or not a piece of punctuation should be included in a token. A document can be tokenized by classifying each piece of punctuation in the document as part of a token or as a token boundary. Removing the pieces of punctuation classified as part of the token will normalize the token.</Paragraph>
    <Paragraph position="1"> Possible features for classifying punctuation may include the piece of punctuation itself, character or characters to the left/right of the punctuation, type of character[s] to the left/right of the punctuation (i.e.</Paragraph>
    <Paragraph position="2"> uppercase, lowercase, number, etc.), the length of the sentence or article the term occurs in, the type of sentence or article the term occurs in, etc.</Paragraph>
    <Paragraph position="3"> The system presented here, BAccHANT (Bioscience And Health Article Normalizing Tokenizer), was created to normalize MEDLINE text for the TREC Genomics track, as presented earlier. It classifies pieces of punctuation in bioscience text based on the surrounding characters, determining whether the punctuation is a token boundary or needs to be removed for normalization.</Paragraph>
    <Paragraph position="4"> The features chosen for BAccHANT were the following: piece of punctuation being classified (Punc), character to the left of the punctuation (CL), type of character to the left of the punctuation (TL), character to the right of the punctuation (CR), type of character to the right of the punctuation (TR), and whether the punctuation should be removed for normalization, or break the token (Class). These features were chosen by the author. Feature selection using information gain ratio indicated that all five should be used.</Paragraph>
    <Paragraph position="5">  * lower: Character is lowercase * cap: Character is a capital letter * num: Character is a number * space: Character is whitespace (space, tab, etc.) * other: Character is none of the above Values for Class are as follows: * remove: The punctuation should be removed * break: The punctuation should break the token  The 'remove' class is of chief importance for the normalization task, since classifying a piece of punctuation as 'remove' means the punctuation will be removed for normalization.</Paragraph>
    <Paragraph position="6">  tokenized by the author using the tokenization scheme presented in the appendix. A domain expert1 was available for determining difficult tokenizations. The 67 abstracts yielded 17253 pieces of punctuation.</Paragraph>
    <Paragraph position="7"> Distributions follow. The feature vectors created from the set were used to create a decision tree, implemented using the Weka tool set (Witten and Frank). The tree used reduced error pruning to increase accuracy.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> The baseline used for evaluation was to simply break on every instance of punctuation; that is, assume no punctuation needs to be removed. This achieves an accuracy of 92.73%, where accuracy is the percentage of correctly classified punctuation. This baseline was chosen for its high accuracy; however, as it is a simple majority class baseline which always predicts 'break', giving it a precision score of 1, a recall score of 0, and an f-measure of 0 for the 'remove' class.</Paragraph>
    <Paragraph position="1"> BAccHANT was trained and tested using 10-fold cross-validation. It achieved an accuracy of 96.60%, which was a statistically significant improvement over the baseline (all significance testing was done using a two-tailed t-test with a p-value of 0.05). More detailed results follow.</Paragraph>
    <Paragraph position="2">  The 'break' classification reached high precision and recall. This is unsurprising as 96.7% of all &lt;space&gt; punctuation classified as 'break', and &lt;space&gt; punctuation made up 83.9% of all punctuation. Commas and periods were similarly easy to classify as 'break'. Of more interest is the 'remove' classification, as this class indicates punctuation to be normalized. The recall was not as good as was hoped, with BAccHANT discovering roughly 2 out of every 3 instances present, though it correctly classified roughly 5 out of 6 instances it found We suspected that punctuation was being used differently inside of named entities vs. outside of NEs. To investigate this suspicion, we tested BAccHANT on NE data from the GENIA corpus. The testing set created from GENIA consisted wholly of character data from inside NEs. The set contained 5798 instances of punctuation. Punctuation distribution for the GENIA corpus test set follows.</Paragraph>
    <Paragraph position="3">  for BAccHANT tested on all text vs. inside NEs.</Paragraph>
    <Paragraph position="4"> Precision, recall and f-measure are given for the 'remove' class Further testing revealed that accuracy outside NEs was near 99%. The statistically significant degradation in performance of BAccHANT inside NEs vs.</Paragraph>
    <Paragraph position="5"> performance both inside and outside NEs indicates that data inside named entities is more difficult to normalize than data outside named entities.</Paragraph>
    <Paragraph position="6"> These results seem to indicate that a normalization system trained solely on data inside NEs could perform better than a system trained on both named and non-named data when normalizing NEs. A new normalization system trained on NE data, BAccHANT-N, was built to test this.</Paragraph>
    <Paragraph position="7"> The new system was trained and tested using the GENIA corpus test set. BAccHANT-N was created similarly to BAccHANT, with identical features, and implemented as a decision tree using reduced error pruning. It was trained and tested using 10-fold cross-validation and achieved an accuracy of 96.5%. More detailed results follow.</Paragraph>
    <Paragraph position="8">  BAccHANT-N tested on named entity data.</Paragraph>
    <Paragraph position="9"> Below is a results summary table, giving accuracy for both classes, and precision, recall, and f-measure for the 'remove' class across all systems presented.</Paragraph>
    <Paragraph position="10"> BAccHANT-N showed statistically significant improvement over BAccHANT when normalizing named entity data. These results show that a system trained on data inside NEs shows improvement in performance over a system trained on data from inside and outside NEs.</Paragraph>
    <Paragraph position="11">  Precision, recall, and f-measure are given for the 'remove' class.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Future Work
</SectionTitle>
    <Paragraph position="0"> Currently, BAccHANT looks only at one character to either side of the piece of punctuation to be classified.</Paragraph>
    <Paragraph position="1"> By expanding the number of characters examined from one to a certain number of characters (a window), accuracy should increase. Since BAccHANT decision tree learns based on context, greater context may allow for better learning, and a window of characters will expand context.</Paragraph>
    <Paragraph position="2"> Also, a window of characters will introduce new features to learn from. Since a decision tree's features determine how it learns from context, adding better features to the decision tree may help the tree learn better. Examples of new features include: * Mixed case - does the window include both uppercase and lowercase characters? * Mixed type - does the window include a mix of letters, numbers, and other character types? * Boundary size - is there a definite token boundary within the character window, and if so, how far into the window is the boundary? Error analysis of BAccHANT on named entity tagged data led to the creation of a normalization system trained on data from inside NEs which performed better than BAccHANT, and hence would be a better choice for normalizing inside NEs. However, this normalizer would necessarily need to be run on named entity tagged data, as it has not been trained to deal with text outside of NEs. To accomplish this, a system to simultaneously tag named entities and normalize at the same time would be desirable. This could be accomplished via hierarchical hidden Markov models (Fine et. al., 1998). A system of this type involves &amp;quot;tiering&amp;quot; hidden Markov models within each other. This model could be used to statistically compute the most likely name for a section of text, and then normalize appropriately in one pass. As hidden Markov models have been used both for name-finding (Bikel et al. (1997)) and tokenization (Cutting et al. (1992)), this seems to be a promising research possibility.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML