File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-2021_intro.xml
Size: 3,511 bytes
Last Modified: 2025-10-06 14:03:25
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-2021"> <Title>Automatic Acronym Recognition</Title> <Section position="3" start_page="0" end_page="167" type="intro"> <SectionTitle> 2 Related work </SectionTitle> <Paragraph position="0"> The task of automatically extracting acronym-definition pairs from biomedical literature has been studied, almost exclusively for English, over the past few decades using technologies from Natural Language Processing (NLP). This section presents a few approaches and techniques that were applied to the acronym identification task. Taghva and Gilbreth (1999) present the Acronyms Finding Program (AFP), based on pattern matching. Their program seeks for acronym candidates which appear as upper case words. They calculate a heuristic score for each competing definition by classifying words into: (1) stop words (&quot;the&quot;, &quot;of&quot;, &quot;and&quot;), (2) hyphenated words (3) normal words (words that don't fall into any of the above categories) and (4) the acronyms themselves (since an acronym can sometimes be a part of the definition). The AFP utilizes the Longest Common Subsequence (LCS) algorithm (Hunt and Szymanski, 1977) to find all possible alignments of the acronym to the text, followed by simple scoring rules which are based on matches. The performance reported from their experiment are: recall of 86% at precision of 98%.</Paragraph> <Paragraph position="1"> An alternative approach to the AFP was presented by Yeates (1999). In his program, Three Letters Acronyms (TLA), he uses more complex methods and general heuristics to match characters of the acronym candidate with letters in the definition string, Yeates reported f-scoreof 77.8%.</Paragraph> <Paragraph position="2"> Another approach recognizes that the alignment between an acronym and its definition often follows a set of patterns (Park and Byrd, 2001), (Larkey et al., 2000). Pattern-based methods use strong constraints to limit the number of acronyms respectively definitions recognized and ensure reasonable precision.</Paragraph> <Paragraph position="3"> Nadeau and Turney (2005) present a machine learning approach that uses weak constraints to reduce the search space of the acronym candidates and the definition candidates, they reached recall of 89% at precision of 88%.</Paragraph> <Paragraph position="4"> Schwartz and Hearst (2003) present a simple algorithm for extracting abbreviations from biomedical text. The algorithm extracts acronym candidates, assuming that either the acronym or the definition occurs between parentheses and by giving some restrictions for the definition candidate such as length and capital letter initialization. When an acronym candidate is found the algorithm scans the words in the right and left side of the found acronym and tries to match the shortest definition that matches the letters in the acronym. Their approach is based on previous work (Pustejovsky et al., 2001), they achieved recall of 82% at precision of 96%.</Paragraph> <Paragraph position="5"> It should be emphasized that the common characteristic of previous approaches in the surveyed literature is the use of parentheses as indication for the acronym pairs, see Nadeau and Turney (2005) table 1. This limitation has many drawbacks since it excludes the acronym-definition candidates which don't occur within parentheses and thereby don't provide a complete coverage for all the acronyms formation.</Paragraph> </Section> class="xml-element"></Paper>