File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-2021_metho.xml
Size: 4,927 bytes
Last Modified: 2025-10-06 14:10:08
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-2021"> <Title>Automatic Acronym Recognition</Title> <Section position="4" start_page="167" end_page="168" type="metho"> <SectionTitle> 3 Methods and implementation </SectionTitle> <Paragraph position="0"> The method presented in this section is based on a similar algorithm described by Schwartz and Hearst (2003). However it has the advantage of recognizing acronym-definition pairs which are not indicated by parentheses.</Paragraph> <Section position="1" start_page="167" end_page="167" type="sub_section"> <SectionTitle> 3.1 Finding Acronym-Definition Candidates </SectionTitle> <Paragraph position="0"> A valid acronym candidate is a string of alphabetic, numeric and special characters such as '-' and '/'. It is found if the string satisfies the conditions (i) and (ii) and either (iii) or (iv): (i) The string contains at least two characters. (ii) The string is not in the list of rejected words1. (iii) The string contains at least one capital letter. (iv) The strings' first or last character is lower case letter or numeric.</Paragraph> <Paragraph position="1"> When an acronym is found, the algorithm searches the words surrounding the acronym for a definition candidate string that satisfies the following conditions (all are necessary in conjunction): (i) At least one letter of the words in the string matches the letter in the acronym. (ii) The string doesn't contain a colon, semi-colon, question mark or exclamation mark. (iii) The maximum length of the string is min(|A|+5,|A|*2), where |A |is the acronym length (Park and Byrd, 2001).</Paragraph> <Paragraph position="2"> (iv) The string doesn't contain only upper case letters. null</Paragraph> </Section> <Section position="2" start_page="167" end_page="168" type="sub_section"> <SectionTitle> 3.2 Matching Acronyms with Definitions </SectionTitle> <Paragraph position="0"> The process of extracting acronym-definition pairs from a raw text, according to the constraints described in Section 3.1 is divided into two steps: 1. Parentheses matching. In practice, most of the acronym-definition pairs come inside parentheses (Schwartz and Hearst, 2003) and can correspond to two different patterns: (i) definition (acronym) (ii) acronym (definition). The algorithm extracts acronym-definition candidates which correspond to one of these two patterns.</Paragraph> <Paragraph position="1"> 2. Non parentheses matching. The algorithm seeks for acronym candidates that follow the constraints, described in Section 3.1 and are not enclosed in parentheses. Once an acronym candidate is found it scans the previous and following context,wheretheacronymwasfound,foradefinition null candidate. The search space for the definition candidate string is limited to four words multiplied by the number of letters in the acronym candidate.</Paragraph> <Paragraph position="2"> The next step is to choose the correct substring of the definition candidate for the acronym candidate. This is done by reducing the definition candidatestringasfollows: thealgorithmsearches for identical characters between the acronym and the definition starting from the end of both strings and succeeds in finding a correct substring for the acronym candidate if it satisfies the following conditions: (i) at least one character in the acronym string matches with a character in the substring of the definition; (ii) the first character in the acronym string matches the first character of the leftmost word in the definition substring, ignoring upper/lower case letters.</Paragraph> </Section> <Section position="3" start_page="168" end_page="168" type="sub_section"> <SectionTitle> 3.3 Machine Learning Approach </SectionTitle> <Paragraph position="0"> To test and compare different supervised learning algorithms, Tilburg Memory-Based Learner (TiMBL)2 was used. In memory-based learning the training set is stored as examples for later evaluation. Features vectors were calculated to describe the acronym-definition pairs. The ten following (numeric) features were chosen: (1) the acronym or the definition is between parentheses (0-false, 1-true), (2) the definition appears before the acronym (0-false, 1-true), (3) the distance in words between the acronym and the definition, (4) the number of characters in the acronym, (5) the number of characters in the definition, (6) the number of lower case letters in the acronym, (7) the number of lower case letters in the definition, (8) the number of upper case letters in the acronym, (9) the number of upper case letters in the definition and (10) the number of words in the definition. The 11th feature is the class to predict: true candidate (+), false candidate (-). An example of the acronym-definition pair <&quot;vCJD&quot;,&quot;variant CJD&quot;> represented as a feature vector is: 0,1,1,4,11,1,7,3,3,2,+.</Paragraph> </Section> </Section> class="xml-element"></Paper>