File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/01/w01-0516_abstr.xml

Size: 4,353 bytes

Last Modified: 2025-10-06 13:42:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0516">
  <Title>Hybrid text mining for finding abbreviations and their definitions</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> We present a hybrid text mining method for finding abbreviations and their definitions in free format texts. To deal with the problem, this method employs pattern-based abbreviation rules in addition to text markers and cue words. The pattern-based rules describe how abbreviations are formed from definitions. Rules can be generated automatically and/or manually and can be augmented when the system processes new documents. The proposed method has the advantages of high accuracy, high flexibility, wide coverage, and fast recognition.</Paragraph>
    <Paragraph position="1"> Introduction Many organizations have a large number of on-line documents -- such as manuals, technical reports, transcriptions of customer service calls or telephone conferences, and electronic mail -which contain information of great potential value. In order to utilize the knowledge these data contain, we need to be able to create common glossaries of domain-specific names and terms. While we were working on automatic glossary extraction, we noticed that technical documents contain a lot of abbreviated terms, which carry important knowledge about the domains. We concluded that the correct recognition of abbreviations and their definitions is very important for understanding the documents and for extracting information from them [1, 6, 9, 11].</Paragraph>
    <Paragraph position="2"> An abbreviation is usually formed by a simple method: taking zero or more letters from each word of its definition. However, the tendency to make unique, interesting abbreviations is growing. So, it is easy to find new kinds of abbreviations which cannot be processed by hard-coded heuristics-based algorithms [1, 6, 7, 13, 14], since they are formed in ways not anticipated when the algorithms were devised.</Paragraph>
    <Paragraph position="3"> We propose a hybrid text mining approach to deal with these problems. We use three kinds of knowledge: pattern-based abbreviation rules, text markers, and linguistic cue words. An abbreviation rule consists of an abbreviation pattern, a definition pattern and a formation rule. The formation rule describes how an abbreviation is formed from a definition. There may exist multiple formation rules for a given pair of abbreviation and definition patterns. Abbreviation rules are described in Section 3. Text markers are special symbols frequently used to imply the abbreviation-definition relationship in texts. They include characters such as '(...)', '[...]', and '='. Cue words are particular words occurring in the local contexts of abbreviations and the definitions, which strongly imply the abbreviation relationship. They include words such as &amp;quot;or&amp;quot;, &amp;quot;short&amp;quot;, &amp;quot;acronym&amp;quot; and &amp;quot;stand&amp;quot;. Text markers and cue words are discussed in section 2.4.</Paragraph>
    <Paragraph position="4"> This system has 5 components -abbreviation recognizer, definition finder, rule applier, abbreviation matcher and best match selector -- as shown in Figure 1. The abbreviation recognizer seeks candidate abbreviations in a text and generates their patterns (Section 1). When an abbreviation candidate is found, the system determines the contexts within which to look for a definition. When it finds a candidate definition, it generates a pattern for it also (Section 2).</Paragraph>
    <Paragraph position="5"> Having generated the abbreviation pattern and the definition pattern, the system first searches the rulebase for a rule which would generate the abbreviation from the definition. The rules for the given candidates are applied in the order of rule priorities (Section 4.1). If the rulebase is empty or if no existing rule matches the candidate abbreviation with the candidate definition, the system runs the abbreviation matcher and generates a new abbreviation rule. The abbreviation matcher consists of 5 layered matching algorithms (Section 4.2). If the matcher succeeds, new rules may be added to the rulebase, allowing it to grow as the system processes new documents.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML