File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/w95-0111_intro.xml

Size: 4,423 bytes

Last Modified: 2025-10-06 14:05:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W95-0111">
  <Title>I Automatic Suggestion of Significant Terms for a Predefined Topic</Title>
  <Section position="2" start_page="0" end_page="131" type="intro">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> As we are facing the growing amount of on-line text, the use of text analysis techniques to access information from electronic sources has become more popular and, at the same time, more difficult. Currently, the effectiveness of such techniques is evaluated not only on how easily they can be applied to text sources to extract information and represent it in a systematic format (Walker 1983), but also on whether they can be applied to large text corpora of several tens of thousand of words.</Paragraph>
    <Paragraph position="1"> One of the applications of text analysis is to identify and extract significant terminology from running text. Choueka (1988), for example, describes an experiment for locating interesting collocational expressions from large textual databases. A collocational expression, as Choueka defines it, is =sequences of words whose unambiguous meaning cannot be dedved from that of their components&amp;quot;. Other representative collocation research can be found in Church and Hanks (1990) and Smadja (1993). Though all statistically-based, their definitions of collocations are different from one another. Unlike Choueka (1988), Church and Hanks (1990) identify as collocations both interrupted and uninterrupted sequences of words. Unlike Church and Hanks (1990), Smadja (1993) goes beyond the &amp;quot;two-word&amp;quot; limitation and deals with &amp;quot;collocations of arbitrary length&amp;quot;.</Paragraph>
    <Paragraph position="2">  The primary goal of collocation research is to build a comprehensive lexicographic toolkit, or to assist automatic language generation applications. Therefore, the focus is on the extraction of all Interesting word pattems without distinction of domain specificity. Identifying domain-specific terminology is another research effort. Gierl and Frost (1992) descdbe their approach to extracting terminological knowledge from medical texts. Following Church and Hanks (1990), they use mutual information to select significant two-word patterns, but, at the same time, a lexical inductive process is incorporated which, as they claim, can improve the collection of domain-specific terms. Justeson and Katz (1993) introduce an algorithm by which technical terms in running text can be identified. Prior to the development of their algorithm, they performed a thorough study on the linguistic properties of technical terminology. They report that, structurally, technical terms make heavy use of noun compounds. In technical terminology, word constituents are limited to adjectives, nouns and occasionally prepositions. Verbs, adverbs, or conjunctions are extremely rare. At the discourse level, technical terms tend to be repetitive. With these observations in mind, they developed an algorithm which has proved to be effective and domain independent.</Paragraph>
    <Paragraph position="3"> In this paper, a preliminary experiment is presented in automatically suggesting significant terms for a predefined topic. The general method is to compare a topic focused sample based on the predefined topic with a larger and more general base sample. A set of statistical measures are used to identify significant word units in both samples. Identification of single word terms is based on the notion of word intervals. Two-word terms are identified through the computation of mutual information, and an extension of mutual information assists in capturing multi-word terms. Once significant terms of all these three types are identified, a comparison algorithm is applied to differentiate terms across the two samples. If significant changes in the values of certain statistical variables are detected, associated terms are selected from the focused sample as being topic-oriented and included in a suggested list.</Paragraph>
    <Paragraph position="4"> To check the quality of the suggested terms, we compare them against terms manually determined by a domain expert. Though the numbers of matches vary, we find that our automatic suggestion process provides more terms (than the manual process) that are useful for describing the predefined topic.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML