File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-2001_metho.xml

Size: 17,469 bytes

Last Modified: 2025-10-06 14:09:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-2001">
  <Title>Determining the Specificity of Terms using Compositional and Contextual Information</Title>
  <Section position="4" start_page="2" end_page="12" type="metho">
    <SectionTitle>
2. Information for Term Specificity
</SectionTitle>
    <Paragraph position="0"> In this section, we describe compositional information and contextual information.</Paragraph>
    <Section position="1" start_page="2" end_page="12" type="sub_section">
      <SectionTitle>
2.1. Compositional Information
</SectionTitle>
      <Paragraph position="0"> By compositionality, the meaning of whole term can be strictly predicted from the meaning of the individual words (Manning, 1999). Many terms are created by appending modifiers to existing terms. In this mechanism, features of modifiers are added to features of existing terms to make new concepts. Word frequency and tf.idf value are used to quantify features of unit words. Internal modifier-head structure of terms is used to measure specificity incrementally.</Paragraph>
      <Paragraph position="1"> We assume that terms composed of low frequency words have large quantity of domain information. Because low frequency words appear only in limited number of terms, the words can clearly discriminate the terms to other terms.</Paragraph>
      <Paragraph position="2"> tf.idf, multiplied value of term frequency (tf) and inverse document frequency (idf), is widely used term weighting scheme in information retrieval (Manning, 1999). Words with high term frequency and low document frequency get large tf.idf value. Because a document usually discusses one topic, and words of large tf.idf values are good index terms for the document, the words are considered to have topic specific information.</Paragraph>
      <Paragraph position="3"> Therefore, if a term includes words of large tf.idf value, the term is assumed to have topic or domain specific information.</Paragraph>
      <Paragraph position="4"> If the modifier-head structure of a term is known, the specificity of the term is calculated incrementally starting from head noun. In this manner, specificity value of a term is always larger than that of the base (head) term. This result answers to the assumption that more specific term has larger specificity value. However, it is very difficult to analyze modifier-head structure of compound noun. We use simple nesting relations between terms to analyze structure of terms. A term X is nested to term Y, when X is substring of Y (Frantzi, 2000) as follows: Definition 1 If two terms X and Y are terms in same category and X is nested in Y as W  are modifiers of X.</Paragraph>
      <Paragraph position="5"> For example two terms, &amp;quot;diabetes mellitus&amp;quot; and &amp;quot;insulin dependent diabetes mellitus&amp;quot;, are all disease names, and the former is nested in the latter. In this case, &amp;quot;diabetes mellitus&amp;quot; is base term and &amp;quot;insulin dependent&amp;quot; is modifier of &amp;quot;insulin dependent diabetes mellitus&amp;quot; by definition 1. If multiple terms are nested in a term, the longest term is selected as head term. Specificity of Y is measured as equation (2).</Paragraph>
      <Paragraph position="6">  respectively.</Paragraph>
      <Paragraph position="7"> a and b , real numbers between 0 and 1, are weighting schemes for specificity of modifiers. They are obtained experimentally.</Paragraph>
    </Section>
    <Section position="2" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
2.2. Contextual Information
</SectionTitle>
      <Paragraph position="0"> There are some problems that are hard to address using compositional information alone. Firstly, although features of &amp;quot;wolfram syndrome&amp;quot; share many common features with features of &amp;quot;insulindependent diabetes mellitus&amp;quot; in semantic level, they don't share any common words in lexical level. In this case, it is unreasonable to compare two specificity values measured based on compositional information alone. Secondly, when several words are combined to a term, there are additional semantic components that are not predicted by unit words. For example, &amp;quot;wolfram syndrome&amp;quot; is a kind of &amp;quot;diabetes mellitus&amp;quot;. We can not predict &amp;quot;diabetes mellitus&amp;quot; from two separate words &amp;quot;wolfram&amp;quot; and &amp;quot;syndrome&amp;quot;. Finally, modifier-head structure of some terms is ambiguous. For instance, &amp;quot;vampire slayer&amp;quot; might be a slayer who is vampire or a slayer of vampires. Therefore contextual is used to complement these problems.</Paragraph>
      <Paragraph position="1"> Contextual information is distribution of surrounding words of target terms. For example, the distribution of co-occurrence words of the terms, the distribution of predicates which have the terms as arguments, and the distribution of modifiers of the terms are contextual information.</Paragraph>
      <Paragraph position="2"> General terms usually tend to be modified by other words. Contrary, domain specific terms don't tend to be modified by other words, because they have sufficient information in themselves (Caraballo, 1999B). Under this assumption, we use probabilistic distribution of modifiers as contextual information. Because domain specific terms, unlike general words, are rarely modified in corpus, it is important to collect statistically sufficient modifiers from given corpus. Therefore accurate text processing, such as syntactic parsing, is needed to extract modifiers. As Caraballo's work was for general words, they extracted only rightmost prenominals as context information. We use Conexor functional dependency parser (Conexor, 2004) to analyze the structure of sentences. Among many dependency functions defined in Conexor parser, &amp;quot;attr&amp;quot; and &amp;quot;mod&amp;quot; functions are used to extract modifiers from analyzed structures. If a term or modifiers of the term do not occur in corpus, specificity of the term can not be measured using contextual information</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="12" end_page="12" type="metho">
    <SectionTitle>
3. Specificity Measuring Methods
</SectionTitle>
    <Paragraph position="0"> In this section, we describe information theory like methods using compositional and contextual information. Here, we call information theory like methods, because some probability values used in these methods are not real probability, rather they are relative weight of terms or words.</Paragraph>
    <Paragraph position="1"> Because information theory is well known formalism describing information, we adopt the mechanism to measure information quantity of terms.</Paragraph>
    <Paragraph position="2"> In information theory, when a message with low probability occurs on channel output, the amount of surprise is large, and the length of bits to represent this message becomes long. Therefore the large quantity of information is gained by this message (Haykin, 1994). If we consider the terms in a corpus as messages of a channel output, the information quantity of the terms can be measured using various statistics acquired from the corpus. A set of terms is defined as equation (3) for further explanation.</Paragraph>
    <Paragraph position="4"> is a term and n is total number of terms.</Paragraph>
    <Paragraph position="5"> In next step, a discrete random variable X is defined as equation (4).</Paragraph>
    <Paragraph position="7"> ) is the probability of event x</Paragraph>
    <Paragraph position="9"> In this section, we describe a method using compositional information introduced in section 2.1.</Paragraph>
    <Paragraph position="10"> This method is divided into two steps: In the first step, specificity values of all words are measured independently. In the second step, the specificity values of words are summed up. For detail description, we assume that a term t</Paragraph>
    <Paragraph position="12"> . In next step, a discrete random variable Y is defined as equation (7).</Paragraph>
    <Paragraph position="14"> ), in equation (5) is redefined as equation (8) based on previous assumption.</Paragraph>
    <Paragraph position="16"> ) is average information quantity of all words in t k . Two information sources, word frequency, tf.idf are used to estimate p(y</Paragraph>
    <Paragraph position="18"> ). In this mechanism, p(y i ) for informative words should be smaller than that of non informative words. When word frequency is used to quantify features of words, p(y i ) in equation (8) is estimated as equation (9).</Paragraph>
    <Paragraph position="20"> ), and j is index of all words in corpus. In this equation, as low frequency words are infor-</Paragraph>
    <Paragraph position="22"> ) for the words becomes small.</Paragraph>
    <Paragraph position="23"> When tf.idf is used to quantify features of words, p(y</Paragraph>
    <Paragraph position="25"> where tf*idf(w) is tf.idf value of word w. In this equation, as words of large tf.idf values are informative, p(y</Paragraph>
    <Paragraph position="27"> ) of the words becomes small.</Paragraph>
    <Paragraph position="28"> 3.2. Contextual Information based Method</Paragraph>
    <Paragraph position="30"> In this section, we describe a method using contextual information introduced in section 2.2.</Paragraph>
    <Paragraph position="31"> Entropy of probabilistic distribution of modifiers for a term is defined as equation (11).</Paragraph>
    <Paragraph position="33"> ) is the probability of mod</Paragraph>
    <Paragraph position="35"> in corpus, j is index of all modifiers of t k in corpus. The entropy calculated by equation (11) is the average information quantity of all (mod</Paragraph>
    <Paragraph position="37"> ) pairs. Specific terms have low entropy, because their modifier distributions are simple. Therefore inversed entropy is assigned to I(x k ) in equation (5) to make specific terms get large quantity of information as equation (13).  where the first term of approximation is the maximum value among modifier entropies of all terms.</Paragraph>
    <Section position="1" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
3.3. Hybrid Method (Method 3)
</SectionTitle>
      <Paragraph position="0"> In this section, we describe a hybrid method to overcome shortcomings of previous two methods.</Paragraph>
      <Paragraph position="1"> This method measures term specificity as equation (14).</Paragraph>
      <Paragraph position="3"> values between 0 and 1, which are measured by compositional and contextual information based methods respectively.</Paragraph>
      <Paragraph position="4">  (0 1)g g[?] [?] is weight of two values. If 0.5g = , the equation is harmonic mean of two values. Therefore I(x k ) becomes large when two values are equally large. 4. Experiment and Evaluation  In this section, we describe the experiments and evaluate proposed methods. For convenience, we simply call compositional information based method, contextual information based method, hybrid method as method 1, method 2, method 3 respectively.</Paragraph>
    </Section>
    <Section position="2" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
4.1. Evaluation
</SectionTitle>
      <Paragraph position="0"> A sub-tree of MeSH thesaurus is selected for experiment. &amp;quot;metabolic diseases(C18.452)&amp;quot; node is root of the subtree, and the subtree consists of 436 disease names which are target terms of specificity measuring. A set of journal abstracts was extracted from MEDLINE  database using the disease names as quires. Therefore, all the abstracts are related to some of the disease names. The set consists of about 170,000 abstracts (20,000,000 words). The abstracts are analyzed using Conexor parser, and various statistics are extracted: 1) frequency, tf.idf of the disease names, 2) distribution of modifiers of the disease names, 3) frequency, tf.idf of unit words of the disease names.</Paragraph>
      <Paragraph position="1"> The system was evaluated by two criteria, coverage and precision. Coverage is the fraction  MEDLINE is a database of biomedical articles serviced by National Library of Medicine, USA. (http://www.nlm.nih.gov) of the terms which have specificity values by given measuring method as equation (15).</Paragraph>
      <Paragraph position="2">  Method 2 gets relatively lower coverage than method 1, because method 2 can measure specificity when both the terms and their modifiers appear in corpus. Contrary, method 1 can measure specificity of the terms, when parts of unit words appear in corpus. Precision is the fraction of relations with correct specificity values as equation (16).</Paragraph>
      <Paragraph position="4"> of R p c with correct specificity p of all R p c = (16) where R(p,c) is a parent-child relation in MeSH thesaurus, and this relation is valid only when specificity of two terms are measured by given method. If child term c has larger specificity value than that of parent term p, then the relation is said to have correct specificity values. We divided parent-child relations into two types. Relations where parent term is nested in child term are categorized as type I. Other relations are categorized as type II. There are 43 relations in type I and 393 relations in type II. The relations in type I always have correct specificity values provided structural information method described section 2.1 is applied.</Paragraph>
      <Paragraph position="5"> We tested prior experiment for 10 human subjects to find out the upper bound of precision. The subjects are all medical doctors of internal medicine, which is closely related division to &amp;quot;metabolic diseases&amp;quot;. They were asked to identify parent-child relation of given two terms. The average precisions of type I and type II were 96.6% and 86.4% respectively. We set these values as upper bound of precision for suggested methods.</Paragraph>
      <Paragraph position="6"> Specificity values of terms were measured with method 1, method 2, and method 3 as Table 2. In method 1, word frequency based method, word tf.idf based method, and structure information added methods were separately experimented. Two additional methods, based on term frequency and term tf.idf, were experimented to compare compositionality based method and whole term based method. Two methods which showed the best performance in method 1 and method 2 were combined into method 3.</Paragraph>
      <Paragraph position="7"> Word frequency and tf.idf based method showed better performance than term based methods. This result indicates that the information of terms is divided into unit words rather than into whole terms. This result also illustrate basic assumption of this paper that specific concepts are created by adding information to existing concepts, and new concepts are expressed as new terms by adding modifiers to existing terms. Word tf.idf based method showed better precision than word frequency based method. This result illustrate that tf.idf of words is more informative than frequency of words.</Paragraph>
      <Paragraph position="8"> Method 2 showed the best performance, precision 70.0% and coverage 70.2%, when we counted modifiers which modify the target terms two or more times. However, method 2 showed worse performance than word tf.idf and structure based method. It is assumed that sufficient contextual information for terms was not collected from corpus, because domain specific terms are rarely modified by other words.</Paragraph>
      <Paragraph position="9"> Method 3, hybrid method of method 1 (tf.idf of words, structure information) and method 2, showed the best precision of 82.0% of all, because the two methods interacted complementary.  The coverage of this method was 70.2% which equals to the coverage of method 2, because the specificity value is measured only when the specificity of method 2 is valid. In hybrid method, the weight value 0.8g = indicates that compositional information is more informatives than contextual information when measuring the specificity of domain-specific terms. The precision of 82.0% is good performance compared to upper bound of 87.4%.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="12" end_page="12" type="metho">
    <SectionTitle>
4.2. Error Analysis
</SectionTitle>
    <Paragraph position="0"> One reason of the errors is that the names of some internal nodes in MeSH thesaurus are category names rather disease names. For example, as &amp;quot;acid-base imbalance (C18.452.076)&amp;quot; is name of disease category, it doesn't occur as frequently as other real disease names.</Paragraph>
    <Paragraph position="1"> Other predictable reason is that we didn't consider various surface forms of same term. For example, although &amp;quot;NIDDM&amp;quot; is acronym of &amp;quot;non insulin dependent diabetes mellitus&amp;quot;, the system counted two terms independently. Therefore the extracted statistics can't properly reflect semantic level information.</Paragraph>
    <Paragraph position="2"> If we analyze morphological structure of terms, some errors can be reduced by internal structure method described in section 2.1. For example, &amp;quot;nephrocalcinosis&amp;quot; have modifier-head structure in morpheme level; &amp;quot;nephro&amp;quot; is modifier and &amp;quot;calcinosis&amp;quot; is head. Because word formation rules are heavily dependent on the domain specific morphemes, additional information is needed to apply this approach to other domains.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML