File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1813_metho.xml
Size: 8,015 bytes
Last Modified: 2025-10-06 14:09:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1813"> <Title>Determining the Specificity of Terms based on Information Theoretic Measures</Title> <Section position="3" start_page="2" end_page="12" type="metho"> <SectionTitle> 2 Specificity Measuring Methods </SectionTitle> <Paragraph position="0"> In this section, we describe information theory like methods to measure the specificity of terms. Here, we call information theory like methods, because some probability values used in these methods are not real probability, rather they are relative weight of terms or words.</Paragraph> <Paragraph position="1"> In information theory, when a low probability message occurs on channel output, the quantity of surprise is large, and the length of bits to represent the message becomes long. Thus the large quantity of information is gained by the message (Haykin, 1994). If we regard the terms in corpus as the messages of channel output, the information quantity of the terms can be measured using information theory. A set of target terms is defined as equation (2) for further explanation.</Paragraph> <Paragraph position="3"> is a term. In next step, a discrete random variable X is defined as equation (3).</Paragraph> <Paragraph position="5"> ) is the probability of x k . The information quantity, I(x k ), gained after observing x k , is used as the specificity of t k as equation (4).</Paragraph> <Paragraph position="7"> (4) By equation (4), we can measure the specificity of t k , by estimating p(x k ). We describe three estimating methods for p(x k ) in following sections. 2.1 Compositional Information based Method</Paragraph> <Paragraph position="9"> By compositionality, the meaning of a term can be strictly predicted from the meaning of the individual words (Manning, 1999). This method is divided into two steps: In the first step, the specificity of each word is measured independently.</Paragraph> <Paragraph position="10"> In the second step, the specificity of composite words is summed up. For detail description, we assume that t k consists of one or more words as equation (5).</Paragraph> <Paragraph position="12"> . In next step, a discrete random variable Y is defined as equation (6).</Paragraph> <Paragraph position="14"> ), in equation (4) is redefined as equation (7) based on previous assumption.</Paragraph> <Paragraph position="16"> ) is average information quantity of all words in t</Paragraph> <Paragraph position="18"> ) of informative words should be smaller than that of non informative words. Two information sources, word frequency, tf.idf are used to estimate p(y</Paragraph> <Paragraph position="20"> independently.</Paragraph> <Paragraph position="21"> We assume that if a term is composed of low frequency words, the term have large quantity of domain information. Because low frequency words appear in limited number of terms, they have high discriminating ability. On this assumption, p(y</Paragraph> <Paragraph position="23"> in corpus. In this estimation, P(y</Paragraph> <Paragraph position="25"> ) for low frequency words becomes small. tf.idf is widely used term weighting scheme in information retrieval (Manning, 1999). We assume that if a term is composed of high tf.idf words, the term have domain specific information. On this assumption, p(y i ) in equation (7) is estimated as equation (8).</Paragraph> <Paragraph position="27"> ) of high tf.idf words becomes small.</Paragraph> <Paragraph position="28"> If the modifier-head structure is known, the specificity of the term is calculated incrementally starting from head noun. In this manner, the specificity of the term is always larger than that of the head term. This result answers to the assumption that more specific term has higher specificity. We use simple nesting relations between terms to analyze modifier-head structure as follows (Frantzi, 2000): Definition 1 If two terms X and Y are terms in same semantic category and X is nested in Y as are modifiers of X.</Paragraph> <Paragraph position="29"> For example, because &quot;diabetes mellitus&quot; is nested in &quot;insulin dependent diabetes mellitus&quot; and two terms are all disease names, &quot;diabetes mellitus&quot; is head term and &quot;insulin dependent&quot; is modifier. The specificity of Y is measured as equation (9).</Paragraph> <Paragraph position="30"> weighting schemes for the specificity of modifiers. They are found by experimentally.</Paragraph> <Section position="1" start_page="12" end_page="12" type="sub_section"> <SectionTitle> 2.2 Contextual Information based Method (Method 2) </SectionTitle> <Paragraph position="0"> There are some problems that are hard to address using compositional information alone. Firstly, although two disease names, &quot;wolfram syndrome&quot; and &quot;insulin-dependent diabetes mellitus&quot;, share CompuTerm 2004 Poster Session - 3rd International Workshop on Computational Terminology88 many common features in semantic level, they don't share any common words in lexical level. In this case, it is unreasonable to compare two specificity values based on compositional information. Secondly, when several words are combined into one term, there are additional semantic components that are not predicted by unit words. For example, &quot;wolfram syndrome&quot; is a kind of &quot;diabetes mellitus&quot;. We can not predict the meaning of &quot;diabetes mellitus&quot; from two separate words &quot;wolfram&quot; and &quot;syndrome&quot;. Thus we use contextual information to address these problems.</Paragraph> <Paragraph position="1"> General terms are frequently modified by other words in corpus. Because domain specific terms have sufficient information in themselves, they are rarely modified by other words, (Caraballo, 1999).</Paragraph> <Paragraph position="2"> Under this assumption, we use probability distribution of modifiers as contextual information.</Paragraph> <Paragraph position="3"> Collecting sufficient modifiers from given corpus is very important in this method. To this end, we use Conexor functional dependency parser (Conexor, 2004) to analyze the structure of sentences. Among many dependency functions defined in the parser, &quot;attr&quot; and &quot;mod&quot; functions are used to extract modifiers from analyzed structures. This method can be applied the terms that are modified by other words in corpus.</Paragraph> <Paragraph position="4"> Entropy of modifiers for a term is defined as equation (10).</Paragraph> <Paragraph position="6"> ) is the probability of mod</Paragraph> <Paragraph position="8"> and it is estimated as relative frequency of mod i in all modifiers of t k . The entropy calculated by equation (10) is the average information quantity of all (mod</Paragraph> <Paragraph position="10"> ) pairs. Because domain specific terms have simple modifier distributions, the entropy of the terms is low. Therefore inversed entropy is assigned to I(x k ) in equation (4) to make specific terms get large quantity of information. where the first term of approximation is the maximum modifier entropy of all terms.</Paragraph> </Section> <Section position="2" start_page="12" end_page="12" type="sub_section"> <SectionTitle> 2.3 Hybrid Method (Method 3) </SectionTitle> <Paragraph position="0"> In this section, we describe hybrid method to overcome shortcomings of previous two methods.</Paragraph> <Paragraph position="1"> In this method the specificity is measured as equation (12).</Paragraph> <Paragraph position="2"> ) are information quantity measured by method1 and method 2 respectively. They are normalized value between 0 and 1. (0 1)g g[?] [?] is weight of two values. If</Paragraph> <Paragraph position="4"> equation is harmonic mean of two values.</Paragraph> <Paragraph position="5"> Therefore I(x k ) becomes large when two values are equally large.</Paragraph> </Section> </Section> class="xml-element"></Paper>