File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3301_metho.xml

Size: 11,339 bytes

Last Modified: 2025-10-06 14:11:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3301">
  <Title>The Semantics of a Definiendum Constrains both the Lexical Semantics and the Lexicosyntactic Patterns in the Definiens</Title>
  <Section position="4" start_page="1" end_page="2" type="metho">
    <SectionTitle>
2 Unified Medical Language System
</SectionTitle>
    <Paragraph position="0"> The Unified Medical Language System (UMLS) is the largest biomedical knowledge source maintained by the National Library of Medicine. It provides standardized biomedical concept relations and synonyms (Humphreys et al. 1998). The UMLS has been widely used in many natural language processing tasks, including information retrieval (Eichmann et al. 1998), extraction (Rindflesch et al. 2000), and text summarization (Elhadad et al. 2004; Fiszman et al. 2004).</Paragraph>
    <Paragraph position="1"> The UMLS includes the Metathesaurus (MT), which contains over one million biomedical concepts and the Semantic Network (SN), which represents a high-level abstraction from the UMLS Metathesaurus. The SN consists of 134 semantic types with 54 types of semantic relations (e.g., is-a or part-of) that relate the semantic types to each other. The UMLS Semantic Network provides broad and general world knowledge that is related to human health. Each UMLS concept is assigned one or more semantic types.</Paragraph>
    <Paragraph position="2"> The National Library of Medicine also makes available MMTx, a programming implementation of MetaMap (Aronson 2001), which maps free text to the UMLS concepts and associated semantic types. MMTx first parses text into sentences, then chunks the sentences into noun phrases. Each noun phrase is then mapped to a set of possible UMLS concepts, taking into account spelling and morphological variations; each concept is weighted, with the highest weight representing the most likely mapped concept. One recent study has evaluated MMTx to have 79% (Yu and Sable 2005) accuracy for mapping a term to the semantic  type(s) in a small set of medical questions. Another study (Lacson and Barzilay 2005) measured MMTx to have a recall of 74.3% for capturing the semantic types in another set of medical texts.</Paragraph>
    <Paragraph position="3"> In this study, we applied MMTx to identify the semantic types of terms that appear in their definitions. For each candidate term, MMTx ranks a list of UMLS concepts with confidence. In this study, we selected the UMLS concept that was assigned with the highest confidence by MMTx. The UMLS concepts were then used to obtain the corresponding semantic types.</Paragraph>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 Data Collection
</SectionTitle>
    <Paragraph position="0"> We collected a large number of online definitions for the purpose of our study. Specifically, we applied more than 1 million of the UMLS concepts as candidate definitional terms, and searched for the definitions from the World Wide Web using the Google:Definition service; this resulted in the downloads of a total of 226,089 definitions that corresponded to a total of 36,535 UMLS concepts (or 3.7% of the total of 1 million UMLS concepts).</Paragraph>
    <Paragraph position="1"> We removed from definitions the defined terms; this step is necessary for our statistical studies, which we will explain later in the following sections. We applied MMTx to obtain the corresponding semantic types.</Paragraph>
  </Section>
  <Section position="6" start_page="2" end_page="3" type="metho">
    <SectionTitle>
4 Statistically Correlated Semantic Types
</SectionTitle>
    <Paragraph position="0"> We then identified statistically correlated semantic types between SDT and SDef based on bivariate tabular chi-square (Fleiss 1981).</Paragraph>
    <Paragraph position="1"> Specifically, given a semantic type STYi, i=1,2,3,..., 134 of any defined term, the observed numbers of definitions that were and were not assigned the STYi are O(Defi) and O(Defi). All indicates the total 226,089 definitions. The observed numbers of definitions in which the semantic type STYi, did and did not appear were O(Alli) and O(Alli). 134 represents the total number of the UMLS semantic types. We applied formulas (1) and (2) to calculate expected frequencies and then the chi-square value (the degree of freedom is one). A high chi-square value indicates the importance of the semantic type that appears in the definition. We removed the defined terms from their definitions prior to the semantic-type statistical analysis in order to remove the bias introduced by the defined terms (i.e., defined terms frequently appear in the definitions).</Paragraph>
    <Paragraph position="3"> To determine whether the chi-square value is large enough for statistical significance, we calculated its p-value. Typically, 0.05 is the cutoff of significance, i.e. significance is accepted if the corresponding p-value is less than 0.05. This criterion ensures the chance of false significance (incorrectly detected due to chance) is 0.05 for a single SDT-SDef pair. However, since there are 134*134 possible SDT-SDef pairs, the chance for obtaining at least one false significance could be very high. To have a more conservative inference, we employed a Bonferroni-type correction procedure (Hochberg 1988).</Paragraph>
    <Paragraph position="4"> Specifically, let )()2()1( mppp [?][?][?] L be the ordered raw p-values, where m is the total number of SDT-SDef pairs. A SDef is significantly associated with a SDT if SDef's corresponding p-value )1/()( +[?][?][?] imp i a for some i. This correction procedure allows the probability of at-least-onefalse-significance out of the total m pairs is less than alpha (=0.05).</Paragraph>
    <Paragraph position="5"> The number of definitions for each SDT ranges from  As the power of a statistical test relies on the sample size, some correlated semantic types might be undetected when the number of available definitions is small. It is therefore worthwhile to know what the necessary sample size is in order to have a decent chance of detecting difference statistically.</Paragraph>
    <Paragraph position="6">  For this task, we assume P0 and P1 are true probabilities that a STY will appear in NDef and NAll. Based upon that, we calculated the minimal required number of sentences n such that the probability of statistical significance will be larger than or equal to 0.8. This sample size is determined based on the following two assumptions: 1) the observed frequencies are approximately normally distributed, and 2) we use chi-square significance to test the hypothesis P0 = P1 at significance level</Paragraph>
  </Section>
  <Section position="7" start_page="3" end_page="3" type="metho">
    <SectionTitle>
5 Semantic Type Distribution
</SectionTitle>
    <Paragraph position="0"> Our null hypothesis is that given any pair of {SDT(X), SDT(Y)}, X [?] Y, where X and Y represent two different semantic types of the total 134 semantic types, there are no statistical differences in the distributions of the semantic types of the terms that appear in the definitions.</Paragraph>
    <Paragraph position="1"> We applied the bivariate tabular chi-square test to measure the semantic type distribution. Following similar notations to Section 4, we use OXi and OYi for the corresponding frequencies of not being observed in SDef(X) and SDef(Y).</Paragraph>
    <Paragraph position="2"> For each semantic type STY, we calculate the expected frequencies of being observed and not being observed in SDef(X) and SDef(Y), respectively, and their corresponding chi-square value according to formulas (3) and (4):</Paragraph>
    <Paragraph position="4"> where NX and NY are the numbers of sentences in SDef(X) and SDef(Y), respectively, and in both (4) and (5), 134,...,2,1=i , and (X, Y)=1,2,..., 134 and X [?] Y. The degree of freedom is 1. The chi-square value measures whether the occurrences of STYi, are equivalent between SDef(X) and SDef(Y). The same multiple testing correction procedure will be used to determine the significance of the chi-square value. Note that if at least one STYi has been detected to be statistically significant after multiple-testing correction, the distributions of the semantic types are different between SDef(X) and SDef(Y).</Paragraph>
  </Section>
  <Section position="8" start_page="3" end_page="4" type="metho">
    <SectionTitle>
6 Automatically Identifying Semantic-Type-
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
Dependent Lexicosyntactic Patterns
</SectionTitle>
      <Paragraph position="0"> Most current definitional question answering systems generate lexicosyntactic patterns either manually or semi-automatically. In this study, we automatically generated large sets of lexicosyntactic patterns from our collection of online definitions. We applied the information extraction system Autoslog-TS (Riloff and Philips 2004) to automatically generate lexicosyntactic patterns in definitions. We then identified the statistical correlation between the semantic types of defined terms and their lexicosyntactic patterns in definitions.</Paragraph>
      <Paragraph position="1"> AutoSlog-TS is an information extraction system that is built upon AutoSlog (Riloff 1996).</Paragraph>
      <Paragraph position="2"> AutoSlog-TS automatically identifies extraction patterns for noun phrases by learning from two sets of un-annotated texts relevant and non-relevant.</Paragraph>
      <Paragraph position="3"> AutoSlog-TS first generates every possible lexico-syntactic pattern to extract every noun phrase in both collections of text and then computes statistics based on how often each pattern appears in the relevant text versus the background and outputs a ranked list of extraction patterns coupled with statistics indicating how strongly each pattern is associated with relevant and non-relevant texts.</Paragraph>
      <Paragraph position="4"> We grouped definitions based on the semantic types of the defined terms. For each semantic type, the relevant text incorporated the definitions, and the non-relevant text incorporated an equal number of sentences that were randomly selected from the MEDLINE collection. For each semantic type, we applied AutoSlog-TS to its associated relevant and non-relevant sentence collections to generate lexicosyntactic patterns; this resulted in a total of 134 sets of lexicosyntactic patterns that corresponded to different semantic types of defined terms. Additionally, we identified the common lexicosyntactic patterns across the semantic types and ranked the lexicosyntactic patterns based on their frequencies across semantic types.</Paragraph>
      <Paragraph position="5">  We also identified statistical correlations between SDT and the lexicosyntactic patterns in definitions based on chi-square statistics that we have described in the previous two sections. For formula 1~4, we replaced each STY with a lexicosyntactic pattern. Our null hypothesis is that given any SDT, there are no statistical differences in the distributions of the lexicosyntactic patterns that appear in the definitions.</Paragraph>
      <Paragraph position="6">  fined terms with the top five statistically correlated semantic types (P&lt;&lt;0.0001) that appear in their definitions.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML