File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1022_metho.xml

Size: 8,876 bytes

Last Modified: 2025-10-06 14:07:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1022">
  <Title>Exogeneous and Endogeneous Approaches to Semantic Categorization of Unknown Technical Terms</Title>
  <Section position="5" start_page="146" end_page="147" type="metho">
    <SectionTitle>
4 Exogeneous Categorization
</SectionTitle>
    <Paragraph position="0"> We tested several classiticatiol~ models. Our tirst ex1)erim(mts were carried out with Examl)le-1)ased classitiers. We used our own implementation of K-nearest neighbors algorithm (kNN), mid then (;lie TiMBL learner (Daelemans et al., 1999), which provides several extensions to kNN, well-suited tbr NLP 1)rol)lems. Neverl;heless, in the current st~tte of our work, better l'esull;s were ol)tained with a 1)rol)al)ilistic classifier similar to  gest that exogeneous and cndogencous al)l)roaches air(,' complementary.</Paragraph>
    <Paragraph position="1"> the one used l)y C\['okllnaga et al.: 1997) tbr thesnm'us exl;ension, l)ue to lack of sl)ace , only this method will l)e (les(:ribed in this 1)aper. We use as contextual cues the open-class words (nouns, verbs, adjectives, adverbs) theft co-occur in the corpus with the technical terms. More precisely, the cues are open-class words surrounding the occllrr(;lices of t;he term in some window of t)i'edetined size. Each new term to be (:ntegorizeA is rel)resented by the overall set of (:oni;exi;ual (:ues that have 1)een (~xl;ra(:l;ed fl:onl :t \])arl; ()f the (:orlms (I;est (:orlms).</Paragraph>
    <Section position="1" start_page="146" end_page="147" type="sub_section">
      <SectionTitle>
4.1 Probability Model
</SectionTitle>
      <Paragraph position="0"> Lel; us consider a tel'ill 5/' for which the contcxl;ll;tl (;ll(;S {lt;i}~_1 ha,ve been collected in the test corl)tlS. Tlm c~l;egorization of this t;erm alllOUlli;S l;o lind the cal;(;gory C* that maximizes</Paragraph>
      <Paragraph position="2"> According to the exogeneous api)roa(:h , l;he probalfility that a term ~/' belongs to category C del)ellds (m the contexl;ual cues of ~F:</Paragraph>
      <Paragraph position="4"> The l)rob~dfilities of the eqmd;ion 3 are estimated from trailfing data: * /)(wile) is the prol)ability that a word wi co-oe(:urs with a term t)elonging to (:ategory C. It is estimated in the fi)llowing way:</Paragraph>
      <Paragraph position="6"> co-occurs with a term belonging to c~tegory 6'.</Paragraph>
      <Paragraph position="7"> This probability accounts tbr the weight of (;ue &amp;quot;w i in cai;egory C.</Paragraph>
      <Paragraph position="9"> the corpus belongs to the category C:</Paragraph>
      <Paragraph position="11"> where Nt(C) is the occurrence number in training data of terms t)elonging to C. This probability accounts for the weight of category C in the corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="147" end_page="147" type="sub_section">
      <SectionTitle>
4.2 Training and Test
</SectionTitle>
      <Paragraph position="0"> qJhe exogeneous classifier starts with the selection of test documents in the cort)us. Technical terms found in these documents will form the test set. The remaining documents represent the training corl)us. %'aining and test stages are the following: * POS tagging. The test and training corpora are tagged with MultAna, a tagger designed as an extension of the Multex morphological analyzer (Petitpierre and Russell, 1995). Occurrences of the technical terms are identified during this stage and the terms to be categorized are those which are identified in the test corpus.</Paragraph>
      <Paragraph position="1"> * Extraction of contextual cues. For each term occurrence in training and test data, the contextual cues are collected.</Paragraph>
      <Paragraph position="2"> Only the lemmas of open-class words are used and cues may correspond to multi-word terms. Each test term is then represented by the set of cues which have been collected in test data.</Paragraph>
      <Paragraph position="3"> '~Note that the categorization process could be simplified by eliminating P(wi), since this quantity is constaifl; for all categories.</Paragraph>
      <Paragraph position="4">  plored to compute the frequencies (occurrences and co-occurrences) of cues, terms and categories. As mentioned earlier (section 3), tile cue occun'ences which have been collected around the test terms are ignored during this step. Tile probabilities required for the categorization operation are then computed.</Paragraph>
      <Paragraph position="5"> * Categorization of the test terms. The most probable categories are assigned to each test term (see section 4.1). Figure 1 gives; some examples of exogencous categorization 3.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="147" end_page="149" type="metho">
    <SectionTitle>
5 Endogeneous Categorization
</SectionTitle>
    <Paragraph position="0"> Our approach to endogeneous categorization is simpler. It is exclusively based oll a quantitative analysis of the lexical composition of technical terms. Henceforth, the open-class words used to compose technical terms will be called terminolwical components. The endogeneous approach relies on a much more restricted source of data than the exogeneous approach, since the comI)one:nt set of a terminological database is quantitatively limited compared with the set of contextual cues extracted from corpora. Nevertheless, we make the assumption that this quantitative limitation is partly compensated by the aSome category labels are described in table 1.</Paragraph>
    <Paragraph position="1">  strong discrimination pow(;r of tel&amp;quot;minologi(:al COml/onen(;s.</Paragraph>
    <Paragraph position="2"> Th(: training t)hase assigns to each category a set of ret)resenl;ative (:onq)onents with resl)e(:t to some association score. The categorization t)hase determines tlm most t)lausil)lc cntegories of a term a(:cording to its (:omt)onents.</Paragraph>
    <Section position="1" start_page="148" end_page="148" type="sub_section">
      <SectionTitle>
5.1 Association Score
</SectionTitle>
      <Paragraph position="0"> To estimate the (lel)end(m(:y lml,ween (:()ml)On(:nts and cat(:gori(:s, w(: eXl)erilnent(:d s(:v(:r~l association criteria. The choi(:(; of l:hes(; (:riteria has l)(;ell intluen(:ed by the (:omt)aral;iv(: study described in (Yang and P(:rdersen, 1997) on f(:atam; select;ion (:riteria for text categorization.</Paragraph>
      <Paragraph position="1"> W(: test(:(t s(:vcral measures, including COml)On(:nt fl:(:quency, information gain and mutual information. Ore: best results were a(:hiev(:(t with mutual information whi(:h is (:stimat(:(t using:</Paragraph>
      <Paragraph position="3"> in category C.</Paragraph>
      <Paragraph position="4"> * Nw is the total number of component occurren(:cs. null * N,,,(G) is tim total nunfl/er of comt)onent oe(:urrences in category C. This factor reduces the etlbct of the coml)oncnts weakly represented in category C, compared with the other ColnI)ollell|;8 oJ' C.</Paragraph>
      <Paragraph position="5"> * N,,,(w) is the frequency of conH)onent 'w in tile terminological database. This factor reduces tile efl'ect of the comi)onents that (h:not(; basi(: concel)ts spread all over (;11(: datzd)ase. \],br examl)le, the coral)orients ,weed, altitude, press'urc, hay(' high frequen&lt;:ies in &lt;:ateg&lt;)ry FLP (I'~lig'ht: IS~rnmct(:rs), but, as basic con&lt;:el&gt;ts, they also at)l)ear fix&gt; (luently in many categories.</Paragraph>
      <Paragraph position="6"> '\]'able 2 gives for two (:ategories the ten most rel)rc.sel~l;;d;ive (:Olnl)onen(;s a(:(:or(ling (;o this S(;or(;.</Paragraph>
      <Paragraph position="7"> The asso(:iation s(:ore l)etween a t;erm T (with</Paragraph>
      <Paragraph position="9"> according to the. colnt)onents of '2':</Paragraph>
      <Paragraph position="11"> Nt(C) is the nmnber of terms 1)el&amp;quot;raining t() (:at.eg()ry C and Nt is the. total nmnber of N,(c) terms. The factor ~ favors lm'ger cal;egories.</Paragraph>
      <Paragraph position="12"> Th(: (:atcgorization task determines 1;t1(: category C* that maxinfizes the association score:</Paragraph>
      <Paragraph position="14"/>
    </Section>
    <Section position="2" start_page="148" end_page="149" type="sub_section">
      <SectionTitle>
5.2 Training and Test
</SectionTitle>
      <Paragraph position="0"> Only multi-word terms can be categorized with this method since our endogeneous at)l)roach is 1)y nature not relevant for simt)le words. A test set of conq)ound terms is extracted from lille terminological database. The remaining terms are us(:d tbr training. The training terms ar(: analyzed in order to assign to each category its terminological comt)onent;s. Then, component fiequen(:ies and asso(:iation scores arc computed.</Paragraph>
      <Paragraph position="1">  Ten experiments have been run fbr a total test set of 2320 terms, During the test phase, each test term is annotated with the most plausible categories according to its coxnl)onents.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML