File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/p95-1025_intro.xml

Size: 8,302 bytes

Last Modified: 2025-10-06 14:05:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="P95-1025">
  <Title>Statistical Sense Disambiguation with Relatively Small Corpora Using Dictionary Definitions</Title>
  <Section position="3" start_page="0" end_page="182" type="intro">
    <SectionTitle>
2 Statistical Sense Disambiguation Using
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="181" type="sub_section">
      <SectionTitle>
Dictionary Definitions
</SectionTitle>
      <Paragraph position="0"> It is well known that some words tend to co-occur with some words more often than with others.</Paragraph>
      <Paragraph position="1"> Similarly, looking at the meaning of the words, one should find that some concepts co-occur more often with some concepts than with others. For example, the concept crime is found to co-occur frequently with the concept punishment. This kind of conceptual relationship is not always reflected at the lexical level. For instance, in legal reports, the Statistical data is domain dependent. Data extracted from a corpus of one particular domain is usually not very useful for processing text of another domain.</Paragraph>
      <Paragraph position="2">  concept crime will usually be expressed by words like offence or felony, etc., and punishment will be expressed by words such as sentence, fine or penalty, etc. The large number of different words of similar meaning is the major cause of the data sparseness problem.</Paragraph>
      <Paragraph position="3"> The meaning or underlying concepts of a word are very difficult to capture accurately but dictionary definitions provide a reasonable representation and are readily available. 2 For instance, the LDOCE definitions of both offence and felony contain the word crime, and all of the definitions of sentence, fine and penalty contain the word punishment. To disambiguate a polysemous word, a system can select the sense with a dictionary definition containing defining concepts that co-occur most frequently with the defining concepts in the definitions of the other words in the context. In the current experiment, this conceptual co-occurrence data is collected from the Brown corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="181" end_page="181" type="sub_section">
      <SectionTitle>
2.1 Collecting Conceptual Co-occurrence Data
</SectionTitle>
      <Paragraph position="0"> Our system constructs a two-dimensional table which records the frequency of co-occurrence of each pair of defining concepts. The controlled vocabulary provided by Longman is a list of all the words used in the definitions but, in its crude form, it does not suit our purpose. From the controlled vocabulary, we manually constructed a list of 1792 defining concepts. To minimise the size of the table and the processing time, all the closed class words and words which are rarely used in definitions (e.g., the days of the week, the months) are excluded from the list. To strengthen the signals, words which have the same semantic root are combined as one element in the list (e.g., habit and habitual are combined as {habit, habitual}).</Paragraph>
      <Paragraph position="1"> The whole LDOCE is pre-processed first. For each entry in LDOCE, we construct its corresponding conceptual expansion. The conceptual expansion of an entry whose headword is not a defining concept is a set of conceptual sets. Each conceptual set corresponds to a sense in the entry and contains all the defining concepts which occur in the definition of the sense. The entry of the noun sentence and its corresponding conceptual expansion 2 Manually constructed semantic frames could be more useful computationally but building semantic frames for a huge lexicon is an extremely expensive exercise.</Paragraph>
      <Paragraph position="2"> are shown in Figure 1. If the headword of an entry is a defining concept DC, the conceptual expansion is given as {{DC}}.</Paragraph>
      <Paragraph position="3"> The corpus is pre-segrnented into sentences but not pre-processed in any other way (sense-tagged or part-of-speech-tagged). The context of a word is defined to be the current sentence) The system processes the corpus sentence by sentence and collects conceptual co-occurrence data for each defining concept which occurs in the sentence. This allows the whole table to be constructed in a single run through the corpus.</Paragraph>
      <Paragraph position="4"> Since the training data is not sense tagged, the data collected will contain noise due to spurious senses of polysemous words. Like the thesaurus-based approach of Yarowsky (1992), our approach relies on the dilution of this noise by their distribution through all the 1792 defining concepts. Different words in the corpus have different numbers of senses and different senses have definitions of varying lengths. The principle adopted in collecting co-occurrence data is that every pair of content words which co-occur in a sentence should have equal contribution to the conceptual co-occurrence data regardless of the number of definitions (senses) of the words and the lengths of the definitions. In addition, the contribution of a word should be evenly distributed between all the senses of a word and the contribution of a sense should be evenly distributed between all the concepts in a sense. The algorithm for conceptual co-occurrence data collection is shown in Figure 2.</Paragraph>
    </Section>
    <Section position="3" start_page="181" end_page="182" type="sub_section">
      <SectionTitle>
2.2 Using the Conceptual Co-occurrence Data
for Sense Disambiguation
</SectionTitle>
      <Paragraph position="0"> To disambiguate a polysemous word W in a context C, which is taken to be the sentence containing W, the system scores each sense S of W, as defined in LDOCE, with respect to C using the following equations.</Paragraph>
      <Paragraph position="1"> score(S, C) = score(CS, C') - score(CS, GlobalCS) \[1\] where CS is the corresponding conceptual set of S, C' is the set of conceptual expansions of all content words (which are defined in LDOCE) in C and GlobalCS is the conceptual set containing all the 1792 defining concepts.</Paragraph>
      <Paragraph position="2">  Entry in LDOCE 1. (an order given by a judge which fixes) a punishment for a criminal found guilty in court 2. a group of words that forms a statement, command, exclamation, or  question, usu. contains a subject and a verb, and (in writing) begins with a capital letter and ends with one of the marks. ! ? conceptual expansion { {order, judge, punish, crime, criminal, fred, guilt, court}, {group, word, form, statement, command, question, contain, subject, verb, write, begin, capital, letter, end,</Paragraph>
      <Paragraph position="4"> 1. Initialise the Conceptual Co-occurrence Data Table (CCDT) with initial value of 0 for 2. For each sentence S in the corpus, do a. Construct S', the set of conceptual expansions of all content words (which are  defined in LDOCE) in S.</Paragraph>
      <Paragraph position="5"> b. For each unique pair of conceptual expansions (CE~, CEj) in S', do For each defining concept DC~mp in each conceptual set CS~m in CE~, do For each defining concept DCjnq in each conceptual set CSj, in CEj, do increase the values of the cells CCDT(DCimp, DCjnq) and CCDT(DCjnq, Dcirnp) by the product of w(DCimp) and w(DCjnq) where w(DCxyz) is the weight of DCxyz given by</Paragraph>
      <Paragraph position="7"> score&lt; CS, C'&gt; = ve~S, core&lt; CS, CE'&gt; /I C'\] for any concp, set CS and concp, exp. set C' \[2\] score(CS, CE') = max score(CS,CS') C8'~CPS' for any concp, set CSand concp, exp. CE' \[31 score( CS, CS') = voe'.es' ~'scdegre( eS'DC') /ICS'\[ for any concp, sets CS and CS' \[4\] score(CS, DC')= ~f~ score(DC, DC') /\[CS\[ for any concp, set CS and def. concept DC' \[5\] score( DC, DC' ) = max(0, I ( DC, DC' )) for any def. concepts DC and DC' \[6\] I(DC, DC') is the mutual information 4 (Fano, 1961) between the 2 defining concepts DC and DC' given by: I(x,y) --- log s P(x,y) P(x). P(y) f(x,y).N Idegg2 f(x). f(y)  (using the Maximum Likelihood Estimator). f(x,y) is looked up directly from the conceptual co-occurrence data table, fix) and f(y) are looked up from a pre-constructed list off(DC) values, for each defining concept DC:</Paragraph>
      <Paragraph position="9"/>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML