File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/w96-0105_intro.xml

Size: 5,084 bytes

Last Modified: 2025-10-06 14:06:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0105">
  <Title>Selective Sampling of Effective Example Sentence Sets for Word Sense Disambiguation</Title>
  <Section position="3" start_page="0" end_page="57" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Word sense disambiguation is a crucial task in many NLP applications, such as machine translation \[1\], parsing \[14, 16\] and text retrieval \[10, 23\]. Given the growing utilization of machine readable texts, word sense disambiguation techniques have been variously used in corpus-based approaches \[1, 3, 5, 12, 18, 20, 21, 24\]. Unlike rule-based approaches, corpus-based approaches release us from the task of generalizing observed phenomena in order to disambiguate word senses. Our system is based on such an approach, or more precisely it is based on an example-based approach \[5\]. Since this approach requires a certain number of examples of disambiguated verbs, we have to carry out this task manually, that is, we disambiguate verbs appearing in a corpus prior to their use by the system. A preliminary experiment on ten Japanese verbs showed that the system needed on average about one hundred examples for each verb in order to achieve 82% of accuracy in disambiguating verb senses. In order to build an operational system, the following problems have to be taken into account:  1. Since there are about one thousand basic verbs in Japanese, a considerable overhead is associated with manual word sense disambiguation.</Paragraph>
    <Paragraph position="1"> 2. Given human resource limitations, it is not reasonable to manually analyze large corpora as they can provide virtually infinite input.</Paragraph>
    <Paragraph position="2"> 3. Given the fact that example-based natural language systems, including our system, search  the example-database (database, hereafter) for the most similar examples with regard to the input, the computational cost becomes prohibitive if one works with a very large database size \[11\].</Paragraph>
    <Paragraph position="3">  All these problems suggest a different approach, namely to select a small number of optimally informative examples from a given corpora. Hereafter we will call these examples &amp;quot;samples.&amp;quot; Our method, based on the utility maximization principle, decides on which examples should be included in the database. This decision procedure is usually called selective sampling. Selective sampling directly addresses the first two problems mentioned above. The overall control flow of systems based on selective sampling can be depicted as in figure 1, where &amp;quot;system&amp;quot; refers to dedicated NLP applications. The sampling process basically cycles between the execution and the training phases. During the execution phase, the system generates an interpretation for each example, in terms of parts-of-speech, text categories or word senses. During the training phase, the system selects samples for training from the previously produced outputs. During this phase, a human expert provides the correct interpretation of the samples so that the system can then be trained for the execution of the remaining data. Several researchers have proposed such an approach.</Paragraph>
    <Paragraph position="4">  ... ...... training phase ..</Paragraph>
    <Paragraph position="5"> I * correct interpretation , \[human I &amp;quot;, ..I. ....... .......... /~'1 ij I y ....... ~ I \[ ......... , .~-------'r'-.-... , , I (f_-l-..~.....~ ..</Paragraph>
    <Paragraph position="6"> &amp;quot;&amp;quot;&amp;quot;-.. ..... outputs.\] .... ..-'&amp;quot;&amp;quot; ......... execution phase ...........</Paragraph>
    <Paragraph position="7">  Lewis et al. proposed an example sampling method for statistics-based text classification \[13\]. In this method, the system always selects samples which are not certain with respect to the correctness of the answer. Dagan et al. proposed a committee-based sampling method, which is currently applied to HMM training for part-of-speech tagging \[2\]. This method selects samples based on the training utility factor of the examples, i.e. the informativity of the data with respect to future training. However, as all these methods are implemented for statistics-based models, there is a need to explore how to formalize and map these concepts into the example-based approach.</Paragraph>
    <Paragraph position="8"> With respect to problem 3, a possible solution would be the generalization of redundant examples \[8, 19\]. However, such an approach implies a significant overhead for the manual training of each example prior to the generalization. This shortcoming is precisely what our approach allows to avoid: reducing both the overhead as well as the size of the database. Section 2 briefly describes our method for a verb sense disambiguation system. The next Section 3 elaborates on the example sampling method, while section 4 reports on the results of our experiment. Before concluding in section 6, discussion is added in section 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML