File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/97/w97-0112_abstr.xml

Size: 12,579 bytes

Last Modified: 2025-10-06 13:48:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0112">
  <Title>Knowledge Acqulsition= Classification of Terms in a thesaurus from~a Corpus</Title>
  <Section position="2" start_page="101" end_page="106" type="abstr">
    <SectionTitle>
,I
</SectionTitle>
    <Paragraph position="0"> The most thoroughly studied application is the information retrieval (IR). Here, the term provides a means for accessing information through its standardising effect on the query and on the text to be found. The term can also be a variable that is used in statistical classification or clustering processes of documents (\[BLO 92\] and \[STA 95a\]), or in selective dissemination of information, in which it is used to bring together a document to be disseminated and its target \[STA 93\].</Paragraph>
    <Paragraph position="1"> Textual information is becoming more and more accessible in electronic form. This accessibility is certainly one of the prerequisites for the massive use of natural language processing (NLP) techniques. These techniques applied on particular domains, often use terminological resources that supplement the lexical resources. The lexical resources (general language dictionaries) are fairly stable, whereas terminologies evolve dynamically with the fields they describe. In particular, the disciplines of information processing (computers, etc.) and biology or genetics are characterised today by an extraordinary terminological activity.</Paragraph>
    <Paragraph position="2"> Unfortunately, the abundance of electronic corpora and the relative maturity of natural language processing techniques have induced a shortage of updated terminological data. The various efforts in automatic acquisition of terminologies from a corpus stem from this observation, and try to answer the following question: &amp;quot;How can candidate terms be extracted from a corpus?&amp;quot; Another inKoortant question is how to position a term in an existing thesaurus. That question can itself be subdivided into several questions that concern the role of the standard relationships in a thesaurus: synonymy, hyperonymy, etc. The question studied in this experiment concerns the positioning or classification of a term in a subject field or semantic field of a thesaurus. This is the first step in a precise positioning using the standard relationships of a thesaurus. This problem is very difficult for a human being to resolve when he is not an expert in the field to which the term belongs and one can hope that an automated classification process would be of great help.</Paragraph>
    <Paragraph position="3"> To classify a term in a subject field can be considered similar to word sense disambiguation (WSD) which consists in classifying a word in a conceptual class (one of its senses). The difference is that, in a corpus, a term is generally monosemous and a word is polysemous. Word sense disambiguation uses a single context (generally a window of a few words around the word to be disambiguated) as input to predict its sense among a few possible senses (generally less than ten). Term subject field discrimination uses a representation of the term calculated on the whole corpus in order to classify it into about 330 subject fields in this experiment.</Paragraph>
    <Paragraph position="4"> The experiment described here was used to evaluate different methods for classifying terms from a corpus in the subject fields of a thesaurus. After a brief description of the corpus and the thesaurus, automatic indexing and terminology extraction are described.</Paragraph>
    <Paragraph position="5"> Linguistic and statistical techniques are used to extract a candidate term from a corpus or to recognise a term in a document. This preparatory processing allows the document to be represented as a set  of terms (candidate terms and key words A classification method is then implemented to classify a subset of 1,000 terms in the 49 themes and 330 semantic fields that make up the thesaurus. The 1,000 terms thus classified comprise the test sample that is used to evaluate three models for representing terms.</Paragraph>
    <Paragraph position="6"> IX. Data Preparation IX.i. Description of the Corpus The corpus studied is a set of I0,000 scientific and technical doc~unents in French (4,150,000 words). Each document consists of one or two pages of text. This corpus describes research carried out by the research division of EDF, the French electricity company. Many diverse subjects are dealt with: nuclear energy, thermal energy, home automation, sociology, artificial intelligence, etc. Each document describes the objectives and stages of a research project on a particular subject. These documents are used to plan EDF research activity.</Paragraph>
    <Paragraph position="7"> Thus, the vocabulary used is either very technical, with subject field terms and candidate terms, or very general, with stylistic expressions, etc.</Paragraph>
    <Paragraph position="9"> Construction de thesaurus~Etatd'avancement ~--~ genera/ .... , _ ~ ~'-qphase d~ndus~/a/isafion expressions ~an a avancemen~ : - 2~-~ La phase d' industrialisation /... ~ Iconstrucfion de the~z , . . . &amp;quot; \[ ................ auru_ L indexatlon automatlque ........ -imqxexauon automauoue k terms A document with terms and general expressions II.2. Description of the Thesaurus The EDF thesaurus consists of 20,000 terms (including 6,000 synonyms) that cover a wide variety of fields (statistics, nuclear power plants, information retrieval, etc.). This reference system was created manually from corporate documents, and was validated with the help of many experts. Currently, updates are handled by a group of documentalists who regularly examine and insert new terms. One of the sources of new terms is the corpora. A linguistic and statistical extractor proposes candidate terms for validation by the documentalists. After validation, the documentalists must position the selected terms in the thesaurus. It's a difficult exercise because of the wide variety of fields.</Paragraph>
    <Paragraph position="10">  The thesaurus is composed of 330 semantic (or subject) fields included in 49 themes such as mathematics, sociology, etc.</Paragraph>
    <Paragraph position="11"> 'i th6ode des erreurs \[ I analyse discriminante I \]statistique \[ , s .ttmation I l analyse statistique \[:modUle statistique lineaire ! l analyse de la variance ~ Generic Relationship i'statistique n'on param6trique \]</Paragraph>
    <Section position="1" start_page="103" end_page="104" type="sub_section">
      <SectionTitle>
See Also Relationship
</SectionTitle>
      <Paragraph position="0"> Extract from the &amp;quot;statistics&amp;quot; sm-~tic field from the EDF thesaurus This example gives an overview of the various relations between terms. Each term belongs to a single semantic field. Each term is linked to other terms through a generic relation (arrow) or a neighbourhood relation (line). Other relations (e.g., synonym, translated by, etc.) exist, but are not shown in this example.</Paragraph>
      <Paragraph position="1"> IIdeg3. Document Indexing As a first step, the set of documents in the corpus is indexed.</Paragraph>
      <Paragraph position="2"> This consists of producing two types of indexes: candidate terms, and descriptors. The candidate terms are expressions that may become terms, and are submitted to an expert for validation. Descriptors are terms from the EDF thesaurus that are automatically recognised in the documents.</Paragraph>
      <Paragraph position="3"> II.3.1. Terminological Filtering In this experiment, terminological filtering is used for each document to produce terms that do not belong to the thesaurus, but which nonetheless might be useful to describe the documents. Moreover, these expressions are candidate terms that are submitted to experts or documentalists for validation.</Paragraph>
      <Paragraph position="4"> Linguistic and statistical terminological filtering are used. The method chosen for this experiment combines an initial linguistic extraction with statistical filtering \[STA 95b\].</Paragraph>
    </Section>
    <Section position="2" start_page="104" end_page="104" type="sub_section">
      <SectionTitle>
Linguistic Extraction
</SectionTitle>
      <Paragraph position="0"> Generally, it appears that the syntactical structure of a term in French language is the noun phrase. For example, in the EDF thesaurus, the syntactic structures of terms are distributed as follows:  Thus, term extraction is initially syntactical. It consists of applying seven recursive syntactic patterns to the corpus \[OGO 94\].</Paragraph>
      <Paragraph position="2"> The seven syntactic patterns for terminology extraction</Paragraph>
    </Section>
    <Section position="3" start_page="104" end_page="105" type="sub_section">
      <SectionTitle>
Statistical Filtering
</SectionTitle>
      <Paragraph position="0"> Linguistic extraction, however, is not enough. In fact, many expressions with a noun phrase structure are not terms. This includes general expressions, stylistic effects, etc. Statistical methods can thus be used, in a second step, to discriminate terms from nonterminological expressions. Three indicators are used here: Frequency: This is based on the fact that the more often an expression is found in the corpus, the more likely it is to be a term. This statement must be kept in proportion, however. Indeed, it seems that a small number of words (usually, very general uniterms) are very frequent, but are not terms.</Paragraph>
      <Paragraph position="1">  - Variance: This is based on the idea that the more the occurrences in a document of an expression are scattered, the more likely it is to be a term. This is the most effective indicator. Its drawback is that it also highlights large noun phrases in which the terms are  included.</Paragraph>
      <Paragraph position="2"> Local density \[STA 95b\]: This is based on the idea that the closer together the documents are that contain the expression, the more likely it is to be a term. The local density of an expression is the  mean of the cosines between documents which contain the given expression. A document is a vector in the Document Vector Space where a dimension is a term. This indicator highlights a certain number of terms that are not transverse to the corpus, but rather concentrated in documents that are close to each other. Nonetheless, this is not a very effective indicator for terms that are transverse to the corpus. For example, terms from computer science, which are found in a lot of documents, are not highlighted by this indicator. Results of the Tez~inological Extraction During this experiment, the terminological extraction ultimately produced 3,000 new terms that did not belong to the thesaurus. These new' terms are used in the various representation models described below. The initial linguistic extracting produced about 50,000 expressions.</Paragraph>
      <Paragraph position="3"> II.3.2. Controlled Indexing A supplementary way of characterising a document's contents is by recognising controlled terms in the document that belong to a thesaurus. To do this, an NLP technique is used \[BLO 92\]. Each sentence is processed on three levels: morphologically, syntactically, and semantically. These steps use a grammar and a general language dictionary.</Paragraph>
      <Paragraph position="4"> The method consists of breaking down the text fragment being processed by a series of successive transformations that may be syntactical (nominalisation, de-coordination, etc.), semantic (e.g., nuclear and atomic), or pragmatic (the thesaurus&amp;quot; synonym relationships are scanned to transform a synonym by its main form). At the end of these transformations, the decomposed text is compared to the list of documented terms of the thesaurus in order to supply the descriptors.</Paragraph>
    </Section>
    <Section position="4" start_page="105" end_page="106" type="sub_section">
      <SectionTitle>
Results of the Controlled Iz~exing
</SectionTitle>
      <Paragraph position="0"> Controlled indexing of the corpus supplied 4,000 terms (of 20,000 in the thesaurus). Each document was indexed by 20 to 30 terms. These documented terms, like the candidate terms, are used in the representation models described below. The quality of the indexing process is estimated at 70 percents (number of right terms divided by number of terms). The wrong terms are essentially due to problems of polysemy. Indeed some terms (generally uniterms) have multiple senses (for example &amp;quot;BASE&amp;quot;) and produce a great part of the noise.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML