File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/93/e93-1011_abstr.xml

Size: 5,111 bytes

Last Modified: 2025-10-06 13:47:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="E93-1011">
  <Title>An Endogeneous Corpus-Based Method for Structural Noun Phrase Disambiguation</Title>
  <Section position="1" start_page="0" end_page="81" type="abstr">
    <SectionTitle>
92141 Clamart Cedex
FRANCE
Abstract
</SectionTitle>
    <Paragraph position="0"> In this paper, we describe a method for structural noun phrase disambiguation which mainly relies on the examination of the text corpus under analysis and doesn't need to integrate any domain-dependent lexico- or syntactico-semantic information. This method is implemented in the Terminology Extraction Sotware LEXTER. We first explain why the integration of LEXTER in the LEXTER-K project, which aims at building a tool for knowledge extraction from large technical text corpora, requires improving the quality of the terminolgy extracted by LEXTER.</Paragraph>
    <Paragraph position="1"> Then we briefly describe the way LEXTER works and show what kind of disambiguation it has to perform when parsing &amp;quot;maximal-length&amp;quot; noun phrases. We introduce a method of disambiguation which relies on a very simple idea : whenever LEXTER has to choose among several competing noun sub-groups in order to disambiguate a maximal-length noun phrase, it checks each of these sub-groups if it occurs anywhere else in the corpus in a non-ambiguous situation, and then it makes a choice. The half-a-million words corpus analysis resulted in an efficient strategy of disambiguation. The average rates are :  knowledge extraction from large technical text corpora LEXTER is a Terminology Extraction Software (Bourigault, 1992a, 1992b). A corpus of French-language texts on any (technical) subject is fed in. LEXTER performs a grammatical analysis of this corpus and yields a list of noun phrases which are likely to be terminological units, representing the concepts of the subject field. This list together with the corpus it has been extracted from is then passed on to an expert for validation by the means of a terminological hypertext web. LEXTER has been developped in an industrial context, in the Research and Development Division of Electricit6 de France. It was previously designed to deal with the problem of creating or updating thesauri used by an Automatic Indexing System.</Paragraph>
    <Paragraph position="2"> We are integrating LEXTER in a text analysis tool to aid knowledge acquisition in the framework of Knowledge-Based System construction. This tool (LEXTER-K) will propose a structured list of candidate terms, rather than a flat list, which could be considered as a first coarse-grained modelisation of the information conveyed by the texts under analysis.</Paragraph>
    <Paragraph position="3"> Structuring of the terminology will be performed in two ways : on the one hand, by a structural analysis of the terminological noun phrases extracted by LEXTER; on the other hand, by an analysis of the sentences in which the candidate terms occur. This analysis will focus on the most relevant terms, determined by a statistical processing based on the assumption that the most frequent terms are probably the most relevant.</Paragraph>
    <Paragraph position="4"> We plan a two-stage architecture for LEXTER-K, that is, (1) the extraction of the terminology of the subject field, by a robust grammatical analysis (LEXTER), (2) the syntactic analysis of the sentences by a parser using this terminology. The syntactic structures of the sentences in a text, and the syntactic structures of the terminological units have to be placed on two different organisational levels. As the terminological unit is a semantic unit, it should be treated as such on the syntactic level, as well. Dissociating these two  analysis, though one taking advantages of the results given by the other, will guarantee a better efficiency for the parser, in particular by limiting the combinatory explosion of structural ambiguities.</Paragraph>
    <Paragraph position="5"> It is well known that Natural Language systems usually require considerable knowledge acquisition, especially in building a specifically oriented field vocabulary in the case of systems which have to analyse technical texts. We think that this two-stage analysis (extracting the terminology of the domain with a robust superficial analysis, and analysing the texts with a more in-depth parser using this terminology), may lighten the expensive burden of hand-coding a specialized language and may lead to more generic and domain-independant Natural Language systems.</Paragraph>
    <Paragraph position="6"> As long as the terminology extracted during the first step has not been validated by an expert, as is now the case, and it directly feeds the syntactic analyser, this two-stage architecture requires a better quality of the terminology extracted by LEXTER. This is the reason why we tried to improve the precision rate in the detection of the terminological noun phrases by implementing an efficient strategy for structural noun phrase disambiguation. This strategy is described in the following sections.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML