File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/e93-1011_metho.xml

Size: 12,885 bytes

Last Modified: 2025-10-06 14:13:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="E93-1011">
  <Title>An Endogeneous Corpus-Based Method for Structural Noun Phrase Disambiguation</Title>
  <Section position="2" start_page="81" end_page="81" type="metho">
    <SectionTitle>
2 The issue of parsing the
ambiguous &amp;quot;Maximal-Length&amp;quot;
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="81" end_page="81" type="sub_section">
      <SectionTitle>
Noun Phrases
</SectionTitle>
      <Paragraph position="0"> In this section, we briefly describe the type of grammatical analysis performed by LEXTER to extract likely terminological units and show what kind of disambiguation LEXTER has to perform.</Paragraph>
      <Paragraph position="1"> As we already pointed out, LEXTER has been achieved in an industrial context; from the beginning of the project, we had decided to focus upon a strongly restrictive criterium : applying and testing the system over a wide range of texts. The texts to be analysed are unrestricted texts gathered in large corpora. We had then to choose a fast and well-proved method. Moreover, we argued that, given the restricted grammatical structures of complex terminological units, it was not necessary to go into a complete syntactic analysis of the sentences to extract the terminology from a corpus (Bourigault, 1992b).</Paragraph>
      <Paragraph position="2"> First, a morphological analyser tags the texts, using a large lexical database and rules of lexical disambiguation. LEXTER treats texts in which each word is tagged with a grammatical category (noun, verb, adjective, etc.). LEXTER works in two main phases : (1) splitting and (2) parsing.</Paragraph>
      <Paragraph position="3">  (1) At the splitting stage, LEXTER takes advantage of &amp;quot;negative&amp;quot; knowledge about the form of  terminological units, by identifying those string level patterns which never go to make up these units and which can thus be considered as potential terminological limits. Such patterns are made up by, say, conjugated verbs, pronouns, conjonctions, certain strings of preposition + determiner. The splitting module is thus set up with a base of about 60 rules for identifying frontier markers, which it uses to split the texts. The splitting phase produces a series of text sequences, most often noun phrases. These noun phrases may well be likely terminological units themselves, but more often than not, they contain sub-groups which are also likely units. That is why it is preferable at the splitting stage to refer to the noun phrases identified as &amp;quot;maximal-length noun phrases&amp;quot;. Here is an example of a real maximal-length noun phrase : MESURE DU DEBIT DU VENTILATEUR</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="81" end_page="81" type="metho">
    <SectionTitle>
D'EXTRACTION AVEC TRAPPE EN POSITION
</SectionTitle>
    <Paragraph position="0"> FERMEE (noun prep det noun prep det noun prep noun prep noun prep noun adj).</Paragraph>
    <Paragraph position="1"> (2) At the parsing stage, LEXTER parses the maximal-length noun phrases (henceforth MLNP) in order to generate sub-groups, in addition to the MLNP, which are likely terminological units by virtue of their grammatical structure and their position in the MLNP. The LEXTER parsing module is made up of parsing rules which indicate which sub-groups to extract from a MLNP on the basis of grammatical structure.</Paragraph>
    <Paragraph position="2"> Some of the MLNP structures are non-ambiguous : given such a structure it can be stated with a very high rate of certainty (Bourigault 1992a) that only one parsing is valid. The corresponding parsing rules are called non-ambiguous rules. For example, structure (1) is non-ambiguous, and parsing rule \[a\] is a non-ambiguous rule.</Paragraph>
  </Section>
  <Section position="4" start_page="81" end_page="82" type="metho">
    <SectionTitle>
FUSIBLE THERMIQUE DE FERMETURE
FUSIBLE THERMIQUE
</SectionTitle>
    <Paragraph position="0"> Some of the MLNP structures are ambiguous, that is, given such a structure it cannot be stated with a sufficient rate of certainty that only one parsing is valid. Several sub-structures compete. The corresponding ambiguous parsing rules generate several competing sub-groups. For example, when information about gender or number agreement are not available or of no help, structure (2) is ambiguous, that is, either  the adjective attaches the head noun1 of the noun sub-group noun1 prep noun2, or it attaches noun2, constituting the noun sub-group (noun2 adj); the competing noun sub-groups (noun1 prep noun2) and (noun2 adj) will be generated by the ambiguous parsing rule \[b\]. Structure (3) and parsing rule \[c\] are other examples of ambiguous structure and rule.</Paragraph>
    <Paragraph position="1"> (2) noun1 prep noun2 adj parsing rule \[bl  The issue is how to disambiguate in cases of MLNP with ambiguous structures, that means, whenever an ambiguous rule applies, how to choose among the competing generated sub-groups. The strategy of disambiguation is described in the next section.</Paragraph>
  </Section>
  <Section position="5" start_page="82" end_page="83" type="metho">
    <SectionTitle>
3 Strategy of disambiguation :
</SectionTitle>
    <Paragraph position="0"> looking at non ambiguous situations anywhere else in the corpus The strategy of disambiguation relies on a very simple idea, that of looking for non-ambiguous situations elsewhere in the corpus. Whenever an ambiguous rule applying to a MLNP with an ambiguous structure generates competing sub-groups, LEXTER (1) checks each of them to ascertain if it has been detected in a non-ambiguous situation (i.e. generated by a non-ambiguous rule) somewhere else in the corpus, and (2) chooses among the competing sub-groups using a set of disambiguation rules.</Paragraph>
    <Paragraph position="1"> There is one specific set of disambiguation rules for each ambiguous structure, which covers all the possibles situtations, that is, all, some, only one, none of the competing sub-group non-ambiguously detected.</Paragraph>
    <Section position="1" start_page="82" end_page="83" type="sub_section">
      <SectionTitle>
3.1 Situations where none of the competing
</SectionTitle>
      <Paragraph position="0"> sub-groups has been non-ambiguously detected Given an ambiguous Maximal-Length Noun Phrase, if none of the competing sub-groups has been detected in a non-ambiguous situation, LEXTER proposes only this MLNP, without any sub-group. No disambiguation is performed. On our half-a-million words test corpus, the total number of non-ambiguous MLNPs is 13,591, the total number of ambiguous MLNPs is 3,230, among which 880 are not disambiguated. The average rate of no-disambiguation is 27%. Rates of no-disambiguation for the ten most frequent ambiguous structures are shown in Table 1.</Paragraph>
      <Paragraph position="2"> noun prep noun adj noun prep noun prep det noun noun adj noun noun prep noun noun noun prep noun prep noun noun noun adj noun noun noun noun noun prep noun noun prep noun prep noun adj noun prep noun adj adj (2)  (1) ambiguous Maximal-Length Noun Phrase (MLNP) structure (2) total number of MLNP with this structure on the half-a-million words test corpus (3) number of cases where none of the competing subgroups has been detected in a non-ambiguous situation (4) rate of no-disambiguation  We are investigating rules that could perform a correct disambiguation in some of these cases. Choosing the right sub-group can be done by checking for each competing sub-group if it has been generated from the analysis of other ambiguous MLNP. For example, RE JET D'AIR FROID and CIRCUIT D'AIR FROID are two ambiguous MLNP extracted from the test corpus and parsed by parsing rule \[lo\] above :</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="83" end_page="84" type="metho">
    <SectionTitle>
REJET D'AIR FROID
# REJET D'AIR
# AIR FROID
CIRCUIT D'AIR FROID
# CIRCUIT D'AIR
# AIR FROID
</SectionTitle>
    <Paragraph position="0"> Since none of the sub-groups RE JET D'AIR and AIR FROID on the one hand, and CIRCUIT D'AIR and AIR FROID on the other hand, have been detected in non-ambiguous situations, the MLNPs RE JET D'AIR FROID and CIRCUIT D'AIR FROID have not been disambiguated by LEXTER. But comparing the parsings of these ambiguous MLNPs (AfR FROID generated in both cases) can lead to the hypothesis that extracting AIR FROID is the correct way of disambiguating them. This hypothesis is reinforced by the fact that the pattern AIR + adj. is very productive in the corpus (AIR EXTERIEUR, AIR FRAIS, AIR NEUF, AIR AMBIANT, AIR RECYCLE, etc.).</Paragraph>
    <Paragraph position="1"> Our experiments show that such situations (sub-groups never non-ambiguously detected but generated from different ambiguous MLNPs) are very rare and this explains why we have no specific treatment for them yet.</Paragraph>
    <Section position="1" start_page="84" end_page="84" type="sub_section">
      <SectionTitle>
3.2 Situations where at least one competing
</SectionTitle>
      <Paragraph position="0"> sub-group has been non-ambiguously detected Given an ambiguous MLNP, systematically keeping (all) the competing sub-group(s) detected elsewhere in the corpus in a non-ambiguous situation is not a satisfying principle of disambiguation. We need more precise rules of disambiguation.</Paragraph>
      <Paragraph position="1"> For example, for each of the parsing rules \[b\] and \[c\], in more than 20 % of the cases on our test corpus (see top left cells of Table 2 and Table 3), both competing sub-groups have been non-ambiguously detected. That means that both sub-groups are attested valid noun phrases as such. However, only one of them corresponds to a correct parsing of the MLNP they have been extracted from. In these cases keeping both sub-groups would alter the precision rate since one of them is not grammatically valid. On the contrary, generating none of them would alter the recall rate. We chose to build a set of disambiguation rules for each of the ambiguous parsing rules.</Paragraph>
      <Paragraph position="2"> To work out the disambiguation rules, we adopted an empirical approach based on large-scale corpus experimentation. For each of the ambiguous structures, we examined all the different situations of disambiguation (only one, more than one, all the competing sub-groups non-ambiguously detected) and for each of them, we parsed by hand a significative number of ambiguous noun phrases extracted from a reference test corpus. Applied to the ambiguous parsing rules \[b\] and \[c\], this approach led us to the following set of disambiguation rules (see Table 2 and Table 3) : where both competing sub-groups have been non-ambiguously detected, we checked that most often (125 cases/141 for rule \[b\], 50 cases/52 for rule \[c\]) the correct parsing isolates the second sub-group,noun2 adj for rule \[b\], noun2 prep noun3 for rule \[c\] (see the top left cells of Table 2 and Table 3).</Paragraph>
      <Paragraph position="3"> where only the second sub-group has been non-ambiguously detected, it always corresponds to the correct parsing and so it is systematically kept (see the top right cells of Table 2 and Table 3).</Paragraph>
      <Paragraph position="4"> parsing rules \[b\] and \[c\] differs for the situations where only the first sub-group (nounl prep noun2 for both rules) has been non-ambiguously detected.</Paragraph>
      <Paragraph position="5"> For rule \[b\], this sub-group is kept since it most often corresponds to a correct parsing of the MLNP (175 cases/190, see the bottom left cell of Table 2). On the contrary, for rule \[c\], no systematic rule can be stated since the correct parsing sometimes isolates this non-ambiguously detected sub-group, but often isolates the second one (noun2 prep noun3), altough it appears nowhere else in the corpus in a non-ambiguous situation (see the bottom left cell of Table 3). This mainly happens in cases of &amp;quot;elliptical denominations&amp;quot;, that is, a concept is first designated in a text by a &amp;quot;complete&amp;quot; term (for example,</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="84" end_page="84" type="metho">
    <SectionTitle>
CIRCUIT D'ASPERSION D'ENCEINTE), and then
</SectionTitle>
    <Paragraph position="0"> is systematically refered to with an &amp;quot;elliptical&amp;quot; term (for example, CIRCUIT D'ASPERSION).</Paragraph>
    <Paragraph position="1"> The results we obtained with such sets of desambiguation rules (see Table 4) are satisfactory and show that the strategy described in this paper is efficient. This is partly due to the fact that terminological noun phrases are fixed never disconnected sequences of words with constrained grammatical structures. Our strategy was not designed to deal with adjective and prepositional phrase attachment in unrestricted noun phrases.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML