XML Viewer - w01-0707

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-0707_intro.xml
Size: 6,543 bytes
Last Modified: 2025-10-06 14:01:10
<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0707">
  <Title>Probabilistic Models for PP-attachment Resolution and NP Analysis</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Nuclei and sequences of nuclei
</SectionTitle>
    <Paragraph position="0"> We first take the general view that our problem can be formulated as one of finding dependency relations between nuclei. Without loss of generality, we define a nucleus to be a unit that contains both the syntactic and semantic head and that exhibits only unambiguous internal syntactic structure. For example, the base NP &amp;quot;the white horse&amp;quot; is a nucleus, since the attachments of both the determiner and the adjective to the noun are straightforward. The segmentation into nuclei relies on a manually built chunker, similar to the one described in (Ait-Mokhtar and Chanod, 1997), and resembles the one proposed in (Samuelsson, 2000). The motivation for this assumption is twofold. First, the amount of grammatical information carried by individual words varies greatly across language families.</Paragraph>
    <Paragraph position="1"> Grammatical information carried by function words in non-agglutinative languages, for instance, is realized morphologically in agglutinative languages. A model manipulating dependencies at the word level only would be constrained to the specific amount of grammatical and lexical information associated with words in a given language. Nuclei, on the other hand, tend to correspond to phrases of the same type across languages, so that relying on the notion of nucleus makes the approach more portable. A second motivation for considering nuclei as elementary unit is that their internal structure is by definition unambiguous, so that there is no point in applying any algorithm whatsoever to disambiguate them.</Paragraph>
    <Paragraph position="2"> We view each nucleus as being composed of several linguistic layers with different information, namely a semantic layer comprising the possible semantic classes for the word under consideration, a syntactic layer made of the POS category of the word and its gender and number information, and a lexical layer consisting of the word itself (referred to as the lexeme in the following), and the preposition, for prepositional phrases. For nuclei comprising more than two non-empty words (as &amp;quot;the white horse&amp;quot;), we retain only one lexeme, the one associated with the last word which is considered to be the head word in the sequence. Except for the semantic information, all the necessary information is present in the output of the chunker. The semantic lexicon we used was encoded as a finite-state transducer, which was looked up for injecting semantic classes in each nucleus. When no semantic information is available for a given word, we use its part-of-speech category as its semantic class1. For example, starting with the sentence:  have not made use of this hierarchy in our experiments.</Paragraph>
    <Paragraph position="3"> Il doit rencontrer le pr'esident de la f'ed'eration franc,aise. (He has to meet the president of the French federation.) we obtain the following sequence of nuclei:  As we see in this example, the semantic resource we use is incomplete and partly questionable. The attribute HUMAN for federation can be understood if one views a federation as a collection of human beings, which we believe is the rationale behind this annotation. However, a federation also is an institution, a sense which is missing in the resource we use.</Paragraph>
    <Paragraph position="4"> In the preceding example, the preposition de can be attached to the verb rencontrer or to the noun pr'esident. It cannot be attached to the pronoun il.</Paragraph>
    <Paragraph position="5"> As far as terminology extraction is our final objective, pr'esident de la f'ed'eration franc,aise can be deemed a good candidate term. However, in order to accurately identify this unit, a high confidence in the fact that the preposition de attaches to the noun pr'esident must be achieved. Sentences can be conveniently segmented into smaller self-contained units according to some heuristics to reduce the combinatorics of attachments ambiguities. We define safe chains as being sequences of nuclei in which all the items but the first are attached to other nuclei within the chain itself. In the preceding example, for instance, only the nucleus associated with rencontrer is not attached to a nucleus within the chain rencontrer ... franc,aise. This chain is thus a safe chain. To keep the number of alternative (combinations of) attachments as low as possible, we are interested in isolating as short safe chains as possible given the information available at this point, i.e. words and their parts-of-speech (the knowledge of semantic classes is of little help in this task).</Paragraph>
    <Paragraph position="6"> In French, and except for few cases involving embedded clauses and coordination, the following heuristics can be used to identify &amp;quot;minimal&amp;quot; safe chains: extract the longest sequences beginning with a nominal, verbal, prepositional or adjectival nucleus, containing only nominal, prepositional, adjectival, adverbial or verbal nuclei in indefinite moods.</Paragraph>
    <Paragraph position="7"> There is a tension in parameter estimation of probabilistic models between relying on accurate information and relying on enough data. In an unsupervised approach to PP-attachment resolution and NP analysis, accurate information in the form of dependency relations between words is not directly accessible. However, specific configurations can be identified from which accurate information can be extracted. Safe chains provide such configurations. Indeed if there is only one possible attachment site to the left of a nucleus, then its attachment is unambiguous. Due to the possible ambiguities the French language displays (e.g. a preposition can be attached to a noun, a verb or an adjective), only the first two nuclei of a safe chain provide reliable information (we skip adverbs, the attachment of which obeys specific and simple rules). From the preceding example, for instance, we can infer a direct relation between rencontrer and pr'esident, but this is the only attachment we can be sure of. The use of less reliable information sources for model parameters whose estimation would otherwise require manual supervision is the object of an experiment described in Section 6.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML