XML Viewer - w02-0905

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0905_metho.xml
Size: 30,247 bytes
Last Modified: 2025-10-06 14:08:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0905">
  <Title>Using Co-Composition for Acquiring Syntactic and Semantic Subcategorisation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ACL Special Interest Group on the Lexicon (SIGLEX), Philadelphia,
</SectionTitle>
    <Paragraph position="0"> Unsupervised Lexical Acquisition: Proceedings of the Workshop of the set of semantic tags is used to annotate the training corpus, it is not obvious that the tags available are the more appropriate for extracting domain-specific semantic restrictions. If the tags were created specifically to capture corpus dependent restrictions, there could be serious problems concerning portability to a new specific domain.</Paragraph>
    <Paragraph position="1"> By contrast, unsupervised strategies to acquire selection restrictions do not require a training corpus to be semantically annotated using pre-existing lexical hierarchies (Sekine et al., 1992; Dagan et al., 1998; Grishman and Sterling, 1994). They require only a minimum of linguistic knowledge in order to identify &amp;quot;meaningful&amp;quot; syntactic dependencies. According to the Grefenstette's terminology, they can be classified as &amp;quot;knowledge-poor approaches&amp;quot; (Grefenstette, 1994). Semantic preferences are induced by merely using co-occurrence data, i.e., by using a similarity measure to identify words which occur in the same dependencies. It is assumed that two words are semantically similar if they appear in the same contexts and syntactic dependencies. Consider for instance that the verb ratify frequently appear with the noun organisation in the subject position. Moreover, suppose that this noun turns to be similar in a particular corpus to other nouns: e.g., secretary and council. It follows that ratify not only selects for organisation, but also for its similar words. This seems to be right. However, suppose that organisation also appears in expressions like the organisation of society began to be disturbed in the last decade, or they are involved in the actual organisation of things, with a significant different word meaning. In this case, the noun means a particular kind of process. It seems obvious that its similar words, secretary and council, cannot appear in such subcategorisation contexts, since they are related to the other sense of the word. Soft clusters, in which words can be members of different clusters to different degrees, might solve this problem to a certain extent (Pereira et al., 1993). We claim, however, that class membership should be modeled by boolean decisions. Since subcategorisation contexts require words in boolean terms (i.e., words are either required or not required), words are either members or not members of specific subcagorisation classes.</Paragraph>
    <Paragraph position="2"> Hence, we propose a clustering method in which a word may be gathered into different boolean clusters, each cluster representing the semantic restrictions imposed by a class of subcategorisation contexts. null This paper describes an unsupervised method for acquiring information on syntactic and semantic subcategorisation from partially parsed text corpora.</Paragraph>
    <Paragraph position="3"> The main assumptions underlying our proposal will be introduced in the following section. Then, section 3 will present the different steps -extraction of candidate subcategorisation restrictions and conceptual clustering- of our learning method. In section 4, we will show how the dictionary entries are provided with the learned information. The accuracy and coverage of this information will be measured in a particular application: attachment resolution.</Paragraph>
    <Paragraph position="4"> The experiments presented in this paper were performed on 1,5 million of words belonging to the P.G.R. (Portuguese General Attorney Opinions) corpus, which is a domain-specific Portuguese corpus containing case-law documents.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Underlying Assumptions
</SectionTitle>
    <Paragraph position="0"> Our acquisition method is based on two theoretical assumptions. First, we assume a very general notion of linguistic subcategorisation. More precisely, we consider that in a &amp;quot;head-complement&amp;quot; dependency, not only the head imposes constraints on the complement, but also the complement imposes linguistic requirements on the head. Following Pustejovsky's terminology, we call this phenomenon &amp;quot;cocomposition&amp;quot; (Pustejovsky, 1995). So, for a particular word, we attempt to learn both what kind of complements and what kind of heads it subcategorises.</Paragraph>
    <Paragraph position="1"> For instance, consider the compositional behavior of the noun republic in a domain-specific corpus. On the one hand, this word appears in the head position within dependencies such as republic of Ireland, republic of Portugal, and so on. On the other hand, it appears in the complement position in dependencies like president of the republic, government of the republic, etc. Given that there are interesting semantic regularities among the words cooccurring with republic in such linguistic contexts, we attempt to implement an algorithm letting us learn two different subcategorisation contexts: a8a10a9a12a11a14a13a16a15a7a17a19a18a21a20a23a22a25a24a27a26a25a28a16a29a6a30a32a31a34a33a36a35a38a37a27a39a25a13a40a15a42a41a36a43 where preposition a18a44a20 introduces a binary relation between the word republic in the role of &amp;quot;head&amp;quot; (role noted by arrow &amp;quot;a37 &amp;quot;), and those words that can be their &amp;quot;complements&amp;quot; (the role complement is noted by arrow &amp;quot;a15 &amp;quot;). This subcategorisation context semantically requires the complements referring to particular nations or states (indeed, only nations or states can be republics).</Paragraph>
    <Paragraph position="2"> a8a10a9a12a11a14a13 a37 a17a19a18a21a20a23a22a25a13 a37 a39a25a24a27a26a45a28a16a29a46a30a38a31a34a33a36a35 a15 a41a36a43 this represents a subcategorisation context that must be filled by those heads denoting specific parts of the republic: e.g., institutions, organisations, functions, and so on.</Paragraph>
    <Paragraph position="3"> Note that the notion of subcategorisation restriction we use in this paper embraces both syntactic and semantic preferences.</Paragraph>
    <Paragraph position="4"> The second assumption concerns the procedure for building classes of similar subcategorisation contexts. We assume, in particular, that different subcategorisation contexts are considered to be semantically similar if they have the same word distribution. Let's take, for instance, the following contexts:</Paragraph>
    <Paragraph position="6"> All of them seem to share the same semantic preferences. As these contexts require words denoting the same semantic class, they tend to possess the same word distribution. Moreover, we also assume that the set of words required by these similar subcategorisation contexts represents the extensional description of their semantic preferences. Indeed, since words minister, president, assembly, . . .</Paragraph>
    <Paragraph position="7"> have similar distribution on those contexts, they may be used to build the extensional class of nouns that actually fill the semantic requirements of the contexts. Such words are, then, semantically subcategorised by them. Unlike most unsupervised methods to selection restrictions acquisition, we do not use the well-known strategy for measuring word similarity based on distributional hypothesis. According to this assumption, words cooccurring in similar subcategorisation contexts are semantically similar.</Paragraph>
    <Paragraph position="8"> Yet, as has been said in the Introduction, such a notion of word similarity is not sensitive to word polysemia. By contrast, the aim of our method is to measure semantic similarity between subcategorisation contexts. This allows us to assign a polysemic word to different contextual classes of subcategorisation.</Paragraph>
    <Paragraph position="9"> This strategy is also used in the Asium system (Faure and N'edellec, 1998; Faure, 2000).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Subcategorisation Acquisition
</SectionTitle>
    <Paragraph position="0"> To evaluate the hypotheses presented above, a software package was developed to support the automatic acquisition of syntactic and semantic subcategorisation information. The learning strategy is mainly constituted by two sequential procedures.</Paragraph>
    <Paragraph position="1"> The first one aims to extract subcategorisation candidates, while the second one leads us to both identify correct subcategorisation candidates and gather them into semantic classes of subcategorisation. The two procedures will be accurately described in the remainder of the section.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Extraction of Candidates
</SectionTitle>
      <Paragraph position="0"> We have developed the following procedure for extracting those syntactic patterns that could become later true subcategorisation contexts. Raw text is tagged (Marques, 2000) and then analyzed using some potentialities of the shallow parser introduced in (Rocio et al., 2001). The parser yields a single partial syntactic description of sentences, which are analyzed as sequences of basic chunks (NP, PP, VP, . . . ). Then, attachment is temporarily resolved by a simple heuristic based on right association (a chunk tend to attach to another chunk immediately to its right). Following our first assumption in section 2, we consider that the word heads of two attached chunks form a binary dependency that is likely to be split in two subcategorisation contexts. It can be easily seen that syntactic errors may appear since the attachment heuristic does not take into account distant dependencies.1 For reasons of attachment errors, it is argued here that the identified subcategorisation contexts are mere hypotheses; hence they are mere subcategorisation candidates. Finally, the set of words appearing in each subcategorisation context are viewed as candidates to be a semantic class.</Paragraph>
      <Paragraph position="1"> For example, the phrase emanou de facto da lei ([it] emanated in fact from the law) 1The errors are caused, not only due to this restrictive attachment heuristic, but also due to further misleadings, e.g., words missing from the dictionary, words incorrectly tagged, other sorts of parser limitations, etc.</Paragraph>
      <Paragraph position="2"> would produce the following two attachments: a53a55a68a85a54a81a64a80a86 a91a74a61a74a57a19a61a36a89a90a79a74a87a21a79a74a60a25a51a59a58a67a56a50a79a82a70a67a78a80a54a45a72a32a73a92a53a71a91a74a61a82a57a34a56a50a79a82a70a93a78a80a54a81a51a52a58a34a66a84a61a36a68a85a72a74a73 from which the following 4 subcategorisation candidates are generated:</Paragraph>
      <Paragraph position="4"> Since the prepositional complement de facto represents an adverbial locution interpolated between the verb and its real complement da lei, the two proposed attachments are odd. Hence, the four subcategorisation contexts should not be acquired. We will see how our algorithm allows us to learn subcategorisation information that will be used later to invalidate such odd attachments and propose new ones. The algorithm basically works by comparing the similarity between the word sets associated to each subcategorisation candidate.</Paragraph>
      <Paragraph position="5"> Let's note finally that unlike many learning approaches, information on co-composition is available for the characterization of syntactic subcategorisation contexts. In (Gamallo et al., 2001b), a strategy for measuring word similarity based on the co-composition hypothesis was compared to Grefensetette's strategy (Grefenstette, 1994). Experimental tests demonstrated that co-composition allows a finer-grained characterization of &amp;quot;meaningful&amp;quot; syntactic contexts.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Clustering Similar Contexts
</SectionTitle>
      <Paragraph position="0"> According to the second assumption introduced above (section 2), two subcategorisation contexts with similar word distribution should have the same extensional definition and, then, the same selection restrictions. This way, the word sets associated with two similar contexts are merged into a more general set, which represents their extensional semantic preferences. Consider the two following subcategorisation contexts and the words that appear in them:</Paragraph>
      <Paragraph position="2"> Since both contexts have a similar word distribution, it can be argued that they share the same selection restrictions. Furthermore, it must be inferred that the words associated to them are all co-hyponyms belonging to the same context-dependent semantic class. In our corpus, context a9a12a11a40a13 a15 a17a19a110a5a18a21a30a45a111a40a22a25a112a7a33a36a18a21a31a19a113a7a24 a37 a39a25a13 a15 a41a36a43 (to infringe) is not only considered similar to context a9a12a11a14a13a114a37a5a17a19a110a5a26a44a22a25a112a5a33a36a18a21a31a19a113 c, ~aa18a98a37a21a39a25a13a16a15a42a41a36a43 (infringement of ) , but also to other contexts such as: a9a12a11a40a13a40a37a5a17a19a110a7a18a44a30a45a111a14a22a25a24a27a26a98a115a25a28a6a26a52a33a93a116a45a113a7a24a42a37a27a39a25a13a40a15a42a41a36a43 (to respect) and a9a12a11a40a13 a15 a17a19a110a7a18a44a30a45a111a14a22a74a113a65a28a4a31a34a33a36a35a59a113a5a24 a37 a39a25a13 a15 a41a36a43 (to apply) . In this section, we will specify the procedure for learning context-dependent semantic classes by comparing similarity between the previously extracted contextual word sets. This will be done in two steps: filtering and clustering.</Paragraph>
      <Paragraph position="3">  As has been said in the introduction, the cooperative system Asium also extract similar subcategorisation contexts (Faure and N'edellec, 1998; Faure, 2000). This system requires the interactive participation of a language specialist in order to the contextual word sets be filtered and cleaned when they are taken as input of the clustering strategy. Such a co-operative method requires manual removal of those words that have been incorrectly tagged or analyzed from the sets. Our strategy, by contrast, attempts to automatically remove incorrect words from the contextual sets. Automatic filtering requires the following subtasks: First, each word set is associated with a list of its most similar contextual sets. Intuitively, two sets are considered as similar if they share a significant number of words. Various similarity measure coefficients were tested to create lists of similar sets.</Paragraph>
      <Paragraph position="4"> The best results were achieved using a particular weighted version of the Jaccard coefficient, where words are weighted considering both their dispersion and their relative frequency for each context (Gamallo et al., 2001a).</Paragraph>
      <Paragraph position="5"> Then, once each contextual set has been compared to the other sets, we select the words shared by each pair of similar sets, i.e., we select the intersection between each pair of sets considered as similar. Since words that are not shared by two similar sets could be incorrect words, we remove them.</Paragraph>
      <Paragraph position="6"> Intersection allows us to clear words that are not semantically homogeneous. Thus, the intersection of two similar sets represents a class of co-hyponyms,  which we call basic class. Let's take an example.</Paragraph>
      <Paragraph position="7"> In our corpus, the most similar set extracted from a9a12a11a40a13a40a15a5a17a19a110a5a26a27a22a25a112a7a33a36a18a21a31a19a113 c, ~aa18a52a37a27a39a25a13a40a15a42a41a36a43 (infringement of )) is the set extracted from a9a12a11a40a13 a15 a17a19a110a5a18a44a30a81a111a40a22a25a112a7a33a36a18a21a31a19a113a5a24 a37 a39a25a13 a15 a41a36a43 (infringe) . Both sets share the following words: sigilo princ'ipios preceito plano norma lei estatuto disposto disposic,~ao direito (secret principle precept plan norm law statute disposition disposition right) This basic class does not contain incorrect words such as vez, flagrantemente, obrigac,~ao, interesse (time, notoriously, obligation, interest), which were oddly associated to the context a9a12a11a40a13a5a15a7a17a19a110a5a26a27a22a25a112a7a33a36a18a21a31a19a113 c, ~aa18a52a37a27a39a25a13a40a15a98a41a36a43 , but which do not appear in context a9a12a11a40a13a21a15a5a17a19a110a7a18a44a30a45a111a14a22a25a112a5a33a36a18a21a31a34a113a5a24a42a37a7a39a25a13a40a15a42a41a36a43 . This class seems to be semantically homogeneous because it contains only co-hyponym words referring to legal documents. Once basic classes have been created, they are used by the conceptual clustering algorithm to build more general classes.</Paragraph>
      <Paragraph position="8">  We use an agglomerative (bottom-up) clustering for successively aggregating the previously created basic classes. Unlike most research on conceptual clustering, aggregation does not rely on a statistical distance between classes, but on empirically set conditions and constraints (Talavera and B'ejar, 1999). These conditions are discussed in (Gamallo et al., 2001a). Figure 1 shows two basic classes associated with two pairs of similar subcategorisation contexts. a9a12a117a119a118a92a120a122a121a94a123 a68 a43 represents a pair of similar subcategorisation contexts sharing the words preceito, lei, norma (precept, law, norm, while a9a12a117a119a118a92a120a122a121a94a123 a86 a43 represents another pair of similar contexts sharing the words preceito,  Cluster 1 contrato execuc,~ao exerc'icio prazo processo procedimento trabalho (agreement execution practice term/time process procedure work) Cluster 2 contrato exerc'iicio prestac,~ao recurso servic,o trabalho (agreement practice installment appeal service work) Cluster 3 actividade atribuic,~ao cargo exerc'icio func,~ao lugar trabalho (activity attribution post practice function post work/job) lei, direito (precept, law, right). Both basic classes are obtained from the filtering process described in the previous section. This figure illustrates more precisely how the basic classes are aggregated into more general clusters. If two classes fill the clustering conditions, they can be merged into a new class. The two basic classes of the example are clustered into the more general class constituted by preceito, lei, norma, direito. At the same time, the two pairs of contexts a9a12a117a119a118a92a120a122a121a94a123 a68 a43 and a9a12a117a119a118a92a120a122a121a94a123 a86 a43 are merged into the cluster a9a12a117a119a118a92a120a122a121a76a123 a68a106a86 a43 . Such a generalization leads us to induce syntactic data that does not appear in the corpus. Indeed, we induce both that the word norma may appear in the syntactic contexts represented by a9a12a117a119a118a124a120a125a121a76a123 a86 a43 , and that the word direito may be attached to the syntactic contexts represented by a9a12a117a104a118a124a120a125a121a76a123 a68 a43 .  Polysemic words are placed in different clusters. For instance, consider the word trabalho (work/job). Table 1 situates this word as a member of at least three different contextual classes. Cluster 1 aggregates words referring to temporal objects.</Paragraph>
      <Paragraph position="9"> Indeed, they are co-hyponyms because they appear in subcategorisation contexts sharing the same selection restrictions: e.g., a9a12a11a14a13a27a15a7a17a19a110a5a26a44a22a82a115a65a29a6a115a74a28a6a26a52a126a127a115a7a128a113a114a18a98a37a42a41a36a43 , (interruption of ), a9a12a11a14a13 a37 a17a67a26a52a129a130a22a25a13 a37 a39a74a35a32a29a16a24a27a115a98a18 a15 a41a36a43 (in course). Cluster 2 represents the result of an action. Such a meaning becomes salient in contexts like for instance a9a12a11a40a13 a15 a17a34a33a95a18a21a30a45a111 a28a6a18a98a24a101a22a25a24a27a26a52a35a50a26a98a30a32a26a52a24 a37 a39a25a13 a15 a41a36a43 (to receive in payment for). Indeed, the cause of receiving money is not the action of working, but the object done or the state achieved by working. Finally, Cluster 3 illustrates the more typical meaning of trabalho: it is a job, function or task, which can be carried out by professionals. This is why these co- null (post rank function place/post remuneration salary) hyponyms can appear in subcategorisation contexts such as: a9a12a11a40a13 a37 a17a19a110a5a26a27a22a25a33a93a126a127a115a25a28a46a26a52a35a32a116a81a18a42a24 a37 a41a36a43 , (of the inspector), a9a12a11a40a13a40a37a5a17a19a110a7a18a44a30a45a111a14a22a74a110a114a26a65a115a42a26a50a129a137a28a6a26a52a126a127a138a40a113a5a24a42a37a7a39a25a13a40a15a42a41a36a43 (to accomplish).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Application and Evaluation
</SectionTitle>
    <Paragraph position="0"> The acquired classes are used in the following way.</Paragraph>
    <Paragraph position="1"> First, the lexicon is provided with subcategorisation information, and then, a second parsing cycle is performed in order to syntactic attachments be corrected. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Lexicon Update
</SectionTitle>
      <Paragraph position="0"> Table 2 shows how the acquired classes are used to provide lexical entries with syntactic and semantic subcategorisation information. Each entry contains both the list of subcategorisation contexts and the list of word sets required by the syntactic contexts.</Paragraph>
      <Paragraph position="1"> As we have said before, such word sets are viewed as the extensional definition of the semantic preferences required by the subcategorisation contexts.</Paragraph>
      <Paragraph position="2"> Consider the information our system learnt for the verb emanar (see table 2). It syntactically subcategorises two kinds of &amp;quot;de-complements&amp;quot;: the one semantically requires words referring to legal documents (emana da lei - emanate from the law; law prescribes), the other selects words referring to institutions (emana da autoridade - emanate from the authority; authority proposes). The semantic restrictions enables us to correct the odd attachments proposed by our syntactic heuristics for the phrase emanou de facto da lei (emanated in fact from the law). As word facto does not belong to the semantic class required by the verb in the &amp;quot;de-complement&amp;quot; position, we test the following &amp;quot;de-complement&amp;quot;. As lei does belong, a new correct attachment is proposed.</Paragraph>
      <Paragraph position="3"> Consider now the nouns abono (loan) and presidente (president). They subcategorise not only complements, but also different kinds of heads.</Paragraph>
      <Paragraph position="4"> For instance, the noun abono selects for &amp;quot;de-head nouns&amp;quot; like fixac,~ao (fixac,~ao do abono fixing the loan), as well as for verbs like fixar in the direct object position: fixar o abono (to fix the loan).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Attachment Resolution Algorithm
</SectionTitle>
      <Paragraph position="0"> The syntactic and semantic subcategorisation information provided by the lexical entries is used to check whether the subcategorisation candidates previously extracted by the parser are true attachments.</Paragraph>
      <Paragraph position="1"> The degree of efficiency in such a task may serve as a reliable evaluation for measuring the soundness of our learning strategy.</Paragraph>
      <Paragraph position="2"> We assume the use of both a traditional chart parser (Kay, 1980) and a set of simple heuristics for identifying attachment candidates. Then, in order to improve the analysis, a &amp;quot;diagnosis parser&amp;quot; (Rocio et al., 2001) receives as input the sequences of chunks proposed as attachment candidates, checks them and raises correction procedures. Consider, for instance, the expression editou o artigo(edited the article). The diagnoser reads the sequence of chunks VP(editar) and NP(artigo), and then proposes the attachment a17a19a110a5a18a21a30a45a111a40a22a82a26a52a110a27a33a36a116a81a113a5a24a50a37a114a39a74a113a7a24a21a116a81a33a67a139a5a18a98a15a42a41 to be corrected by the system. Correction is performed by accepting or rejecting the proposed attachment. This is done looking for the subcategorisation information contained in the lexicon dictionary, information which has been acquired by the clustering method described above. Four tasks are performed to check the attachment heuristics: Task 1a - Syntactic checking of artigo: check word artigo in the lexicon. Look for the syntactic restriction a9a12a11a14a13a7a37a5a17a19a110a5a18a21a30a45a111a40a22a25a13a40a37a7a39a74a113a5a24a21a116a95a33a67a139a114a18a65a15a42a41a36a43 . If artigo has this syntactic restriction, then, pass to the semantic checking. Otherwise, pass to task 2a.</Paragraph>
      <Paragraph position="3"> Task 1b - Semantic checking of artigo: check the semantic restriction associated with a9a12a11a40a13a40a37a5a17a19a110a7a18a44a30a45a111a14a22a25a13a16a37a7a39a74a113a5a24a42a116a81a33a67a139a5a18a98a15a42a41a36a43 . If word editar belongs to that restricted class, then we can infer that a17a19a110a7a18a44a30a45a111a14a22a82a26a65a110a27a33a93a116a45a113a5a24a98a37a114a39a74a113a7a24a21a116a81a33a67a139a5a18a98a15a42a41 is a binary relation. Attachment is then confirmed. Otherwise, pass to task 2a.</Paragraph>
      <Paragraph position="4"> Task 2a - Syntactic checking of editar: check word editar in the lexicon. Look for the syntactic restriction a9a12a11a40a13 a15 a17a19a110a7a18a44a30a45a111a14a22a82a26a65a110a27a33a93a116a45a113a7a24 a37 a39a25a13 a15 a41a36a43 . If editar has this syntactic restriction, then, pass to the semantic checking. Otherwise, attachment cannot be confirmed. null Task 2b - Semantic checking of editar: check the semantic restriction associated with a9a12a11a40a13a40a15a5a17a19a110a7a18a44a30a45a111a14a22a82a26a65a110a27a33a93a116a45a113a5a24a98a37a5a39a25a13a40a15a42a41a36a43 . If word artigo belongs to that restricted class, then we can infer that a17a19a110a7a18a44a30a45a111a14a22a82a26a65a110a27a33a93a116a45a113a5a24a98a37a114a39a74a113a7a24a21a116a81a33a67a139a5a18a98a15a42a41 is a binary relation. Attachment is then confirmed. Otherwise, attachment cannot be confirmed.</Paragraph>
      <Paragraph position="5"> Semantic checking is based on the co-specification hypothesis stated above. According to this hypothesis, two chunks are syntactically attached only if one of these two conditions is verified: either the complement is semantically required by the head, or the head is semantically required by the complement.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Evaluating Performance of Attachment
Resolution
</SectionTitle>
      <Paragraph position="0"> Table 3 shows some results of the corrections proposed by the diagnosis parser. Accuracy and coverage were evaluated on three types of attachment candidates: NP-PP, VP-NP, and VP-PP. We call accuracy the proportion of corrections that actually correspond to true dependencies and, then, to correct attachments. Coverage indicates the proportion of candidate dependencies that were actually corrected.</Paragraph>
      <Paragraph position="1"> Coverage evaluation was performed by randomly selecting as test data three sets of about 100-150 occurrences of candidate attachments from the parsed corpus. Each test set only contained one type of candidate attachments. Because of low coverage, accuracy was evaluated by using larger sets of test candidates. A brief description of the evaluation results are depicted in Table 3.</Paragraph>
      <Paragraph position="2">  Even though accuracy reaches a very promising value (about a141a44a149a27a151 ), coverage merely achieves a148a44a145a27a151 . There are two main reasons for low coverage: on the one hand, the learning method needs words to have significant frequencies through the corpus; on the other hand, words are sparse through the corpus, i.e., most words of a corpus have few occurrences. However, the significant differences between the coverage for NP-PP attachments and that for verbal attachments (i.e., VP-NP and VP-PP), leads us to believe that the values reached by coverage should increase as corpus size grows. Indeed, given that verbs are less frequent than nouns, verb occurrences are still very low in a corpus containing a147 a39a82a152 millions of word occurrences. We need larger annotated corpora to improve the learning task, in particular, concerning verb subcategorisation.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Future Work
</SectionTitle>
    <Paragraph position="0"> As we do not propose long distance attachments, our method can not be compared with other standard corpus-based approaches to attachment resolution (Hindle and Rooth, 1993; Brill and Resnik, 1994; Li and Abe, 1998). Long distance attachments only will be considered after having achieved the corrections for immediate dependencies in the first cycle of syntactic analysis. We are currently working on the specification of new analysis cycles in order to long distance attachments be solved. Consider again the phraseemanou de facto da lei. At the second cycle, the diagnoser proposed that the first PP de facto is not corrected attached to emanou.</Paragraph>
    <Paragraph position="1"> At the third cycle, the system will check whether the second PPda leimay be attached to the verb. We will perform n-cycles of attachment propositions, until no candidates are available. At the end of the process, we will be able to measure in a more accurate way what is the degree of robustness the parser may achieve.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML