File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-2028_evalu.xml
Size: 3,577 bytes
Last Modified: 2025-10-06 13:59:33
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-2028"> <Title>Bayesian Network, a model for NLP?</Title> <Section position="4" start_page="196" end_page="197" type="evalu"> <SectionTitle> 4 Experiments and Discussion </SectionTitle> <Paragraph position="0"> Medline is a database specialized in genomic research articles. We extracted from it 11966 abstracts with keywords bacillus subtilis, transcription factors, Human, blood cells, gene and fusion. Among these abstracts, we isolated 3347 occurences of the pronoun it and two human annotators tagged it occurences as either anaphoric or non-anaphoric8. After discussion, the two annotators achieved a total agreement.</Paragraph> <Paragraph position="1"> We implemented the HC rules, LC rules and surface clues using finitetransducers and extracted the pronoun syntactic role from the results of the Link Parser analysis of the corpus (Aubin, 2005). As a working approximation, we automaticaly generated the verb, adjective and noun classes from the training corpus: among all it occurences tagged as non-anaphoric, we selected the verbs, adjectives and nouns occurring between thedelimiter and the pronoun. We considered a third of the corpus for training and the remaining for testing.</Paragraph> <Paragraph position="2"> Ourexperiment wasperformed using 20-cross validation. null Table1 summarizes the average results reached by the state-of-the-art methods described above9.</Paragraph> <Paragraph position="3"> The BN system achieved a better classification than other methods.</Paragraph> <Paragraph position="4"> In order to neutralize and comparatively quantify the contribution in the decision of the dependancy relationships between the factors, we have implemented a Naive Bayesian Classifier (NBC) which exploits the same pieces of knowledge and the same parameters as the BN but it does not profitfromreinforcement mechanism, whichleads to a rise in the number of false positive cases.</Paragraph> <Paragraph position="5"> Our BN, which has a good precision, nevertheless tags as non-anaphoric some occurrences which are not. The most recurrent error corresponds to the sequences ending with a delimiter to recognized by some LC rules. Although none HC rule matches the sequence, its minimal length and the fact that it contains particular adjectives or verbs like assumed or shown, makes this configuration caracteristic enough to tag the pronoun as non-anaphoric. When the delimiter is that, this classification is correct 10 but it is always incorrect when the delimiter is to11. For the delimiter to, the rules must be more carefully designed.</Paragraph> <Paragraph position="6"> Three different factors explain the false negative cases. Firstly, some sequences were ignored because the delimiter remained implicit12. Secondly, the presence of apposition clauses increases the sequence length and decreases the confidence.</Paragraph> <Paragraph position="7"> Dedicated algorithms taking advantage ofadeeper syntactic analysis could resolve these cases. The last cause is the non-exhaustiveness of the verb, adjective and noun classes. It should be possible to enrich them automatically. In our experiments we have noticed that if a LC rule matches a sequence in the firstclause ofthe first sentence in the abstract then the pronoun is non-anaphoric. We could automatically extract from Medline a large number of such sentences and extend our classes by selecting the verbs, adjectives and nouns occuring between the pronoun and the delimiter in these sentences.</Paragraph> </Section> class="xml-element"></Paper>