File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-2028_metho.xml
Size: 8,504 bytes
Last Modified: 2025-10-06 14:10:07
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-2028"> <Title>Bayesian Network, a model for NLP?</Title> <Section position="3" start_page="0" end_page="196" type="metho"> <SectionTitle> 2 Identification of Non-anaphoric it occurences </SectionTitle> <Paragraph position="0"> The decisions made by NLP systems depend on the available knowledge. However this information isoften weakly reliable and leads toerroneous or incomplete results.</Paragraph> <Paragraph position="1"> One of first pronoun classifier system is presented by (Paice, 1987). It relies on a set of logical first order rules to distinguish the non-anaphoric occurences of the pronoun it. Non-anaphoric sequences share remarkable forms(theystartwithan it and endwithadelimiter like to, that, whether...). The rules expresses some constraints which vary according to the delimiter. They concern the left context of the pronoun (it should not be immediately preceded by certain words like before, from to), the distance between the pronoun and the delimiter(it mustbeshorter than 25wordslong), and finallythelexical itemsoccurring betweenthepronoun and the delimiter (the sequence must or must not contain certain words belonging to specific sets, such as words expressing modality over the sentence content, e.g. certain, known, unclear...).</Paragraph> <Paragraph position="2"> Tests performed by Paice show good results with 91.4%Accuracy1 on a technical corpus. However the performances are degraded if one applies them to corpora of different natures: the number of false positive increases.</Paragraph> <Paragraph position="3"> Inorder toavoid this pitfall, (Lappin, 1994) proposes some more constrained rules in the form of finite state automata. Based on linguistic knowledge the automata recognize specific sequences like It is not/may be<Modaladj>; It is <Cogved> that <Subject> where <Modaladj> and <Cogv> are modal adjective and cognitive verbs classes known to introduce non-anaphoric it (e.g.</Paragraph> <Paragraph position="4"> necessary, possible and recommend, think). This system has a good precision (few false positive cases), but has a low recall (many false negative cases). Any sequence with a variation is ignored by the automata and it is difficult to get exhaustive adjective and verb semantic classes2. In the next paragraphs we refer to Lappin rules' as Highly Constraint rules (HC rules) and Paice rules' as Lightly Constraint rules (LC rules).</Paragraph> <Paragraph position="5"> (Evans, 2001) gives up the constraints brought into play by these rules and proposes a machine learning approach based on surface clues. The training determines the relative weight of the variouscorpus clues. Evansconsiders 35syntactic and contextual surface clues (e.g. pronoun position in the sentence, lemma of the following verb) on a manually annotated sample. The system classifies the new it occurences by the k-nearest neighbor method metric. The first tests achieve a satisfactory score: 71.31%Acc on a general language corpus. (Clement, 2004) carries out a similar test in the genomic domain. He reduces the number of Evans's surface clues to the 21 most relevant ones and classifies the new instances with a Support Vector Machine(SVM). It obtains 92.71%Acc to be compared with a 90.78%Acc score for the LC rules on the same corpus. The difficulty, however, comes from the fact that the information on which 1Accuracy(Acc) is a classification measure: Acc= P+NP+N+p+n where p is the number of anaphoric pronoun occurences tagged as non-anaphoric, which we call the false positive cases, n the number of non-anaphoric pronoun ocurrences tagged as anaphoric, the false negative cases. P and N are the numbers of correctly tagged non-anaphoric and anaphoric pronoun occurences, the true positive and negative cases respectively.</Paragraph> <Paragraph position="6"> 2For instance in the sentences It is well documented that treatment of serum-grown... and It is generally accepted that Bcl-2 exerts... the it occurences are not classified as nonanaphorics because documented does not belong to the original verb class <Cogv> and generally does not appear in the previous automaton.</Paragraph> <Paragraph position="7"> the systems are built is often diverse and heterogeneous. This system is based on atomic surface clues only and does not make use of the linguistic knowledge or the relational information that the constraints ofthe previous systems encode. Wearguethatthesethree types ofknowledge thatarethe HC rules, the LC rules, and the surfaces clues are all relevant and complementary for the task and thatthey mustbeunified inasingle representation.</Paragraph> <Paragraph position="8"> Neither the surface clues nor the surface clues are reliable indicators of the pronoun status. They encode heterogeneous pieces of information and consequently produce different false negative and positive cases. TheHCrules haveagoodprecision but tag only few pronouns. On the opposite, the LC rules, which have a good recall, are not precise enough to be exploited as such and the additional surface clues must be checked. Our model combines these clues and take their respective reliabilityintoaccount. Itobtains betterresults thanthose obtained from each clue exploited separately.</Paragraph> <Paragraph position="9"> The BN is a model designed to deal with dubious pieces of information. It is based on a qualitative description of their dependancy relationships, a directed acyclic graph, and a set of conditionnal probablities, each node being represented as a Random Variable (RV). Parametrizing the BN associates an a priori probability distribution to the graph. Exploiting the BN (inference stage) consists in propagating new pieces of information through the network edges and updating them according to observations (a posteriori probabilities). null We integrated all the clues exploted by of the previous methods within the same BN. We use dependancy relationships to express the fact that two clues are combined. The BN is manually designed (choice of the RV values and graph structure). On the Figure1, the nodes associated with the HC rules method are marked in grey, white is for the LC rules method and black for the Clement's method3. The Pronoun node estimates the decision probability for a given it occurence to be non-anaphoric.</Paragraph> <Paragraph position="10"> The parameterising stage establishes the a priori probability values for all possible RV by simple frequency counts in a training corpus. They express the weight of each piece of information in the decision, its a priori reliability in the classification decision4. The inference stage exploits the relationships for the propagation of the information andtheBNoperates byinformation reinforcement to label a pronoun. We applied all precedent rules and checked surface clues on the sequence containing the it occurrence and set observation values to the correspondant RV probabilities. A new probability is computed for the node's variable Pronoun: if it is superior or equal to 50% the pronoun is labeled non-anaphoric, anaphoric otherwise.</Paragraph> <Paragraph position="11"> Let us consider the sentence extracted from our corpus: It had previously been thought that ZE-BRA's capacity to disrupt EBV latency.... No HC rule recognizes the sequence even by tolerating 3 unknown words 5, but a LC rule matches it with 4 words between the pronoun and the delimiter that6. Among the surface clues, we checked that the sequence is at the beginning of the sentence pus (see next section), the HC rules recognized 649 of the 727 non-anaphoric pronouns and they have erroneously recognized as non-anaphoric 17 pronouns, so we set the HCR-rules node probabilities as P(HCRrules=Match|Pronoun=Non-Anaphoric)=89.2% and P(HCRrules=Match|Pronoun=Anaphoric)=1.3% which expresses the expected value for the false negative cases and the false positive cases produced by the HC rules respectively. (1) but that the sentence is not the first of the abstract (2). The sentence also contains the adverb previously (3) and the verb think (4), which words belong to our semantic classes7. The a priori probability for the pronoun to be non-anaphoric is 36.2%. After modifying the probabilities of the nodes of the BN according to the corpus observations, the a posteriori probability computed for this occurence is 99.9% and the system classifies it as non-anaphoric.</Paragraph> </Section> class="xml-element"></Paper>