File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1204_intro.xml
Size: 8,638 bytes
Last Modified: 2025-10-06 14:02:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1204"> <Title>Discovering patterns to extract protein-protein interactions from full</Title> <Section position="3" start_page="22" end_page="24" type="intro"> <SectionTitle> 2 Method </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="22" end_page="22" type="sub_section"> <SectionTitle> 2.1 Alignment algorithm </SectionTitle> <Paragraph position="0"> Suppose we have two sequences</Paragraph> <Paragraph position="2"> is called as a character, and '-' denotes a white-space or a gap. We want to assign a score to measure how similar X and Y are. Define F(i,j) as the score of the optimal alignment between the initial segment</Paragraph> <Paragraph position="4"> Here, p(a) denotes the appearance probability of character a, and p(a,b) denotes the probability that a and b appear at the same position in two aligned sequences. Probabilities p(a) and p(a,b) can be easily estimated by calculating appearance frequencies for each pair with pre-aligned training data.</Paragraph> <Paragraph position="5"> Note that the calculation of scores for a gap will be different. In formula (2), when a or b is a gap, the scores can not be directly estimated by the formula because of two reasons: 1) the case that a gap aligns to another gap will never happen in the alignment algorithm since it is not optimal, therefore, what s('-', '-') exactly means is unclear; 2) Gap penalty should be negative, but it is unclear what p('-') should be. In DNA sequence alignment, these gap penalties are simply assigned with negative constants. Similarly, we tune each gap penalty for every character with some fixed negatives. Then a linear gap model is used.</Paragraph> <Paragraph position="6"> Given a sequence of gaps with length n which aligns to a sequence of</Paragraph> <Paragraph position="8"> with no gaps, the linear penalty is as follows:</Paragraph> <Paragraph position="10"> For sequence X of length n and sequence Y of length m, totally (n+1)*(m+1) scores will be calculated by applying equation (1a-b) recursively.</Paragraph> <Paragraph position="11"> Store the scores in a matrix F=F(x</Paragraph> <Paragraph position="13"> back-tracing in F, the optimal local alignment can be found.</Paragraph> <Paragraph position="14"> In our method, the alphabet consists of three kinds of tags: 1) part-of-speech tags, as those used by Brill's tagger (Brill et al., 1995); 2) tag PTN for protein names; 3) tag GAP for a gap or white-space. Gap penalties for main tags are shown in Table 1. Tag Penalty Tag Penalty Tag Penalty</Paragraph> <Paragraph position="16"/> </Section> <Section position="2" start_page="22" end_page="23" type="sub_section"> <SectionTitle> 2.2 Pattern generating algorithm </SectionTitle> <Paragraph position="0"> For our problem, a data structure called sequence structure, instead of a flat sequence, is used. Sequence structure consists of a sequence of tags (including PTN and GAP) and word indices in the original sentence for each tag (for tag PTN and GAP, word indices are set to -1). Through the structure, we are able to trace which words align together.</Paragraph> <Paragraph position="1"> Similarly, we also use another data structure called pattern structure which is made up of three parts: a sequence of tags; an array of word index lists for each tag, where each list defines a set of words for a tag that can appear at the corresponding position of a pattern; a count of how many times the pattern has been extracted out in the training corpus. With the structure, the pattern generating algorithm is shown in Figure 1. The filtering rules are listed in Table 2.</Paragraph> <Paragraph position="2"> Note that a threshold d is used in the algorithm. If a pattern appears less than d times in the corpus, it will be discarded; otherwise those infrequent patterns will cause many matching errors. Through adjusting this parameter, generalization and usability of patterns can be controlled. The larger the threshold is, the more general and accurate patterns are.</Paragraph> <Paragraph position="3"> Tags like JJ (adjective) and RB (adverb) are too common and can appear at every position in a sentence; hence if patterns include such kind of tags, they lose the generalization power. Some tags such as DT (determiner) only play a functional role in a sentence and they are useless to pattern generation. Therefore, just as the first step in our algorithm shown in Figure 1, we remove directly the useless tags such as JJ, JJS (superlative adjective), JJR (comparative adjective), RB, RBS (superlative adverb), RBR (comparative adverb) and DT from the sequences. Furthermore, to control the form of a pattern, filtering rules shown in Table 2 are adapted. Verb or noun tags define interactions between proteins, thus they are indispensable for a pattern, as the first rule shows. The second rule guarantees the integrality of a pattern because tags like IN and TO must be followed by an object. The last one requires symmetry between the left and right neighborhood of CC tag. Actually more rigid or looser filtering rules than those shown in Table 2 can be applied to meet special demands, which will affect the forms of patterns.</Paragraph> </Section> <Section position="3" start_page="23" end_page="24" type="sub_section"> <SectionTitle> 2.3 Pattern matching algorithm </SectionTitle> <Paragraph position="0"> Because one pattern possibly matches a sentence at different positions, we have to explore an algorithm that is able to find out multiple matches.</Paragraph> <Paragraph position="1"> ) in the corpus size n.</Paragraph> <Paragraph position="2"> Here if we think a pattern as a motif, and sentence as a protein sequence, then our task is similar to finding out all motifs in the sequence.</Paragraph> <Paragraph position="3"> Suppose that</Paragraph> <Paragraph position="5"> is the sequence of tags for a sentence in which we look for multiple matches, and</Paragraph> <Paragraph position="7"> is a pattern. We still use a score matrix F, while the recurrence, defined by formulas (4a-b), is different from that of pattern generating algorithm. Formula (4a) only allows matches to end when they have score at least T.</Paragraph> <Paragraph position="9"> The total score of all matches is obtained by adding an extra cell to the matrix, F(n+1,0), using (4a). By tracing back from cell (n+1,0) to (0,0), the individual match alignments will be obtained.</Paragraph> <Paragraph position="10"> Threshold T should not be identical for different patterns. Threshold T is calculated as follows:</Paragraph> <Paragraph position="12"> where e is a factor, in our method we take e=0.5.</Paragraph> <Paragraph position="13"> The right hand of formula (5) is the maximum score when a pattern matches a sentence perfectly.</Paragraph> <Paragraph position="14"> A match is accepted only when three conditions are satisfied: 1) a pattern has a local optimal match with the sentence; 2) words in matching part of the sentence can be found in the word set of the pattern; 3) decision rules are satisfied.</Paragraph> <Paragraph position="15"> 1. If a pattern has neither verb tag nor noun tag, reject it.</Paragraph> <Paragraph position="16"> 2. If the last tag of a pattern is IN or TO, reject it.</Paragraph> <Paragraph position="17"> 3. If the left neighborhood of a CC tag is not equal the right neighborhood of the tag in a pattern, reject the pattern.</Paragraph> <Paragraph position="18"> Input: an integer d, as pattern p. Add the corresponding word indices to pattern structure; c) Judge whether p is legal, using the filtering rules. If it is illegal, go to step 2; d) If p exists in P, increase the count of p with 1. If not, add p to P with a count of 1; 3. For every p in P , do If the count of p is less than d, discard p; 4. Output P.</Paragraph> <Paragraph position="20"> To show details how well a pattern matches a sentence, a measurement data structure is defined, which is formalized as a vector. It will be referred to as mVector:</Paragraph> <Paragraph position="22"> where cLen is the length of a pattern; cMatch is the number of matched tags; cPtn is the number of protein name tag (PTN) skipped by the alignment in the sentence; cVb is the number of skipped verbs. Based on the structure, decision rules shown in Table 3 are used in the pattern matching.</Paragraph> <Paragraph position="23"> There are two parameters P and V used in the decision rules, which can be adjusted according to the performance of the experiments. Here we take</Paragraph> </Section> </Section> class="xml-element"></Paper>