File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2204_metho.xml
Size: 16,552 bytes
Last Modified: 2025-10-06 14:10:47
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2204"> <Title>Transductive Pattern Learning for Information Extraction</Title> <Section position="4" start_page="25" end_page="28" type="metho"> <SectionTitle> 3 The TPLEX algorithm </SectionTitle> <Paragraph position="0"> The goal of the algorithm is to identify the members of the target fields within unlabelled texts by generalizing from seed examples in labelled training texts. We achieve this by generalizing boundary detecting patterns and scoring them with a recursive scoring function. null As shown in Fig. 1, TPLEX bootstraps the learning process from a seed set of labelled examples. The examples are used to populate initial pattern sets for each target field, with patterns that match the start and end positions of the seed fragments. Each pattern is then generalised to produce more patterns, which are in turn gorithm is the following recursive scoring method: pattern scores are a function of the scores of the positions they extract, and position scores are a function of the scores of the patterns that extract them.</Paragraph> <Paragraph position="1"> applied to the corpus in order to identify more base patterns. This process iterates until no more patterns can be learned.</Paragraph> <Paragraph position="2"> TPLEX employs a recursive scoring metric in which good patterns reinforce good positions, and good positionsreinforcegoodpatterns. Specifically, wecalculate confidence scores for positions and patterns. Our scoring mechanism calculates the score of a pattern as a function of the scores of the positions that it matches, and the score of a position as a function of the scores of the patterns that extract it.</Paragraph> <Paragraph position="3"> TPLEX is a multi-field extraction algorithm in that it extracts multiple fields simultaneously. By doing this, information learned for one field can be used to constrainpatternslearnedforothers. Specifically, ourscoring mechanism ensures that if a pattern scores highly for one field, its score for all other fields is reduced. In the remainder of this section, we describe the algorithm by formalizing the space of learned patterns, and then describing TPLEX's scoring mechanism.</Paragraph> <Section position="1" start_page="25" end_page="26" type="sub_section"> <SectionTitle> 3.1 Boundary detection patterns </SectionTitle> <Paragraph position="0"> TPLEX extracts fragments of text by identifying probable fragment start and end positions, which are then assembled into complete fragments. TPLEX's patterns are therefore boundary detectors which identify one end of a fragment or the other. TPLEX learns patterns to identify the start and end of target occurrences independently of each other. This strategy has previously been employed successfully (Freitag and Kushmerick, 2000; Ciravegna, 2001; Yangarber et al., 2002; Finn and Kushmerick, 2004).</Paragraph> <Paragraph position="1"> TPLEX's boundary detectors are similar to those learned by BWI (Freitag and Kushmerick, 2000). A boundary detector has two parts, a left pattern and a right pattern. Each of these patterns is a sequence of tokens, where each is either a literal or a generalized token. For example, the boundary detector [will be <punc>][<caps> <fname>] would correctly find the start of a name in an utterance such as &quot;will be: Dr Robert Boyle&quot; and &quot;will be, Sandra Frederick&quot;, but it will fail to identify the start of the name in &quot;will be Dr. Robert Boyle&quot;. The boundary detectors that find the beginnings of fragments are called the pre-patterns, and the detectors that find the ends of fragments are called the post-patterns.</Paragraph> </Section> <Section position="2" start_page="26" end_page="26" type="sub_section"> <SectionTitle> 3.2 Pattern generation </SectionTitle> <Paragraph position="0"> As input, TPLEX requires a set of tagged seed documents for training, and an untagged corpus for learning. The seed documents are used to initialize the pre-pattern and post-pattern sets for each of the target fields. Within the seed documents each occurrence of a fragment belonging to any of the target categories is surrounded by a set of special tags that denote the field to which it belongs.</Paragraph> <Paragraph position="1"> The algorithm parses the seed documents and identifies the tagged fragments in each document. It then generates patterns for the start and end positions of each fragment based on the surrounding tokens.</Paragraph> <Paragraph position="2"> Each pattern varies in length from 2 to n tokens. For a given pattern length lscript, the patterns can then overlap the position by zero to lscript tokens. For example, a pre-pattern of length four with an overlap of one will match the three tokens immediately preceeding the start position of a fragment, and the one token immediately following that position. In this way, we generatesummationtext</Paragraph> <Paragraph position="4"> TPLEX then grows these initial sets for each field by generalizing the initial patterns generated for each seed position. We employ eight different generalization tokens when generalizing the literal tokens of the initial patterns. The wildcard token <*> matches every literal. The second type of generalization is <punc>, which matches punctuation such as commas and periods. Similarly, the token<caps>matches literals with an initial capital letter, <num> matches a sequence of digits, <alpha num> matches a literal consisting of letters followed by digits, and <num alpha> matches a literal consisting of digits followed by letters. The final two generalisations are <fname> and <lname>, which match literals that appear in a list of first and last names (respectively) taken from US Census data.</Paragraph> <Paragraph position="5"> All patterns are then applied to the entire corpus, including the seed documents. When a pattern matches a new position, the tokens at that position are converted into a maximally-specialized pattern, which is added to the pattern set. Patterns are repeatedly generalized until only one literal token remains. This whole process iterates until no new patterns are discovered. We do not generalize the new maximally-specialized patterns discovered in the unlabelled data. This ensures that all patterns are closely related to the seed data. (We experimented with generalizing patterns from the unlabelled data, but this rapidly leads to overgeneralization.) The locations in the corpus where the patterns match are regarded as potential target positions. Pre-patterns indicate potential start positions for target fragments while post-patterns indicate end positions. When all of thepatternshavebeenmatchedagainstthecorpus, each field will have a corresponding set of potential start and end positions.</Paragraph> </Section> <Section position="3" start_page="26" end_page="26" type="sub_section"> <SectionTitle> 3.3 Notation & problem statement </SectionTitle> <Paragraph position="0"> Positions are denoted by r, and patterns are denoted by p. Formally, a pattern is equivalent to the set of positions that it extracts.The notation p - r indicates that pattern p matches position r. Fields are denoted by f, and F is the set of all fields.</Paragraph> <Paragraph position="1"> The labelled training data consists of a set of positions R = {...,r,...}, and a labelling function</Paragraph> <Paragraph position="3"> indicates that position r is labelled with field f in the training data. T(r) = X means that r is not labelled in the training data (i.e. r is a negative example for all fields).</Paragraph> <Paragraph position="4"> The unlabelled test data consists of an additional set of positions U.</Paragraph> <Paragraph position="5"> Given this notation, the learning task can be stated concisely as follows: extend the domain of T to U, i.e. generalize from T(r) for r [?] R, to T(r) for r [?] U.</Paragraph> </Section> <Section position="4" start_page="26" end_page="27" type="sub_section"> <SectionTitle> 3.4 Pattern and position scoring Whenthepatternsandpositionsforthefieldshavebeen </SectionTitle> <Paragraph position="0"> identified we must score them. Below we will describeindetailtherecursivemannerinwhichwedefine null scoref(r) in terms of scoref(p), and vice versa. Given that definition, we want to find fixed-point values for scoref(p) and scoref(r). To achieve this, we initialize the scores, and then iterate through the scoring process (i.e. calculate scores at step t+1 from scores at step t).</Paragraph> <Paragraph position="1"> This process repeats until convergence.</Paragraph> <Paragraph position="2"> Initialization. As the scores of the patterns and positions of a field are recursively dependant, we must assign initial scores to one or the other. Initially the only elementsthatwecanclassifywithcertaintyaretheseed fragments. We initialise the scoring function by assigning scores to the positions for each of the fields. In this way it is then possible to score the patterns based on these initial scores.</Paragraph> <Paragraph position="3"> From the labelled training data, we derive the prior probability pi(f) that a randomly selected position belongs to field f [?] F:</Paragraph> <Paragraph position="5"> Note that 1 [?] summationtextf pi(f) is simply the prior probability that a randomly selected position should not be extracted at all; typically this value is close to 1.</Paragraph> <Paragraph position="6"> Given the priors pi(f), we score each potential posi-</Paragraph> <Paragraph position="8"> The first case handles positions in the unlabelled documents; atthispointwedon'tknowanythingaboutthem and so fall back to the prior probabilities. The second and third cases handle positions in the seed documents, for which we have complete information.</Paragraph> <Paragraph position="9"> Iteration. After initializing the scores of the positions, we begin the iterative process of scoring the patterns and the positions. To compute the score of a pattern p for field f we compute a positive score, posf(p); a negative score, negf(p); and an unknown score, unk(p). posf(p) can be thought of as a measure of the benefit of p to f, while negf(p) measures the harm of p to f, and unk(p) measures the uncertainty about the field with which p is associated.</Paragraph> <Paragraph position="10"> These quantities are defined as follows: posf(p) is the average score for field f of positions extracted by p. We first compute:</Paragraph> <Paragraph position="12"> where Zp = summationtextf summationtextp-r scoretf(r) is a normalizing constant to ensure thatsummationtextf posf(p) = 1.</Paragraph> <Paragraph position="13"> For each field f and pattern p, negf(p) is the extent to which p extracts positions whose field is not f: negf(p) = 1[?]posf(p).</Paragraph> <Paragraph position="14"> Finally, unk(p) measures the degree to which p extract positions whose field is unknown:</Paragraph> <Paragraph position="16"> where unk(r) measures the degree to which position r is unknown. To be completely ignorant of a position's field is to fall back on the prior field probabilities pi(f).</Paragraph> <Paragraph position="17"> Therefore, we calculate unk(r) by computing the sum of squared differences between scoretf(r) and pi(f):</Paragraph> <Paragraph position="19"> The normalization constant Z ensures that unk(r) = 0 for the position r whose scores are the most different from the priors--ie, r is the &quot;least unknown&quot; position. For each field f and pattern p, scoretf(p) is defined in terms of posf(p), negf(p) and unk(p) as follows:</Paragraph> <Paragraph position="21"> This definition penalizes patterns that are either inaccurate or have low coverage.</Paragraph> <Paragraph position="22"> Finally, we complete the iterative step by calculating a revised score for each position:</Paragraph> <Paragraph position="24"> where min = minf,p-r summationtextp-r scoretf(p) and max = maxf,p-r summationtextp-r scoretf(p), are used to normalize the scores to ensure that the scores of unlabelled positions never exceed the scores of labelled positions. The first case in the function for scoret+1f (r) handles positive and negative seeds (i.e. positions in labelled texts), the second case is for unlabelled positions.</Paragraph> <Paragraph position="25"> We iterate this procedure until the scores of the patterns and positions converge. Specifically, we stop</Paragraph> <Paragraph position="27"> In our experiments, we fixed th = 1.</Paragraph> </Section> <Section position="5" start_page="27" end_page="28" type="sub_section"> <SectionTitle> 3.5 Position filtering & fragment identification </SectionTitle> <Paragraph position="0"> Due to the nature of the pattern generation strategy, many more candidate positions will be identified than there are targets in the corpus. Before we can proceed with matching start and end positions to form fragments, wefilterthepositionstoremovetheweakercandidates. null We rank all of positions for each field according to their score. We then select positions with a score above a threshold b as potential positions. In this way we reduce the number of candidate positions from tens of thousands to a few hundred.</Paragraph> <Paragraph position="1"> The next step in the process is to identify complete fragments within the corpus by matching pre-positions with post-positions. To do this we compute the length probabilities for the fragments of field f based on the lengths of the seed fragments of f. Suppose that positionr1 has been identified as a possible start for fieldf, and position r2 has been identified as a possible field f end, and let Pf(lscript) be the fraction of field f seed fragments with length lscript. Then the fragment e = (r1,r2) is assigned a score scoref(e) = scoref(r1)*scoref(r2)*Pf(r2[?]r1 +1).</Paragraph> <Paragraph position="2"> Despite these measures, overlapping fragments still occur. Since the correct fragments can not overlap, we know that if two extracted fragments overlap, at least one must be wrong. We resolve overlapping fragments by calculating the set of non-overlapping fragments that maximises the total score while also accounting for the expected rate of occurrence of fragments from each field in a given document.</Paragraph> <Paragraph position="3"> In more detail, let E be the set of all fragments extracted from some particular document D. We are interested in the score of some subset G [?] E of D's fragments. Let score(G) be the chance that G is the correct set of fragments for D. Assuming that the correctness of the fragments can be determined independently given that the correct number of fragments have been identified for each field, then score(G) can be defined zero if [?] (r1,r1),(s2,r2) [?] G such that (s1,r1) overlaps (s2,r2), and score(G) = producttextf score(Gf) otherwise, where Gf [?] G is the fragments in G for field f. The score of Gf = {e1,e2,...} is defined as</Paragraph> <Paragraph position="5"> is the fraction of training documents that have |Gf |instances of field f.</Paragraph> <Paragraph position="6"> It is infeasible to enumerate all subsets G [?] E, so we perform a heuristic search. The states in the search space are pairs of the form (G,P), where G is a list of good fragments (i.e. fragments that have been accepted), and P is a list of pending fragments (i.e. fragments that haven't yet been accepted or rejected).</Paragraph> <Paragraph position="7"> The search starts in the state ({},E), and states of the form (G,{}) are terminal states. The children of state (G,P) are all ways to move a single fragment from P to G. When forming a child's pending set, the moved fragment along with all fragments that it overlaps are removed (meaning that the moved fragment is selected and all fragments with which it overlaps are rejected). More precisely, the children of state (G,P) The search proceeds as follows. We maintain a set S of the best K non-terminal states that are to be expanded, and the best terminal state B encountered so far. Initially, S contains just the initial state. Then, the children of each state in S are generated. If there are no such children, the search halts and B is returned as the best set of fragments. Otherwise, B is updated if appropriate, and S is set to the best K of the new children. Note that K = [?] corresponds to an exhaustive search. In our experiments, we used K = 5.</Paragraph> </Section> </Section> class="xml-element"></Paper>