File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2208_metho.xml
Size: 11,222 bytes
Last Modified: 2025-10-06 14:10:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2208"> <Title>Expanding the Recall of Relation Extraction by Bootstrapping</Title> <Section position="4" start_page="56" end_page="57" type="metho"> <SectionTitle> 2 String Pattern Learning (SPL) </SectionTitle> <Paragraph position="0"> Both SPL and LRPL start with seed tuples that were extracted by the baseline KnowItAll system, with extraction frequency at or above a threshold (set to 2 in these experiments). In these experiments, we downloaded a set of sentences from the Web that contained an occurrence of at least one relation label and used this as our reservoir of unlabeled training and test sentences. We created a set of positive training sentences from those sentences that contained both argument values of a seed tuple.</Paragraph> <Paragraph position="1"> SPL employs a method similar to that of (Downey et al., 2004). It generates candidate extraction rules with a prefix context, a middle context, and a right context. The prefix is zero toC4 D7CXCSCT tokens immediately to the left of extracted argument1, the middle context is all tokens between argument1 and argument2, and the right context of zero to C4 D7CXCSCT tokens immediately to the right of argument2. It discards patterns with more than C4 D1CXCS intervening tokens or without a relation label. SPL tabulates the occurrence of such patterns in the set of positive training sentences (all sentences from the reservoir that contain both argument values from a seed tuple in either order), and also tabulates their occurrence in negative training sentences. The negative training are sentences that have one argument value from a seed tuple and a nearest simple NP in place of the other argument value. This idea is based on that of (Ravichandran and Hovy, 2002) for a QA system. SPL learns a possibly large set of strict extraction rules that have alternating context strings and extraction slots, with no gaps or wildcards in the rules.</Paragraph> <Paragraph position="2"> SPL selects the best patterns as follows: 1. Groups the context strings that have the exact same middle string.</Paragraph> <Paragraph position="3"> 2. Selects the best pattern having the largest pat- null tern score, C8CB, for each group of context strings having the same middle string.</Paragraph> <Paragraph position="5"> B4D4B5 is a set of sentences that match pattern D4 and include both argument values of a seed tuple. CB D2CTCVCPD8CXDACT B4D4B5 is a set of sentences that match D4 and include just one argument value of a seed tuple (e.g. just a company or a person for CeoOf). AB is a constant for smoothing.</Paragraph> </Section> <Section position="5" start_page="57" end_page="58" type="metho"> <SectionTitle> 3 Less Restrictive Pattern Learning (LRPL) </SectionTitle> <Paragraph position="0"> LRPLuses amore flexible rule representation than SPL.Asbefore, the rules are based on a windowof tokens to the left of the first argument, a window of middle tokens, and a window of tokens to the right of the second argument. Rather than using exact string match on a simple sequence of tokens, LRPL uses a combination of bag of words and immediately adjacent token. The left context is based on a window of C4 D7CXCSCT tokens immediately to the left of argument1. It has two sets of tokens: the token immediately to the left and a bag of words for the remaining tokens. Each of these sets may have zero or more tokens. The middle and right contexts are similarly defined. We call this representation extended bag of words.</Paragraph> <Paragraph position="1"> Here is an example of how LRPL represents the context of a training sentence with window size set to 4. &quot;Yesterday , BOArg2BQSteve BallmerBO/Arg2BQ, the Chief Executive Officer of BOArg1BQMicrosoftBO/Arg1BQ said that he is ...&quot;. order: arg2_arg1 values: Steve Ballmer, Microsoft L: {yesterday} {,} M: {,} {chief executive officer the} {of} R: {said} {he is that} Some of the tokens in these bags of words may be dropped in merging this with patterns from other training sentences. Each rule also has a confidence score, learned from EM-estimation.</Paragraph> <Paragraph position="2"> We experimented with simply using three bags of words as in SnowBall, but found that precision was increased when we distinguished the tokens immediately adjacent to argument values from the other tokens in the left, middle, and right bag of words.</Paragraph> <Paragraph position="3"> Less restrictive patterns require a Named Entity Recognizer (NER), because the patterns can not extract candidate entities by themselves1. LRPL trains a supervised NER in bootstrapping for extracting candidate entities.</Paragraph> <Paragraph position="4"> Figure 1 overviews LRPL. It consists of two bootstrapping modules: Relation NER and Relation Assessor. LRPL trains the Relational NER from seed tuples provided by the baseline KnowItAll system and unlabeled sentences in the reservoir. Then it does NE tagging on the sentences to learn the less restrictive rules and to extract candidate tuples. The learning and extraction steps at Relation Assessor are similar to that of SnowBall; it generates a set of rules and uses EM-estimation to compute a confidence in each rule. When these rules are applied, the system computes a probability for each tuple based on the rule confidence, the degree of match between a sentence and the rule, and the extraction frequency.</Paragraph> <Section position="1" start_page="57" end_page="58" type="sub_section"> <SectionTitle> 3.1 Relation dependent Named Entity Recognizer </SectionTitle> <Paragraph position="0"> Relation NER leverages an off-the-shelf supervised NER, based on Conditional Random Fields (CRF). In Figure 1, TrainSentenceGenerator automatically generates training sentences from seeds and unlabeled sentences in the reservoir. TrainEntityRecognizer trains a CRF on the training sentences and then EntityRecognizer applies the trained CRF to all the unlabeled sentences, creating entity annotated sentences.</Paragraph> <Paragraph position="1"> It can extract entities whose type matches an argument type of a particular relation. The type is not explicitly specified by a user, but is automatically determined according to the seed tuples. For example, itcan extract 'City' and 'Mayor' type entities for MayorOf(City, Mayor) relation. We describe CRF in brief, and then how to train it in bootstrapping.</Paragraph> <Paragraph position="2"> 1Although using all noun phrases in a sentence may be possible, it apparently results in low precision.</Paragraph> <Paragraph position="3"> Several state-of-the-art supervised NERs are based on a feature-rich probabilistic conditional classifier such as Conditional Random Fields (CRF) for sequential learning tasks(Lafferty et al., 2001; Rosenfeld et al., 2005). The input of CRF is a feature sequence CG of features DC . In the applying phase, given CG, it outputs a tag sequence CC by using C5 CRD6CU . In the case of NE tagging, given a sequence of tokens, it automatically generates a sequence of feature sets; each set is corresponding to a token. It can incorporate any properties that can be represented as a binary feature into the model, such as words, capitalized patterns, part-of-speech tags and the existence of the word in a dictionary. It works quite well on NE tagging tasks (McCallum and Li, 2003).</Paragraph> <Paragraph position="4"> Bootstrapping We use bootstrapping to train CRF for relation-specific NE tagging as follows: 1) select the sentences that include all the entity values of a seed tuple, 2) automatically mark the argument values in each sentence, and 3)train CRF on the seed marked sentences. An example of a seed marked sentence is the following: seed tuple: <Microsoft, Steve Ballmer> seed marked sentence:</Paragraph> </Section> </Section> <Section position="6" start_page="58" end_page="58" type="metho"> <SectionTitle> &quot;Yesterday, <Arg2>Steve Ballmer</Arg2>, CEO of <Arg1>Microsoft</Arg1> </SectionTitle> <Paragraph position="0"> announced that ...&quot; Because of redundancy, we can expect to generate a fairly large number of seed marked sentences by using a few highly frequent seed tuples. To avoid overfitting on terms from these seed tuples, we substitute the actual argument values with random characters for each training sentence, preserving capitalization patterns and number ofcharacters in each token.</Paragraph> <Section position="1" start_page="58" end_page="58" type="sub_section"> <SectionTitle> 3.2 Relation Assessor Relation Assessor employs several SnowBall-like </SectionTitle> <Paragraph position="0"> techniques including making rules by clustering and EM-estimation for the confidence of the rules and tuples.</Paragraph> <Paragraph position="1"> In Figure 1, ContextRepresentationGenerator generates extended bag of words contexts, from entity annotated sentences, and classifies the contexts into two classes: training contexts BV based on the match score between contexts, and generates a rule from each cluster, that has average vectors over contexts belonging to the cluster. Given a set by using an EM algorithm. It also estimates the confidence of the tuples extracted by the base-line system, and outputs the merged result tuples with confidence.</Paragraph> <Paragraph position="2"> We describe the match score calculation method, the EM-algorithm, and the merging method in the following sub sections.</Paragraph> <Paragraph position="3"> The match score (or similarity) C5 of two extended bag of words contexts CR</Paragraph> <Paragraph position="5"> where, D7 is the index of left, middle, or right contexts. D8 is the index of left adjacent, right adjacent, or other tokens. DB D7D8 is the weight corresponding to the context vector indexed by D7 and D8. To achieve high precision, Relation Assessor uses only the entity annotated sentences that have just one entity for each argument (two entities in total) and where those entities co-occur within</Paragraph> <Paragraph position="7"> and right tokens. It discards patterns without a relation label.</Paragraph> <Paragraph position="8"> confidence Several rules generated from only positive evidence result in low precision (e.g. rule &quot;of&quot; for MayorOf relation generated from &quot;Rudolph Giuliani of New York&quot;). This problem can be improved by estimating the rule confidence by the following EM-algorithm.</Paragraph> <Paragraph position="9"> This algorithm assigns a high confidence to the rules that frequently co-occur with only high confident tuples. It also assigns a high confidence to the tuples that frequently co-occur with the contexts that match high confidence rules. When it merges the tuples extracted by the base-line system, the algorithm uses the following constant value for any context that matches a baseline pattern.</Paragraph> <Paragraph position="10"> pattern. With this calculation, the confidence of any tuple extracted by a baseline pattern is always greater than or equal to that of any tuple that is extracted by the learned rules and has the same frequency.</Paragraph> </Section> </Section> class="xml-element"></Paper>