File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1102_intro.xml
Size: 4,440 bytes
Last Modified: 2025-10-06 14:03:35
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1102"> <Title>Names and Similarities on the Web: Fact Extraction in the Fast Lane</Title> <Section position="3" start_page="0" end_page="809" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Background </SectionTitle> <Paragraph position="0"> The potential impact of structured fact repositories containing billions of relations among named entities on Web search is enormous. They enable the pursuit of new search paradigms, the processing of database-like queries, and alternative methods of presenting search results. The preparation of exhaustive lists of hand-written extraction rules is impractical given the need for domain-independent extraction of many types of facts from unstructured text. In contrast, the idea of bootstrapping for relation and information extraction was first proposed in (Riloff and Jones, 1999), and successfully applied to the construction of semantic lexicons (Thelen and Riloff, 2002), named entity recognition (Collins and Singer, 1999), extraction of binary relations (Agichtein and Gravano, 2000), and acquisition of structured data for tasks such as Question Answering (Lita and Carbonell, 2004; Fleischman et al., 2003). In the context of fact extraction, the resulting iterative acquisition [?]Work done during internships at Google Inc.</Paragraph> <Paragraph position="1"> framework starts from a small set of seed facts, finds contextual patterns that extract the seed facts from the underlying text collection, identifies a larger set of candidate facts that are extracted by the patterns, and adds the best candidate facts to the previous seed set.</Paragraph> </Section> <Section position="2" start_page="0" end_page="809" type="sub_section"> <SectionTitle> 1.2 Contributions </SectionTitle> <Paragraph position="0"> Figure 1 describes an architecture geared towards large-scale fact extraction. The architecture is similar to other instances of bootstrapping for information extraction. The main processing stages are the acquisition of contextual extraction patterns given the seed facts, acquisition of candidate facts given the extraction patterns, scoring and ranking of the patterns, and scoring and ranking of the candidate facts, a subset of which is added to the seed set of the next round.</Paragraph> <Paragraph position="1"> Within the existing iterative acquisition framework, our first contribution is a method for automatically generating generalized contextual extraction patterns, based on dynamically-computed classes of similar words. Traditionally, the acquisition of contextual extraction patterns requires hundreds or thousands of consecutive iterations over the entire text collection (Lita and Carbonell, 2004), often using relatively expensive or restrictive tools such as shallow syntactic parsers (Riloff and Jones, 1999; Thelen and Riloff, 2002) or named entity recognizers (Agichtein and Gravano, 2000). Comparatively, generalized extraction patterns achieve exponentially higher coverage in early iterations. The extraction of large sets of candidate facts opens the possibility of fast-growth iterative extraction, as opposed to the de-facto strategy of conservatively growing the seed set by as few as five items (Thelen and Riloff, 2002) after each iteration.</Paragraph> <Paragraph position="2"> The second contribution of the paper is a method for domain-independent validation and ranking of candidate facts, based on a similarity measure of each candidate fact relative to the set of seed facts. Whereas previous studies assume clean text collections such as news corpora (Thelen and Riloff, 2002; Agichtein and Gravano, 2000; Hasegawa et al., 2004), the validation is essential for low-quality sets of candidate facts collected from noisy Web documents. Without it, the addition of spurious candidate facts to the seed set would result in a quick divergence of the iterative acquisition towards irrelevant information (Agichtein and Gravano, 2000). Furthermore, the finer-grained ranking induced by similarities is necessary in fast-growth iterative acquisition, whereas previously proposed ranking criteria (Thelen and Riloff, 2002; Lita and Carbonell, 2004) are implicitly designed for slow growth of the seed set.</Paragraph> </Section> </Section> class="xml-element"></Paper>