File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-1038_intro.xml
Size: 3,435 bytes
Last Modified: 2025-10-06 14:00:41
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-1038"> <Title>Large-scale Controlled Vocabulary Indexing for Named Entities</Title> <Section position="2" start_page="0" end_page="276" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The goal of the Entity Indexing R&D program at LEXIS-NEXIS is to add controlled vocabulary indexing for named entities to searchable fields in appropriate news documents across thousands of news publications, where documents include both incoming news articles as well as news articles already in the LEXIS-NEXIS archives. A controlled vocabulary term (CVT) is a consistently specified topic indicator that users can incorporate into their queries in order to retrieve documents about the corresponding topic. When a CVT is added to an appropriate field in a document, it can be included in a Boolean query using field-name(controlled vocabulary term) The initial Entity Indexing release focused on companies as topics. For company indexing, the primary CFT is a standard form of the company name. When we add a company CVT to a document, we often also want to add secondary CFTs to the document that specify attributes of that company. Attributes may include the ticker symbol, SIC product codes, industry codes and company headquarters information. Secondary CVTs allow customers to easily search on groups of companies that have one or more attributes in common, such as searching for documents about banks in our set of companies that are headquartered in Utah.</Paragraph> <Paragraph position="1"> It is generally easy to get high recall with Boolean queries when searching for documents about named entities. Typically the query will only need a short form of the entity's name. For example, the query American will retrieve virtually every document that mentions American Airlines. Of course, this query results in poor precision due to the ambiguity of American. The problem we wanted to address with controlled vocabulary indexing is to help online customers limit their search results to only those documents that contain a major reference to the topic, that is, to documents that are substantially about the topic.</Paragraph> <Paragraph position="2"> Because of the volume of news data we have, it is necessary that we fully automate the document categorization and indexing step in our data preparation process. LEXIS-NEXIS adds 100,000 news articles daily to its collection of over 2 billion documents.</Paragraph> <Paragraph position="3"> For marketing and product positioning reasons, we want to provide indexing for tens of thousands of companies, where companies are targeted based on their presence on the New York, American and NASDAQ exchanges or on revenue-based criteria.</Paragraph> <Paragraph position="4"> Although such selection criteria help us explain the product feature to customers, it does not ensure that the targeted companies actually appear all that often in the news. In fact, for many targeted companies there is little training and test data available. Our company indexing system should address the following business product requirements: * Assign primary and corresponding secondary CVTs to appropriate documents taining topic definitions Also, we target 90% recall and 95% precision when using the CVTs to retrieve major reference documents about the corresponding companies.</Paragraph> </Section> class="xml-element"></Paper>