File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0313_intro.xml
Size: 7,990 bytes
Last Modified: 2025-10-06 14:06:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0313"> <Title>A Corpus-Based Approach for Building Semantic Lexicons</Title> <Section position="3" start_page="0" end_page="118" type="intro"> <SectionTitle> 2 Generating a Semantic Lexicon </SectionTitle> <Paragraph position="0"> Our work is based on the observation that category members are often surrounded by other category members in text, for example in conjunctions (lions and tigers and bears), lists (lions, tigers, bears...), appositives (the stallion, a white Arabian), and nominal compounds (Arabian stallion; tuna fish). Given a few category members, we wondered whether it would be possible to collect surrounding contexts and use statistics to identify other words that also belong to the category. Our approach was motivated by Yarowsky's word sense disambiguation algorithm (Yarowsky, 1992) and the notion of statistical salience, although our system uses somewhat different statistical measures and techniques.</Paragraph> <Paragraph position="1"> We begin with a small set of seed words for a category. We experimented with different numbers of seed words, but were surprised to find that only 5 seed words per category worked quite well. As an example, the seed word lists used in our experiments are shown below.</Paragraph> <Paragraph position="2"> Energy: fuel gas gasoline oil power Financial: bank banking currency dollar money Military: army commander infantry soldier troop Vehicle: airplane car jeep plane truck Weapon: bomb dynamite explosives gun rifle The input to our system is a text corpus and an initial set of seed words for each category. Ideally, the text corpus should contain many references to the category. Our approach is designed for domain-specific text processing, so the text corpus should be a representative sample of texts for the domain and the categories should be semantic classes associated with the domain. Given a text corpus and an initial seed word list for a category C, the algorithm for building a semantic lexicon is as follows: 1. We identify all sentences in the text corpus that contain one of the seed words. Each sentence is given to our parser, which segments the sentence into simple noun phrases, verb phrases, and prepositional phrases. For our purposes, we do not need any higher level parse structures.</Paragraph> <Paragraph position="3"> 2. We collect small context windows surrounding each occurrence of a seed word as a head noun in the corpus. Restricting the seed words to be head nouns ensures that the seed word is the main concept of the noun phrase. Also, this reduces the chance of finding different word senses of the seed word (though multiple noun word senses may still be a problem). We use a very narrow context window consisting of only two words, the first noun to the word's right and the first noun to its left. We collected only nouns under the assumption that most, if not all, true category members would be nouns3 .</Paragraph> <Paragraph position="4"> The context windows do not cut across sentence boundaries. Note that our context window is much narrower than those used by other researchers (Yarowsky, 1992). We experimented with larger window sizes and found that the narrow windows more consistently included words related to the target category.</Paragraph> <Paragraph position="5"> Given the context windows for a category, we compute a category score for each word, which is essentially the conditional probability that the word appears in a category context. The category score of a word W for category C is defined as: C/corefW C/7~ - /reg. o/ w in O's context windows v/ freq. o\] W in corpus .</Paragraph> <Paragraph position="6"> .</Paragraph> <Paragraph position="7"> Note that this is not exactly a conditional probability because a single word occurrence can belong to more than one context window. For example, consider the sentence: I bought an AK-~7 gun and an M-16 rifle. The word M-16 would be in the context windows for both gun and rifle even though there was just one occurrence of it in the sentence. Consequently, the category score for a word can be greater than 1. Next, we remove stopwords, numbers, and any words with a corpus frequency < 5. We used a stopword list containing about 30 general nouns, mostly pronouns (e.g., /, he, she, they) and determiners (e.g., this, that, those). The stopwords and numbers are not specific to any category and are common across many domains, so we felt it was safe to remove them. The remaining nouns are sorted by category score and ranked so that the nouns most strongly associated with the category appear at the top. The top five nouns that are not already seed words are added to the seed word list dynamically. We then go back to Step 1 and repeat the process. This bootstrapping mechanism dynamically grows the seed word list so that each iteration produces a larger category context. In our experiments, the top five nouns were added automatically without any human intervention, but this sometimes allows non-category words to dilute the growing seed word list. A few inappropriate words are not likely to have much impact, but many inappropriate words or a few highly frequent words can weaken the feedback process. One could have a person verify that each word belongs to the target category before adding it to the seed word list, but this would require human interaction at each iteration of the feedback cycle. We decided to see how well the technique could work without this additional human interaction, but the potential benefits of human feedback still need to be investigated. null After several iterations, the seed word list typically contains many relevant category words. But more importantly, the ranked list contains many additional category words, especially near the top. The number of iterations can make a big difference in the quality of the ranked list. Since new seed words are generated dynamically without manual review, the quality of the ranked list can deteriorate rapidly when too many non-category words become seed words. In our experiments, we found that about eight iterations usually worked well.</Paragraph> <Paragraph position="8"> The output of the system is the ranked list of nouns after the final iteration. The seed word list is thrown away. Note that the original seed words were already known to be category members, and the new seed words are already in the ranked list because that is how they were selected. ~ Finally, a user must review the ranked list and identify the words that are true category members. How one defines a &quot;true&quot; category member is subjective and may depend on the specific application, so we leave this exercise to a person. Typically, the words near the top of the ranked list are highly associated with the category but the density of category words decreases as one proceeds down the list. The user may scan down the list until a sufficient number of category words is found, or as long as time permits. The words selected by the user are added to a permanent semantic lexicon with the appropriate category label.</Paragraph> <Paragraph position="9"> Our goal is to allow a user to build a semantic lexicon for one or more categories using only a small set of known category members as seed words and a text corpus. The output is a ranked list of potential category words that a user can review to create a semantic lexicon quickly. The success of this approach depends on the quality of the ranked list, especially the density of category members near the top. In the next section, we describe experiments to evaluate our system.</Paragraph> <Paragraph position="10"> 21t is possible that a word may be near the top of the ranked list during one iteration (and subsequently become a seed word) but become buried at the bottom of the ranked list during later iterations. However, we have not observed this to be a problem so far.</Paragraph> </Section> class="xml-element"></Paper>