File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1106_intro.xml
Size: 9,006 bytes
Last Modified: 2025-10-06 14:06:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1106"> <Title>References</Title> <Section position="3" start_page="51" end_page="52" type="intro"> <SectionTitle> 3 Generating conceptual case frames </SectionTitle> <Paragraph position="0"> from extraction patterns The algorithm for building conceptual case frames begins with extraction patterns and a semantic lexicon for the domain. The semantic lexicon is a dictionary of words that belong to relevant semantic categories. We used AutoSlog-TS to generate the extraction patterns and a corpus-based algorithm to generate the semantic lexicon. ~ The corpus-based algorithm that we used to build the semantic lexicon (Riloff and Shepherd, 1997) requires five &quot;seed words&quot; as input for each semantic category, and produces a ranked list of words that are statistically associated with each category. First, the algorithm looks for all sentences in Khich a seed word is used as the head noun of a noun phrase.</Paragraph> <Paragraph position="1"> For each such occurrence of a seed word, the algorithm collects a small context window around the seed word. The context window consists of the closest noun to the left of the seed word, and the closest noun to its right. The context windows for all seed words that belong to the same category are then combined, and each word is assigned a category score. The category score is (essentially) the conditional probability that the word appears in a category context. The words are ranked by this score and the top five are dynamically added to the seed word list. This bootstrapping process dynamically grows the seed word list so that each iteration produces a larger category context. After several iterations, the final list of ranked words usually contains many words that belong to the category, especially near the top. The ranked list is presented to a user, who scans down the list and removes any words that do not belong to the category. For more details of this algorithm, see (Riloff and Shepherd, 1997).</Paragraph> <Paragraph position="2"> A flowchart for the case frame generation process appears in Figure 2. AutoSlog-TS produces a ranked list of extraction patterns and our semantic lexicon generator produces a ranked list of words for each category. Generating these lists is fully automatic, but a human must review them to decide which extraction patterns and category words to keep. This is the only part of the process that involves human interaction.</Paragraph> <Paragraph position="3"> ~Other methods could be used to generate these items, including the use of existing knowledge bases such as Word-Net (Miller, 1990) or Cyc (Lenat et al., 1986) if they have adequate coverage for the domain.</Paragraph> <Paragraph position="4"> seed words ranked extractionN~ /ranked category patterns ~ ~ words Next, the extraction patterns are applied to the texts to generate a semantic profile for each pattern. The semantic profile shows the semantic categories that were extracted by each pattern, based on the head noun of each extraction. Figure 3 shows the semantic profile for the pattern &quot;attack on <nounphrase>&quot;. PFreq is the number of times that the extraction pattern fired, SFreq is the number of times that the pattern extracted the given semantic category, and Prob is the estimated probability of the pattern extracting the given semantic category (SFreq/PFreq). Note that many extractions will not be labeled with any semantic category if the head noun is unknown (i.e., not in the semantic lexicon).</Paragraph> <Paragraph position="5"> Figure 3 shows that attacks are often carried out on buildings, civilians, dates, government officials, locations, military people, and vehicles. It seems obvious that attacks will occur on people and on physical targets, but a person might not realize that attacks will also occur on dates (e.g., Monday) and on locations (e.g., a neighborhood). This example shows how the corpus-based approach can identify semantic preferences that a person might not anticipate. Also, note that the semantic profile shows no instances of attacks on terrorists or weapons, which makes sense in this domain.</Paragraph> <Paragraph position="6"> phrase>&quot; The semantic profile is used to select semantic preferences that are strong enough to become selectional restrictions. We use the following formula to identify strong semantic preferences: (SFreq > FI) or ((SFreq ~ F2) and (Prob > P)) The first test selects semantic categories that are extracted with high frequency, under the assumption that this reflects a real association with the category. The second case selects semantic categories that represent a relatively high percentage of the extractions even though the frequency might be low (e.g., 2 out of 4 extractions). In our experiments, we chose F1=3, F2=2, and P=0.1. We used fairly lenient criteria because (a) patterns can often extract several types of objects that belong to different semantic categories, and (b) many extractions contain unknown words. Also, remember that the semantic lexicon is reliable because it was reviewed by a person, so it is usually meaningful when a pattern extracts a semantic category even once. The threshoIds are needed only to eliminate noise, which can be caused by misparsed sentences or polysemous words.</Paragraph> <Paragraph position="7"> The semantic preferences are used to assign conceptual roles to each extraction pattern. At this point, one additional piece of input is needed: a list of conceptual roles and associated semantic categories for the domain. The conceptual roles identify the types of information that need to be recognized.</Paragraph> <Paragraph position="8"> Figure 4 shows the conceptual roles used for the terrorism domain.</Paragraph> <Paragraph position="9"> Each extraction pattern is expanded to include a set of conceptual roles based on its semantic preferences. These conceptual roles are assigned automatically based on a pattern's semantic profile. This process eliminates the need for a human to assign roles to the extraction patterns by hand, as had been necessary when using AutoSlog or AutoSlog-TS by themselves.</Paragraph> <Paragraph position="10"> For example, the pattern &quot;machinegunned <direct-obj>&quot; had strong semantic preferences for BUILDING, CIVILIAN, LOCATION, and VEHICLE, so it was expanded to have three conceptual roles with four selectional restrictions. The expanded extraction pattern for &quot;machinegunned <direct-obj>&quot; is: Only semantic categories that were associated with a pattern are included as selectional restrictions. For example, the GOVOFFICIAL category also represents possible terrorism victims, but it was not strongly associated with the pattern. Our rationale is that an individual pattern may have a strong preference for only a subset of the categories that can be associated with a role. For example, the pattern &quot;<subject> was ambushed&quot; showed a preference .for VEHICLE extractions but not BUILDING extractions, which makes sense because it is hard to imagine ambushing a building. Including only VEHICLE as its selectional restriction for targets might help eliminate incorrect building extractions. One could argue that this pattern is not likely to find building extractions anyway so the selectional restriction will not matter, but the selectional restriction might help filter out incorrect extractions due to misparses or metaphor (e.g., &quot;The White House was ambushed by reporters.&quot;). Ultimately, it is an empirical question whether it is better to include all of the semantic categories associated with a conceptual role or not.</Paragraph> <Paragraph position="11"> Finally, we merge the expanded extraction patterns into multi-slot case frames. All extraction patterns that share the same trigger word and compatible syntactic constraints are merged into a single structure. For example, we would merge all patterns triggered by a specific verb in its passive voice. For example, the patterns &quot;<subject> was kidnapped&quot;, &quot;was kidnapped by <noun-phrase>&quot;, and &quot;was kidnapped in <noun-phrase>&quot; would be merged into a single case frame. Similarly, we would merge all patterns triggered by a specific verb in its active voice. For example, we would merge patterns for the active form of &quot;destroyed&quot; that extract the subject of &quot;destroyed&quot;, its direct object, and any prepositional phrases that are associated with it. We also merge syntactically compatible patterns that are triggered by the same noun (e.g., &quot;assassination&quot;) or by the same infinitive verb structure (e.g., &quot;to kill&quot;). When we merge extraction patterns into a case frame, all of the slots are simply unioned together.</Paragraph> </Section> class="xml-element"></Paper>