File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0505_metho.xml
Size: 12,674 bytes
Last Modified: 2025-10-06 14:07:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0505"> <Title>Towards Translingual Information Access using Portable Information Extraction</Title> <Section position="4" start_page="0" end_page="32" type="metho"> <SectionTitle> 2 Analyst Scenario </SectionTitle> <Paragraph position="0"> Figure 1 illustrates how an intelligence analyst might use the proposed system: * The analyst selects one or more Korean documents in which to search for information (this step not shown).</Paragraph> <Paragraph position="1"> # Current affiliation: Konan Technology, Inc., Korea, nari@konantech.co.kr * Current affiliation: A'IT Labs-Research, Florham Park, NJ, USA, rambow@research.att.com The tmo ministers ~9rsed that any further launching of a missile by North Korean would undermine the security of ~Northeast Asia and the Korea, the United States and Japan should take Joint steps against the North Korean missile threat.</Paragraph> <Paragraph position="2"> }-long requested that Koeura cork to normalize Japan's relations with North Korea. rather than cutting channels of dialogue bet#men the two countries.</Paragraph> <Paragraph position="3"> Koeura said that if North Korea continues Its missile testing, the Japanese government will definitely stop making contributions to KEDO.</Paragraph> <Paragraph position="4"> The tee ministers also tentatively agreed that J~anese primo minister Kslzo Obuchl should make a state visit to Korea on or around Nerch 20.</Paragraph> <Paragraph position="6"> The analyst selects one or more scenario template, to activate in the query. Each scenario template corresponds to a specific type of event. Available scenario templates might include troop movements, acts of violence, meetings and negotiathms, protests, etc. In Figure 1, the selected event is of type meeting (understood broadly).</Paragraph> <Paragraph position="7"> The analyst fills in the available slots of the selected scenario template in order to restrict the search to the information considered to be relevant. In Figure 1, the values specified in the scenario template indicate that the information to f'md is about meetings having as location South Korea and as issue North Korea and missiles. The analyst also specifies what information s/he wants to be reported when information matching the query is found. In Figure 1, the selected boxes under the Report column indicate that all information found satisfying the query should be reported except for the meeting participants. 1 * Once the analyst submits the query for evaluation, the system searches the input documents for information matching the query. As a result, a hypertext document is generated describing the information matching the query as well as the source of this information. Note that the query contains English keywords that are automatically translated into Korean prior to matching. The extracted information is presented in English after being translated from Korean. In Figure 1, the generated hypertext response indicates two documents in the input set that matched the query totally or in part. Each summary in the response includes just the translations of the extracted information that the analyst requested to be reported.</Paragraph> <Paragraph position="8"> * For each document extract matching the analyst query, the analyst can obtain a complete machine translation of the Korean document where the match was found, and where the matched information is highlighted. Working with a human translator, the analyst can also verify the accuracy of the reported information by accessing the documents in their original language.</Paragraph> </Section> <Section position="5" start_page="32" end_page="32" type="metho"> <SectionTitle> 3 System Design </SectionTitle> <Paragraph position="0"> Figure 2 shows the high-level design of the system. It consists of the following components: * The User Interface. The browser-based interface is for entering queries and displaying the resulting presentations.</Paragraph> <Paragraph position="1"> * The Portable Information Extractor (PIE) component. The PIE component uses the While in this example the exclusion of participant information in the resulting report is rather artificial, in general a scenario template may contain many different types of information, not all of which are likely to interest an analyst at once.</Paragraph> <Section position="1" start_page="32" end_page="32" type="sub_section"> <SectionTitle> Extraction Pattem Library -- which </SectionTitle> <Paragraph position="0"> contains the set of extraction patterns learned in the lab, one set per scenario template -- to extract specific types of information from the input Korean documents, once parsed.</Paragraph> <Paragraph position="1"> * The Ranker component. This component ranks the extracted information returned by the PIE component according to how well it matches the keyword restrictions in the query. The MT component's English-to-Korean Transfer Lexicon is used to map the English keywords to corresponding Korean ones. When the match falls below a user* configurable threshold, the extracted information is filtered out.</Paragraph> <Paragraph position="2"> * The MT component. The MT component (cf. Lavoie et al., 2000) translates the extracted Korean phrases or sentences into corresponding English ones.</Paragraph> </Section> </Section> <Section position="6" start_page="32" end_page="34" type="metho"> <SectionTitle> * The Presentation Generator component. </SectionTitle> <Paragraph position="0"> This component generates well-organized, easy-to-read hypertext presentations by organizing and formatting the ranked extracted information. It uses existing NLG components, including the Exemplars text planning framework (White and Caldwell, 1998) and the RealPro syntactic realizer (Lavoie and Rainbow, 1997).</Paragraph> <Paragraph position="1"> In our feasibility study, the majority of the effort went towards developing the PIE component, described in the next section. This component was implemented in a general way, i.e. in a way that we would expect to work beyond the specific training/test corpus described below. In contrast, we only implemented initial versions of the User Interface, Ranker and Presentation Generator components, in order to demonstrate the system concept; that is, these initial versions were only intended.to work with our training/test corpus, and will require considerable further development prior to reaching operational status. For the MT component, we used an early version of the lexical transfer-based system currently under development in an ongoing SBIR Phase II project (cf. Nasr et al., 1997; Palmer et al., 1998; Lavoie et al., 2000), though with a limited lexicon specifically for translating the slot fillers in our training/test corpus.</Paragraph> <Section position="1" start_page="33" end_page="33" type="sub_section"> <SectionTitle> 4.1 Scenario Template and Training/Fest Corpus </SectionTitle> <Paragraph position="0"> For our Phase I feasibility demonstration, we chose a minimal scenario template for meeting and negotiation events consisting of one or more participant slots plus optional date and location slots. 2 We then gathered a small corpus of thirty articles by searching for articles containing &quot;North Korea&quot; and one or more of about 15 keywords. The first two sentences (with a few exceptions) were then annotated with the slots to be extracted, leading to a total of 51 sentences containing 47 scenario templates and 89 total 2 In the end, we did not use the 'issue' slot shown in Figure 1, as it contained more complex Idlers than those that typically have been handled in IE systems.</Paragraph> <Paragraph position="1"> correct slots. Note that in a couple of cases more than one template was given for a single long sentence.</Paragraph> <Paragraph position="2"> When compared to the MUC scenario template task, our extraction task was considerably simpler, for the following reasons: * The answer keys only contained information that could be found within a single sentence, i.e. the answer keys did not require merging information across sentences.</Paragraph> <Paragraph position="3"> * The answer keys did not require anaphoric references to be resolved, and we did not deal with conjuncts separately.</Paragraph> <Paragraph position="4"> * We did not attempt to normalize dates or remove appositives from NPs.</Paragraph> </Section> <Section position="2" start_page="33" end_page="34" type="sub_section"> <SectionTitle> 4.2 Extraction Pattern Learning </SectionTitle> <Paragraph position="0"> For our feasibility study, we chose to follow the AutoSlog (Lehnert et al., 1992; Riloff, 1993) approach to extraction pattern acquisition. In this approach, extraction patterns are acquired via a one-shot general-to-specific learning algorithm designed specifically for the information extraction task. 3 The learning algorithm is straightforward and depends only on the existence of a (partial) parser and a small set of general linguistic patterns that direct the creation of specific patterns. As a training corpus, it requires a set of texts with noun phrases annotated with the slot type to be extracted.</Paragraph> <Paragraph position="1"> To adapt the AutoSlog approach to Korean, we first devised Korean equivalents of the English patterns, two of which are shown in Figure 3. It turned out that for our corpus, we could collapse some of these patterns, though some new ones were also needed. In the end we used just nine generic patterns.</Paragraph> <Paragraph position="2"> Important issues that arose in adapting the approach were (1) greater flexibility in word order and heavier reliance on morphological cues in Korean, and (2) the predominance of light verbs (verbs with little semantic content of their own) and aspectual verbs in the chosen domain. We discuss these issues in the next two sections.</Paragraph> </Section> <Section position="3" start_page="34" end_page="34" type="sub_section"> <SectionTitle> 4.3 Korean Parser </SectionTitle> <Paragraph position="0"> We used Yoon's hybrid statistical Korean parser (Yoon et al., 1997, 1999; Yoon, 1999) to process the input sentences prior to extraction. The parser incorporates a POS tagger and</Paragraph> </Section> </Section> <Section position="7" start_page="34" end_page="35" type="metho"> <SectionTitle> 3 For TIDES, we plan to use more sophisticated </SectionTitle> <Paragraph position="0"> learning algorithms, as well as active learning techniques, such as those described in Thompson et al. (1999).</Paragraph> <Paragraph position="1"> morphological analyzer and yields a dependency representation as its output? The use of a dependency representation enabled us to handle the greater flexibility in word order in Korean. To facilitate pattern matching, we wrote a simple program to convert the parser's output to XML form. During the XML conversion, two simple heuristics were applied, one to recover implicit subjects, and another to correct a recurring misanalysis of noun compounds.</Paragraph> <Section position="1" start_page="34" end_page="35" type="sub_section"> <SectionTitle> 4.4 Trigger Word Filtering and Generalization </SectionTitle> <Paragraph position="0"> In the newswire corpus we looked at, meeting events were rarely described with the verb 'mannata' ('to meet'). Instead, they were usually described with a noun that stands for 'meeting' and a light or aspectual verb, for example, 'hoyuy-lul kacta' ('to have a meeting') or 'hoyuy-lul machita' ('to finish a meeting').</Paragraph> <Paragraph position="1"> In order to acquire extraction patterns that made appropriate use of such collocations, we decided to go beyond the AutoSlog approach and explicitly group trigger words (such as 'hoyuy') into classes, and to likewise group any collocations, such as those involving light verbs or aspectual verbs. To fmd collocations for the trigger words, we reviewed a Korean lexical co-occurrence base which was constructed from a corpus of 40 million words (Yoon et al., 1997).</Paragraph> <Paragraph position="2"> We then used the resulting specification to filter the learned patterns to just those containing the 4 Overall dependency precision is reported to be 89.4% (Yoon, 1999).</Paragraph> <Paragraph position="3"> .-! trigger words or trigger word collocations, as well as to generalize the patterns to the word class level. Because the number of tr:igger words is small, this specification can be done quickly, and soon pays off in terms of time saved in manually filtering the learned patterns.</Paragraph> </Section> </Section> class="xml-element"></Paper>