File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1910_metho.xml
Size: 15,011 bytes
Last Modified: 2025-10-06 14:10:48
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1910"> <Title>Experiments Adapting an Open-Domain Question Answering System to the Geographical Domain Using Scope-Based Resources</Title> <Section position="4" start_page="69" end_page="69" type="metho"> <SectionTitle> 3 System Description </SectionTitle> <Paragraph position="0"> GeoTALP-QA has been developed within the framework of ALIADO2 project. The system architecture uses a common schema with three phases that are performed sequentially without feedback: Question Processing (QP), Passage Retrieval (PR) and Answer Extraction (AE). More details about this architecture can be found in (Ferr'es et al., 2005) and (Ferr'es et al., 2004). Before describing these subsystems, we introduce some additional knowledge sources that have been added to our system for dealing with the geographic domain and some language-dependent NLP tools for English and Spanish. Our aim is to develop a language independent system (at least able to work with English and Spanish). Language dependent components are only included in the Question Pre-processing and Passage Pre-processing components, and can be easily substituted by components for other languages.</Paragraph> <Section position="1" start_page="69" end_page="69" type="sub_section"> <SectionTitle> 3.1 Additional Knowledge Sources </SectionTitle> <Paragraph position="0"> One of the most important task to deal with the problem of GDQA is to detect and classify NEs with its correct Geographical Subclass (see classes in Section 3.3). We use Geographical scope based Knowledge Bases (KB) to solve this problem.</Paragraph> <Paragraph position="1"> These KBs can be built using these resources: * GEOnet Names Server (GNS3). A world-wide gazetteer, excluding the USA and Antarctica, with 5.3 million entries.</Paragraph> </Section> </Section> <Section position="5" start_page="69" end_page="71" type="metho"> <SectionTitle> * Geographic Names Information System </SectionTitle> <Paragraph position="0"> (GNIS4). A gazetteer with 2.0 million entries about geographic features of the USA.</Paragraph> <Paragraph position="1"> * Grammars for creating NE aliases. Geographic NEs tend to occur in a great variety of forms. It is important to take this into account to avoid losing occurrences.</Paragraph> <Paragraph position="2"> A set of patterns for expanding have been created. (e.g. <toponym> Mountains, <toponym> Range, <toponym> Chain).</Paragraph> <Paragraph position="3"> * Trigger Words Lexicon. A lexicon containing trigger words (including multi-word terms) is used for allowing local disambiguation of ambiguous NE, both in the questions and in the retrieved passages.</Paragraph> <Paragraph position="4"> Working with geographical scopes avoids many ambiguity problems, but even in a scope these problems occur: * Referent ambiguity problem. This problem occurs when the same name is used for several locations (of the same or different class). In a question, sometimes it is impossible to solve this ambiguity, and, in this case, we have to accept as correct all of the possible interpretations (or a superclass of them). Otherwise, a trigger phrase pattern can be used to resolve the ambiguity (e.g. &quot;Madrid&quot; is an ambiguous NE, but in the phrase, &quot;comunidad de Madrid&quot; (State of Madrid), ambiguity is solved). Given a scope, we automatically obtain the most common trigger phrase patterns of the scope from the GNS gazetteer.</Paragraph> <Paragraph position="5"> * Reference ambiguity problem. This problem occurs when the same location can have more than one name (in Spanish texts this frequently occurs as many place names occur in languages other than Spanish, as Basque, Catalan or Galician). Our approach to solve this problem is to group together all the geographical names that refer to the same location. All the occurrences of the geographical NEs in both questions and passages are substituted by the identifier of the group they belong to.</Paragraph> <Paragraph position="6"> We used the geographical knowledge available in the GNS gazetteer to obtain this geographical NEs groups. First, for each place name in the scope-based GNS gazetteer we obtained all the NEs that have the same feature designation code, latitude and longitude. For each group, we then selected an identifier choosing one of the NE included in it using the following heuristics: the information of the GNS field &quot;native&quot; tells if a place name is native, conventional, a variant, or, is not verified. So we decided the group representative assigning the following order of priorities to the names: native, conventional name, variant name, unverified name. If there is more than one place name in the group with the same name type we decide that the additional length gives more priority to be cluster representative. It is necessary to establish a set of priorities among the different place names of the group because in some retrieval engines (e.g. web search engines) is not possible to do long queries.</Paragraph> <Section position="1" start_page="70" end_page="70" type="sub_section"> <SectionTitle> 3.2 Language-Dependent Processing Tools </SectionTitle> <Paragraph position="0"> A set of general purpose NLP tools are used for Spanish and English. The same tools are used for the linguistic processing of both the questions and the passages (see (Ferr'es et al., 2005) and (Ferr'es et al., 2004) for a more detailed description of these tools). The tools used for Spanish are: * FreeLing, which performs tokenization, morphological analysis, POS tagging, lemmatization, and partial parsing.</Paragraph> <Paragraph position="1"> * ABIONET, a NE Recognizer and Classifier (NERC) on basic categories.</Paragraph> <Paragraph position="2"> * EuroWordNet, used to obtain a list of synsets, a list of hypernyms of each synset, and the Top Concept Ontology class.</Paragraph> <Paragraph position="3"> The following tools are used to process English: * TnT, a statistical POS tagger.</Paragraph> <Paragraph position="4"> * WordNet lemmatizer 2.0.</Paragraph> <Paragraph position="5"> * ABIONET.</Paragraph> <Paragraph position="6"> * WordNet 1.5.</Paragraph> <Paragraph position="7"> * A modified version of the Collins parser. * Alembic, a NERC with MUC classes.</Paragraph> </Section> <Section position="2" start_page="70" end_page="71" type="sub_section"> <SectionTitle> 3.3 Question Processing </SectionTitle> <Paragraph position="0"> The main goal of this subsystem is to detect the Question Type (QT), the Expected Answer Type (EAT), the question logic predicates, and the question analysis. This information is needed for the other subsystems. We use a language-independent formalism to represent this information.</Paragraph> <Paragraph position="1"> We apply the processes described above to the the question and passages to obtain the following information: * Lexical and semantic information for each word: form, lemma, POS tag (Eagles or PTB tag-set), semantic class and subclass of NE, and a list of EWN synsets.</Paragraph> <Paragraph position="2"> * Syntactic information: syntactic constituent structure of the sentence and the information of dependencies and other relations between these components.</Paragraph> <Paragraph position="3"> EACL 2006 Workshop on Multilingual Question Answering - MLQA06 Once this information is obtained we can find the information relevant to the following tasks: * Environment Building. The semantic process starts with the extraction of the semantic relations that hold between the different components identified in the question text. These relations are organized into an ontology of about 100 semantic classes and 25 relations (mostly binary) between them. Both classes and relations are related by taxonomic links.</Paragraph> <Paragraph position="4"> The ontology tries to reflect what is needed for an appropriate representation of the semantic environment of the question (and the expected answer). A set of about 150 rules was built to perform this task. The ontology has been extended for the GD (see below the classes related with this domain).</Paragraph> <Paragraph position="5"> In order to determine the QT our system uses a Prolog DCG Parser. This parser uses the following features: word form, word position in the question, lemma and part-of-speech (POS). A set of DCG rules was manually configured in order to ensure a sufficient coverage. null The parser uses external information: geographical NE subclasses, trigger words for each Geographical subclass (e.g. &quot;poblado&quot; (ville)), semantically related words of each subclass (e.g. &quot;water&quot; related with sea and river), and introductory phrases for each Question Type (e.g. &quot;which extension&quot; is a phrase of the QT What area).</Paragraph> </Section> </Section> <Section position="6" start_page="71" end_page="73" type="metho"> <SectionTitle> * Semantic Constraints Extraction. Depend- </SectionTitle> <Paragraph position="0"> ing on the QT, a subset of useful items of the environment has to be selected in order to extract the answer. Accordingly, we define the set of relations (the semantic constraints) that are supposed to be found in the answer.</Paragraph> <Paragraph position="1"> These relations are classified as mandatory, (MC), (i.e. they have to be satisfied in the passage) or optional, (OC), (if satisfied the score of the answer is higher). In order to build the semantic constraints for each question, a set of rules has been manually built. A set of 88 rules is used. An example of the constraints extracted from an environment is shown in Table 2. This example shows the question type predicted, the initial predicates extracted from the question, the Environment predicates, the MCs and the OCs. MCs are entity(4) and i en city(6). The first predicate refers to token number 4 (&quot;autonomia&quot; (state)) and the last predicate refers to token number 6 (&quot;Barcelona&quot;).</Paragraph> <Section position="1" start_page="72" end_page="72" type="sub_section"> <SectionTitle> 3.4 Passage Retrieval </SectionTitle> <Paragraph position="0"> We use two different approaches for Passage Retrieval. The first one uses a pre-processed corpus as a document collection. The second one uses the web as document collection.</Paragraph> <Paragraph position="1"> This approach uses a pre-processed and indexed corpus with Scope-related Geographical Information as a document collection for Passage Retrieval. The processed information was used for indexing the documents. Storing this information allows us to avoid the pre-processing step after retrieval. The Passage Retrieval algorithm used is the same of our ODQA system: a data-driven query relaxation technique with dynamic passages implemented using Lucene IR engine API (See (Ferr'es et al., 2005) for more details).</Paragraph> <Paragraph position="2"> The other approach uses a search-engine to get snippets with relevant information. We expect to get a high recall with few snippets. In our experiments, we chose Google as the search-engine using a boolean retrieval schema that takes advantage of its phrase search option and the Geographical KB to create queries that can retrieve highly relevant snippets. We try to maximize the number of relevant sentences with only one query per question.</Paragraph> <Paragraph position="3"> The algorithm used to build the queries is simple. First, some expansion methods described below can be applied over the keywords. Then, stop-words (including normal stop-words and some trigger words) are removed. Finally, only the Nouns and Verbs are extracted from the keywords list. The expansion methods used are: * Trigger Words Joining (TWJ). Uses the trigger words list and the trigger phrase pattern list (automatically generated from GNS) to join trigger phrases (e.g. &quot;isla Conejera&quot; o &quot;Sierra de los Pirineos&quot;).</Paragraph> <Paragraph position="4"> * Trigger Words Expansion (TWE). This expansion is applied to the NEs that were not detected as a trigger phrase. The expansion uses its location subclass to create a key-word with the pattern: TRIGGER + NE (e.g.</Paragraph> <Paragraph position="5"> &quot;Conejera&quot; is expanded to: (&quot;isla Conejera&quot; method appends keywords or expands the query depending on the question type. As an example, in the case of a question classified as What length, trigger words and units associated to the question class like &quot;longitud&quot; (length) and &quot;kil'ometros&quot; (kilometers) are appended to the query.</Paragraph> </Section> <Section position="2" start_page="72" end_page="73" type="sub_section"> <SectionTitle> 3.5 Answer Extraction </SectionTitle> <Paragraph position="0"> We used two systems for Answer Extraction: our ODQA system (adapted for the GD) and a frequency based system.</Paragraph> <Paragraph position="1"> The linguistic process of analyzing passages is similar to the process carried out on questions and leads to the construction of the environment of each sentence. Then, a set of extraction rules are applied following an iterative approach. In the first iteration all the MC have to be satisfied by at least one of the candidate sentences. Then, the iteration proceeds until a threshold is reached by relaxing the MC. The relaxation process of the set of semantic constraints is performed by means of structural or semantic relaxation rules, using the semantic ontology. The extraction process consists on the application of a set of extraction rules on the set of sentences that have satisfied the MC. The Knowledge Source used for this process is a set of extraction rules owning a credibility score. Each QT has its own subset of extraction rules that leads to the selection of the answer.</Paragraph> <Paragraph position="2"> In order to select the answer from the set of candidates, the following scores are computed and accumulated for each candidate sentence: i) the rule score (which uses factors such as the confidence of the rule used, the relevance of the OC satisfied in the matching, and the similarity between NEs occurring in the candidate sentence and the question), ii) the passage score, iii) a semantic score (see (Ferr'es et al., 2005)) , iv) the extraction rule relaxation level score. The answer to the question is the candidate with the best global score.</Paragraph> <Paragraph position="3"> This extraction algorithm is quite simple. First, all snippets are pre-processed. Then, we make a ranked list of all the tokens satisfying the expected EACL 2006 Workshop on Multilingual Question Answering - MLQA06 answer type of the question. The score of each token in the snippets is computed using the following formula:</Paragraph> <Paragraph position="5"> Finally, the top-ranked token is extracted.</Paragraph> </Section> </Section> class="xml-element"></Paper>