File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1410_intro.xml

Size: 4,977 bytes

Last Modified: 2025-10-06 14:01:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1410">
  <Title>Question Terminology and Representation for Question Type Classication</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In Information Retrieval (IR), text categorization and clustering, documents are usually indexed and represented by domain terminology: terms which are particular to the domain/topic ofadocument. However, whendocumentsmust be retrieved or categorized according to criteria whichdonotcorrespondtothedomains,suchas genre (text style) (Kessler et al., 1997;; Finn et al., 2002) or subjectivity (e.g. opinion vs. factual description) (Wiebe, 2000), we must use dierent, domain-independentfeatures to index and represent documents. In those tasks, selection of the features is in fact one of the most critical factors which aect the performance of a system.</Paragraph>
    <Paragraph position="1"> Question type classication is one of such tasks, where categories are question types (e.g.</Paragraph>
    <Paragraph position="2"> 'how-to', 'why' and 'where'). In recent years, question type has been successfully used in many Question-Answering (Q&amp;A) systems for determining the kind of entity or concept being asked and extracting an appropriate answer (Voorhees, 2000;; Harabagiu et al., 2000;; Hovy et al., 2001). Just like genre, question types cut across domains;; for instance, we can ask 'how-to' questions in the cooking domain, the legaldomainetc. However, featuresthatconstitutequestiontypesaredierentfromthoseused null for genre classication (typically part-of-speech or meta-lingusitic features) in that features are strongly lexical due to the large amount of idiosyncrasy (keywords, idioms or syntactic constructions) that is frequently observed in questionsentences. Forexample,wecaneasilythink of question patterns such as \What is the best way to ..&amp;quot; and \What do I have to do to ..&amp;quot;. In this regard, terms whichidentify question type are considered to form a terminology of their own, whichwe dene as question terminology.</Paragraph>
    <Paragraph position="3"> Terms in question terminology have some characteristics. First, they are mostly domainindependent, non-content words. Second, they include many closed-class words (such as interrogatives, modals and pronouns), and some open-class words (e.g. the noun \way&amp;quot; and the verb \do&amp;quot;). In a way, question terminology is a complement of domain terminology.</Paragraph>
    <Paragraph position="4"> Automaticextractionofquestionterminology isaratherdiculttask,sincequestiontermsare mixed in with content terms. Another complicating factor is paraphrasing { there are many ways to ask the same question. For example,  - \How can I clean teapots?&amp;quot; - \In what way can we clean teapots?&amp;quot; - \What is the best way to clean teapots?&amp;quot; - \What method is used for cleaning teapots?&amp;quot; - \How do I go about cleaning teapots?&amp;quot;  In this paper, we present the results of our investigation on how to automatically extract questionterminologyfromacorpusofquestions and represent them for the purpose of classifying by question type. It is an extension of ourpreviouswork(Tomuro andLytinen, 2001),  wherewecomparedautomaticandmanualtechniques to select features from questions, but only (stemmed) words were considered for features. The focus of the current work is to investigate the kind(s) of features, rather than selection techniques, which are best suited for representing questions for classication. Specifically, from a large dataset of questions, we automatically extracted two sets of features: one set consisting of terms (i.e., lexical features) only, and another set consisting of a mixture of termsandsemantic concepts (i.e., semantic features). Our particularinterest isto see whether  ornotsemanticconceptscanenhancetherepresentation of strongly lexical nature of question sentences. To this end, we apply two machine learning algorithms (C5.0 (Quinlan, 1994) and PEBLS (Cost and Salzberg, 1993)), and comparetheclassicationaccuracyproducedforthe null two feature sets. The results show that there is no signicant increase by either algorithm by the addition of semantic features.</Paragraph>
    <Paragraph position="5"> The original motivation behind our work on question terminology was to improve the retrievalaccuracyofoursystemcalledFAQFinder null (Burkeetal.,1997;; LytinenandTomuro, 2002).</Paragraph>
    <Paragraph position="6"> FAQFinder is a web-based, natural language Q&amp;A system which uses Usenet Frequently Asked Questions (FAQ) les to answer users' questions. Figures1and2showanexamplesession with FAQFinder. First, the user enters a question in natural language. The system then searches the FAQ les for questions that are similar to the user's. Based on the results of the search, FAQFinder displays a maximum of</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML