File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1083_metho.xml
Size: 15,533 bytes
Last Modified: 2025-10-06 14:08:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1083"> <Title>Browsing Help for Faster Document Retrieval</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Navigation Features </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Conceptualization 3.1.1 Description </SectionTitle> <Paragraph position="0"> The &quot;concepts part&quot; of the interface shows several links represented by short noun phrases.</Paragraph> <Paragraph position="1"> When the user clicks on one of these links, a new query is submitted to the engine. The documents retrieved by the first query are then filtered and only the ones that contain the selected noun phrase are kept. This is a very convenient way to select relevant topics. The user can select the appropriate concept corresponding to his/her expectations in order to reduce the search space. For instance, the concepts retrieved with the 'ouragan &quot;Amerique Central&quot; 1998' (hurricane &quot;Central America&quot;) query are the following (numbers in brackets give the number of documents in which the concepts occur): Centrale&quot; 1998' Because concepts are extracted from the top list of relevant documents (according to the relevance score), they can be seen as a summary mined across them. The list contains different types of concepts, from noun groups to proper nouns. In the top of the list comes the answer to the current question (Q1056): ouragan Mitch, Mitch and cyclone Mitch (Mitch hurricane, Mitch and Mitch cyclone). A click on one of those links will directly lead to the document containing the text string, and thus, to the relevant documents.</Paragraph> <Paragraph position="2"> This way of browsing is even more useful when the engine is not able to get rid of an ambiguity. In a perfect world, a query divides the document space in two parts, the relevant and non-relevant documents. However, what might be relevant regarding to a query, might not be relevant according to the user. Everybody knows that a search engine often returns non-relevant documents. This is due to both the complexity of languages and the difficulty to express an information in some words. Because an engine may not fit correctly the needs of the user, the proposed way to browse within the retrieved documents is very handy. The user can then select the relevant concepts. Of course, it is also possible to select several concepts, to eliminate several others and then resubmit a query.</Paragraph> <Paragraph position="3"> As the search engine indexes the documents, several linguistic analysis are applied on each of them in order to detect all possible concepts.</Paragraph> <Paragraph position="4"> Morpho-syntactic analysis is needed by concept detection because most of the patterns are based on Part-of-Speech sequences. The concept detection itself is based on Finite State Automata. The automata were built by linguists in order to catch syntactic relation such as the ones cited above. For each document, the potential concepts are stored in the engine database.</Paragraph> <Paragraph position="5"> For the purpose of concept selection, only the first 1000 documents retrieved by the engine (or fewer if relevancy score is too low) are used. Then, frequencies of concept occurrences in the sub-corpus are compared with the frequencies in the entire corpus. The selected concepts should be the best compromise between minimum ambiguity and the maximum of occurrence. A specificity score is computed for each concept. This score is used to sort all the occurring noun phrases. Only the top ones are displayed and should represent the most important concepts of the documents.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Named entities </SectionTitle> <Paragraph position="0"> The last area of the interface shows several named entities: locations, people and organizations (see section 4.1 for a description of the named entity recognition procedure). Like it is done with meta-data, entities can be used in order to restrict search space. We can filter the documents retrieved by the original query and get only those, which contain Managua.</Paragraph> <Paragraph position="2"> Concepts for query 'ouragan &quot;Amerique Centrale&quot; 1998' Named entities become very useful when doing statistics on a corpus. For a given query, the distribution for each entity type can be computed and sorted according to a scoring function.</Paragraph> <Paragraph position="3"> Document frequency (DF) is usually a good way to sort the result. But the information provided by the search engine is very useful against the query. The scoring function used by Intuition is based on document score th and document rank j (1<j<N) for a given category v: The parameter a modifies the importance given to the document score, and the parameter b modifies the importance given to the document ranking. Figure 2 presents the entities for locations, persons and organizations for the query 'ouragan &quot;Amerique Centrale&quot; 1998'. Numbers in parenthesis represent the entity score.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Named Entities visualization </SectionTitle> <Paragraph position="0"> Sometimes, additional information is insufficient or not at all present in the documents. In order to increase the browsing possibilities, specific information can be automatically extracted from texts. For this purpose, we use a document analysis process based on transducers in order to detect named entities. This system has been previously developed in order to participate to question/answering task in TREC evaluation campaign (Voorhees, 2001). The commonly established notion of names entities has been extended in order to include more types. More than 50 different types of entities are recognized in French and English.</Paragraph> <Paragraph position="1"> The document analysis system can be decomposed in two main tasks. First, a morpho-syntactic analysis is done on the documents. Every word is reduced to its basic form, and a Part-of-Speech tag is proposed. In addition to the classical POS tags, the lexicon includes semantic information. For example, first names have a specific tag (&quot;PRENOM&quot;). These semantic tags are used in the next phase for entity recognition.</Paragraph> <Paragraph position="2"> Transducers are applied in cascade. Every entity recognized by one transducer can be used by the next one. The analysis results in a list of entity type, value, position and length in the original Entity recognition and extraction opens up new perspectives for browsing within documents. The most trivial use is to display certain entities in color according to their type. Users can then quickly filter documents talking about the right persons or places. He can also immediately find interesting passages. Figure 3 shows a document with highlighted entities.</Paragraph> <Paragraph position="4"> It is clear that this allows an easier quick reading because the most representative parts of the documents are highlighted.</Paragraph> <Paragraph position="5"> Moreover, it is very easy to find the entities in the current document. In Fig. 4, one can immediately see which locations are mentioned (e.g. Amerique Centrale, Salvador, Honduras, Nicaragua, Managua, etc).</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Task description </SectionTitle> <Paragraph position="0"> The evaluation includes six interfaces with different features for the most of them. They were designed in order to evaluate whether the navigation facilities proposed to users improve their ability to find relevant documents. The six interfaces query the same document base: 775 000-article collection extracted from the French newspaper Le Monde (years 1989 to 2002).</Paragraph> <Paragraph position="1"> The features used for each interface are listed in Interface1: No additional navigation facilities are provided to users. A simple query box is supplied in order to query Intuition search engine (see Section 2). A summary of 10 documents per page is presented to the user. It gives the article title, the relevance score and an abstract consisting in the first 250 bytes from the document.</Paragraph> <Paragraph position="2"> Interface2: Equivalent to Interface1, it features in addition a list of concepts in summary presentation. Concepts are extracted according to the user query (see Section 3.1).</Paragraph> <Paragraph position="3"> Interface3: Equivalent to Interface1, it displays also four lists of named entities related to the documents returned by the engine. In the left side column are listed the persons, cities, counties and companies the most representative (see Section 3.2).</Paragraph> <Paragraph position="4"> Interface4: Alike Interface1, the only difference resides in the named entities highlighting (persons, dates, cities, counties and companies) when users open the articles (see Section 3.3).</Paragraph> <Paragraph position="5"> Interface5: Same as Interface1, it enables, when opening a document, to navigate through one of the 3 similar documents proposed into an additional frame.</Paragraph> <Paragraph position="6"> Interface6: It figures a compilation of additional features used in all the other interfaces. All the user actions are stored into the search engine log file, so that we can evaluate how many users employ additional features. On each visited article, users were asked, through buttons, to precise whether the document was relevant (VALIDATION button) or not (ANNULATION button). Information such as time and user id was stored in the log file as well.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Experiment </SectionTitle> <Paragraph position="0"> In order to evaluate the six interfaces, a set of queries had to be built according to the number of subjects available for the experiment. Furthermore, a specific framework has been set for each user.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Material </SectionTitle> <Paragraph position="0"> Two sets of queries were used for this evaluation. The first is composed of 12 task description queries, which originate from TREC-6 ad-hoc campaign (Voorhees and Harman, 1997).</Paragraph> <Paragraph position="1"> Twelve descriptions were selected among the fifty proposed for the task according to their applicability to a French newspaper corpus. We deliberately selected the description part in order to have a more precise idea of what document should be considered has relevant. Moreover, supplying a short description (2-3 words) would have lead to equivalent queries at the first stage. Users would have probably copied the proposed keywords in order to compose their queries. Then, they were translated into French by an external person (not involved in the evaluation process). The second set is composed of 6 factual questions inspired from the previous TREC Question/Answering evaluation campaigns (Voorhees, 2003) and translated. The subjects were asked to retrieve documents containing the answer.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="1" type="metho"> <SectionTitle> ID Queries </SectionTitle> <Paragraph position="0"> 301 Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved.</Paragraph> <Paragraph position="1"> 304 Compile a list of mammals that are considered to be endangered, identify their habitat and, if possible, specify what threatens them.</Paragraph> <Paragraph position="2"> 305 Which are the most crashworthy, and least crashworthy, passenger vehicles? 310 Evidence that radio waves from radio towers or car phones affect brain cancer occurrence.</Paragraph> <Paragraph position="3"> 311 Document will discuss the theft of trade secrets along with the sources of information: trade journals, business meetings, data from Patent Offices, trade shows, or analysis of a competitor's products.</Paragraph> <Paragraph position="4"> 322 Isolate instances of fraud or embezzlement in the international art trade.</Paragraph> <Paragraph position="5"> 326 Any report of a ferry sinking where 100 or more people lost their lives.</Paragraph> <Paragraph position="6"> 327 Identify a country or a city where there is evidence of human slavery being practiced in the eighties or nineties.</Paragraph> <Paragraph position="7"> 331 What criticisms have been made of World Bank policies, activities or personnel? 338 What adverse effects have people experienced while taking aspirin repeatedly? 339 What drugs are being used in the treatment of Alzheimer's Disease and how successful are they? 342 The end of the Cold War seems to have intensified economic competition and has started to generate serious friction between nations as attempts are made by diplomatic personnel to acquire sensitive trade and technology information or to obtain information on highly classified industrial projects. Identify instances where attempts have been made by personnel with diplomatic status to obtain information of this nature.</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 5.2 Evaluation framework </SectionTitle> <Paragraph position="0"> The definition of the framework was constraint by the number of subjects available for this evaluation. Because it was an internal experiment, only six persons tested the interfaces. The group was composed of 3 linguists and 3 computer scientists (2 females and 4 males) with different aptitude levels with search engines. Each subject was given 3 queries (2 descriptive queries and 1 question) per interface starting with Interface1 and finishing with Interface6. A cross-evaluation was used so that two subjects would not employ the same interface with the same question. At the end, the 18 queries were evaluated with each interface.</Paragraph> <Paragraph position="1"> Because of the corpus nature (newspaper), subjects need a certain amount of time to read the article in order to judge it relevant or not. The time available for each query was limited to 10 minutes during which the subject was asked to retrieve a maximum of relevant documents. It is twice the time devoted to a similar task presented in (Bruza et al., 2000) . We consider that the time needed to find relevant documents on a newspaper collection is greater than on the Internet for many reasons: First, the redundancy is much higher on the Internet; Second, we mostly find long narrative articles on a newspaper collection though web documents seems more structured (section title, colors, bold and italic phrase, table, figures, etc.). This last enables a quicker reading of the document.</Paragraph> <Paragraph position="2"> Bruza et al. have compared three different kinds of interactive Internet search: The first was based on Google search engine; the second was a directory-based search via Yahoo; and the last was a phrase based query reformulation assisted search via the Hyperindex Browser.</Paragraph> </Section> </Section> class="xml-element"></Paper>