File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1319_metho.xml
Size: 10,748 bytes
Last Modified: 2025-10-06 14:07:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1319"> <Title>A Real-time Integration of Concept-based Search and Summarization on Chinese Websites</Title> <Section position="3" start_page="0" end_page="153" type="metho"> <SectionTitle> 2. System Configuration </SectionTitle> <Paragraph position="0"> Figure 1 and Figure 2 present the overall system configuration and data flow of the integrated system. The system consists of four main components: a concept net, a query reformulation model, a standard search engine, and a summarizer. There is also an optional component, i.e., if the user chooses, she can launch a text-to-speech (TTS) engine to read out the automatically generated summary.</Paragraph> <Paragraph position="1"> The concept net is a network of conceptual terms. It is normally constructed for a specific domain with certain amount of human intervention. A link connects each pair of related concepts in the network, specifying the semantic relationship between them. For the current economic news domain, the main types of relationships include, but not limited to, canonical form of, synonym of, hyponym of, hypernym of, part of, product of, member of, etc.</Paragraph> <Paragraph position="2"> The system accepts users' queries expressed in Chinese natural language. The query may contain the terms that are already stored in the concept network. Unlike professional searchers, Most of Chinese web users have little training in information retrieval or have no prior knowledge or experience in it. They may not even know where to search and how to search.</Paragraph> <Paragraph position="3"> To perform an initial search, they tend to use one or more very general or vague terms. Under such circumstances the system guidance and navigation are extremely important. One unique functionality of this integrated system is to intuitively lead a casual or novice user from a more general search to a more specific search until the user becomes satisfied with the returned information.</Paragraph> <Paragraph position="4"> For each search conducted, the query reformulation model looks up the concept network for more specific terms that are relevant to the more general terms in the earlier query. For example, if a general term is a company' s name, then its subsidiaries, its products, its stock symbol, its industrial code, etc. are considered to be specific information about the company.</Paragraph> <Paragraph position="5"> The query reformulation model either replaces or expands the original query with these related terms and formulates them into a standardized format. Search operators, such as AND, OR, NOT, NEAR, etc. are used to connect the terms selected. Assigned to each individual term is a different weight so as to reshape a new search emphasis.</Paragraph> <Paragraph position="6"> The standard search engine performs the search against the targeted database using the reformulated query with N relevant documents returned in an order of the relevance to the query (N is a number defined by the user and it is 10 by default). At the same time, the concept terms in the original query and the corresponding specific terms extracted from the concept network are also displayed. The web interface is designed in a way that makes each of the specific terms searchable. The user has the choice to select any of these specific terms to form a new query. The search engine takes the new query to perform the next round of search, actually a more specific search based on the user' s intention. The iteration continues the more specific search while using more specific terms, the closer the user will be to his desired information - though he can stop anytime after each search iteration to examine the retrieved documents.</Paragraph> <Paragraph position="7"> A text summarizer automatically generates the summaries for the documents each search iteration returns (Liu and Zhou, 2000). Together with the output summary, a selection panel that includes the key-word list (average 3 to 6 words), the headline, and the leading text (usually the first 100 characters) of the document is displayed. The selection panel provides the user with an ability of examining the retrieved information more efficiently. By glancing over the key words, the user should be able to grasp the main idea of the document. He can make a decision whether to skip the document or continue to learn more about it. If his choice is the latter, he can move up to the headline or the leading text that offer more about the document content. He can, of course, move further to look at the summary that is supposedly a mini-document of the original. If the user decides to move further to read the entire document, then he will click on the hyperlinked title or headline of the document.</Paragraph> <Paragraph position="8"> The integrated system will go directly online, usually a specific website, to grab the document for the user.</Paragraph> <Paragraph position="9"> Associated with the summarizer is an optional text-to-speech (TTS) engine. The user can choose either to read the document himself or get relaxed by simply listening to the system' s voice output.</Paragraph> <Paragraph position="10"> In this section we use an example to illustrate how the integrated system works.</Paragraph> <Paragraph position="11"> Suppose that there is a Chinese user who has no experience in search over Internet, neither has any idea regarding where or how to do it. He just heard some news about F~ (Legend, one of the biggest computer manufacturing companies in China) and wants to find it out from Internet. When we make our integrated search and summarization system available to him, the only word in his mind is the name of the company ~,~ (Legend). So, he enters this fairly general term as his fast search query and presses the Go button (see Figure 3 (a)).</Paragraph> <Paragraph position="12"> Figure 3 (b) shows what the integrated system returns to the user in response to his initial search. In addition to the top 10 most relevant documents, two more specific terms are extracted from the pre-constructed concept network, i.e., tI~\[\] (Legend Corporation) and I~,flgJ\]~ (Legend computer). In the concept network these two terms are found to be closely associated with the initial search termF~,~., (Legend). When the user examines the top relevant document returned, he fmds that a selection panel is displayed on the interface that is associated with this (or each) returned document. If looking up from the last row, he notices three key words that are extracted from the document, i.e., ~\[\] (Legend Corporation), C/d,-T:~_~ B\] (electronic company), ~1~ ~)~- (internationally well-known brand). By putting these three key words together the user should be able to grasp the main idea expressed in the document (something like &quot;Legend Corporation is an electronic company who has some internationally well-known brand names&quot; ). If the user wants to know more about the document, he can move up to click the button fl~3~J (automatic summarization) or read the leading text (in this case the first 50 characters from the document are extracted). If the user wants to refer to the entire document, he can do it by clicking on the headline. The integrated system will go live to Internet to retrieve the entire original text.</Paragraph> <Paragraph position="13"> But, what if the user is not satisfied with the current answer set? Then, he has an option of kicking off another round of search. To do that he may want to narrow down his search by choosing between ~o~,~-,~ \[\] (Legend Corporation) and I~f~ fl~ (Legend computer) - the two relatively speaking, more specific terms associated with his initial query I~ (Legend). Let us suppose that the user decides to select 1~C/3.~ \[\] (Legend Corporation) to conduct a new but more specific search. The search can be activated by simply clicking on this selected term. Figure 3 (e) shows the results of this search iteration. Again, in addition to the top 10 most relevant documents returned, more than 20 related terms are extracted in this round from the concept network that are considered to be conceptually relevant to the search termID~,~..,~\[\] (Legend Corporation). These terms represent even more specific concepts comparing to the two specific terms returned in the first round. A detailed examination of these terms reveals that they represent the following conceptual categories: first, more specific news or information about the company, such as ~\[\]IT~3lr (the head of Chinese 1T), C/P~-T~'~f3~~&quot; (No. 1 of the first 100 most powerfil electronic companies in China), \]~L~ (Joint venture of Legend associated with each retrieved document to see which document contains the information he wants. If so, his search iteration will terminate. If not, he can reformulate his next search by selecting any one or more terms from the specific term list. In this particular case, the user may go for the first choice since the top document retrieved turns out to be identical for these two search iterations conducted so far. The document entitled J.~ZJ~ll/kJl~!~3-~qF0~ ~1~I~~ (Reorganization of Legend Corporation to welcome the joining of WTO) receives the relevance rate of 100% in both searches. This is probably the news about Legend Corporation the user is looking for.</Paragraph> <Paragraph position="14"> As mentioned above, even if the user has obtained the information he desires, he still can reformulate or expand his next search for more specific news about ~ (Legend Corporation).</Paragraph> <Paragraph position="15"> Also, the user can launch a completely new search for another company. Remember ~J~/F~\[\] (Haier Corporation), another most powerful company in China, appears in the specific term list. By clicking on this term, the user will be able to locate the latest news or other information about a new company.</Paragraph> <Paragraph position="16"> Chinese Web News Summadzer anti R, .Enter a general term, say, &quot;Legend&quot;, the name of a ; ........... 0&quot; ...................... 7-.&quot;large computer company in ~.~ ................. ~-.j to begin search (a) User enters initial query ,~'i ......................... m~ ........... II II I II I IIIIIIII IIIIIIIIIII II ~1oI~ Two more specific ten~s that are associated with Legend.</Paragraph> <Paragraph position="17"> .e. Le~,end Corp. and Legend mpuwr are extracted fi'om ~7~1~~ the concept net are displayed.</Paragraph> <Paragraph position="18"> ......................... They can be selected by the '~-' ~h~w~ user to conduct another search (b) Results of the first search iteration</Paragraph> </Section> class="xml-element"></Paper>