File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2111_intro.xml
Size: 4,056 bytes
Last Modified: 2025-10-06 14:02:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2111"> <Title>Concordances of Snippets</Title> <Section position="2" start_page="0" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The access to the web is ubiquitous, it is a selfrenewing language resource and its size and variety exceeds all previous corpora. The counterpart of the public web was estimated up to 28 million books already in 2002, which can be compared with the largest number of volumes held by Harvard University - about 15 million (O'Neill, Lavoie, Bennett, 2003). Excellent concordances are produced by tools mounted on regular web search engines but these tools are not suitable for quick lookups on the web because it takes time to collect ad-hoc corpora with occurrences of a queried word or phrase. Is it possible to get a web concordance in an instant? There is actually one being currently developed and engines are used to produce web concordances.</Paragraph> <Paragraph position="1"> Search engines improve constantly. For instance it is no longer true that &quot;Some search engines, including Google, FAST and Lycos, do not support wildcards at all&quot; (Kehoe and Renouf, 2002). In Google wildcards are available for words and in AltaVista wildcards were available for both words and characters until its unfortunate recent death (1 st April 2004). Google covers 4.28 billion webpages, it has snap-shots of majority of them in its cache and its result lists include snippets - short text excerpts from matching web-pages showing a search term in its closest context. Google is not immaculate. A search term cannot be longer than 10 words and it is not always included in a snippet. It does not support case sensitive search or wildcards for characters.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.1 Tools for concordancing the web </SectionTitle> <Paragraph position="0"> Concordance tools mounted on web search engines enable a user to compile own corpora from web-pages for a chosen search term and produce a concordance of the gathered text material.</Paragraph> <Paragraph position="1"> 2.1.1 Concordances collected in batch mode KWiCFinder seems to have been the first one to provide linguists with KWIC concordances from the web. KWiCFinder is intended to be used in batch mode. It assists the user to formulate a query, the query is submitted to AltaVista, documents are retrieved and a KWIC concordance of 5-15 online documents per minute is produced. KWiCFinder is used in its own client application which needs to be downloaded.</Paragraph> <Paragraph position="2"> WebConc is mounted on Google. It takes a search term from the user, accesses each web-page obtained from Google, collects all contexts of the search term and presents them as a concordance. It is perspicuous and easy to use. The maximal number of web-pages is limited to 50 in order to keep the retrieval time down but even with the minimum of 10 web-pages it is too slow for interactive use. It is possible to limit retrieval to tested for English (http://lse.umiacs.umd.edu:8080/). some chosen URL in WebConc, which probably is the best way to use the tool.</Paragraph> <Paragraph position="3"> 2.1.3 Concordances by e-mail WebCorp (Kehoe and Renouf, 2002) makes access to each of web-pages retrieved by a chosen search engine, fortunately one does not have to wait for the results to appear on the screen because it is possible to order a concordance to be sent by e-mail. Various types of reports are made available, e.g. collocates of the search term can be presented summarized in a table. A frequency or alphabetically ordered list of all the words on any source page is available upon clicking on a URL link. Regular expressions can be used to express form alterations. WebCorp is an excellent example of how useful search engines can be made for linguists when their power is enhanced with natural language processing.</Paragraph> </Section> </Section> class="xml-element"></Paper>