File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2111_metho.xml

Size: 11,775 bytes

Last Modified: 2025-10-06 14:09:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2111">
  <Title>Concordances of Snippets</Title>
  <Section position="3" start_page="1" end_page="9" type="metho">
    <SectionTitle>
3 Instant web concordances
</SectionTitle>
    <Paragraph position="0"> The web is not a true corpus: it is not representative of anything and it is not balanced. Nonetheless there is no better place to look up examples of current language use than the web, which possibly is also the most suitable type of use of this language resource. But interactive use, expected of concordances in general, requires short retrieval times.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Why web concordances are slow
</SectionTitle>
      <Paragraph position="0"> WebCorp is said to be slow because &amp;quot;the current version of WebCorp is for demonstration purposes and the speed at which results are returned will increase as the tool is developed further&amp;quot;. Is the speed really in the hands of the developers of the system? The decisive factor here is the time it takes to access each web-page, which depends on the capacity of the data transmission channel and the actual server a web-page is on, and this is not even predictable. It is possible to make sure that a connection is always made to a quick server by using Google cache instead of original URLs but the time taken by data transmission still remains a problem. One possible solution is to rely on snippets for concordance lines. This saves the time needed to collect and transfer ad-hoc web corpora.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.2 Web concordances and online dictionaries
</SectionTitle>
      <Paragraph position="0"> The possibility to access current language use on the web in an instance is crucial as a complement to online dictionaries. The problem confronting dictionaries is how to handle two incompatible tasks simultaneously. One is to supply correct definitions and thereby preserve the usefulness of words. The other is to report on current trends in language usage, even when it means effacing meaningful differences between words. The role of an online dictionary complemented with concordances from the web would be to consider whether some popular usage may be based on confusion.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Lexware Culler
</SectionTitle>
      <Paragraph position="0"> Lexware Culler builds concordances of Google snippets and it takes the same time to look up words and phrases in Lexware Culler as it takes for Google to deliver results. Language processing is applied not only to search terms but also to snippets from Google.</Paragraph>
      <Paragraph position="1">  which can be used for any word in general (*), it is possible to select words of a particular part of speech, in which case part of speech variables are used. Function word variables trigger expansion of search terms into alternative queries while variables of open parts of speech are used to filter away non-matching snippets obtained for a search term. This postfiltering is available for English and Swedish and it is being developed for Polish.</Paragraph>
      <Paragraph position="2"> A table with a summary of results is always supplied in Culler along with concordance lines, which proves often handy, e.g. in investigation of collocations. It is possible to test the tool at http://82.182.103.45/lexware/concord/culler.html</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="9" type="sub_section">
      <SectionTitle>
3.4 Examples of use
</SectionTitle>
      <Paragraph position="0"> Examples provided below are representative of the uses of Lexware Culler tested so far.</Paragraph>
      <Paragraph position="1">  It is not obvious how to find a word or a phrase in a dictionary if all one can go after is its context, it may be difficult even in a corpus unless very large. In order to find the word for a stick used in conducting an orchestra we made some futile checks in online dictionaries  . A simple query for &amp;quot;conductor's *&amp;quot; in Culler yields baton directly with several examples like ...not the first soloist to feel the lure of the conductor's baton... . A new adverbial use of the word fett (fat) has become very popular in the language of young Swedes - it has the role of a general magnifier. This use cannot be found in Swedish dictionaries or in the corpus of the Bank of Swedish. A typical context entered in Culler as a search term: &amp;quot;det ar fett *&amp;quot; (it is fat *) gives 188 hits of which very few are examples of the basic uses of the word,  finally we found an example with baton in Cambridge Dictionaries Online.</Paragraph>
      <Paragraph position="2"> majority of excerpts exemplify the new adverbial use.</Paragraph>
      <Paragraph position="3"> A search term can be formulated as a typical defining context, for instance: &amp;quot;Moomin is a *&amp;quot; and &amp;quot;Moomins are *&amp;quot;. If the search is not limited to a specific country excerpts thus obtained are hard to find elsewhere side by side: Many people in Japan think that Moomin is a hippopotamus, however, it is actually a forest fairy or Moomin is a  &amp;quot;entrepreneurs&amp;quot;. Has the correct spelling of Polish adverb z powrotem (back) lost to a new one word spelling spowrotem yet? Not, yet: it is used on 22 400 web-pages, while the correct one is used on 83 500 web-pages. Besides such simple checks Culler can be used to ferret out popular misinterpretations, such as those of the Swedish idiom med beratt mod (in cold blood). Table 1 is a result summary for the search term: &amp;quot;med beratt *&amp;quot; and &amp;quot;-mod&amp;quot; (any phrase beginning with med beratt excluding those with mod).</Paragraph>
      <Paragraph position="4">  The incorrect alternative versions thus extracted are: mord (murder), lugn (calm), vald (violence). The number of web-pages returned by Google is shown in the right column, the left one contains the number of  One can learn from a dictionary what creatures typically produce grunting sounds. A further check in a large corpus yields yet more examples. The range of grunting creatures which appears in snippets for the search term &amp;quot;grunting like a *&amp;quot; and &amp;quot;-pig&amp;quot; is truly amazing: from more or less predictable ones, like a walrus, a deranged gorilla, a wrestler, a lumberjack, to rather unexpected ones, like a constipated weasel, a freakin caveman, a eunuch impersonating Billy Idol, plus many fresh associations like an orc, the beasts back on Mordor, etc.</Paragraph>
      <Paragraph position="5">  Word filter in Google is applied to the whole webpage. null In order to check whether and how expressions for becoming of age are dependent on the age itself the following queries were entered: &amp;quot;going on NUM&amp;quot; &amp;quot;gonna be NUM&amp;quot; &amp;quot;become NUM&amp;quot; &amp;quot;turn NUM&amp;quot;, &amp;quot;push NUM&amp;quot;, &amp;quot;reach NUM&amp;quot; &amp;quot;make it to NUM&amp;quot; &amp;quot;hit NUM&amp;quot;, where NUM stands for numerals. Enormous material was obtained for all ages, some of which involved surprises. For instance, &amp;quot;make it to NUM&amp;quot; had most hits in lower ages, where the lowest numbers referred mostly to the age of relations, while middle numbers referred to young people sick in some incurable illness; hitting 50 and more proved to be rare, probably because of its low news value.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="9" end_page="9" type="metho">
    <SectionTitle>
4 Snippets as concordance lines
</SectionTitle>
    <Paragraph position="0"> Whether snippets are sufficient as concordance lines is a question which can be settled empirically only. Culler has been used extensively for the past three months in uses for which the examples provided above are representative.</Paragraph>
    <Section position="1" start_page="9" end_page="9" type="sub_section">
      <SectionTitle>
4.1 Google selections
</SectionTitle>
      <Paragraph position="0"> An average snippet is about 20 words long, which in most cases is sufficient as disambiguating context. For each query Google retrieves max. 100 URLs and there may be up to 300 queries generated by Culler for a search term (when expanded with inflectional forms and/or function words).</Paragraph>
      <Paragraph position="1"> Google selects web-pages according to a complex ranking, the main ingredient of which is the popularity of a web-page, measured among others in the number of links from other webpages. Snippets are thus representative of prevalent language use. So are the numbers of matches reported by search engines because they report the number of web-pages with at least one match. The fact that each snippet is from a different web-page contributes to the diversity of excerpts.</Paragraph>
      <Paragraph position="2"> The quality of excerpts can differ tremendously dependent on a search term. Generally the longer the search term the higher the chance for better excerpts. Some terms get snippets with proper name readings only, in which case it is better to limit the source of snippets to some large</Paragraph>
    </Section>
    <Section position="2" start_page="9" end_page="9" type="sub_section">
      <SectionTitle>
4.2 Culler selections
</SectionTitle>
      <Paragraph position="0"> Post-filtering of snippets is triggered either by variables of open parts of speech in a search term or by noise in excerpts. Three types of noisy excerpts are filtered away: repetitive, non-textual and non-phrasal. Each of the filters can be turned off. An average percentage of noise in snippets is  The URL is entered in Culler's slot for word filters. about 20%.The three types of noises amount to an average of 21.7 % snippets discarded in the examples cited above (T is the number of snippets obtained from Google, D is the number of discarded snippets).</Paragraph>
      <Paragraph position="1">  Google does not return web-pages which are exact copies but it does return snippets which are the same or almost the same. Famous lyrics, dramas, stories, sermons, speeches, important news appear in enormous number of copies. For instance a search term &amp;quot;children at your feet&amp;quot; has about 50% repetitions, all of which involve web-pages with lyrics of the song &amp;quot;Lady Madonna&amp;quot;. The average level of repetitive snippets is about 5% in the cited examples. Repetitive snippets discarded by Lexware Culler are the ones which differ only in:  Several types of more or less formulaic elements which are common on the web appear in snippets. None of these are usually desirable as concordance lines: boilerplate information, mathematical formulae, navigation tips, hyperlinks, e-mail addresses, post addresses, data on updates, headers, footers, copy right statements, logs, fragments of lists of items. 7% of snippets are discarded on average by this type of filtering in the cited examples.</Paragraph>
      <Paragraph position="2">  Punctuation is ignored by Google while Culler departs from an assumption that phrasal context is normally requested, hence only snippets without interrupting punctuation within a search term are selected. Adding marginal wildcards to a search term is interpreted in Culler as a request for an unbroken phrasal context including words matched by wildcards. Snippets with search terms interrupted by commas, full-stops, colons, semicolons, question and exclamation marks are discarded by this filtering. The impact of this filtering differs very much from case to case: from half of the excerpts to none at all.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML