File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/99/w99-0806_concl.xml
Size: 10,087 bytes
Last Modified: 2025-10-06 13:58:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0806"> <Title>Web Access to Corpora: the W3Corpora Project*</Title> <Section position="6" start_page="0" end_page="0" type="concl"> <SectionTitle> 4 Existing Work, Comparisons </SectionTitle> <Paragraph position="0"> There are a large number of tools and systems that offer something similar to what the W3Corpora site seeks to provide. They range from simple Unix command-line style programs like Ptx, to sophisticated GUI interfaces. For local installation, on a Macintosh one has Conc 1.7, and ParaConc; for DOS/Windows, one has ICEUP (ii&quot;om ICE), LEXA (from ICAME), Micro-OCP, Multiconcord, LDB (from Nijmegen), Wordsmith Tools, TACT, and Sara (for the BNC); for Unix, there many standard utilities, as well as ptx, and Xkwic (from Stuttgart).</Paragraph> <Paragraph position="1"> As regards Web-accessible resources, the following should me mentioned: BNC The BNC site provides access to a subset four 2-hour sessions , two of which are descriptive, two practical; in the latter two the students use the W3Corpora search engine, under supervision. A practical corpus investigation, using tools such as the W3Corpora search engine, is one of the options for course assessment.</Paragraph> <Paragraph position="2"> of the British National Corpus on a trial basis. This permits simple searches on-line, but with limited number of hits, and limited information about the hits. Registration for a trial account (20 days) is required. Full access requires downloading (Windows) client program (available for Windows95, and Windows3.x only), and payment of an annual registration fee. It is restricted to users within the EC.</Paragraph> <Paragraph position="3"> Canadian Hansard This site permits access to the proceedings of the Canadian Parliament in English and French. These are paraiiel corpora (English and French), searches may be mono- or bi-lingual (in either case, the results returned are bi-lingual -- i.e. the user sees both the context where the search term appears, and translation). In the mono-lingual case one can see how an expression is used and translated. The bi-lingual case allows one to see, e.g. where English commitment is translated as French attachement.</Paragraph> <Paragraph position="4"> In addition to verbatim (case independent) searches, it is also possible to perform a dictionary search, e.g. the query: pull+ the plug will match pull the plug, pulling the plug, pulls the plug, etc, and to search for words that do not appear contiguously (e.g. make ... arraugements). No frequency information is provided.</Paragraph> <Paragraph position="5"> CobuUd This site gives limited access to the Cobuild Corpora: the &quot;Bank of English&quot; (over 50million words), giving an idea of the kinds of search possible with the full system.</Paragraph> <Paragraph position="6"> It, is possible to search for regular expressions (including a special character which matches inflectional endings), combinations of words, and part of speech tags. Only 40 lines of concordance are returned, and no information about frequency, or wider context is accessible. It is also possible to search for collocates of words, based on either of two statistical scores (mutual information and T-score). The site does not provide much in the way of help pages, and there is no tutorial.</Paragraph> <Paragraph position="7"> TACTWeb a pilot version of the TACTWeb software can be used on the Bergen Corpus of London Teenager Language (TACTWeb is intended to make a TACT style text database accessible over the WWW). This is close in intention to the present project. At the time of writing, it is still under development.</Paragraph> <Paragraph position="8"> LDC/Brown Corpus Text Corpora, and Speech Corpora, are accessible via the Linguistic Data Consortium. After registration, it is possible to access the Brown Corpus.</Paragraph> <Paragraph position="9"> For individuals who are not (affiliated to) members of the LDC it is possible to register as a guest, and access corpora with the password that is sent to the user by email.</Paragraph> <Paragraph position="10"> Frequency information is available, and a wide variety of searches is supported, concordances can be generated, and collocational intbrmation retrieved. Access to the TIMIT Speech Corpus is similar.</Paragraph> <Paragraph position="11"> It is obvious that some of these sites provide functionality that is not available at the W3Corpora site -- notably (i) multi-lingual searching and searching over parallel corpora, (ii) collocational information, and (iii) 'dictionary style searching' -- and several provide access to far more extensive corpus resources.</Paragraph> <Paragraph position="12"> On the other hand none of these sites duplicates what is available at the W3Corpora site. In particular, none of them provides the balance of easy (immediate) access to usable quantities of corpus material, with easy, customizable functionality, and extensive user support and tutorial facilities. So far as I know, in no case is the source code freely available. Where they do provide semiintroductory access (e.g. by means of free registration and/or a guest account), there is generally very little in the way of of tutorial material. 3 5 Conclusion: some Problems, By far the most serious problem that the project faced was the difficulty of getting corpus resources that could be made freely available (i.e. without registration) over the Web.</Paragraph> <Paragraph position="13"> The whole system took about two years (three person years) to complete. This is a considerable effort, and one that is only worthwhile for a relatively stable area like corpus linguistics, where there one can reasonably expect several years of use for a resource.</Paragraph> <Paragraph position="14"> The finished system is very large: the search engine and interface involves over 12,000 lines of code, much of it very straightforward (Perl commands to generate the html forms that provide the interface). It is hard to resist the sense that there should be an easier way to do this.</Paragraph> <Paragraph position="15"> Using html forms brings some problems. In particular, the lack of any kind of 'interactive' forms means that the interface is more complicated than it might otherwise need be (a form must be completed in toto and then submitted -- it cannot be partially completed and updated on the fly).</Paragraph> <Paragraph position="16"> The Perl-cgi-bin combination is powerful and excellent for small applications, but there is a severe lack of good debugging tools.</Paragraph> <Paragraph position="17"> It had originally been hoped to make the resource both 'future proof' and 'past proof'. The former is not too problematic -- the technology involved is likely to be supported for many years to come. But the latter -- the intention to make the resource usable with essentially any kind of browser -- quickly proved impossible, because of the need to use frames in serving the search engine interface.</Paragraph> <Paragraph position="18"> The resource is now fully operational and available. While it has been evaluated by a number of different kinds of users in a number of contexts, there are still many open questions about how it can or should best be used.</Paragraph> <Paragraph position="19"> In designing the resources, we had in mind a casual, novice user, either an individual student or researcher with an interest in, but no strong aSee (Arnold et al., 1999) for a fuller discussion of alternatives.</Paragraph> <Paragraph position="20"> comnfitment to, Corpus Linguistics, or a student on a course where Corpus Linguistics has a minor place (in the order of, say, three two hour sessions). See (Arnold and Berglund, 1998) for a little more discussion of this. (We took the view that committed users would invest the effort in installing corpora and corpus searching tools locally, and would find the overheads of WWW access unacceptable). Similarly, the resource was intended to be 'stand-alone' -- this was intended to make it as generally usable as possible. This means it does not form part of a larger suite of materials, and there are open questions about how it should best be integrated into schemes of study, and about what sorts of teaching method are appropriate.</Paragraph> <Paragraph position="21"> At one extreme, a teacher may simply note the resources as one among many resources available for further investigation, at another, one could imagine entire classes trying to access the resources at the same time, with similar queries, under the direct supervision of a teacher. Apart from obvious remarks about the machine and network loading implications of the latter, I have nothing to offer here. But these are important issues, and since this range of possibilities exist in principle for any Web-Based resource, quite general.</Paragraph> <Paragraph position="22"> The resources and tools were designed for WWW based access. But many of the advantages (and a few other benefits) can be gained by a local area (LAN) installation. The cost is that the tools and corpora must be installed and maintained locally, the advantage is that one eliminates the WWW network overhead, and no longer has to rely on a remote site to provide the resource. 4 Again, this is a general question for WWW based teaching, but one on which it is hard to say anything general. From a users point of view, the key questions are obviously the reliability of the remote site, compared to the reliability of local systems, and the inconvenience of the network overhead. These are matters which will vary greatly from one place to another, and will depend on the resources being provided -- in the case of the W3Corpora, there is still insufficient experience in practice to do much more than raise the questions. 4Of course, once one has decided to go for a local installation, there are many alternatives to the W3Corpora resources, and one is not tied to a Web browser style interface.</Paragraph> </Section> class="xml-element"></Paper>