File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0806_metho.xml
Size: 12,866 bytes
Last Modified: 2025-10-06 14:15:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0806"> <Title>Web Access to Corpora: the W3Corpora Project*</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Motivation, Design Criteria </SectionTitle> <Paragraph position="0"> The motivation for the project was the observation that the up-take of corpus linguistics is not what it should be -- in this day and age some corpus linguistics should be part of every course to do with language. The problem is that learning about corpus linguistics involves doing it, and that the overheads involved in getting started in doing or teaching corpus linguistics are considerable. Corpora in many languages are now easily available, but to use them requires a significant investment in hardware (e.g. disk space), software (tools need to be downloaded and installed), and time and effort (the tools have to be understood and techniques mastered). All this is hard enough for the individual researcher. In a teaching context, all these problems will typically be magnified by the need to deal with differences in the environment available to students (architecture, operating system, software: if something can differ, it will differ; if a difference can matter, it will matter). null Corpus linguistics should be a part of every scheme of study, and it may have a role in almost every piece of research. But it need not be a central theme, certainly not central enough to justify the effort involved. It would be nice to be able have (say) three sessions on corpus linguistics in a course with a wider focus, and in that time give students a real feeling for what can be gained, and what are the limitations. It would be nice for a researcher to be able to find out whether corpora can provide any useful data about some phenomenon without having to actually become a corpus linguist.</Paragraph> <Paragraph position="1"> There are surely many areas of linguistics, even computational linguistics, that are like this: as subjects develop, it becomes impossible for students or researchers to master the full range of ideas and techniques, and there is an increasing need for the provision of knowledge of subject areas at a 'contextual' rather than specialist level.</Paragraph> <Paragraph position="2"> It is important to be able convey something about a wide range of areas very briefly, but (hopefully) without trivialization.</Paragraph> <Paragraph position="3"> So, the goal of the project was to provide instant, and instantly usable, access to corpora, including tools to manipulate them, as well as general information and tutorial information about how and why One might manipulate them.</Paragraph> <Paragraph position="4"> Of course, the World Wide Web is excellent for this. In principle, all the user needs is a Web connection and a browser. Beyond this, no investment of money, and little investment of effort should be needed: there should be no need to obtain and install corpora, or download and install software, and the interface to the corpus manipulating tools should already be familiar (since it would be based closely on their web browser).</Paragraph> <Paragraph position="5"> Moreover, from a teaching perspective, problems of different architecture (etc) are minimized -- all that is necessary is the browser and the Web connection. null Given these aims and motivation, a number of design decisions are rather natural: * The system should be immediately usable by anyone with WWW access and a Web Browser, for example: - it, should be usable without the need to install or download any programs; - it should be usable with essentially any generally available browser; - it should be usable without the need to register and get authorization.</Paragraph> <Paragraph position="6"> * The interface should be as 'friendly' and easy to use as possible; it should be supported by extensive on-line help, and backed up by detailed information about corpus linguistics in general, and how to 'do' corpus linguistics in a. practical way, using a tool such as this.</Paragraph> <Paragraph position="7"> * It is typical of novice users that they make mistakes with queries; thus, there should be some method for users to correct and 'refine' their queries very easily (this lead to the idea of an editable 'search history').</Paragraph> <Paragraph position="8"> * It should be possible for a user to search their own Corpora -- in this way a user can explore not only what is possible in general, but what is possible in relation to the kinds of material they are interested in or have to deal with.</Paragraph> <Paragraph position="9"> A major problem with Web delivery is the network overhead. Thus the source code should be freely available (in GNU 'Copyleft' style), which should allow the system to be installed and run locally over the Web at other sites.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Implementation, Overview of the W3Corpora Web-Site </SectionTitle> <Paragraph position="0"> This section will give an overview of W3Corpora web-site. See Table 1 for URLs.</Paragraph> <Paragraph position="1"> The site is divided into three main parts: the General Information where the user can learn about corpus linguistics in general (e.g. general discussions: 'What is a corpus?' issues of corpus design and annotation, research areas, bibliography, etc). This is the kind and level of information one might expect in an introductory text book, e.g. (Barnbrook, 1996) or (Kennedy, 1998).</Paragraph> <Paragraph position="2"> Tutorial where the user can find out how to use the tools provided, and where some areas are described where corpus techniques are useful. A variety of tasks are described in some detail with practical examples (e.g. how to investigate the meaning of word, compare two similar words, how a word is used in different contexts, investigating spelling, and choice of preposition in a context like an explanation of/for something). Here an elsewhere, the emphasis is on classical corpus linguistics, neglecting e.g. statistical techniques that can be built on top.</Paragraph> <Paragraph position="3"> Here the key aim is to answer, as quickly and easily as possible the two questions: 'How can I use this thing?' and 'What can I use it for?' It does not pretend to be a complete, stand-alone tutorial in Corpus Linguistics; it does not go to the length of (say) (Aston and Burnard, 1997), nor does it go into the same level of detail. The primary aim is to take the user to the point where they can answer the question 'Is Corpus Linguistics useful in my study and research?', and in case of an affirmative answer, give a basis for proceeding (perhaps, in fact most likely, with other resources and tools, installed locally to avoid network overheads).</Paragraph> <Paragraph position="4"> Search Engine where the user can carry out Apart from the Search Engine, the implementation is rather straightforward: text marked up as html, there is extensive use of frames so that users are able to maintain an overview of documents as well pursuing detail.</Paragraph> <Paragraph position="5"> The implementation of the Search Engine merits more discussion, but it is also based on rather standard techniques, using cgi-bin scripts written in Perl, and fairly standard indexing techniques to speed up searching.</Paragraph> <Paragraph position="6"> When a user arrives at the top-level search page, she is invited to select a corpus and from a menu, and to speciI) a search string and search type (e.g. regular expression, exact match, whole words, etc). Confirming these selection generates a 'session file' which records the selections. Also generated is a file recording various default values for options dictating inter alia what sort of results should be displayed first (frequency, or Key Word In Context -- KWIC), for KWIC results, how many results should be displayed at one time, how much context should be displayed, etc. etc. The user can modify these options interactively via a form, which is generated in response to clicking the 'Options' button at the top of the screen. Currently, some 19,000,000 words (321) texts from the Gutenberg Project corpora can be searched. 1 A flavour of the interface to the Search Engine can be gained from Figure 1, which shows the results of searching for the regular expression / \[Nn\] ice/ over a subset of the Gutenberg texts, and clicking on one of the results to view the wider context. An early stage in the project defined a list of properties that a corpus searching interface should have (Brines-Moya and Hartill, 1998). This interface satisfies ahnost all.</Paragraph> <Paragraph position="7"> A large amount of on-line help is available (via the 'Help') button (the information supplied is somewhat sensitive to the particular screen being viewed).</Paragraph> <Paragraph position="8"> 'Frequency' and 'Display' buttons generate different views of the search results: * The 'Frequency' button generates frequency information for a search (total number of hits, l l Million words of the LOB corpus, tagged and untagged can also be searched, after a user has registered and re(:eived a password.</Paragraph> <Paragraph position="9"> hits per-subcorpus, and lexical information -- e.g. how many of the hits for / \[Nn\] ice/ arise from the the word nice, how many from nicer, nicest, Venice, cornice, etc).</Paragraph> <Paragraph position="10"> * The 'Display' button generates a KWIC display of search results (see Figure 1). KWIC results are editable -- the user can delete certain results, and can also call up wider context by clicking on a key word.</Paragraph> <Paragraph position="11"> The 'Search' button allows the user to (i) carry out a totally new search, (ii) 'refine' the existing search, or (iii) view, and modify the search history. Refining a search returns a subset of the current search: the user supplies a regular expression which potential hits must satisfy in addition to the original pattern. Thus, one might refine /\[Nn\] ice/ to /^ \[Nn\]/ to eliminate Venice and cornice as hits, a further refinement to /e$/ would eliminate nicer and nicest. A sequence of refinements constitutes a search history: users can view, and edit a search history -- moving backwards and forwards through the different stages of a search. The user can also delete stages (e.g. to leave just an initial and a final stage).</Paragraph> <Paragraph position="12"> An aspect of the system that may be particularly useful to teachers is the ability to up-load corpora for searching. When a user up-loads a (plain text) corpus to the Web-site (by anonymous ftp), it becomes selectable for searching. When so selected, it is indexed and prepared for searching in the normal way. This may be particularly useful to teachers who want students to carry out exercises on particular corpora that are not already provided at the site.</Paragraph> <Paragraph position="13"> The site has been used and positively evaluated by 'expert users' (i.e. with a background in corpus linguistics), and by students at Essex and elsewhere, but there are many open questions about how it can or should best be used in the context of different courses and learning situations (see Section 5). 2 shorp I We, n done, Carat I Oood dog I ~=. old fello'w I Now behove pzetty I &quot;Aad DELETE age to get mmied ,~irh a M_, C/e~sible ~ that could ~preci~:e ~ DELETE forlitde voraeam dJfficul~.u .A ~ liole, whexe allrhelicdewomen D\]:~LET/~ a most disc, ngmslied m~uex. &quot;All, how ~.. of you, my deer sexless t I DELETE on in the world. Oh, a ~ lot alley e~e t &quot;Vandeuw~ did his DELIFTE heart ! Oh, it would be too ~ if we could always five to geexh~r DELE'I~ ~gre.~,becaus~itwouldbeso ~ forchem~threeto ~ty ~EI.J~TE of e.n.noyance. They had choa~, a ~ d~, cecrte~y, ~ ~ Hi~aess on DELE'I&quot;~</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> CONTEXT FRAME </SectionTitle> <Paragraph position="0"> Hany would be one-~d - ~hir ty next July, he d~l~red. Prop~ age to get married u4th * nice, sensible ~ifl chat: could appreciate a go od home. H e was a very high-spirked b oy. Hi~h-spirked husbands were ~t e.~sieet to manage. These me~m, soft chap s, ~hat you would d~.r& buyer wouldn't melt in ~hoir moues, ~eze the ones to make a worn-en ~oxou~ly sis arable. And dlexe wu norking like a home - - a fireside - - a good roof : no rcaung out o~ your w~rm bed in rill sons of weadRx. &quot;Eh, my deex ? &quot; deleted and which indicate which sub-corpus each hit comes from. At the bottom of the page the wider context of one of the hits is displayed (the user has clicked on one of the individual hits to obtain this).</Paragraph> </Section> class="xml-element"></Paper>