File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0108_metho.xml

Size: 23,227 bytes

Last Modified: 2025-10-06 14:07:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0108">
  <Title>Using GATE as an Environment for Teaching NLP</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 GATE from a Teaching
</SectionTitle>
    <Paragraph position="0"> Perspective GATE (Cunningham et al., 2002a) is an architecture, a framework and a development environment for human language technology modules and applications. It comes with a set of reusable modules, which are able to perform basic language processing tasks such as POS tagging and semantic tagging. These eliminate the need for students to re-implement useful algorithms and modules, which are pre-requisites for completing their assignments. For example, Marin Dimitrov from Sofia University successfully completed his masters' degree by implementing a lightweight approach to pronominal coreference resolution for named entities1, which uses GATE's reusable modules for the earlier processing and builds upon their results (see Section 4).</Paragraph>
    <Paragraph position="1"> For courses where the emphasis is more on linguistic annotation and corpus work, GATE can be used as a corpus annotation environment (see http://gate.ac.uk/talks/tutorial3/). The annotation can be done completely manually or it can be bootstrapped by running some of GATE's processing resources over the corpus and then correcting/adding new annotations manually. These facilities can also be used in courses and assignments where the students need to learn how to create data for quantitative evaluation of NLP systems.</Paragraph>
    <Paragraph position="2"> If evaluated against the requirements for teaching environments discussed in (Loper and Bird, 2002), GATE covers them all quite well.</Paragraph>
    <Paragraph position="3"> The graphical development environment and the JAPE language facilitate otherwise difficult tasks. Inter-module consistency is achieved by using the annotations model to hold language data, while extensibility and modularity are the very reason why GATE has been successfully used in many research projects (Maynard et al., 2000). In addition, GATE also offers robustness and scalability, which allow students to experiment with big corpora, such as the British National Corpus (approx. 4GB). In the following subsections we will provide further detail about these aspects of GATE.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 GATE's Graphical Development
Environment
</SectionTitle>
      <Paragraph position="0"> GATE comes with a graphical development environment (or GATE GUI) that facilitates students in inspecting the language processing results and debugging the modules. The environment has facilities to view documents, corpora, ontologies (including the popular Prot'eg'e editor (Noy et al., 2001)), and linguistic data (expressed as annotations, see below), e.g., Figure 1 shows the document viewer with some annotations highlighted. It also shows the resource panel on the left with all loaded appli- null cations, language resources, and processing resources (i.e., modules). There are also viewers/editors for complex linguistic data like coreference chains (Figure 2) and syntax trees (Figure 3). New graphical components can be integrated easily, thus allowing lecturers to customise the environment as necessary. The GATE team is also developing new visualisation modules, especially a visual JAPE rule development tool.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 GATE API and Data Model
</SectionTitle>
      <Paragraph position="0"> The central concept that needs to be learned by the students before they start using GATE is the annotation data model, which encodes all linguistic data and is used as input and output for all modules. GATE uses a single unified model of annotation - a modified form of the TIPSTER format (Grishman, 1997) which has been made largely compatible with the Atlas format (Bird and Liberman, 1999). Annotations are characterised by a type and a set of features represented as attribute-value pairs. The annotations are stored in structures called annotation sets which constitute independent layers of annotation over the text content. The annotations format is independent of any particular linguistic formalism, in order to enable the use of modules based on different linguistic theories. This generality enables the representation of a widevariety of linguistic information, ranging from very simple (e.g., tokeniser results) to very com- null plex (e.g., parse trees and discourse representation: examples in (Saggion et al., 2002)). In addition, the annotation format allows the representation of incomplete linguistic structures, e.g., partial-parsing results. GATE's tree viewing component has been written especially to be able to display such disconnected and incomplete trees.</Paragraph>
      <Paragraph position="1"> GATE is implemented in Java, which makes it easier for students to use it, because typically they are already familiar with this language from their programming courses. The GATE API (Application Programming Interface) is fully documented in Javadoc and also examples are given in the comprehensive User Guide (Cunningham et al., 2002b). However, students generally do not need to familiarise themselves with Java and the API at all, because the majority of the modules are based on GATE's JAPE language, so customisation of existing and development of new modules only requires knowledge of JAPE and the annotation model described above.</Paragraph>
      <Paragraph position="2"> JAPE is a version of CPSL (Common Pattern Specification Language) (Appelt, 1996) and is used to describe patterns to match and annotations to be created as a result (for further details see (Cunningham et al., 2002b)). Once familiar with GATE's data model, students would not find it difficult to write the JAPE pattern-based rules, because they are effectively regular expressions, which is a concept familiar to most  tial syntax tree for a sentence from a telecom news text CS students.</Paragraph>
      <Paragraph position="3"> An example rule from an existing named entity recognition grammar is:</Paragraph>
      <Paragraph position="5"> The rule matches a pattern consisting of any kind of word, which starts with an upper-cased letter (recognised by the tokeniser), followed by one of the entries in the gazetteer list for company designators (words which typically indicate companies, such as 'Ltd.' and 'GmBH'). It then annotates this pattern with the entity type &amp;quot;NamedEntity&amp;quot;, and gives it a feature &amp;quot;kind&amp;quot; with value company and another feature &amp;quot;rule&amp;quot; with value &amp;quot;Company1&amp;quot;. The rule feature is simply used for debugging purposes, so it is clear which particular rule has fired to create the annotation. null The grammars (which are sets of rules) do not need to be compiled by the students, because they are automatically analysed and executed by the JAPE Transducer module, which is a finite- null state transducer over the annotations in the document. Since the grammars are stored in files in a plain text format, they can be edited in any text editor such as Notepad or Vi. The rule development process is performed by the students using GATE's visual environment (see Figure 1) to execute the grammars and visualise the results. The process is actually a cycle, where the students write one or more rules, re-initialise the transducer in the GATE GUI by right-clicking on it, then run it on the test data, check the results, and go back to improving the rules. The evaluation part of this cycle is performed using GATE's visual evaluation tools which also produce precision, recall, and f-measure automatically (see Figure 4).</Paragraph>
      <Paragraph position="6"> The advantage of using JAPE for the student assignments is that once learned by the students, it enables them to experiment with a variety of NLP tasks from tokenisation and sentence splitter, to chunking, to template-based information extraction. Because it does not need to be compiled and supports incremental development, JAPE is ideal for rapid prototyping, so students can experiment with alternative ideas.</Paragraph>
      <Paragraph position="7"> Students who are doing bigger projects, e.g., a final year project, might want to develop GATE modules which are not based on the finite-state machinery and JAPE. Or the assignment might require the development of more complex grammars in JAPE, in which case they might have to use Java code on the right-hand side of the rule. Since such GATE modules typically only access and manipulate annotations, even then the students would need to learn only that part of GATE's API (i.e., no more than 5 classes).</Paragraph>
      <Paragraph position="8"> Our experience with two MSc students - Partha Lal and Marin Dimitrov - has shown that they do not have significant problems with using that either.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Some useful modules
</SectionTitle>
      <Paragraph position="0"> The tokeniser splits text into simple tokens, such as numbers, punctuation, symbols, and words of different types (e.g. with an initial capital, all upper case, etc.). The tokeniser does not generally need to be modified for different applications or text types. It currently recognises many types of words, whitespace patterns, numbers, symbols and punctuation and should handle any language from the Indo-European group without modifications. Since it is available as open source, one student assignment could be to modify its rules to cope with other languages or specific problems in a given language. The tokeniser is based on finite-state technology, so the rules are independent from the algorithm that executes them.</Paragraph>
      <Paragraph position="1"> The sentence splitter is a cascade of finite-state transducers which segments the text into sentences. This module is required for the tagger. Both the splitter and tagger are domainand application-independent. Again, the splitter grammars can be modified as part of a student project, e.g., to deal with specifically formatted texts.</Paragraph>
      <Paragraph position="2"> The tagger is a modified version of the Brill tagger, which assigns a part-of-speech tag to each word or symbol. To modify the tagger's behaviour, students will have to re-train it on relevant annotated texts.</Paragraph>
      <Paragraph position="3"> The gazetteer consists of lists such as cities, organisations, days of the week, etc. It not only consists of entities, but also of names of useful indicators, such as typical company designators (e.g. 'Ltd.'), titles, etc. The gazetteer lists are compiled into finite state machines, which annotate the occurrence of the list items in the given document. Students can easily extend the existing lists and add new ones by double-clicking on the Gazetteer processing resource, which brings up the gazetteer editor if it has been installed, or using GATE's Unicode editor.</Paragraph>
      <Paragraph position="4"> The JAPE transducer is the module that runs JAPE grammars, which could be doing tasks like chunking, named entity recognition, etc. By default, GATE is supplied with an NE transducer which performs named entity recognition for English and a VP Chunker which shows how chunking can be done using JAPE.</Paragraph>
      <Paragraph position="5"> An even simpler (in terms of grammar rules complexity) and somewhat incomplete NP chunker can be obtained by request from the first author.</Paragraph>
      <Paragraph position="6"> The orthomatcher is a module, whose primary objective is to perform co-reference, or entity tracking, by recognising relations between entities, based on orthographically matching their names. It also has a secondary role in improving named entity recognition by assigning annotations to previously unclassified names, based on relations with existing entities.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Support for languages other than
English
</SectionTitle>
      <Paragraph position="0"> GATE uses Unicode (Unicode Consortium, 1996) throughout, and has been tested on a variety of Slavic, Germanic, Romance, and Indic languages. The ability to handle Unicode data, along with the separation between data and algorithms, allows students to perform easily even small-scale experiments with porting NLP components to new languages. The graphical development environment supports fully the creation, editing, and visualisation of linguistic data, documents, and corpora in Unicode-supported languages (see (Tablan et al., 2002)). In order to make it easier for foreign students to use the GUI, we are planning to localise its menus, error messages, and buttons which currently are only in English.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.5 Installation and Programming
Languages Support
</SectionTitle>
      <Paragraph position="0"> Since GATE is 100% Java, it can run on any platform that has a Java support. To make it easier to install and maintain, GATE comes with installation wizards for all major platforms. It also allows the creation and use of a site-wide GATE configuration file, so settings need only be specified once and all copies run by the students will have the same configuration and modules available. In addition, GATE allows students to have their own configuration settings, e.g., specify modules which are available only to them. The personal settings override those from GATE's default and site-wide configurations. Students can also easily install GATE on their home computers using the installation program. GATE also allows applications to be saved and moved between computers and platforms, so students can easily work both at home and in the lab and transfer their data and applications between the two.</Paragraph>
      <Paragraph position="1"> GATE's graphical environment comes configured by default to save its own state on exit, so students will automatically get their applications, modules, and data restored automatically the next time they load GATE.</Paragraph>
      <Paragraph position="2"> Although GATE is Java-based, modules written in other languages can also be integrated and used. For example, Prolog modules are easily executable using the Jasper Java-Prolog linking library. Other programming languages can be used if they support Java Native Interface (JNI).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Existing Uses of GATE for
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Teaching
</SectionTitle>
      <Paragraph position="0"> Postgraduates in locations as diverse as Bulgaria, Copenhagen and Surrey are using the system in order to avoid having to write simple things like sentence splitters from scratch, and to enable visualisation and management of data. For example, Partha Lal at Imperial College is developing a summarisation system based on GATE and ANNIE as a final-year project for an MEng Degree in Computing (http://www.doc.ic.ac.uk/~ pl98/). His site includes the URL of his components and once given this URL, GATE loads his software over the network. Another student project will be discussed in more detail in Section 4.</Paragraph>
      <Paragraph position="1"> Our colleagues in the Universities of Edinburgh, UMIST in Manchester, and Sussex (amongst others) have reported using previous versions of the system for teaching, and the University of Stuttgart produced a tutorial in German for the same purposes. Educational users of early versions of GATE 2 include Exeter University, Imperial College, Stuttgart University, the University of Edinburgh and others. In order to facilitate the use of GATE as a teaching tool, we have provided a number of tutorials, online demonstrations, and exhaustive documentation on GATE's Web site (http://gate.ac.uk).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 An Example MSc Project
</SectionTitle>
    <Paragraph position="0"> The goal of this work was to develop a coreference resolution module to be integrated within the named entity recognition system provided with GATE. This required a number of tasks to be performed by the student: (i) corpus analysis; (ii) implementation and integration; (iii) testing and quantitative evaluation.</Paragraph>
    <Paragraph position="1"> The student developed a lightweight approach to resolving pronominal coreference for named entities, which was implemented as a GATE module and run after the existing NE modules provided with the framework. This enabled him also to use an existing annotated corpus from an Information Extraction evaluation competition and the GATE evaluation tools to establish how his module compared with results reported in the literature. Finally, the testing process was made simple, thanks to GATE's visualisation facilities, which are already capable of displaying coreference chains in documents.</Paragraph>
    <Paragraph position="2"> GATE not only allowed the student to achieve verifiable results quickly, but it also did not incur substantial integration overheads, because it comes with a bootstrap tool which automates the creation of GATE-compliant NLP modules.</Paragraph>
    <Paragraph position="3"> The steps that need to be followed are:2 * use the bootstrap tool to create an empty Java module, then add the implementation to it. A JAVA development environment like JBuilder and VisualCafe can be used for this and the next stages, if the students are familiar with them; * compile the class, and any others that it uses, into a Java Archive (JAR) file (GATE</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Example Topics
</SectionTitle>
    <Paragraph position="0"> Since GATE has been used for a wide range of tasks, it can be used for the teaching of a number of topics. Topics that can be covered in (part of) a course, based on GATE are: * Language Processing, Language Engineering, and Computational Linguistics: differences, methodologies, problems.</Paragraph>
    <Paragraph position="1">  * Architectures, portability, robustness, corpora, and the Web.</Paragraph>
    <Paragraph position="2"> * Corpora, annotation, and evaluation: tools and methodologies.</Paragraph>
    <Paragraph position="3"> * Basic modules: tokenisation, sentence splitting, gazetteer lookup.</Paragraph>
    <Paragraph position="4"> * Part-of-speech tagging.</Paragraph>
    <Paragraph position="5"> * Information Extraction: issues, tasks, representing linguistic data in the TIPSTER annotation format, MUC, results achieved. - Named Entity Recognition.</Paragraph>
    <Paragraph position="6"> - Coreference Resolution - Template Elements and Relations - Scenario Templates * Parsing and chunking * Document summarisation * Ontologies and discourse interpretation * Language generation  While language generation, parsing, summarisation, and discourse interpretation modules are not currently distributed with GATE, they can be obtained by contacting the authors. Modules for text classification and learning algorithms in general are to be developed in the near future. A lecturer willing to contribute any such modules to GATE will be very welcome to do so and will be offered integration support.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Example Assignments
</SectionTitle>
    <Paragraph position="0"> The availability of example modules for a variety of NLP tasks allows students to use them as a basis for the development of an entire NLP application, consisting of separate modules built during their course. For example, let us consider two problems: recognising chemical formulae in texts and making an IE system that extracts information from dialogues. Both tasks require students to make changes in a number of existing components and also write some new grammars.</Paragraph>
    <Paragraph position="1"> Some example assignments for the chemical formulae recognition follow: * tokeniser: while it will probably work well for the dialogues, the first assignment would be to make modifications to its regular expression grammar to tokenise formulae like H4ClO2 and Al-Li-Ti in a more suitable way.</Paragraph>
    <Paragraph position="2"> * gazetteer: create new lists containing new useful clues and types of data, e.g., all chemical elements and their abbreviations.</Paragraph>
    <Paragraph position="3"> * named entity recognition: write a new grammar to be executed by a new JAPE transducer module for the recognition of the chemical formulae.</Paragraph>
    <Paragraph position="4"> Some assignments for the dialogue application are: * sentence splitter: modify it so that it splits correctly dialogue texts, by taking into account the speaker information (because dialogues often do not have punctuation). For example: A: Thank you, can I have your full name? C: Err John Smith A: Can you also confirm your postcode and telephone number for security? C: Erm it's 111 111 11 11 A: Postcode? C: AB11 1CD * corpus annotation and evaluation: use the default named entity recogniser to bootstrap the manual annotation of the test data for the dialogue application; evaluate the performance of the default NE grammars on the dialogue texts; suggest possible improvements on the basis of the information about missed and incorrect annotations provided by the corpus benchmark tool.</Paragraph>
    <Paragraph position="5"> * named entity recognition: implement the improvements proposed at the previous step, by changing the default NE grammar rules and/or by introducing rules specific to your dialogue domain.</Paragraph>
    <Paragraph position="6"> Finally, some assignments which are not connected to any particular domain or application: * chunking: implement an NP chunker using JAPE. Look at the VP chunker grammars for examples.</Paragraph>
    <Paragraph position="7"> * template-based IE: experiment with extracting information from the dialogues using templates and JAPE (an example implementation will be provided soon).</Paragraph>
    <Paragraph position="8"> * (for a group of students) building NLPenabled Web applications: embed one of the IE applications developed so far into a Web application, which takes a Web page and returns it annotated with the entities. Use http://gate.ac.uk/annie/index.jsp as an example. null In the near future it will be also possible to have assignments on summarisation and generation, but these modules are still under development. It will be possible to demonstrate parsing and discourse interpretation, but because these modules are implemented in Prolog and somewhat difficult to modify, assignments based on them are not recommended. However, other such modules, e.g., those from NLTK (Loper and Bird, 2002), can be used for such assignments. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML