File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1043_metho.xml

Size: 67,390 bytes

Last Modified: 2025-10-06 14:14:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="X96-1043">
  <Title>Class Annotation Type of AttributedObject Properties</Title>
  <Section position="4" start_page="253" end_page="253" type="metho">
    <SectionTitle>
Abstract Class ObjectReference
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="5" start_page="253" end_page="255" type="metho">
    <SectionTitle>
3 For implementation in languages which cannot determine the type of the argument at run time, such as C, this operation requires two
</SectionTitle>
    <Paragraph position="0"> arguments. The additional argument (the first of the two arguments) is of enumerated type &amp;quot;'one of {string, sequence, CollectionReference, DocumentReference, AnnotationReference, AttributeReference}&amp;quot; and specifies the type of the second argument, which is the value itself.</Paragraph>
    <Paragraph position="1">  ObjectReferences are references to (names of) persistent collections, documents, etc., and not to the object instances created by opening a collection, etc. It is therefore possible to have ObjectReferences to documents in collections which are not currently open; it is even possible to have references to documents which have been deleted from a collection. Because of the variety of objects which can be referenced, the Architecture does not provide a single dereferencing operator. Dereferencing must be done explicitly by the Application using the property accessors -opening the collection, accessing the document, accessing the annotation in the document, etc.</Paragraph>
    <Paragraph position="2"> An abstract class for objects which have attributes is defined as:</Paragraph>
  </Section>
  <Section position="6" start_page="255" end_page="256" type="metho">
    <SectionTitle>
Abstract Class AttributedObject
</SectionTitle>
    <Paragraph position="0"> assign value as the current value of attribute name of object, overwriting any prior assignment of a value to that attribute GetAttribute (AttributedObject, name: string): AttributeValue OR nil if attribute name of object has been assigned a value by a prior PutAttribute operation, return that value, else return nil RemoveAttribute(AttributedObject, name: string) if AttributedObject has an Attribute whose Name property is name, remove that attribute from AttributedObject (otherwise do nothing)</Paragraph>
    <Section position="1" start_page="256" end_page="256" type="sub_section">
      <SectionTitle>
3.2 Persistent Objects
</SectionTitle>
      <Paragraph position="0"> The TIPSTER Architecture assumes a name space of persistent objects; each persistent object is assigned a unique name (a string). If the Architecture is operating in a networked environment, this name will presumably consist of a host name and a unique name on that host.</Paragraph>
      <Paragraph position="1"> The (abstract) class Persistent Object is introduced, which is a superclass of any class of persistent objects.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="256" end_page="258" type="metho">
    <SectionTitle>
Abstract Class PersistentObject
</SectionTitle>
    <Paragraph position="0"> creates a new object of a specified class, and returns that object (it is an error if name is the name of an existing persistent object) Open.PersistentObject (name: string): PersistentObject name should be the name of an object of class PersistentObject, created by a prior Create.PersistentObject operation; the object with that name is returned Close (object: PersistentObject) saves any changes made to object in persistent storage and frees any local memory associated with this object the Architecture assumes that all Persistent Objects will be automatically closed on system termination Sync (object: PersistentObject) saves any changes made to object in persistent storage Destroy (name: string) erases the persistent instance of the object (it is an error if name is not the name of a persistent object) The architecture does not require us to identify persistent object names with file names, but this may be the simplest way to manage initial implementations. In the present architecture DocumentCollectionlndexes and QueryCollectionlndexes are persistent; Collections are optionally persistent (Documents are not persistent objects themselves but have persistence as a part of a Collection).</Paragraph>
    <Section position="1" start_page="256" end_page="257" type="sub_section">
      <SectionTitle>
3.3 Byte Sequences
</SectionTitle>
      <Paragraph position="0"> The decision about the representation of a sequence of bytes, which constitutes the contents of a document, should be hidden from most applications. To do so, the class ByteSequence is introduced. The minimal requirement for an  implementation of the Architecture is to be able to obtain the length of a ByteSequence, and to convert between a ByteSequence and a string:</Paragraph>
    </Section>
    <Section position="2" start_page="257" end_page="258" type="sub_section">
      <SectionTitle>
Class ByteSequence
Operations
</SectionTitle>
      <Paragraph position="0"> Length (ByteSequence): integer returns the number of bytes in ByteSequence ConvertToString (ByteSequence): string CreateByteSequence (string): ByteSequence (In fact, the simplest implementation of a ByteSequence will probably be as a string, so the conversion will be an identity operation.) Implementations may choose to supplement these with additional operations for creating and accessing ByteSequences, for two reasons:  1. For applications involving large documents, the implementation may wish to provide the ability to directly access portions of the document. This may be done through operations which retrieve substrings of a ByteSequence, or through operations which allow a ByteSequence to be opened to a stream (for subsequent read and write operations).</Paragraph>
      <Paragraph position="1"> 2. A collection of documents needs to be converted into a TIPSTER Collection prior to processing within the Architecture. For large collections which are already in place on some data store, such as a file system or a data base, it may be highly desirable to create the TIPSTER Collection without copying the document text. A TIPSTER implementation can support this capability by allowing a ByteSequence to be created as a  reference to a portion of this data store. For example, the implementation could define a &amp;quot;file segment&amp;quot; as a portion of a file (with start and end positions), and support operations for creating a ByteSequence from a file segment. Alternatively, an application based on a data base could define an operation for creating a ByteSequence from a data base field.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="258" end_page="259" type="metho">
    <SectionTitle>
4.0 DOCUMENTS AND COLLECTIONS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="258" end_page="258" type="sub_section">
      <SectionTitle>
4.1 Documents
</SectionTitle>
      <Paragraph position="0"> The document is the central object class in the TIPSTER architecture. As a unit of information, it serves several basic functions within the architecture: * it is the repository of information about a text, in the form of attributes and annotations (although annotations will in general refer to portions of documents) * it is the atomic unit in building collections * it is the atomic unit of retrieval in detection operations Each Document is part of one or more Collections (see Section 4.2). A Document has persistence by virtue of being a member of a Collection, and can be accessed only as a member of a Collection. Each document is given a unique identity by its ld property, which is copied by the CopyDocument and CopyBareDocument operations, and is also copied when a new collection is created by document retrieval operations.</Paragraph>
      <Paragraph position="1">  an internal document identifier, assigned automatically when a new Document is created, which is unique within an entire TIPSTER system (to insure uniqueness in a distributed system, an implementation may choose to include a host name as part of the Id) Externalld: string (R, W) a document identifier assigned by the application RawData: ByteSequence the contents of the document prior to any TIPSTER processing. The byte-sequence may include subsequences representing text in multiple languages, as well as non-text material such as pictures, audio, and tables Annotations: AnnotationSet information about portions of the document (information about the document as a whole is stored in Attributes; a Document inherits an Attributes property by virtue of being a type of Attributed Object)</Paragraph>
    </Section>
    <Section position="2" start_page="258" end_page="259" type="sub_section">
      <SectionTitle>
Operations
</SectionTitle>
      <Paragraph position="0"> CreateDocument (Parent: Collection, Externalld: string, RawData: ByteSequence, annotations: AnnotationSet, attributes: sequence of Attribute): Document creates a new document within the Collection Parent and assigns the document a new unique Id CopyBareDocument (NewParent: Collection, Document): Document makes a copy of Document, including only its internal Id, Externalld, and RawData, and places the copy in collection NewParent. The attributes and annotations of the original document are not copied by this operation.</Paragraph>
      <Paragraph position="1">  CopyDocument (NewParent: Collection, Document): Document makes a copy of Document, including its internal Id, Externalld, RawData, attributes, and annotations, and places the copy in collection NewParent.</Paragraph>
      <Paragraph position="2"> Annotate (Document, AnnotatorName: string) invokes annotation procedure AnnotatorName on the Document; see Section 5.6. WriteSGML (Document, AnnotationSet, AnnotationPrecedence: sequence of string): string Converts a document together with a set of Annotations into SGML format. AnnotationPrecedence, which is a list of annotation types, is used to resolve conflicts when two annotations cover the same span: the tag corresponding to the annotation type which appears first in the list is written out first. The resulting document is in a &amp;quot;normalized&amp;quot; SGML, with all attributes and end tags explicit. 4 ReadSGML (string, Parent: Collection, Externalld: string): Document Reads a string marked up with &amp;quot;normalized&amp;quot; SGML, with all attributes and end tags explicit, and generates a Document with the specified Externalld, no attributes, and an AnnotationSet containing one annotation for each SGML text element marked in the input text. If the input violates these constraints (e.g., unmatched start tags) or violates SGML syntax (e.g., unmatched quotation marks within tags), an error will be signaled. 5 As noted earlier, new sources of data will need to be converted by the application into Collections of Documents before they can be processed within the TIPSTER Architecture. The functions which perform these conversions will necessarily be specific to the type of data source, and hence a TIPSTER application will be required to provide these conversion operations when a new type of data source is to be used.</Paragraph>
    </Section>
    <Section position="3" start_page="259" end_page="259" type="sub_section">
      <SectionTitle>
4.2 Collections
</SectionTitle>
      <Paragraph position="0"> Documents are gathered into Collections, which may have attributes on the collection level as well as on the individual documents. Collections provide a permanent repository for documents within the TIPSTER Architecture.</Paragraph>
      <Paragraph position="1"> Collections in general are persistent and hence have names; however, the Architecture also provides for volatile,  CreateCollection (name: string, attributes: sequence of Attribute): Collection creates a named, persistent collection CreateVolatileCollection (attributes: sequence of Attribute): Collection creates an unnamed, volatile collection</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="259" end_page="259" type="metho">
    <SectionTitle>
4 The specification of this operation is subject to revision based on the experience of implementors in using these SGML representations in
</SectionTitle>
    <Paragraph position="0"> applications.</Paragraph>
  </Section>
  <Section position="10" start_page="259" end_page="261" type="metho">
    <SectionTitle>
5 The specification of this operation is subject to revision based on the experience of implementors in using these SGML representations in
</SectionTitle>
    <Paragraph position="0"> applications.</Paragraph>
    <Paragraph position="1">  returns the &amp;quot;first' document within a collection and initializes data structures internal to the collection so that NextDocument can be used to iterate through the documents in a collection. Returns nil if no documents are found in the collection.</Paragraph>
    <Paragraph position="2"> NextDocument (Collection): Document OR nil returns the &amp;quot;next' document within a collection. Normally used to iterate through all documents in a collection. Returns nil if no more documents are found in the collection. FirstDocument and NextDocument must be well behaved in the presence of calls to CreateDocument and RemoveDocument. This means that a loop using FirstDocument and NextDocument must visit all documents which were in the collection when FirstDocument was called if and only if the documents are not deleted before the loop reaches them. Documents added after FirstDocument is called may or may not be encountered during the loop.</Paragraph>
    <Paragraph position="3"> GetByExternalld (Collection, Externalld: string): Document OR nil returns the Document in the Collection with the given Externalld; if several Documents have the same Externalld, returns one of them; if none have this Externalld, returns nil.</Paragraph>
    <Paragraph position="4"> AnnotateCollection (which: Collection, destination: Collection, AnnotatorName: string) invokes annotation procedure AnnotatorName on a subset of Collection destination; see Section 5.6 for further information.</Paragraph>
  </Section>
  <Section position="11" start_page="261" end_page="268" type="metho">
    <SectionTitle>
5.0 DOCUMENT ANNOTATIONS: GENERAL STRUCTURE
</SectionTitle>
    <Paragraph position="0"> Annotations, along with attributes, provide the primary means by which information about a document is recorded and transmitted from module to module within a system. This chapter elaborates the general structure of annotations, noting some of the issues which arise at each stage.</Paragraph>
    <Section position="1" start_page="261" end_page="263" type="sub_section">
      <SectionTitle>
5.1 What Is Annotated?
</SectionTitle>
      <Paragraph position="0"> An annotation provides information about a portion of the document (including, possibly, the entire document). The portion of the document is specified by a set of spans. Each span consists in turn of a pair of integers specifying the starting and ending byte positions in the RawData of the document (with the first byte of the document counting as byte 0).</Paragraph>
      <Paragraph position="1">  CreateSpan (start: integer, end: integer): Span The current span design is intended for character-based text documents which may contain additional types of information such as graphical images, audio, or video, which needs to be retained and displayed, but which would not be further processed by components of the TIPSTER Architecture. For documents which do not contain text in the form of a sequence of characters, the meaning of a span will not necessarily be compatible with this start byte/end byte convention. For instance, in compressed video, the information contained in a sequence of frames cannot be located using starting and ending byte. Similarly, in a graphical image of a document, such as a fax, the most natural definition of a primitive subimage is likely to be a rectangle. Note that the data in a fax is not even byte aligned. All of these considerations indicate that eventually an opaque type for spans with a subclass being TextSpan will be needed Most annotations will be associated with a single contiguous portion of the text, and hence with a single span. However, a set of spans is provided for in order to be able to refer to non-contiguous portions of the text. For example, an event might be described at the beginning of an article and again later in the article, but not in the intervening text; using a set of spans allows us to have an annotation for the event refer to these two passages. It would also allow for discontinuous linguistic elements, such as verb plus particle pairs (&amp;quot;I gave my gun up.&amp;quot;).  Positions in the RawData are represented internally in terms of byte offsets, rather than characters. This is necessary because the RawData may contain non-text data, such as graphics or sounds, for which character addressing would not be meaningful. However, once a text has been segmented into text and non-text portions, and the text portion into segments involving different character codes, it should be possible to provide operations at the character level (i.e., operations which are sensitive to the different sizes of characters in different codes). This segmentation into regions using different character code sets is to be recorded in the TIPSTER Architecture as Annotations on the document (see Section 6.1). By accessing these Annotations, an application can determine the code set employed at a specific position in a document, and hence the size of the character at that position. This information can be used to implement operations to extract a single character or advance to the next character position.</Paragraph>
      <Paragraph position="2"> More work is required on the multi-lingual design of the Architecture before such operations can be incorporated into the Architecture itself.</Paragraph>
      <Paragraph position="3">  To allow annotations to modify the text (and, in particular, to insert characters) in such a way that subsequent accesses to the text see the modified text in place of the original text, it is necessary to require a representation of positions in a document which allows for insertions (e.g., by using integers above the length of the original string to refer to inserted elements). The current architecture does not allow for such changes; corrections to the text must be recorded as attributes on text elements which are explicitly accessed by subsequent processes. Alternatively, the application can create a new Document with a new RawText property which incorporates these modifications.</Paragraph>
    </Section>
    <Section position="2" start_page="263" end_page="263" type="sub_section">
      <SectionTitle>
5.2 Information Associated With an Annotation
</SectionTitle>
      <Paragraph position="0"> An annotation associates a type with a span of the document. Examples of possible types are token, sentence, paragraph, and dateline. In addition, one or more attributes may be assigned to each annotation.</Paragraph>
    </Section>
    <Section position="3" start_page="263" end_page="263" type="sub_section">
      <SectionTitle>
Class Annotation
Type of AttributedObject
Properties
</SectionTitle>
      <Paragraph position="0"> Id: string the identifier of an Annotation, which is nil when the Annotation is created and which is set when the Annotation is added to a Document; the value assigned is unique among the Annotations on that Document.</Paragraph>
      <Paragraph position="1"> Type: string Spans: sequence of Span</Paragraph>
    </Section>
    <Section position="4" start_page="263" end_page="263" type="sub_section">
      <SectionTitle>
Operations
</SectionTitle>
      <Paragraph position="0"> CreateAnnotation (Type: string, Spans: sequence of Span, attributes: sequence of Attribute): Annotation Examples of simple attributes on annotations (attributes whose values are single strings) include a type-of-name attribute on name annotations, which might take on such values as &amp;quot;person, country&amp;quot;, &amp;quot;company&amp;quot;, etc.; a pos (part of speech) attribute on token annotations, which might take on the Penn Tree Bank values, such as &amp;quot;NNS&amp;quot; and &amp;quot;VBG&amp;quot;, and a root attribute on token annotations, which would record the root (uninflected) form of a token.</Paragraph>
      <Paragraph position="1"> An example of an attribute whose value is another annotation would be a coreference pointer. An even more complex attribute value would be a template object, which may in turn contain pointers to several other annotations (for the text elements filling various slots in the template object).</Paragraph>
    </Section>
    <Section position="5" start_page="263" end_page="265" type="sub_section">
      <SectionTitle>
5.3 Accessing Annotations
</SectionTitle>
      <Paragraph position="0"> Because annotations are central to the TIPSTER architecture, it is expected that applications will have frequent need to access, search, and select annotations on a document. To meet this need, the Architecture defines a class AnnotationSet and a number of operations operating on such sets of annotations. In particular, operations are provided to support the sequential scanning of a document (AnnotationsAt, NextAnnotations) and to support thc extraction of annotations meeting certain criteria (SclcctAnnotations).</Paragraph>
      <Paragraph position="1"> Although AnnotationSets are logically just sets of annotations, and could be implemented like other sets (e.g., as lists), a special class is provided in the expectation that implementations may wish to choose a more elaborate implementation (such as a sorted list or tree with one or more indexes) in order to implement the operations more efficiently.</Paragraph>
      <Paragraph position="2"> Each Document includes as one property an AnnotationSet, holding the annotations on that Document. Most of the operations on AnnotationSets can also be applied to Documents, and in that case apply the same operation to the AnnotationSet property of the Document.</Paragraph>
      <Paragraph position="3">  adds an annotation to a document. If the Id slot of Annotation is nil, this operation creates a new annotation Id (unique for this document) and assigns it to the id field of Annotation. If the ld field of Annotation is filled (not nil), and there is an existing annotation on the document with the same Id, the new annotation replaces the existing annotation. The Id field of the annotation is returned.</Paragraph>
      <Paragraph position="4"> RemoveAnnotation (Document OR AnnotationSet, Id: string) removes the annotation with the specified Id from the Document or AnnotationSet. It is an error if the document does not have an annotation with that Id.</Paragraph>
      <Paragraph position="5"> GetAnnotation (Document OR AnnotationSct, Id: string): Annotation returns the annotation whose id slot is equal to the desired value. It is an error if no annotation has the specified identifier.</Paragraph>
      <Paragraph position="6"> Length (AnnotationSet): integer returns a count of the number of annotations in AnnotationSet Nth (AnnotationSet, n: integer): Annotation returns the Nth annotation in AnnotationSct, where the first annotation has index 0.</Paragraph>
      <Paragraph position="7"> SclectAnnotations (Document OR AnnotationSet, type: swing OR nil, constraint: sequence of Attribute): AnnotationSet returns the (possibly empty) set of annotations from the Document or AnnotationSet which are of type type and which satisfy constraint, constraint is a sequence of attributes, where the ith attribute has name a i and value vi. An annotation satisfies the constraint if (for each i), attribute ai of the annotation has value v i. If constraint is the empty sequence, no constraint is placed on the attributes: all annotations of the given type are selected. If type is nil, annotations of all types satisfying the attribute constraints are included.</Paragraph>
      <Paragraph position="8"> DeleteAnnotations (Document OR AnnotationSct, type: string OR nil, constraint: sequence of Attribute) removes from the Document or AnnotationSet all annotations which are of type type and which satisfy constraint. These arguments have the same significance as for SelectAnnotations, above. AnnotationsAt (Document OR AnnotationSet, Position: integer): AnnotationSet returns the set of annotations from Document or AnnotationSet which start at the specified position. NextAnnotations (Document OR AnnotationSet, Position: integer): AnnotationSct Returns the set of annotations from Document or AnnotationSet which have the smallest starting point that is greater than or equal to Position.</Paragraph>
      <Paragraph position="9"> MergeAnnotations (AnnotationSet, AnnotationSet): AnnotationSet returns the union of the Annotations in the two AnnotationSets.</Paragraph>
    </Section>
    <Section position="6" start_page="265" end_page="266" type="sub_section">
      <SectionTitle>
5.4 Annotation Type Declarations
5.4.1 Introduction
</SectionTitle>
      <Paragraph position="0"> A central goal in creating the TIPSTER architecture is for different modules to be able to share information about a document through the use of annotations. Such information sharing will be workable only if there are precise, formal descriptions of the structure of these annotations, and if the modules which create annotations adhere to these descriptions.</Paragraph>
      <Paragraph position="1"> Therefore, annotation type declarations are introduce here which serve to document the information associated with different types of annotations. In the present architecture these declarations only serve as documentation; future generations of the architecture may seek to do type checking based on these declarations (see Appendix A. 1).</Paragraph>
      <Paragraph position="2"> Type declarations are organized into packages. A package will typically include a set of related annotation types. For example, a package may declare all the types of annotations used to represent the document structure for one message format (header, dateline, author, etc.). Another package, associated with an extraction system, would represent the annotation types corresponding to the template objects created by that system.</Paragraph>
      <Paragraph position="3"> The declaration of a package of annotation types would consist of a package name declaration followed by one or more annotation type declarations. The package name declaration has the form type package identifier An annotation type declaration defines an annotation type; it specifies the attributes which such annotations may have and the type of value of each attribute. The declaration has the form annotation type identifier ( attribute-spec l attribute-spec2 .... }; where each attribute specification, attribute-spec,., has the form attribute-name: type-spec The type-spec specifies the type of allowable values of the attribute. The type spec may specify a basic type:  annotation (a reference to an annotation of any type) it may specify an enumerated type by giving its alternative values: ( value 1, value2 ...) it may specify a union of types by listing the alternative types: ( type I or type 2 or ...) to indicate that the value may be of any one of the types listed; it may specify a compound type, either  sequence of type which allows for a sequence of zero or more instances of type type, or optional type whose value may be either of type or be nil. Finally, type-spec may be a previously defined annotation type, specifying a reference to an annotation of that type.</Paragraph>
      <Paragraph position="4">  One or more white-space characters (blanks, tabs, or newlines) are required between successive identifiers and alphabetic names; zero or more white-space characters are allowed before and after the separator characters &amp;quot;: ; ()&amp;quot;. Any text between a left bracket and a right bracket (\[...\]) is considered a comment. Here is a simple example based on the mini-MUC organization template (more elaborate template examples are given in Section 8): type package organizations; annotation type organization { org_name: org_aliases: org_type: org_location: annotation type typed_location {location: type:</Paragraph>
    </Section>
    <Section position="7" start_page="266" end_page="268" type="sub_section">
      <SectionTitle>
5.5 Examples of Annotations
</SectionTitle>
      <Paragraph position="0"> string, sequence of string, { government, company, other }, sequence of typed_location }: string; {country city landregion province waterregion address oth-unk } } ;  This section shows some simple examples of annotated documents. Each example is shown in the form of a table, At the top of the table is the document being annotated; immediately below the line with the document is a ruler showing the position (byte offset) of each character. Underneath this appear the annotations, one annotation per line. For each annotation is shown its Id, Type, Span, and Attributes. Integers are used as the annotation Ids. Also, for simplicity only a single Span for each Annotation is shown. The attributes are shown in the form name = value. At the end of this section the type declaration packages which would be used to describe these annotations is shown. The first example shows a single sentence and the result of three annotation procedures: tokenization with part-of-speech assignment, name recognition, and sentence boundary recognition. Each token has a single attribute, its part of speech (pos), using the tag set from the University of Pennsylvania Tree Bank; each name also has a single attribute, indicating the type of name: person, company, etc.</Paragraph>
      <Paragraph position="2"> Annotations will typically be organized to describe a hierarchical decomposition of a text. A simple illustration would be the decomposition of a sentence into tokens. A more complex case would be a full syntactic analysis, in which a sentence is decomposed into a noun phrase and a verb phrase, a verb phrase into a verb and its complement, etc. down to the level of individual tokens. Such decompositions can be represented by annotations on nested sets of spans. Both of these are illustrated in our second example, which is an elaboration of our first example to include parse information. Each non-terminal node in the parse tree is represented by an annotation of type parse.</Paragraph>
      <Paragraph position="4"> In most cases, the hierarchical structure could be recovered from the spans. However, it may be desirable to record this structure directly through a constituents attribute whose value is a sequence of annotations representing the immediate constituents of the initial annotation. For the annotations of type parse, the constituents are either non-terminals (other annotations in the parse group) or tokens. For the sentence annotation, the constituents attribute points to the constituent tokens. A reference to another annotation is represented in the table as &amp;quot;\[Annotation hi\]&amp;quot;; for example, &amp;quot;\[3\]&amp;quot; represents a reference to annotation 3. Where the value of an attribute is a sequence of items, these items are separated by commas. No special operations are provided in the current architecture for manipulating constituents.</Paragraph>
      <Paragraph position="5"> At a less esoteric level, annotations can be used to record the overall structure of documents, including in particular documents which have structured headers, as is shown in our third example6: 6 lncounting characters, count one character for the newline between lines</Paragraph>
    </Section>
  </Section>
  <Section position="12" start_page="268" end_page="271" type="metho">
    <SectionTitle>
1 Addressee
2 Source
3 Date
4 Subject
5 Priority
6 Bod~C/
7 Sentence
8 Sentence
</SectionTitle>
    <Paragraph position="0"> ddmmyy=101194 If the Addressee, Source .... annotations are recorded when the document is indexed for retrieval, it will be possible to perform retrieval selectively on information in particular fields.</Paragraph>
    <Paragraph position="1"> Our final example involves an annotation which effectively modifies the document. The current architecture does not make any specific provision for the modification of the original text. However, some allowance must be made for processes such as spelling correction. This information will be recorded as a correction attribute on token annotations and possibly on name annotations:  The sample annotations shown here would rec type package basic; annotation type token: annotation type name: annotation type sentence: type package parse; annotation type parse: type package message; annotation type addressee; annotation type source; annotation type date: {ddmmyy: string}; annotation type subject; annotation type priority; annotation type body; uire the following type declarations: {pos: string, correction: optional string};  { name_type: { person, organization, other } } ; { constituents: optional sequence of token }; { symbol: string, constituents: sequence of (parse or token or name)};</Paragraph>
    <Section position="1" start_page="269" end_page="270" type="sub_section">
      <SectionTitle>
5.6 Invoking Annotators
</SectionTitle>
      <Paragraph position="0"> Each TIPSTER system will be provided with a number of &amp;quot;annotators&amp;quot; procedures for generating annotations. There will be annotators for different types of annotations; for example, for tokenization, for sentence segmentation, for name recognition, etc. In addition, there may be multiple annotators of a single type; e.g., multiple tokenizers.</Paragraph>
      <Paragraph position="1"> Each annotator is assigned a name (a string). It is invoked by Annotate (Document, AnnotatorName: string) or AnnotateCollection (which: Collection, destination: Collection, AnnotatorName: swing) The first form annotates a single Document. The second form annotates a Collection or a subset thereof. This uses Collection which to determine which documents to process, and Collection destination to record the annotations. For each document in collection which, if the same document (a document with the same Id) appears in destination, annotate that document in collection destination. This calling sequence allows us to selectively apply annotators to subsets of a collection, but to keep all the annotations together in the &amp;quot;original&amp;quot; collection. If which and destination are identical, the entire collection is annotated.</Paragraph>
      <Paragraph position="2"> Note: Future versions of the architecture will include operations for managing the set of annotators: for adding an annotator to the set of annotators, for recording the types of annotations produced by an annotator, and for searching the set of annotators.</Paragraph>
    </Section>
    <Section position="2" start_page="270" end_page="270" type="sub_section">
      <SectionTitle>
5.7 External Representation of Annotations
</SectionTitle>
      <Paragraph position="0"> The TIPSTER architecture provides an external, character-based representation of annotated documents, so that such documents can be interchanged among modules (possibly as part of different TIPSTER systems on different machines) without regard to the internal representation used on particular machines. A representation based on SGML has been selected in order to be able to make use of the large number of existing applications which can operate on SGML documents.</Paragraph>
      <Paragraph position="1"> In this representation, if the document consists of the text &amp;quot;aaaa bbbb cccc&amp;quot;, and the span corresponding to &amp;quot;bbbb&amp;quot; has been assigned an annotation of type atype with id ident, and this annotation has attributes attribute1, attribute2, ... with values value1, value2 .... then the external representation of the annotated document will be aaaa &lt;atype id=ident attributel=valuel attribute2=value2... &gt;bbbb&lt;/atype&gt; cccc This representation is produced by the WriteSGML operation, which takes as arguments a document, an AnnotationSet, and a precedence list among annotation types. This precedence list is used to determine the nesting of SGML tags if two annotations involve the same span. A complementary operation, ReadSGML, reads a SGML document which conforms to this format (with all attributes and end tags explicit) and creates a document with annotations.</Paragraph>
      <Paragraph position="2"> The specification of these operations is subject to revision based on the experience of implementors in using these SGML representations in applications.</Paragraph>
      <Paragraph position="3"> It may be desirable to have a second external representation which much more closely parallels the internal property structure of the annotations, particularly if annotations are to be exchanged over a network.</Paragraph>
    </Section>
    <Section position="3" start_page="270" end_page="271" type="sub_section">
      <SectionTitle>
5.8 Annotation Schemata and Style Sheets
</SectionTitle>
      <Paragraph position="0"> Different groups of annotations normally exist in some fixed structural relationships to one another. Thus, a text body may consist of paragraphs, a paragraph of sentences, a sentence of tokens, etc. For an SGML document, these relationships are provided by a DTD. At present, the Architecture includes a very limited amount of such information in the form of the PrecedenceList argument to WriteSGML; it may be desirable to include in later versions of the architecture an AnnotationSchema more analogous to a DTD.</Paragraph>
      <Paragraph position="1"> When an SGML form is generated from an annotated document, rules must be applied to realize each type of annotation as a sequence of characters. In the present version, these rules are assumed to be built in to the WriteSGML operation, but in later versions it may be desirable to provide these rules explicitly as a StyleSheet. A TIPSTER System would have a default StyleSheet, but it may be necessary to extend the WriteSGML operation to use a different, explicitly specified style sheet.</Paragraph>
    </Section>
  </Section>
  <Section position="13" start_page="271" end_page="273" type="metho">
    <SectionTitle>
6.0 TYPES OF DOCUMENT ANNOTATIONS
</SectionTitle>
    <Paragraph position="0"> The TIPSTER Architecture defines a number of standard annotations; these are divided into structural and linguistic annotations. If these particular annotation type names are used, they must be used for the purpose designated.</Paragraph>
    <Paragraph position="1"> However, a TIPSTER system is free to create and use any other annotation types that it wishes.</Paragraph>
    <Paragraph position="2"> These annotations all have to be described in further detail.</Paragraph>
    <Section position="1" start_page="271" end_page="271" type="sub_section">
      <SectionTitle>
6.1 Structural Annotations
</SectionTitle>
      <Paragraph position="0"> 1. The raw document may contain several types of information, including text, tables, and graphics. The TIPSTER Architecture needs to preserve all this information in the document, but for the present will only process the text information (at a subsequent stage other structures with embedded text information, such as tables, may also be processed).</Paragraph>
      <Paragraph position="1"> To delimit these different types of information, the TIPSTER Architecture will use annotations of type TextSegment, each subsuming a maximal contiguous sequence of text (and possibly other annotations, such as GraphicsSegment, which would be ignored in subsequent processing).</Paragraph>
      <Paragraph position="2">  2. A text segment may consist of text in one or more languages and character codes. This information would be recorded by annotations of type MonolingualTextSegment which each have Language and CharacterSet attributes.</Paragraph>
      <Paragraph position="3"> 3. A document may be divided into a header and a body. The body would be annotated with a body annotation. The header may include a document identifier (to be annotated with a docid annotation) and such other properties as a title or headline, a dateline, etc.</Paragraph>
      <Paragraph position="4"> 4. A body may be divided into paragraphs; the p annotation type will be used to identify paragraphs. 5. A paragraph may be divided into sentences; the s annotation type will be used to identify sentences. 6. A sentence may be divided into tokens. The rules for tokenization for English will follow those used by the  Penn Tree Bank. Tokens will be denoted by the token annotation.</Paragraph>
      <Paragraph position="5"> The capability to annotate sentences and tokens will be obligatory for a TIPSTER System, since so many other properties may be expected to assume their existence. Other levels of annotation will be optional.</Paragraph>
    </Section>
    <Section position="2" start_page="271" end_page="273" type="sub_section">
      <SectionTitle>
6.2 Linguistic Annotations
</SectionTitle>
      <Paragraph position="0"> 1. Names, as defined for MUC-6. This includes company names, people's names, locations, currencies, and dates.</Paragraph>
      <Paragraph position="1"> 2. Part of speech labels, using the Penn TreeBank set as a standard for English. 3, Coreference tagging, as is being defined for MUC-6. Standards for other linguistic annotations, such as  phrase structure, word senses, and predicate-argument structure, may be added as more progress is made in defining these annotations for MUC evaluation.</Paragraph>
      <Paragraph position="2"> All of these linguistic annotations would be optional: the architecture would be used to establish standards whereby people who want to generate or use these annotations could communicate, but (except possibly for name recognition) this would not obligate anyone to produce these annotations.</Paragraph>
    </Section>
  </Section>
  <Section position="14" start_page="273" end_page="275" type="metho">
    <SectionTitle>
7.0 DETECTION
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="273" end_page="275" type="sub_section">
      <SectionTitle>
7.1 Object Classes
7.1.1 Detection Needs and Queries
</SectionTitle>
      <Paragraph position="0"> The user's request for documents is initially prepared in the form of a DetectionNeed: a document with a variety of SGML-delimited fields. A DetectionNeed is a type of Document, and so partakes of all the operations which can be applied to Documents. As a specialization of Collections, the Architecture includes DetectionNeedCollections; these are required primarily for routing operations, which typically involve sets of DetectionNeeds.</Paragraph>
      <Paragraph position="1"> The DetectionNeed is transformed in two stages: it is first transformed into a DetectionQuery, and thence into either a RetrievalQuery or a RoutingQuery. DetectionNeeds are independent of the specific retrieval engine employed, while DetectionQueries, RetrievalQueries, and RoutingQueries are specific to a particular retrieval engine. The DetectionQuery is specific to the retrieval engine but independent of the collection over which retrieval is to be performed, and the operation (retrieval or routing) to be performed; the RetrievalQuery and RoutingQuery are specific to the retrieval engine, to the operation, and to a collection (they may incorporate, for example, term weights based on the Inverse Document Frequencies in a collection). The transformation process is divided into these two stages because a retrieval system may provide specialized tools for modifying the DetectionQuery.</Paragraph>
      <Paragraph position="2">  A system-independent description of the contents of the documents that the user would like to retrieve.</Paragraph>
      <Paragraph position="3"> The description may be in natural language, expressed with query language operators (described below), or a combination of natural language and query language operators.</Paragraph>
      <Paragraph position="4">  Generate a system-specific DetectionQuery from an analysis of the DetectionNeed; the DetectionQuery has the same Externalld as the DetectionNeed.</Paragraph>
      <Paragraph position="5"> Query language operators are represented within the DetectionNeed using SGML-style tags. Each query language operator has the following syntax.</Paragraph>
      <Paragraph position="7"> That is, an operator consists of an operator field marker (e.g. &lt;OPERATOR&gt;), one or more arguments, and an ending field marker (e.g. &lt;/OPERATOR&gt;). Operators may be nested arbitrarily. Operator characteristics can be altered as shown. When alternatives are given (e.g. EXACT or FUZZY), the first one listed is the default. The default value for numeric arguments is 1.</Paragraph>
      <Paragraph position="8"> It is not necessary for a system to implement each operator exactly as described below. A compliant system is one that can translate any valid DetectionNeed into its own query language, and that documents how each operator is handled. A system may ignore operators that it does not implement, or it may map them to the nearest reasonable alternative in that system's query language.</Paragraph>
      <Paragraph position="9"> Any text not explicitly encapsulated in a query language operator is assumed to be implicitly encapsulated by the &lt;SUM&gt; operator (described below).</Paragraph>
      <Paragraph position="10"> When it is necessary to distinguish among two or more DetectionNeeds, for example when they are stored in an ASCII file, the &lt;DETECTION-NEED&gt; SGML tag indicates the beginning of a DetectionNeed, and the  &lt;/DETECTION-NEED&gt; SGML tag indicates the end of the DetectionNeed. Text that is not enclosed between these tags is handled in a system-dependent (i.e. not defined by the TIPSTER architecture) manner.</Paragraph>
      <Paragraph position="11"> The operators are listed below, in alphabetic order.</Paragraph>
    </Section>
  </Section>
  <Section position="15" start_page="275" end_page="283" type="metho">
    <SectionTitle>
&lt;AND MATCH=\[EXACT I FUZZY\]&gt;
</SectionTitle>
    <Paragraph position="0"> Document should contain all arguments. EXACT match means that each document must contain all of the arguments. FUZZY match means that a document may be returned if it lacks one or more arguments, but the document is presumably ranked lower than documents that match all arguments.</Paragraph>
    <Paragraph position="1">  The arguments are to be matched against that portion of the document annotated with the annotation of type &amp;quot;name&amp;quot;. Note that annotations may denote document structure, so that this operator may be used to restrict the match to within a single phrase, sentence, paragraph, section, etc.</Paragraph>
    <Paragraph position="2"> &lt;DOC-ATTRIBUTE=name&gt; The arguments are to be matched against the value of attribute &amp;quot;name&amp;quot;.</Paragraph>
    <Paragraph position="3"> &lt;NL&gt; The arguments are a natural language description of part of the information need. No other query operator can occur in the &lt;NL&gt; description of the information need. (Any operators encountered are to be treated as text.) &lt;/NL&gt; ends the field, unless it is escaped (see below).</Paragraph>
    <Paragraph position="4"> &lt;ESCAPE&gt; All tokens until &lt;/ESCAPE&gt; are query terms, not operators. If the next token is &lt;/ESCAPE&gt; then it is a query term, and not the end of the &lt;ESCAPE&gt;.</Paragraph>
    <Paragraph position="5">  Functionally, this operator is like an &lt;OR&gt; operator: The document must contain one or more arguments. However, the user may assume that documents that match more arguments are generally ranked higher than documents that match fewer arguments. (Typically used with vector-space or probabilistic systems.)  FormRetrievalQuery (DetectionQuery, sequence of DocumentCollectionlndex): RetrievalQuery translate the DetectionQuery into an RetrievalQuery by using the information (e.g., document frequencies of terms, similarities between terms) in the set of DocumentCollectionlndexes; the RetrievalQuery has the same Externalld as the DetectionQuery FormRoutingQuery (DetectionQuery, sequence of DocumentCollectionlndex): RoutingQuery translate the DetectionQuery into a RoutingQuery by using the information (e.g., document frequencies of terms, similarities between terms) in the set of DocumentCollectionlndex; the RetrievalQuery has the same Externalld as the DetectionQuery EditQuery (DetectionQuery) optional: this system-specific operation allows the user to modify the query, providing information which cannot be provided through the (system-independent) DetectionNeed  assign to each Document in Collection an attribute relevance whose value indicates the relevance of the document to the query UpdateUsingRelevanceFeedback (RetrievalQuery, relevant_docs: Collection, sequence of DocumentCollectionlndex): RetrievalQuery this operation updates the RetrievalQuery using relevance feedback, and returns the updated (or new) Query. The relevance feedback is provided through the relevant_docs argument. Each document in this collection should have an Attribute relevant with the value &amp;quot;true&amp;quot; or &amp;quot;false&amp;quot;. Furthermore, if that value is &amp;quot;true&amp;quot;, the entry may also have one or more Annotations of type relevant-section whose Spans indicate the relevant sections of the document.</Paragraph>
    <Paragraph position="6"> RetrievalQueryFromRelevanceJudgements (relevant_docs: Collection, sequence of DocumentCollectionlndex, DetectionNeed): RetrievalQuery this operation is similar to Update UsingRelevanceFeedback, but creates a new RetrievalQuery from scratch based on the relevance judgments recorded in relevant_docs. The DetectionNeed parameter is required since each query must point to the original DetectionNeed; this DetectionNeed may contain a narrative characterization of the query being created, but no information from the DetectionNeed is used in creating the query  The TIPSTER Architecture provides for two types of document detection operations: retrieval and routing. In essence, retrieval involves the comparison of a single query against a large number of documents, while routing involves the comparison of a single document against a large number of queries (or &amp;quot;user profiles&amp;quot;). As a preliminary step for retrieval, generally, the set of documents must be pre-processed. Typically, this involves the creation of a term index, but it may also involve the gathering of various statistics about the set of documents (such as term document frequencies, term co-occurrence frequencies, and even term similarities based on cooccurrence). The result of all this preprocessing is a DocumentCollectionlndex. Retrieval is then performed by sending a query (in the form of an RetrievalQuery) to the DocumentCollectionlndex; the DocumentCollectionlndex returns a list of relevant documents.</Paragraph>
    <Section position="1" start_page="278" end_page="278" type="sub_section">
      <SectionTitle>
Class DocumentCollectionIndex
Type of PersistentObject
Description
</SectionTitle>
      <Paragraph position="0"> a form of a Collection which is capable of responding to DetectionQuery. For most systems, this involves the annotation of the documents in the collection with approach-specific annotations, and then the creation of an inverted index involving these annotations. For some systems, however, an &amp;quot;index&amp;quot; might just be a normalized copy of the original text in a form which can be scanned by high speed search software.</Paragraph>
    </Section>
    <Section position="2" start_page="278" end_page="279" type="sub_section">
      <SectionTitle>
Operations
Augment (DocumentCollectionIndex, Collection)
</SectionTitle>
      <Paragraph position="0"> adds all the documents in Collection to the DocumentCollectionIndex RetrieveDocuments (sequence of DocumentCollectionlndex, RetrievalQuery, NumberToRetrieve: integer, Monitor or nil): Collection returns a collection of Documents (of maximal length NumberToRetrieve) which are most closely related to the DetectionNeed from which the Retrieval Query is derived. The DocumentCollectionIndex will provide progress updates as requested by the Monitor. A nil argument means that no progress monitoring is required. A retrieval operation canceled by the Monitor object's MonitorProgress operation returns a Collection of accumulated documents In routing, a set of queries or user profiles (in the form of RoutingQueries) are pre-processed to create a QueryCollectionIndex. Routing is then performed by sending a Document to a QueryCollectionIndex; what is returned is a set of relevant profiles (in the form of a DetectionNeedcollection).</Paragraph>
      <Paragraph position="1">  adds a single query (in the form of an RoutingQuery) to a QueryCollectionlndex; if an existing query in the QueryCollectionlndex is based on the same DetectionNeed as RoutingQuery, the existing query is replaced by RoutingQuery RemoveQuery (QueryCollectionlndex, RoutingQuery) if QueryCollectionlndex includes a query based on the same DetectionNeed as RoutingQuery, that query is removed from the Index RetrieveQueries (sequence of QueryCollectionlndex, Document, NumberToRetrieve: integer): DetectionNeedCollection returns the collection of DetectionNeeds (of maximal length NumberToRetrieve) which are most closely related to Document  The Monitor object is intended as an advisory object in the Architecture. If no Monitor object is provided, no monitoring or interruption of the RetrieveDocuments operation is possible. The RetrieveDocuments operation will not fail due solely to the absence of a nil Monitor argument.</Paragraph>
      <Paragraph position="2">  StatusType is the type of report requested. If the type is not supported a reasonable default shall be provided with the type indicated. IntervalType indicates the desires type of interval which may differ from StatusType. Interval indicates the frequency of status information. The Interval value behaves according to the IntervalType. If IntervalType is Percent then Interval = 5 means provide status when each 5% of the documents are processed. ClientData is optional user data for the MonitorProgress operation MonitorProgress (Monitor, DCIName: string, Status: integer, MaxStatus: integer, Type: one of{NumDocs, Time, Percent}): Boolean DCIName is the name of the DocumentCollectionlndex which is being monitored. Status is the current status consistent with type. MaxStatus indicates the maximum value Status may have for DCIName. Type is the type of progress update provided to the function Returns FALSE to terminate the search, returns TRUE to continue the search</Paragraph>
    </Section>
    <Section position="3" start_page="279" end_page="283" type="sub_section">
      <SectionTitle>
7.2 Functional Model
</SectionTitle>
      <Paragraph position="0"> The following functional model diagrams are based on the notation used by Rumbaugh et al. Ovals represent processes (operations); boxes with only a top and bottom represent &amp;quot;data stores&amp;quot; -- persistent repositories of data; fully enclosed boxes represent &amp;quot;actors&amp;quot; -- active sources of data.</Paragraph>
      <Paragraph position="1">  The system begins by converting the DocumentCollection(s) into DocumentCollectionlndex(es), as shown on the left side. To retrieve information from this collection, the User produces a DetectionNeed. This DetectionNeed is converted in two stages, first to a DetectionQuery and then to an RetrievalQuery, as shown in the right column (the latter step may use information, for example, on term weights, from the DocumentCollectionIndex). Finally, the  Routing requires a DocumentCollectionIndex which is used to determine weights for the translation of a DetectionQuery into an RoutingQuery. Typically an application will be able to use a pre-existing index (for a Collection of content comparable to the documents to be routed).</Paragraph>
      <Paragraph position="2"> Each DetectionNeed (user profile) in the DetectionNeedCollection is translated in two stages: first to a DetectionQuery, and then into a RoutingQuery. These RoutingQueries are then stored and indexed in a QueryCollectionlndex. Finally, this QueryCollectionlndex can be run against a Document to produce a set of relevant queries (profiles), in the form of a DetectionNeedCollection.</Paragraph>
      <Paragraph position="3">  Relevance feedback begins with an initial RetrievalQuery, which is used to retrieve a set of documents. This operation is shown as &amp;quot;Retrieve Documents \[1\]&amp;quot; in the figure below (the DocumentCollectionlndex input is not shown), and produces a Collection. A human judge (or possibly an alternative source of relevance judgments, such as an extraction system) then reviews the retrieved documents and records relevance judgments on the Collection using the relevant Attribute. This is done using a Relevance Recorder, which is not part of the Architecture but would be part of any application system which wished to support relevance feedback. The Collection is then fed, along with the original query, to an UpdateUsingRelevanceFeedback operation, producing an updated query. Finally, the updated query can be used to retrieve a new set of documents (shown as &amp;quot;Retrieve Documents \[2\]&amp;quot; at the bottom of the figure).</Paragraph>
    </Section>
  </Section>
  <Section position="16" start_page="283" end_page="287" type="metho">
    <SectionTitle>
8.0 EXTRACTION
</SectionTitle>
    <Paragraph position="0"> Information extraction the extraction from a document of information concerning particular classes of events is a form of document annotation. An extraction engine adds annotations describing the events and their participants.</Paragraph>
    <Paragraph position="1"> Extraction therefore does not require any operations and classes beyond those already presented. However, because extraction will be a major component of many systems built using the Architecture, this section describes how extraction fits into the current Architecture.</Paragraph>
    <Paragraph position="2"> At present the development of extraction engines from a description of a class of events (a &amp;quot;scenario&amp;quot;) is a black art practiced by a cadre of information extraction specialists. It is expected that in the future it will be possible for experienced users to customize extraction systems to new scenarios; this would be an interactive process which would draw upon a library of predefined template objects. Appendix A.2 presents the additional object classes which would be needed to support such customization.</Paragraph>
    <Section position="1" start_page="283" end_page="283" type="sub_section">
      <SectionTitle>
8.1 Representing Templates as Annotations
</SectionTitle>
      <Paragraph position="0"> In the terminology developed by the Message Understanding Conferences, the information extracted from a document is stored in a (filled) template, which in turn consists of a set of template objects. A template object may contain information about a real-world object (such as a person, product, or organization), a relationship, or an event.</Paragraph>
      <Paragraph position="1"> Each such template object provides information about a portion of a document and is therefore represented in the TIPSTER Architecture by an annotation. A particular extraction task will involve several kinds of template objects, for events, people, organizations, etc. Each kind of template object corresponds to a type of annotation. Thus the formal specification of a set of template objects corresponds to a set of annotation type declarations. This formal specification is supplemented by a large amount of narrative (the &amp;quot;fill rules&amp;quot;) describing the circumstances under which a template object is to be created and the information to be placed in each slot.</Paragraph>
      <Paragraph position="2"> Each slot/value pair in the template object is represented as an attribute/value pair on the annotation. Note that the values of attributes can be lists (thus allowing for slots with multiple values) and can be references to other annotations (thus allowing for a hierarchy of filled objects, and allowing for references to other annotations, such as names which have been identified by a prior annotation process). Furthermore, each annotation has a span which can link the object to the text from which it has been derived.</Paragraph>
      <Paragraph position="3"> Some applications may want to link an individual slot in the template object to text in the document. This can be done by introducing additional annotations. Instead of having the value of the attribute corresponding to that slot be a string, it would be a reference to an annotation of type string-annotation. That annotation would (like all annotations) have a set of spans referencing the text; it would also have a value attribute holding the value of the template slot (the &amp;quot;slot filler&amp;quot;). This has been done for one of the slots in the example below, the role slot of personnel, but could have done it for others.</Paragraph>
      <Paragraph position="4"> If an application system involves extractions for multiple scenarios (multiple classes of events),it will be necessary to distinguish the annotations corresponding to different extraction scenarios (so that, for example, one can display all the annotations for one scenario). This can be done by adding a scenario attribute to each annotation. In similar fashion, in an application environment integrating annotation modules from different suppliers, it would be desirable to record the source of particular annotations using an annotator attribute. These additional attributes are not shown in the example below.</Paragraph>
    </Section>
    <Section position="2" start_page="283" end_page="287" type="sub_section">
      <SectionTitle>
8.2 An Example
</SectionTitle>
      <Paragraph position="0"> As an illustration of this approach, consider the result of annotating a document consisting of the sentence The KGB kidnapped ARPA program manager Umay B. Funded.</Paragraph>
      <Paragraph position="1">  with an information extraction system covering terrorist events.  PER_NAME: &amp;quot;Umay B. Funded&amp;quot; These might be encoded as a set of annotations as follows: The MUC-style template for such an event might 7 The templates shown here are loosely based on those for the MUC-6 information extraction task.  The type declaration package for these annotations is as follows: type package terrorist_event; annotation type event: annotation type org: annotation type personnel: annotation type person: annotation type name: annotation type string-annotation:  { event_type: { kidnapping, murder .... }, perp: org, target: personnel }; { org_name: name, org_nationality: string}; { person: person, organization: org, role: string-annotation } ; { per_name: name }; {name_type: {person, organization, other} }; { value: string };</Paragraph>
    </Section>
  </Section>
  <Section position="17" start_page="287" end_page="289" type="metho">
    <SectionTitle>
APPENDIX A POSSIBLE EXTENSIONS TO THE ARCHITECTURE
A.1 Enforcing Type Declarations
</SectionTitle>
    <Paragraph position="0"> In the current Architecture, annotation type declarations serve only as documentation; they are not processed by any component of the Architecture. It may be desirable in future versions of the Architecture to perform type checking based on such declarations. This could involve:  1. creation of a new class of document, TypeDeclarationDocument, containing a package of type declarations 2. associating a set of declaration packages with a Collection 3. requiring that any annotation added to a document in a collection conform to the associated type declaration  A number of issues would need to be resolved to implement such a scheme, including the name scoping of annotation types, and the implications of modifying a type declaration after annotations of that type have been created. The overall type checking mechanism would be fairly complex and so has not been included in the current Architecture.</Paragraph>
    <Section position="1" start_page="287" end_page="289" type="sub_section">
      <SectionTitle>
A.2 Customizable Extraction Systems
</SectionTitle>
      <Paragraph position="0"> The present Architecture treats extraction engines as modules which have been hand-coded for specific tasks (extraction scenarios). In the future, it is expected that there will be more general extraction engines which can be customized by users to specific needs. This section considers the additional object classes and data flow which would be entailed,  The user would prepare an ExtractionNeed, using a combination of formal specification and narrative description comparable to the &amp;quot;fill rules&amp;quot; for MUC-5. This would then be &amp;quot;translated&amp;quot; to produce a CustomizedExtractionSystem. This translation would be performed by a component which will guide an analyst in producing a CustomizedExtractionSystem; this interactive translation component is labeled Customize below. Once a CustomizedExtractionSystem is created, it can be applied to documents in a collection (like other, pre-existing annotators) and will produce templates for the documents.</Paragraph>
      <Paragraph position="1"> The Extraction Need would include annotation type declarations for the annotations to be produced. These type definitions will be supplemented by fill rules in the form of comments. As the process of translating ExtractionNeeds becomes more formalized, the fill rules will accordingly also become more formalized. For example, the specifications may include the semantic class of particular slot fills. For the present, however, an ExtractionNeed is a  the operation which generates templates from documents. Extraction is a special type of annotation, and accordingly the Extract operation is a variant of the Annotate operation (Section 5.6). For each document in collection which, if the same document (a document with the same Id) appears in destination, annotate that document in collection destination with the information extraction templates generated for that document.</Paragraph>
    </Section>
    <Section position="2" start_page="289" end_page="289" type="sub_section">
      <SectionTitle>
Class Template Object Library
Description
</SectionTitle>
      <Paragraph position="0"> a set of system-specific rules for extracting various classes of objects, such as persons or organizations; this library could bc used in customizing an extraction system to a particular task</Paragraph>
    </Section>
  </Section>
  <Section position="18" start_page="289" end_page="290" type="metho">
    <SectionTitle>
A.2.2 Functional Model
</SectionTitle>
    <Paragraph position="0"> The analyst begins by preparing an ExtractionNeed. The ExtractionNeed would serve as the starting point for customization, which would be performed by the analyst using an interactive customization tool and drawing upon the Template Object Library. The result of this process would be a CustomizedExtractionSystem.</Paragraph>
    <Paragraph position="1"> Once a CustomizedExtractionSystem has been created, it can bc given a Collection specifying the documents to be annotated (the &amp;quot;which&amp;quot; argument) and a Collection where the annotations shall be placed (the &amp;quot;destination&amp;quot; argument); it will add to each document of the destination Collection the appropriate templates (in the form of annotations).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML