File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/65/c65-1006_metho.xml

Size: 34,926 bytes

Last Modified: 2025-10-06 14:10:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="C65-1006">
  <Title>AUTOmaTIC LINGUISTIC CLASSIFICATION</Title>
  <Section position="4" start_page="6" end_page="6" type="metho">
    <SectionTitle>
BASIC PROGRAHS
</SectionTitle>
    <Paragraph position="0"> The Automatic Classification System (ACS), a Fortran IV programming package (7) based on the classification theories of Needham and Parker-Rhodes (8) has been developed by the Linguistics Research Center of The University of Texas (under support of the National Science Foundation and the U. S. Army Electronics Laboratories), and has been made generally available for classification research. The version of the system used in our own facility has been augmented with list-processing routines and other specialized programming which greatly increased its efficiency and data-capacity.</Paragraph>
    <Paragraph position="1"> ACS is a generalized classification system which can be applied to non-linguistic as well as linguistic problems. Its basic inputs are data describing the incidence (or the frequency of incidence) of particular properties upon particular objects. These incidence data may be transposed, so that either the properties or the objects can be classified. Various measures of the similarity between pairs of the objects (or properties) are available, permitting the incidence data to be used in computations of the connections between object (or property) pairs. Using these connection data, other routines group together &amp;quot;clumps&amp;quot; of objects with similar properties (or of properties occurring similarly in the objects). Various kinds of the clumps can be discovered.</Paragraph>
    <Paragraph position="2"> ACS has a section which controls the selection of similarity measures and clumping methods in classification experiments (6). To formalize the concepts of distributional classification, e.g. those investigated by Hockett (8) and by Harris (9) we have extended the general classification theory to binary as well as singulary relations. In linguistic classification these can be interpreted as constitutive relations (e.g.</Paragraph>
    <Paragraph position="3"> Pendergraft, Dale 2-2 concatenation). This interpretation, more exactly, assumes that the incidence data describe pairs of objects standing in that particular relation. Clumps of similar objects are then found in both the domain and counterdomain of the relation. Finally, individual clumps in the domain are paired with individual clumps in the counterdomain according as the connections between members of the two clumps are dense (in a precise sense) relative to the entire set of connections. Pairs of clumps may also be found by using a measure of relative sparseness in the connections. These capabilities have been added to ACS, and programming is being done to prepare incidence data mechanically from the results of automatic analysis in LRS (i0).</Paragraph>
    <Paragraph position="4"> The automatic analysis algorithms in LRS are linguistically generalized, i.e. they will recognize the expressions of any object-language according to the exact specifications described in particular meta-languages. These language data, furthermore, are operationally generalized; they will be given relationally (solely in terms of relations and not as a process) so that synthesis as well as analysis algorithms may refer to the same descriptions. Object-language descriptions are conveyed by a hierarchy oPS meta-languages rather than by a single meta-language (5). The complete data hierarchy will be given by lexical, syntactical, semantical and pragmatical meta-languages; the first three are currently available in LRS. Lexical, syntactic, semantic and pragmatic analysis (or synthesis) algorithms will be oriented to the corresponding levels of monolingual data. Analysis will affect a transfer to the next higher level of processing; synthesis to the next lower level. Automated lexical and syntactic analysis (as well as synthesis and translation) are operational in LRS, and the semantic algorithms will be later this year. All of the algorithms are parallel, stochastic, heuristic and machineindependent; that is to say, they have the following design features which we believe to be important in automatic linguistic  classification experiments: Pcndergraft, Dale 2-3 (a) They carry forward a search for all possible linguistic alternatives in parallel, instead of following to completion one sequence of alternatives before beginning another. As a result, all of the available linguistic evidence is represented in the analysis output.</Paragraph>
    <Paragraph position="5"> (b) They compute a probability for each linguistic alternative being processed. The probability will be the same in analysis or synthesis; it represents the likelihood of occurrence in tile language rather than in the process.</Paragraph>
    <Paragraph position="6"> (c) They {may or may not, as a matter of choice) use  the probabilities as heuristic criteria to limit the analysis output to the most likely alternatives.</Paragraph>
    <Paragraph position="7"> (d) They are oriented entirely to particular metasyntactical and meta-semantical relations, not to the components of a particular computer. All processing decisions and results will be the same on every computer large enough to do the linguistic processing.</Paragraph>
    <Paragraph position="8"> Pendergraft, Dale 3-1</Paragraph>
  </Section>
  <Section position="5" start_page="6" end_page="6" type="metho">
    <SectionTitle>
MORPHOLOGICAL CLASSIFICATION
</SectionTitle>
    <Paragraph position="0"> Constitutive relations may, of course, be used as the basis of classification of individual objects (i.e. for singulary as opposed to binary classification). A formal distinction can also be made between the classes of objects which are identical and those which are to some degree equivalent in distribution. The former, to be specific, have identical clump membership and are thus indistinguishable with respect to the particular constitutive relation described in the incidence data. The latter have common membership in a particular set of clumps and, as a consequence, share certain distributional properties which are represented by those clumps.</Paragraph>
    <Paragraph position="1"> Morphological classification, therefore, will involve the following basic operations within our theories.</Paragraph>
    <Paragraph position="2"> (These will also be pertinent to our remarks below about semological classification.) For the given constitutive relation, the classification algorithm will have to:  (a) Recognize, among the objects potentially made available by segmentation, those which are to be classified. (b) Perform singulary classification of the recognized objects to determine which subsets of them have identical  distribution relative to that constitutive relation.</Paragraph>
    <Paragraph position="3"> We assume, in the morphological problem, that the objects to be classified are lexical units (whether phonetic or orthographic) and that concatenation is the constitutive relation. Our working hypothesis is that the morphological objects are those which maximize the connection entropy.</Paragraph>
    <Paragraph position="4"> This seems intuitively reasonable, since more or less homogeneous connections would be anticipated among objects Pendergraft, Dale 3-2 having elementary status. Conversely, a relatively strong connection between two objects would be evidence that they were parts of a single construct.</Paragraph>
    <Paragraph position="5"> Accordingly, a routine has been added to ACS to normalize the connection data and,nfOr the normalized connections PI' P2' &amp;quot;''' P2 (i * Pi; r Pi = i), to compute</Paragraph>
    <Paragraph position="7"> H(PI' P2' &amp;quot;''' Pn ) = -r pj log pj j=l for a convenient logarithmic base \[11\]. The second operation, that of determining what objects have identical distributional properties, will be handled by a routine which finds the sets in the intersections defined by the collection of all clumps.</Paragraph>
    <Paragraph position="8"> (We will say that the members of the identit Z classes form &amp;quot;component sets&amp;quot; of the universe of objects being classified, because the sets partition that universe.) Some of the strategies which may be used in morphological classification have been compared by Hockett \[12\] who concluded that different classification methods could succeed in establishing the same relation between morphemes and phonemes. A strategy chosen for automation must, above all, be computationally tractable. The two methods which Hockett calls the &amp;quot;morph approach&amp;quot; and the &amp;quot;morphophonemic approach&amp;quot; would have inherent advantages or disadvantages computationally.</Paragraph>
    <Paragraph position="9"> The &amp;quot;morph approach,&amp;quot; according to Hockett, sup= poses that morphemes are represented by morphs and that morphs are composed by phonemes. In consequence, the constitutive Pendergraft, Dale 3-3 relation (concatenation) must obtain between constructions of phonemes. By our hypothesis, a set of morphs would be any set of the constructions maximizing (perhaps locally) the connection entropy. And members of each identity class of the morphs would be the allomorphs of a particular morpheme. Computationally, then, we might carry out the following morphological classification algorithm:  (a) Perform lexical analysis, using the current set of phoneme constructs as the lexical data (lexicon).</Paragraph>
    <Paragraph position="10"> (b) Prepare incidence data describing the constitutive relation between the contructs.</Paragraph>
    <Paragraph position="11"> (c) Compute the connections between pairs of the constructs and the connection entropy, comparing the result with the entropy of the preceding cycle.</Paragraph>
    <Paragraph position="12"> (d) If the entropy has increased, combine the (one or more, depending upon the rate of increase) strongly connected pairs into single constructs; return to (a).</Paragraph>
    <Paragraph position="13"> (e) If the entropy has not increased, perform singulary classification and find the component sets of morphs which will represent morphemes.</Paragraph>
    <Paragraph position="14">  llockett's &amp;quot;morphophonemic approach,&amp;quot; in contrast, would take morphemes to be composed of morphophonemes and morphophonemes to be represented by phonemes. One interpretation of these relations in terms of the classification theories (among several) would involve the supposition that the members of each identity class of the phonemes represent Pendergraft, Dale 3-4 a particular morphophoneme. Consequently, any phoneme representing the morphophoneme M would be distinguished in the incidence data only as an M. A phoneme construct, similarly, would be distinguished only as the construct of represented morphophonemes. Any set of the morphophoneme constructs maximizing (maybe locally) the connection entropy would be recognized as morphemes.</Paragraph>
    <Paragraph position="15"> That these relations would call for a different computational strategy should be evident. Because of the higher level of abstraction, one might anticipate that (a) there would be fewer morphophoneme than phoneme constructs, and (b) the latter would occur more frequently than the former in the outputs of lexical analysis. But our aim is not to prejudge the computational advantages of one approach above another; the schemes which are feasible within our analysis and classification capabilities will be tested.</Paragraph>
    <Paragraph position="16"> Pendergraft, Dale 4-1</Paragraph>
  </Section>
  <Section position="6" start_page="6" end_page="6" type="metho">
    <SectionTitle>
SYNTACTICAL CLASSIFICATION
</SectionTitle>
    <Paragraph position="0"> The advantages of abstraction in classification are nevertheless striking in syntactical applications. Indeed the possible gains seem so promising that we have bypassed automated morphological classification in our first experiments to investigate the following operations of syntactical classification. Each operation presupposes not only the existence of a set of morphemes, but an assignment of the morphemes to syntactical equivalence classes relative to concatenation, as already described.</Paragraph>
    <Paragraph position="1"> (i) Identification of classes. If, in the outputs of syntactical analysis, it is found that some expression has been (ambiguously) recognized both as an A and as a B, then this coincidence of A and B will be the event counted. Singulary classification will then be performed to determine whether an A and a B are distinguishable distributionally relative to coincidence.</Paragraph>
    <Paragraph position="2"> If not, we will induce that the predicates &amp;quot;A&amp;quot; and &amp;quot;B&amp;quot; are coextensive, i.e. they denote the same objects \[13\]. The two predicates will therefore be replaced (wherever they occur in the syntactical description) by a single predicate.</Paragraph>
    <Paragraph position="3"> (2) Generalization of classes. During the class identification operation (i), the event of being an A will be assigned to a set of (zero or more) clumps. If being an A entails being in the clump C, then we introduce the new predicate &amp;quot;C&amp;quot;. We induce, further, that the predicate &amp;quot;C&amp;quot; comprehends the predicate &amp;quot;A&amp;quot;, i.e. &amp;quot;C&amp;quot; denotes every object that &amp;quot;A&amp;quot; does \[13\]. And, since the extensions of the new predicates are clumps of objects sharing some distributional property, we characterize &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; as ostensive and distributional predicates, respectively, relative to the constitutive relation.</Paragraph>
    <Paragraph position="4"> Taking a new incidence data to describe the relation of comprehension between the distributional and ostensive ,red-Pendergraft, Dale 4-2 icates, we will next perform singulary classification to bring together predicates which are similar relative to comprehension. Thus) we induce that the predicates have similar extensions. The ostensive predicates in each clump will be replaced (in all their occurrences in the syntactical description) by the distributional predicate of that clump, i.e. by the predicate whose extension is the union of the extensions of the extensionally-similar predicates. (Kclumping is convenient for generalizing classes because it provides a parameter for the degree of generalization.) (3) Rul_.__~e generation. The aim of this operation will be to find new syntactic rules (i.e. taxonomic axioms) to be added to the syntactical description. The events to be counted in preparing the incidence data will be those in which an A is found to be concatenated to a B in the outputs of automated syntactic analysis. Binary classification will be used to pair clumps on the basis of dense connections, as explained above. For any resulting pair of densely connected clumps C) D classifying an A as a C) and a B as a D) respectively) we generate the syntactic rules A ~ C, B = D and C D ~ E. The predicate &amp;quot;C D&amp;quot; will have as its extension any C concatenated to any D. &amp;quot;E&amp;quot; will be a new predicate comprehending &amp;quot;C'~D. '' Rules generated inductively will tend to be overly general. There will be an operation) however) by which syntactical classes can be specialized to conform to the empirical  analysis data.</Paragraph>
    <Paragraph position="5"> (4) Specialization of classes. From the rules A ~ C and C D ~ E we may infer the derived rule A D ~ E. Hence the application of A c C to C~D ~ E at the first (left-most) place \['endergraft, Dale 4-3  in the latter may be symbolized algebraically \[14\] as follows: f-a &amp;quot;/'7~ ,.-~ (c D F) (A CO) = (A D c..E) Incidence data, prepared for one particular class Cp will describe tile frequency of application of rules at places mentioniag that class. The events counted, specifically, will be those in which a rule X is found (in the analysis outputs) to be applied at a place p in rule Y (i.e. for the event &amp;quot;~xfPaY, the pair of objects YP~x will be regarded as standing in the constitutive relation). Different places in the same rule will be treated as different objects relative to application. Binary classification will be used to pair densely-connected clumps of distributionally similar (in the domain) places of application~ and (in the counterdomain) rules being applied to the places. The predicate &amp;quot;C&amp;quot; will be replaced (in those particular occurrences in the syntactical description) by a new predicate denoting that subclass of C. These syntactic classification operations will be opera= tional in the combined LRS-ACS programming system before the end of this year. We plan an extension of the system to include automated morphological, semological and semantical classification. The last will be restricted to a distributional semantics without identification of references, i.e. to the restricted form of our theoretical hypothesis \[5\] which assumes that applications at different places in the same rule are independent events.</Paragraph>
    <Paragraph position="6"> Pendergraft, Dale 5-1</Paragraph>
  </Section>
  <Section position="7" start_page="6" end_page="6" type="metho">
    <SectionTitle>
SEMOLOGICAL CLASSIFICATION
</SectionTitle>
    <Paragraph position="0"> Recently we observed \[iS\] that a small informational unit in language data seems convenient for the descriptive linguist, but a large informational unit would optimize linguistic processing. In retrospect it appears likely that, in our project and elsewhere, different approaches to syntactical description have too often been concerned with different informational units rather than different information. As anticipated above, we have come to questions in syntactical classification which are analogous to those in lexical classification which gave rise to morphology; viz. what objects are to be classified semantically? Joos \[16\] has stimulated our thinking about semology, as has La~nb \[12\]. Undoubtedly the latter's own interest in automated syntactical classification \[i\] has contributed to the similarity of our theories; the study of automatic linguistic classification brings one to consider informational units which are small enough to be discovered mechanically.</Paragraph>
    <Paragraph position="1"> Adopting Bloomfield's terminology \[18\], we will refer to the elemental units of syntactical description (i.e.</Paragraph>
    <Paragraph position="2"> those rules conveying minimal units of information) as tasmemes.</Paragraph>
    <Paragraph position="3"> The elementary units to be classified semantically will be semes. Between the two, we will posit semological relations analogous to those which Hockett presented for morphology.</Paragraph>
    <Paragraph position="4"> (i) The first hypothesis would be that sememes are represented by semes, and that semes are composed of tagmemes.</Paragraph>
    <Paragraph position="5"> Within the frame of our classification theories, therefore, the constitutive relation would be application: the semes would be regarded as the representatives of a particular sememe. This is the approach we will take in our first semological experiments.</Paragraph>
    <Paragraph position="6"> Pendergraft, Dale 5-2 (2) Semes would be composed of semotagmemes, in tile second hypothesis, and semotagmemes represented by tagmemes.</Paragraph>
    <Paragraph position="7"> Consequently, for the purposes of automated classification, the members of an identity class relative to application would be regarded as the representatives of a particular semotagmeme. A set of semotagmeme constructs (locally) maximizing the connection entropy would be recognized as semes. This approach to automated semological classification may have the advantage of a higher level of abstraction, like the analogous morphophoneme approach in morphological classification.</Paragraph>
    <Paragraph position="8"> Both semological hypotheses will be tested when we have the additional data-capacity which a magnetic disk will provide in ACS early next year. LRS programs that maintain either type of semological data are already operational.</Paragraph>
    <Paragraph position="9"> Pendergraft, Dale 6-1</Paragraph>
  </Section>
  <Section position="8" start_page="6" end_page="6" type="metho">
    <SectionTitle>
6 SEMANTICAL CLASSIFICATION
</SectionTitle>
    <Paragraph position="0"> Sememes, in the sense which may be formalized as suggested above, are regarded in our working hypothesis as describing signs in the object-language. To be specific they will have two epistemological functions: (a) They will convey the (formational) syntax of the object-language, i.e. the information needed to construct complex signs from the basic ones.</Paragraph>
    <Paragraph position="1"> (b) They will be units substituted in translation, paraphrasing and other transformations based on semantical criteria.</Paragraph>
    <Paragraph position="2"> A fundamental principle leading to distributional semantics was cited by Martin \[15\] in 1958. In discussing &amp;quot;translational&amp;quot; and &amp;quot;non-translational&amp;quot; semantical metalanguages, he presents a thesis which we will paraphrase very roughly for our present purpose: Semantical relations (e.g. denotation, designation), in requiring as their arguments both signs and their objects (denotata, designata), make it necessary that the semantical meta-language itself have signs for the same objects as the object-language. The meta-language signs are, accordingly, translations of the object-language signs, since the two sets have common objects. As a consequence of this, semantical relations in the meta-language will be at least as complex as those in the object-language. However a &amp;quot;non-translational&amp;quot; semantical meta-language may describe a relation between signs, but one defined in semantical terms (e.g. comprehension, where one sign will comprehend another if the former denotes every Pendergraft, Dale 6-2 object the latter does). This second type of meta-language will be semantically less complex than the object-language. Furthermore, as we have suggested above, it is probable that comprehension of signs may be induced from distributional evidence. A distributional semantics, in addition to being a non-translational in Martin's sense, would define comp\[ehension or some alternative relation between signs in purely distributional terms) leaving aside all theoretical references to objects which the signs may or may not have. This is the approach we \]lave taken) by employing the concepts of classification theory to formalize those of distribution.</Paragraph>
    <Paragraph position="3"> With few exceptions the computational strategies in semantical classification will be the same as in the distributional syntactics. Analogous operations of class identification, generalization and specialization will be available. But the members of syntactical classes will be syntactic rules. And the rules in a given class will be required to have the same &amp;quot;degree,&amp;quot; i.e. the same number of those predicates with the equivalence (but not the identity) classes as their extensions \[5\].</Paragraph>
    <Paragraph position="4"> Generation of semantic rules will likewise be analogous to the syntactical operation. But our semantical hypothesis requires that all of the syntactic rules in the extensions of two semantical classes be applied (pairwise) at~laces of application with the same name. For instance A B would describe the applications of the rules in semantical class B to those in the class A at the places named by the numeral 2. When the syntactic rules are first generated, Pendergraft, Dale 6-3 their places of application will be named positionally (from left to right) and, in the restricted theory, uniquely (no two places will have the same name). Binary semantical classification, as part of the rule generation operation, will show how the places should be renamed to satisfy the above semantical convention. (In LRS this is the information conveyed by &amp;quot;superscripts&amp;quot; associated with the appropriate predicates in syntactic rules.) Otherwise the generated semantic rules will be formally the same as the syntactic (e.g. A = C, B c D, C~D c E). The numeral naming the place of application of two semantical classes is given in our notations as part of the connective symbolizing application. Conventions for renaming the places during deductive inference \]lave been reported elsewhere \[19\].</Paragraph>
    <Paragraph position="5"> Pendergraft, Dale 7-1</Paragraph>
  </Section>
  <Section position="9" start_page="6" end_page="6" type="metho">
    <SectionTitle>
SELF-ORGANIZING LINGUISTIC SYSTE~4S
</SectionTitle>
    <Paragraph position="0"> Automatic linguistic classification will give us various capabilities for changing language descriptions.</Paragraph>
    <Paragraph position="1"> l%e plan to study each capability separately so that it will receive its own development. Coordination of the capabilities into an integrated system will be approached as a different problem, that of self-organization. The system as a whole must not only change, but change for the better.</Paragraph>
    <Paragraph position="2"> IIomeostasis, as explained by Ashby \[20&amp;quot;\], is the fundamental control principle we will investigate. Roughly speaking, it calls for reorganization when the situation (according to some criterion) is getting worse and stability when it is getting better. Hence the algorithms we described for morphological (or semological) classification were too simple. If a decrease in connection entropy defines &amp;quot;getting worse&amp;quot; in morphological (or semological) classification, the system must be able to deliver smaller as well as larger constructs during its reorganization. In syntactical (or semantical) classification, stability or reorganization (in response to decreasing or increasing entropy, respectively) may be obtained by a choice between the class identification and generalization operations. With K-clumping, class generalization may also be parameterized to specify a greater or lesser reorganization in descriptive categories.</Paragraph>
    <Paragraph position="3"> These basic control techniques will be tried toward the end Of this year. To control class specialization and rule generation, we will use the following processing sequence after each cycle of syntactic (or semantic) analysis.</Paragraph>
    <Paragraph position="4">  Pendergraft, Dale 7-2 (a) Compute the connections and connection entropy for each class.</Paragraph>
    <Paragraph position="5"> (h) Sort the classes so that those with the lowest entropy come first.</Paragraph>
    <Paragraph position="6"> (c) Perform the class specialization operation on the successive classes until one is reached which cannot be specialized. null (d) Use only that class and the ones following it for rule generation.</Paragraph>
    <Paragraph position="7">  Underlying this processing strategy is the assumption that stable classes will be characterized by high connection entropy. (Though plausible, this must be tested.) Rule generation will thus be limited, as a result of the strategy, to those classes which are found to be the most stable. Broadly effective control strategies are our present concern; we believe it will be possible to supplement these with more selective controls later on.</Paragraph>
    <Paragraph position="8"> Incidence data for our first automatic linguistic classification experiments }lave been prepared mechanically from statistics brought directly to ACS from the analysis outputs in LRS. For the self-organizing linguistic system we felt that the statistics should he accumulated from analysis statistics from LRS to the Information ?laintenance System (IHS), a coordinate information storage and retrieval system \[21\] which we have programmed fol the Aeronautical Systems Division, Air Force Systems Command. This system has been released by its sponsor for use in linguistic research. Classification statistics from ACS will also be Pendergraft, Dale 7-3 stored in IMS. A report generator will be added to I~S so that the analysis and classification statistics can be displayed in formats suitable for publication.</Paragraph>
    <Paragraph position="9"> Programming to implement the Self-organizing Linguistic System (SLS) will include the following routines:</Paragraph>
    <Section position="1" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
7.1 LRS-IMS Interface
</SectionTitle>
      <Paragraph position="0"> Transportation of the analysis statistics on coincidence, concatenation and application at the different linguistic levels will be performed by these programs. In addition to collecting and organizing the statistics, they will update the stores in I~|S, also handling the additions and deletions of rules or classes. Normalizing factors will be maintained cumulatively so that statistics collected during different periods of time may be compared. These programs are now almost completed.</Paragraph>
    </Section>
    <Section position="2" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
7.2 I~!S-ACS Interface
</SectionTitle>
      <Paragraph position="0"> This set of programs will carry out tile control strategies we have mentioned. They are being written under IBSYS so that they will be compatible with ACS Programming.</Paragraph>
      <Paragraph position="1"> The 151S store has been designed so that it can be manipulated under either the LRS operating system or IBSYS. It is antici~ated that most of these routines will be in operation before the end of 1965.</Paragraph>
    </Section>
    <Section position="3" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
7.3 ACS=IMS Interface
</SectionTitle>
      <Paragraph position="0"> Classification results will be collected, organized and transported to IMS by these routines. They will also update the I~4S store. Their completion will coincide with routines in the IMS-ACS interface.</Paragraph>
      <Paragraph position="1"> Pendergraft, Dale 7-4</Paragraph>
    </Section>
    <Section position="4" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
7.4 IMS-LRS Interface
</SectionTitle>
      <Paragraph position="0"> The same request formats which the linguist uses in adding, changing or deleting language data in LRS will be used by the self-organizing system, llowever, language data processing in LRS may be performed either with mnemonic symbols or numerals as the names of syntactical (or semantical) classes. The automated system will use the numerals, referencing its requests to the results of automatic classification. Because the self-organizing system will be able to make extensive changes in the data base, which would be prohibitive by manual coding, we plan to provide macro-requests (e.g. a request to eliminate the distinction between the predicates grouped together by the generalization operation).</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="6" end_page="6" type="metho">
    <SectionTitle>
Pendergraft~ Dale 8-1
AN EXPERIMENT
</SectionTitle>
    <Paragraph position="0"> This experiment in class identification will exemplify the type of research we are performing.</Paragraph>
    <Paragraph position="1"> Although the operations performed are those described above as class generalization, by setting the K-clumping parameter to 1 we obtain component sets as the classification output.</Paragraph>
    <Section position="1" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
8.1 Experimental Design
General Definitions
</SectionTitle>
      <Paragraph position="0"> Given a binary matrix: l(i,j): l(i): l(j): the number of l's in the intersection of columns i and j. the number of l's in column i. the number of l's in column j.  Discussion: The frequency matrix which forms the incidence data for the experiment is a table showing how many times an object i coincides with an object j. Reducing the matrix by removing columns which are all zero on all off-diagonal cells, deletes from the set of objects those objects for Pendergraft, Dale 8-3 which there are no coincidence data. Normalizing the columns of F by the diagonal--which contains the number of instances of the object in the sample--produces the normalized incidence matrix.</Paragraph>
      <Paragraph position="1"> Connection matrix C is a symmetric matrix which describes the relation of object i to object j based on the normalized incidence data. Matrix C constitutes the data for the next stage of processing*</Paragraph>
      <Paragraph position="3"> GR-clump: U set A is a GR-clump of U (----'&gt; it is a local minimum for the following function</Paragraph>
      <Paragraph position="5"> In terms of individual elements, the definition can be stated as follows: A = {xIb(x,A)&gt;_0VxeA and b (y,A)e OVye~} Discussion: There is no known way to predict how many GR-clumps exist in a given space. The GR-clump finding procedures \[22\] produce a set of highly overlapping GR-clumps.</Paragraph>
      <Paragraph position="7"> Discussion: The K-clumps located in F will be those elements which are highly similar in ti~eir distributional properties. The threshold value can be used to vary the amount of similarity.</Paragraph>
      <Paragraph position="8">  syntactically analyzed in LRS. The outputs were on magnetic tape. A computer program was written to take this data and form a binary incidence array as follows: classes strings of text i,j i, j = 1 if string i was in class j The list of classes was generated at the same time. In the six paragraphs, 129 classes were found. Graph 1 shows the rate at which classes were found.</Paragraph>
      <Paragraph position="9"> In the next stage of processing, this incidence array was used to make a co-incidence frequency count. Forty-five of the 129 classes occurred uniquely, i.e. did not coincide with another class. These 45 were deleted from the data set, leaving 84 classes.</Paragraph>
      <Paragraph position="10"> The next step was to normalize the frequency matrix and compute the connection matrix as explained in the experimental design. In the 84 x 84 matrix there were 1012 nonzero entries giving a matrix density of 14.3%. The connection values ranged from zero to 3.33283.</Paragraph>
      <Paragraph position="11">  GR-clumping was done in the connection matrix describe in Phase I. Using the pivot variable method of initial partitioning \[22\] 44 GR-clumps were located. Graph 2 displays the distribution (by size) of the GR-clumps found. Phase 3: The connection matrix was computed as described in the experimental design. K-clumps partitioned the set of 84 categories intocomponent sets. The K-clumps ranged in size from 2-14 classes. Graph 3 shows the number of classes by size of the K-clumps.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML