XML Viewer - j79-1036

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/79/j79-1036_metho.xml
Size: 47,775 bytes
Last Modified: 2025-10-06 14:11:10
<?xml version="1.0" standalone="yes"?>
<Paper uid="J79-1036">
  <Title>Grammatical Compression in Notes and Records: Analysis and Cornputdtion Barbara B. Anderson, Irwin D. J. Bross, and</Title>
  <Section position="4" start_page="116" end_page="116" type="metho">
    <SectionTitle>
Abstract The problem of constmining the set of hfemtces added to a set of
</SectionTitle>
    <Paragraph position="0"> beliefs is considered. One method, based on finding a minimal unifying structure, is frresented and discussed. The method is meant to pnxride internal criteria for inference cut-off.</Paragraph>
    <Paragraph position="1"> I. Introduction Natural language processing systems that are sensitive to the semantic and logical content of processed sentences and to the p~glratics of their use generally draw inferences. A set of fonmilas representing the meaning of a sentence and the 'state of belieft of the system is augnented by other related formulas (the inferences) which are retrieved and/or constructed during the pmcessing. The problem to be investigated here is: How can thi$ process be contmlled? Can reasonable criteria be found for restraining the addition of inferences? Top-down inferences fol.luwing from the meaning of lexical items (often expressed by decomposition into primitives) are clearly bounded, if no interactions are allowed amng the generated sub-formulas. This process (which we call EXPANSION) will not be discussed here. Rather, we shall be concerned with SYNlESIS, i.e., the addition of new formulas based on the  * This work was partially suppored by NSF- Grant SCC 72-05465A01.</Paragraph>
    <Paragraph position="2"> ** AuthorT s currerrt address : Courant Institute of Mathemtical Sciences, New York University, 2 51 Mercer Smet, New York , New York 10 012.  zxsence - of already generated lower-level formulas, 'v~hich we shall call kliefs.</Paragraph>
    <Paragraph position="3"> In particular, we are concerned with infererces addgd because a set cf beliefs is recognized as fitting a plre-defined pattern.</Paragraph>
    <Paragraph position="4"> The question we ask is: Given an initial set of beliefs ovm a set of ;ri&amp;ives, - what 'crite~ion can be us&amp; to Mt the pmcess of pattern matcljng r-d associated inference addition? The major structural- feature that we use % wide such a criterion is a partial order over the set of patterns. Before pursuing this suggestion any further, let us examme sane d -3s ait~~native ap-pmaches to infence and iiiCsrence c~r-off.</Paragraph>
    <Paragraph position="5"> To logicians, deductive inference involves rules by which fanmilas can 3e added to a set (which ki~ally cantains the dons) in certain ways pvided other formulas are dLready in the set. In general, this sort of infexence is quite open-ended in that one can keep applying the rules of LnXerence and ccrme up with mre and ao~e famulad dl of which represent 'pvablet statements. Xhe terminaticm criterion for a particular invocation of the m&amp;ankmmi&amp;t be the appemce of an 'intestingt farmola or the loss of interest of the infemcer, but in general the stdement of the rules of inference says nohing about when to cease deriving fbmulas.</Paragraph>
    <Paragraph position="6"> This para- from logic has been carried over into ktificbl 3telligence qmtenrs, where the issue of terndnaticm is very real. The usual solution has been to invoke the inferencer under the very strict control of a supervising pgran w*&amp;- has its own gmls progmnmd in which mkes c-ain that appropriate criteria are applied to hdlt the inferenchg. This is =st apparent in systems written in PLANNER-like languages ach has use~~pmgrammble me&amp;anisms for conbmlling the pl~gf process .</Paragraph>
    <Paragraph position="7"> In the work uf Schank and Riege, (Sch, 75) (Ri , 74 inference has nore of the flavor of be association; inferences are conceive$ of as expan&amp;g sph~s in'inference space. ' 'Ib termination strategies are qloyed: (1) the (iisoovery of a chain of inferences leading fr\m one of the iiidl behkfs to another thmw a shared formila, or 'contact pint in infmce space, and (2) the association of numerical fstnngthst to fomuks so that a line of inference can be discontinued if the strength falls below a certain thIesbld.</Paragraph>
    <Paragraph position="8"> Smtegy (2) is scmwihat unsatisfyii in viaw of t!e prwLtW.</Paragraph>
    <Paragraph position="9"> arbi-brsariness and attendant difficulties in evaluating the mle of parttcular numerical constants in the total. behavim of a cc~lplex systep. These osnstants, presumbly, have little to do with the m-iical stru~tuce of t?e foninl inferace scha~~, and as such we would call them 'extaml criteria. ' A stmtegy like (1) above, on the other hard, is nure tintexmalt and is to be pfm.' A gwl of 'the present mrlj is to fcmnulate a reasonable internal criterion for infmce cut-off tjhich can be stated fcmnally as part of the inference rule. To do this, we &amp;EL impose a stmchm on the set of patterns to be used in inferencing, and the rule for adding inferences will be fc~rmkted in terns of .this strutme.</Paragraph>
    <Paragraph position="10"> The operatibns to be de-ibed below ate exp- mrre fully in (R,75), where a desmiption of a ccquter iqlem~ntath is also presented.</Paragraph>
    <Paragraph position="11">  !he inference rule we are aiming fop is to depend on the - ret of input beliefs and the - set 0s patterns. The notion we are trying to fcmnalize is ''What does this set of beliefs suggest with respect to this set af patterns?&amp;quot; The particular class of inferences we are concerned with are those gotten by matching beliefs in the input set against a pattern and augmenting the beliefs with additional ppositions as dictated by the pattern. We want to find the least instarlces of patterns which cover (include) the set of input beiiefs. We will take as inferences all pmpositions (an arbitrary nunber) wW&amp; a. entailed by that instance of the pattern.</Paragraph>
    <Paragraph position="12"> Put another way, the inference operaticsl is to jump to ~onclusions. However, it is cmly to jump to those conclusion required to make the resulting set an instance of ,the - least possible pattern in the pattern set. The key concept here is 'least' in that thls is what cmtmls how many inferences are added. What would be a suitable dering relation far patterns and ~aopositional beliefs? One which naturally suggests itself and which is c~rrartly unclex- investigation relies Qn the relations of instantiation and substitution instance and (2) S 5, S' if S a St.</Paragraph>
    <Paragraph position="13"> &lt; {q l,... ,%I, where the p 's Carbbing these two, we say that EUq,. . . ,p,) -</Paragraph>
    <Paragraph position="15"> 's are ~~opositional forms, if there is a substitution, s, for the variables of {pl,.. . ,p 3 such that {s(pl),-. ,s(p 1) &lt; Iq ,enm 9%)n n -1 1 We adopt the notational convention of prefixing variables with '?' . and let Q = I (HAPPY JOHN) , (GIVE MR. JONES JOHN TOY), (PIIRE;YT MR. JONES JOHN)).</Paragraph>
    <Paragraph position="16"> Then P 5 Q under the substitution ?x+JOHN, ?y+MR. JWS.</Paragraph>
    <Paragraph position="17"> The tless-thm~uaJ.' relation is also defined lor pairs of pattam: ht PAM = {(P ?x ?y), $Q ?y ?a)} and l&amp; PAT-2 = {iR ?u ?v ?w), (Q ?w ?v), Pdu ?w) 1.</Paragraph>
    <Paragraph position="18"> Clearly, PH-1 f P-2 under the sub9citution ?x+?u, ?3*?~, ?B-?V. This defi#itfon of - &lt; is quite stdghtfcrrd and czn be made to aceorodate expressions wdth embeddings and mate Variables. (These are included in the implaentation. 1 that the relation &lt; - can be -thought of an informition-axrtent caparison; if S - &lt; St then St contains at least askmch 'infomatiant as S (and pcwsibly mre) either by virtue of variables t~ving been replaced by particular constants or by additional farmuLas having been added to the set. Given 5 far rehting pairs of belief sets, pairs of patterns, aa. bedief-set/pttem pairs, we can now fundate me belief-set-extending III. The Infemnce Operation: SYNTHESIZE Given a set of P of patterns and an input set Bel of beliefs,  SYN!EESTZE returns a set I of instantiated patterns fma P such that the following -Wee caditions a3l hold: (1) ~Caemge of input. beliefs) For each instantiated (2) (Pairwise inccnparability) If p,q r I then (3E (MinWity) mere are no other instances r of patterns in P  which are not ih I and yet which are to some element of I .cI and for which Bil - &lt; r.</Paragraph>
    <Paragraph position="19"> The el-ts of I = SYHTHESSrEbl) represent possible rnumd.  There are two possible rmrundl . .</Paragraph>
    <Paragraph position="20"> extensions, but the set of clear extensions contains no inferences beyond the input set, Bel.</Paragraph>
    <Paragraph position="21"> (Had pl and pq shared another clause, however, an inference would have been added.) If the input set Be1 = EU(G JOHN), (B JOHN11 then SYNTHESIZE(l3el) I { (G JOHN), (A JOHN), (B JOHN), (C JOHN)} I, Pattern pa is the least pattern which wheh instantiated covers the inputs, and there are two inferxed pru,positions : (A JOHN) and (C JOHN).</Paragraph>
    <Paragraph position="22"> me descripti~n given here has been necessmily brief and incomplete A mre farnodl trea-t of SYNTHESIZE in tams of lattice-themetic operations is given in (R,75) and is miz zed in (JR,75). One additional technical point should be made: It often happens that for a given input set there are no single patt- instances which cover all the inputs, though patterns  exist mse instances cover subsets of the inputs. In such a case we use an extended SYNTHFLSIZE operationt~hich is defined in the same spirit as SYWHESIZE. (See (R,75).) Even witbut the firU fd treatment, several things should now be clear. First, the actual nunibex of inferences dra.. (propositions added) for a particular input set may be small or large (depending on the inputs and the pattern set,) but it is bounded in a phcipled way because of the definitim of SYNTHESIZE.</Paragraph>
    <Paragraph position="23"> Second, the usual distinction between 'antecedent' and 'consequent' clauses in the pattern is not htained; a clause in the pattern may serve as an antecedent on one occasion and a consequent on ano-Eher. Third, if 'defined1 lexical item were to be associated with tht: patterns, noting which variables are to be bound as arguments upon instantiation, then the SYN'IIESIZE function can be used to canpute sumnarizhg expressions. a*ls SYNTHESIZE remsents a possiELe formalism for lexical insertion.</Paragraph>
    <Paragraph position="24"> IV. An f5wmole of the beration SYNTHESIZE Far the sake of illustration, let the primitives be:</Paragraph>
    <Paragraph position="26"> (These primitives and the patterns below may appear somwhat arkificu, but we have chosen a sinrple illustration due to the difficulties in following examples with axre than a few clauses.) kt the pattern set consist of the following four pattms:</Paragraph>
    <Paragraph position="28"> and (2) (INTEND JOHMXlE (RFIURN JOHNDOE 1000-DOLLARS BANK)) have been reversed. In Situation 2, (1) was an input and ( 2 ) was infemed , whereas in Situation 3, (2) was input wd (1) inferred.</Paragraph>
    <Paragraph position="29"> The curresponding clauses of the loan pattke2*n were serving as antecedents on one occasion and consequents on the other. This follows naturKLly fran the way SYNIPIESIZE was defined.</Paragraph>
    <Paragraph position="30"> In this regard the reader rnay notice that sane input belief sets might yield 'warrantedt or 'spu~?ious' inferences--jumping to too many cmclusicms. Hawever, the incmmntal addition of new patterns corrects this anom19 in a natural way: Patterns which formerly were 'least covers' may cease to be so in the extended pattern set.</Paragraph>
    <Paragraph position="31"> V. Using Definitions to Set -Up the Pgttem Space We have been particularly interested in using definitions of words to set up pattern spaces in which SYNTHESIZE could wark as an inferencer and a lexical insertion technique. Special attention was payed to the 'speech actv verbs, and a bief sample list is presented below.</Paragraph>
    <Paragraph position="32"> (The symbol '?Prt denotes a predicate variable. Also, primitive predicates are capitalized, while defined predicates are underlined. ) Again, the definitions are greatly oversimplified for illustrative purposes.</Paragraph>
    <Paragraph position="33">  (define - tell (?x ?y ?p ?t) (and (-RE ?tO ?t) (NOT (KNOW ?y ?p ?to)) (SAY ?x ?y ?p ?t) (KNOW ?y ?p ?t) (CAUSE (SAY ?x ?y ?p ?t)(KNW ?y ?p ?t)))) (define request (?x ?y ?p ?t) (tels ?x ?y (W ?x ?p ?t) ?t)) (define mse (?x ?y ?Pr ?t) (and (EELS-OBLIGA'FEUD ?x (?Pr ?XI ?t) (tell - ?x ?y UNTENP ?x (?Pr ?x) ?t) ?t))) (define camand (?x ?y ?Pr ?t) (request ?x ?y (?Fb ?y) ?t 1) (define implare (?x ?y ?Pr ?t ) (and WWl?S-FAVOR-FROM ?x ?y)  The expansion of these items to patterns over the primitives yields a set in which, far example, KNOW &lt; - tell 5 request iccnmand. The input set Be1 = {CBEFQRE tl t2), (SAY JAMES MASTER (INTEND JAMES (OPEN JAMES DOOR) t2) t2),</Paragraph>
  </Section>
  <Section position="5" start_page="116" end_page="116" type="metho">
    <SectionTitle>
(FEELS-OBLIGATED JAMES (OPE3 JAMES DOOR) t2 1)
</SectionTitle>
    <Paragraph position="0"> tmuld be synthesized to (pmmise JAMES MASER (O?EN + LX)OR) t2), with &amp;eed inferences (KNOW FASTER (DElfi) JLLCES (OPEN JAI CCOi:? t2) t21, etc., as dictatd by the pattern instance of Wse.</Paragraph>
    <Paragraph position="1"> A mt?d bas been pmwsed far 'fred bfez-znciri - by attern - inatchig in which inference cut-f can be structurally ccmstrained: A pattsx is matched if it is one of the minioil =ems whose instantiati~n corn &amp;he iqput in.~~mati.cm--~ven if this necessitates addkg an mbitmry anounf of additional infmmtion. Similarly, on the question of bw my infmces to &amp;aw: 'Enom -a inferemes are drawn to enable a cohmt pattm to be matched.</Paragraph>
    <Paragraph position="2"> The method we have proposed is general in that it nrakes no assunptions about the particular predicates to be used in the patterns and beliefs. (Of course, it does make as^^ about ht counts as a pattern m a belief. 1 The infmcing auld be done by a general purpose cmponent wfiich accepts a set of patterns as a parameta. Th*, a pgrw designing a system for inference by pattern mtch need not&amp;quot; devise external criteria, and certainly not miteria to be associated Wi'th 1 every pattern. Ram the criteria are hqlicit in the system as a Wle; any wtterns which can be described in a vw general pattern description language will genemte its awn set of internal miteria fur inference cut-off.</Paragraph>
    <Paragraph position="3"> We are continuing to investigate fdsms for smcturing pattern sets in the hope of gaining further insights into this class ~f inferences. D. BECKLES, L. CARRINGTON, AND G. WARNER IN COLLABORATION WITH C. BORELY, H. KNIGHT, P. AQUING, AND J. MARQUE^</Paragraph>
  </Section>
  <Section position="6" start_page="116" end_page="116" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> Linguistic communication in Trinidad and Tobago i~ characterised by intra- and inter-ideolectal variation in a spectrum ranging frbm Creole-English to Internationally Acceptab le English. The tape-recorded speech of a sample of children is being analysed to determine the structure of their language, its correlation with socio-linguistic facters and their progress in the use of English. X.2 coqvter system is designed to deal with manually codified data in the form of parse trees with associated grammatical and semantic information. The communication complex does not have readily identifiab le norms. The analytical method and compwr sys tern effect recognition of stable sub-systems (regardless of the external criteria which determine these sub-system), comparison of these sub-systems with English as well as state the evolution of the children's language.</Paragraph>
  </Section>
  <Section position="7" start_page="116" end_page="116" type="metho">
    <SectionTitle>
Acknowledgement
</SectionTitle>
    <Paragraph position="0"> The research of which tkis paper is a working document is partially funded by Ford Foundation Grant 690-06641). The authors acknowledge the kind assistance of the IBM worid Trade Corporation, Port of Spain, Trinidad.</Paragraph>
    <Paragraph position="1"> The design and some results ~f the research to which the computer system relates are described by Carrington, Borely and Knight (1969, 1972, 1974 a + b) . Part of the intention of the project is to describe in terms applicable to curriculum development and teacher education, the structure of the speech of school-children aged 5-11+ in Trinidad and Tobago and to compare this speech with English.</Paragraph>
    <Paragraph position="2"> The official language and medium of instruction is Englfsh. However, the medium of daily communication ranges from a type of Creole-English to a modifed variety of Internationally Acceptable English (IAE). The term 'postcreole dialect continuum&amp;quot; has been used by several researchers, notably Le Page (1957), De Camp (1971) and Bickerton (1973) to refer to apparently analagous situations in Jamaica and Guyana. In addition to Creole, English and variants of both, a large part of the population is exposed to a local variety of Hindi (Bho jpuri) . Smaller numbers are exposed to Lesser Antillean French Creole and fewer still to Spanish.</Paragraph>
    <Paragraph position="3"> Communication within the society is characterised by inter-ideolectal variation related to several socio-linguistic factors - ethno-linguistic background, social class, educational level, occupation, sex and age.</Paragraph>
    <Paragraph position="4">  Code-switching and intra-ideolectal variation related to the context, content and purpose of communication complicate the examinat ion of the communication s)btem. Since the variant levels of the complex appear to overlap they are difficult f o separate into distinct sub-systems .</Paragraph>
    <Paragraph position="6"> The available corpus comprises 100 hours of the recorded conversation of almost 1,000 children between 5 and 11+ selected randomly from 30 schools.</Paragraph>
    <Paragraph position="7"> me data fall into two pre-determined categories: (a) free (with pees group); controlled (with investigator) . Given the nature of the communication compaex stated above, variation and contrast are central to the data. In addition to the usual socio-linguistic correlates of variation, these data have the possibility of containing linguistic elements which are not paralleled anywhere else in the community. These elements may occur as a result of the instability intrinsic to the performance of a vulnerable age cohort. We are not dealing with fully learned discrete languages or dialects but with partially learned systems of speech communication being used by children who, by virtue of being in school, are under pressure to abandon part of their communication repertoire in favour of another variety of speech.</Paragraph>
    <Paragraph position="8"> hplic~ltions of the Data Type for the</Paragraph>
    <Section position="1" start_page="116" end_page="116" type="sub_section">
      <SectionTitle>
Analytical Procedure
</SectionTitle>
      <Paragraph position="0"> hglish is the only code of the communication complex for which adequate grammatical descriptions are available . It is demonstrably untenable to assume that the informants are attempting to speak English at all times. They are communicating in a set of language varieties which are assumed to be rule-governed. A statement of frequency and type of deviation from Bnglish cannot therefore be an adequate analysis. The first task of the analysis must be to determine the structures, both major and minor, used by informants of' various socio-linguistic descriptions.</Paragraph>
      <Paragraph position="1"> A preliminary examination of the data shows that at the level of phrase-structure of utterances; the structures will appear to be predominantly identical with English. It is the components of the elements, their meanings and functions that will show the differences from English.</Paragraph>
      <Paragraph position="2"> Consequently, the analysis mst note the levels at which derivational trees cease to be compatible with English.</Paragraph>
      <Paragraph position="3"> In view of the variability inherent in the data, the analysis must discover the socio-linguistic correlates of the occurence of elements, as hue11 as state co-occurence restrictions of a given element.</Paragraph>
      <Paragraph position="4"> Since it is possible that some elements may be distributed ih a way that does not perinit correlation with the stated socio-linguistic factors, the analysis must permit grouping of informants based on shared linguistic features for sasequent re-examination. This provision admits the possibility that sets of features may be typical of a language acquistion stage of the informants  mgardless of their socio-linguistic descriptions.</Paragraph>
      <Paragraph position="5"> me Analytical Procedure 1, Each utterance is phonetically transcribed and ascribed to an informant by an identification procedure. Doubtful identf ty is specially coded.</Paragraph>
      <Paragraph position="6"> 2 Each utterance is rewritten in English orthography.</Paragraph>
      <Paragraph position="7"> 3. For each utterance a parse tree is constacted using the following  protocol where each category described below forms the content of a node of the parse tree. The numbers are for reference and indicate the hierarchical relationship of the nodes.</Paragraph>
      <Paragraph position="8"> .a Utterance type S sentence  1-9 surface structure of the clause/phrase occurring first.</Paragraph>
      <Paragraph position="9"> e.g. MC~ --7 SUM + PRED* + IOBJ + WBJ + PREP P *PRED = predicator - not predicate 1-1 detailed analysis of first occurring element of 1.9. e.g. SUBJ 3 PRMD + HDW 1.1.1 first element of subject. e.g. PRMD~[HE] PADJ, RD, MASC, SG, NOK; IAE: [HfS] etc, 2. fU surface structure of the clause/phrase occurring  second,, , etc to 7,9, As exemplified at 1.1.1, the last node of each sub-part states the actual literal being described.</Paragraph>
      <Paragraph position="10"> The acceptability of the item as IAE is noted,OK or NOK,together with a reasonable IAE alternative.</Paragraph>
      <Paragraph position="11"> Apart from the obligatory information requifed by the procedure, the analyst may make additional comments which may be either in keywords or English.</Paragraph>
      <Paragraph position="12"> e.g. CMNT: probably idiosyncratic or CMNT: double NEG.</Paragraph>
      <Paragraph position="13"> 8.6 is reserved for special idioms.</Paragraph>
      <Paragraph position="14"> e.g. 8.9 [SCRUNT]----) serounge for a living 9.g is reserved for tags.</Paragraph>
      <Paragraph position="15"> e.g. 9.0 TAG-[YOU HEAR] Fig. 1 shows a sample analysis.</Paragraph>
    </Section>
    <Section position="2" start_page="116" end_page="116" type="sub_section">
      <SectionTitle>
Developing the Computer System
</SectionTitle>
      <Paragraph position="0"> The strucfure of the parse tree is, in general, quite complex and a simple ad hoc approach to validity checking was quickly seen to be inadequate.</Paragraph>
      <Paragraph position="1">  As a result a formal description of the tree was developed and used to construct a (partially) syntax-driven validity checking rgutine. The output of this routine consists of a listing of the input, with error comments where necessary, together with the internal representation of the valid trees which is written onto a file - the parse-tree file - for the subsequent analyses. several other files are used in addition to the parse tree file.</Paragraph>
      <Paragraph position="2"> There is the informant file which contains profiles of the informants, (e.g. age, sex, linguistic background, etc), a set of form class files and a set of classification files.</Paragraph>
      <Paragraph position="3"> The form class files are groupings of the various keywords which may occur in the data.</Paragraph>
      <Paragraph position="4"> Thus, for example, one form class file contains all keywords which may occur on the left-hand side of a rewrite. A classification file contains a group number for each informant; for example, one classification file contains 0 for each informant not aged 5 1 if the informant is aged 5 with a Hindi linguistic background and 2 otherwise: In any operation on the data the utterances of informants in group 0 of the relevant classification will be ignored.</Paragraph>
      <Paragraph position="5"> Each node of a tree in the parse tree file consists of a name - in the case of a rewrite this is the left-hand side of the rewrite, otherwise it is the level number - and a set of descriptors, e .g. the grammar associated with the name. Thus, in the example of Figure 1, the lines 1.1, 1.1.1, 1.1.2, 1.1.2.1 become the sub-tree of Figure 2 where the descriptors are put in parentheses.</Paragraph>
      <Paragraph position="6">  For any tree, each analysis starts at the root and many of the tasks to be described below may be regarded, in part, as a pattern matching exercise. The difficulties, and interest, arise because each node of the parse tree carries a substantial amount of information, and except for literals, only a partial matching of the nodes is usually required. In addition, some tasks requira the matching of disjoint sub-trees within a given parse tree, occasionally subject to side conditions which may involve nodes not lying on the paths between the root and any of the sub-trees of interest. Apart from the pattern matching,there is the problem of classification of the occurrences of the various patterns. This is a simple tabulation complicated, in some cases, by the fact that the total number of categories is unknown. The basic task of the system may be cast in the form: count with respect to a given classification file, and subject to stated side conditions, the occurrences of a given pattern.</Paragraph>
      <Paragraph position="7"> Since there are only 1,000 informants and they fall into a reasonably small nunher of classes it is economical to pre-classify on the basis of the informant profiles rather than build the classification process into the rest of the analysis. The system is instructed to produce a classification file by a statement of the form: CLASS = ( classification file name ) , (4 expression list &gt;) where (classification file name) is the name by which the file will be known, and each expression in&lt;eqression list 7 is a Boolean expression. For example : CLASS = HINDI, (AGE = 5 e LANG = HINDi, AGE = 5 4 LmG + HIND11 will produce the classification file given earlier as an example.</Paragraph>
      <Paragraph position="8"> The side conditions refer to items ih the parse trees which must occur if the tree is to be i~~cludd in a given analysis. For example, if only affirmative active uHerances are to be analysed the side condition Q. 7 AFM AC171 is used. me patrern to be used is stated in a manner similar to that used in specifying the input data. Thus, the pattern description PRED r .. . + AUX.. .; GR: @ CTN, NEUT TM, @ PROG, PATT Pdicates that the sub-tree PRED (GR: @ CTN, NEUT TM, @ FROG, PATT) is of interest, subject to the convention that both the order of node descriptors (where given) and node descriptors not mentioned in the pattern are to be Lgnored. The occurrence of keyword FORM = &lt; form class file name) indicates that the contents of the stated form class file are to form an additional dimension to the final tabulations. Thus the pattern AUX --+ [?I FORM = OKFILE where OKFILE contains the keywords OK and NOK and is an abbreviation Sor the pair of patterns.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="116" end_page="116" type="metho">
    <SectionTitle>
AUX -P [?I OK
AUX+ [?I NOK
</SectionTitle>
    <Paragraph position="0"> The symbol ? indicates that the items found there are also to add an additional dimension to the tabulations. The output of each tabulation may also be used to construct a classification file of the informants, to be used in further analyses.</Paragraph>
  </Section>
  <Section position="9" start_page="116" end_page="116" type="metho">
    <SectionTitle>
CONCLUSION
</SectionTitle>
    <Paragraph position="0"> In respect of performance of groups with different socio-linguistic descriptions, for purposes of this study, it is assumed that the frequency of occurrence of particular basic parse trees is a meaningful indicator of differences in speech patterns.</Paragraph>
    <Paragraph position="1"> A major difficulty is that no two trees in the study are identical but at the same time if we strip too much information from each node there are too few trees to make an analysis worthwhile, and in part, the study aims at determining the degree to which strippilrp of information at interior nodes is necessary if the Gomputer is to be a qseful aid.</Paragraph>
  </Section>
  <Section position="10" start_page="116" end_page="116" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> A set of SAIL programs has been implemented for analyzing large bodies of natural language data in which associations exist between strings and sets of strings. These programs include facilities for compiling information such as frequency of occurrence of strings (e.g. word frequencies) or substrings (e.g. consonant cluster frequencies), and describing relationships among strings (e.g. various phonological realizations af a word). Also, an associative data base may be interactively accessed on the basis of keys corresponding to different types of data elements, and a pattern matcher allows retrieval of incompletely specified elements. Applications Of this natural language processing package include analysis of phonological variation for specifying and testing phonological rules, and comparison across languages for historical reconstruction.</Paragraph>
    <Paragraph position="1"> f, NATURAL LANGUAGE PROCESSING PACKAGE A. General characteristics The natural language processing package implemented at the Speech Communications Research Laboratoqy (SCIU;) is currently wed in the analysis of associated lists @f string data such as discourse transcriptions or pronouncing dictionaries. The package consists of a) a set of &amp;quot;batchw programs which provide frequency and context information on the lexical and phonological forms appearing in the input; and b) a system for interactively accessing the data dn the basis of orthographic and phonological patterns.</Paragraph>
    <Paragraph position="2"> All of the programs in this package are written in SAIL, an ALGOL-based language offering extended string and set manipulation operations and an associative data base. The programs run on a DEC PDP-10 at Carnegie-Mellon University via the Advanced Research Projects Agency (ARPA) computer network (ARPANET). The ARPANET is accessed by the ELI? operating system developed by SCRL, which runs on a local PDP-11 [I]. While the processing package is applicable to various types of natural language data, it has been used most extensively at SCRL in the analysis of discourse transcriptions. The discourses consist of conversational speech gathered in interviews with adult speakers of various dialects of American English. More than twenty-five discourses, transcribed orthographically and phonologically, have been prmssed, yielding detailed information on over 28,000 utterances representing about 3,500 distinct lexical- items. All examples in this section are taken from a typical discburse.</Paragraph>
    <Paragraph position="3"> B. &amp;quot;Batchw Facility Discourse processing usually begins with the generation of a transcription reference file in which orthographic and phonological representations are listed in discourse order, as illustrated in Figure 1.</Paragraph>
    <Paragraph position="4">  In this example, the phonological realization of TRY is /tray/ (coded TRAY). The phonological code shown is a basic ARPA phonemic alphabet augmented by special symbols indicatim some phonetic detail, such as vowel height. The realization of THE, for example, is coded DH$I, indicating that the vowel fell between /i/ and /I/.</Paragraph>
    <Paragraph position="5"> Reference number's assigned to each utterance serve as an index to the discourse context in which utterances occur, and are used to interpret the output of other programs in the package. Separate reference number sequences are provided for the orthographic and phonological forms in the reference files, since there may not be a one-to-one correspondence between these forms, as in the case of phonological merging whl-eh obscures word boundaries, In Figure 1, for example, the two orthographic items WELL and LET'S are realized as a single phonological item /wl E ts/ (coded WELEHTS) The core of the &amp;quot;batch! processing facility is a set of three programs: PROCON, ENVIRN and CLUSTR. PROCON provides frequency and context information on the lexical level, while the other two provide similaf information on the phonological level, PROCON output contains an alphabetically sorted list of the utterance types occurring in the input discourse transcription file as illustrated in Figure 2. Frequency of occurrence of each type is given, along with the various phonological realizations. For each phonological realization, frequency count and reference numbers are provided.</Paragraph>
    <Paragraph position="6">  In Figure 2, for example, HAVE occurred eight times, and was pronounced (/av/) three times and HHAEV v three times. Using the reference numbers associated with these pronunciations, it is possible to establish the discourse context. One would find that the tbree AXV pronunciations (i.e.</Paragraph>
    <Paragraph position="7"> utterances 11, 337 and 703) all involved the auxiliary construction in &amp;quot;,.,may have felt,,,seemed to have been which have since been. .. II ENVIRN tallies occurrences of phonological segments and environments in the discourse transcriptions. The output of this program lists frequencies of all phonemes appearing in &amp;he input file, as illustrated in Figure 3.</Paragraph>
    <Paragraph position="8"> Figure 3 Glottal stop, coded Q, occurred a total of thirty times in the discourse, The immediate environments of Q are listed alphabetically by left context, with word boundaries indicated by slash /, and a frequency count and reference numbers are given for each environment. For example, Q appeared eight times in the context EH--EN (E-n, and a check of the reference list shows that all these occurrences were in the word sentence (s) .</Paragraph>
    <Paragraph position="9"> ENVIRN output also provides a frequency ordered liSt of phonemes, with frequency totals brokerr down according to occurrence in word initial, medial and final position.</Paragraph>
    <Paragraph position="10"> CLUSTR, the third of the &amp;quot;batch&amp;quot; programs,is used in the analysis of phoneme cluster distribution in the discourse data. All clusters are indexed by each of their component phonemes, so that the cluster NDZ (fndz)') which is listed under D in Figure 4 also appears under N ad 2 in the full output.</Paragraph>
    <Paragraph position="11">  Separate output may be generated for clusters occurring within woxds or across word boundaries- Currently, consonant and vawel clusters are tallied, but the program can be easily modified to handle sequences of phonemes belonging to arbitrary user-defined classes (e.g. voiced sounds,, nasals, unvoiced stops, etc. ) .</Paragraph>
    <Paragraph position="12"> For each phoneme belonging to a selected class, CLUSTR provides a count of the number of times that the phoneme appears in clusters, an alphabetically sorted list of those clusters, and a frequency count and reference numbers for each cluster. Figure 4, a sample of CLUSTR output for within-word consonant clusters, shows that D appeared in clusters a total of 70 times, with 32 of these being ND clusters. Reference numbers may be used to establish the discourse context of any cluster. For example, the cluster D Q EN T S (/di?nts/) appears in utterance 486 which is the word students. Like ENVIRN, CLUSTR provides a frequency ordered list of cluster types in addition ts the alphabetic list.</Paragraph>
    <Paragraph position="13"> C- Interactive Retrieval Facility The set of &amp;quot;batch&amp;quot; programs is complemented by a language data retrieval system which allows the user to interactively retrieve data items conforming to various orthographic, phonological and syntactic patterns.</Paragraph>
    <Paragraph position="14"> Linguistic data is inte~nally stored in the system as a network of associations between items of various types. These associations are implemented in SAIL as LEAP triples [2J and the element types entering into these associations vary according to the - particular application. For example, in analysis of the discourse data described above, triples contain orthographic, phonological and syntactic elements. For study of phonetic-to-phonemic mapping, triples might be orthographic, phonemic and phonetic elements. In comparative linguistic research, triples might consist of an orthographic element and two phonological elements corresponding to two languages or dialects Data can be accessed on the basis of patterns directed to any one (or any combination) of these elements. For example, if the data base contains associations between orthographic, phonological and syntactic elements, then the query P/ 0: THE retrieves the phonological items associated with the .spelling THE, and might return DHAX(/Ba/) and DHIY i). The query O/ P: TUW would retprn the orthographic items pronounced Tm (/tu/), e.g two, too, tor  ---Patterns such as THE and TUW completely specify the element to which they are directed, but various special forms allow partial. specifications to be expressed also. The symbol $ matches any single segment (in a phonological pattern) or character (in an orthographic pattern), and the symbol = matches any number, llncluding zero, of contiguous segments (or characters). Thus, if N is the syntactic code for Nounr the query O/ P: $$, S: N, 0: D= searches for all two-phoneme nouns which begin with the letter D, and might return dye, day, - doe, dough.</Paragraph>
    <Paragraph position="15"> Each phonological element is defined in terms of a set of features such as UV (unvoiced) and ST (stop), and these features may be used to specify segments in phonological pawerns. To search for phonological realizations containing /i/ between unvoiced stops, one could use the query</Paragraph>
    <Paragraph position="17"> to find /kip/ (keep) , /pik~ ?/ (peeking) , and /r pit d/ (repeated) Boolean operators are also available for specifying pattern segments. For example, the query 0 6: (C OR K)=, P: (NOT K)= returns arthographic ikems which begin with C or K and are not pronounced with initial k, e.g. cite, change, know.</Paragraph>
    <Paragraph position="18"> Several capabilities lacking in the current interactive system will be available in the near future. The user will be able to (1) specify optional segments and sequences of segments in phonological patterns; (2) create and name sets containing items of interest, e.g. monosyllabic function words, and use set operations such as union and intersection; (3) interactively modify feature definitions of phonological symbols: (4) retrieve several elements, e.g. orthographic and phonological forms, simultaneously; (5) display the discourse context of any given item, and (6) write retrieval queries and responses to a file for subsequent analysis.</Paragraph>
  </Section>
  <Section position="11" start_page="116" end_page="116" type="metho">
    <SectionTitle>
11. APPLICATIONS
</SectionTitle>
    <Paragraph position="0"> The processing package can be used in the analysis of various kihds of natural language data, as illustrated in the following examples.</Paragraph>
    <Paragraph position="1"> A. Phonological variation The programs can be used to efficiently index and sort natural language data so that systematic phonological variation can be easily examined. For example, inspection of a PROCON output for a ten minute interview consisting of over 2,000 utterance tokeno yields general observations such as -- final /t/ alternates with final glottal stop /?/ under certain conditions; -- alveolar flapping occurs under several stress conditions whidh appear to be related to noun affixes. These preliminary observations can be systematically investigated using the interactive query system.</Paragraph>
    <Paragraph position="2"> The data base can be queried for all phonological realizations ending in T (/t/) or Q (/?/), and the corresponding orthographic entries, using the queries P/ P: =(T OR Q) and O/ P: =(T OR 9) The resulting list might include  That is, final /t/ appears to vary with final /?/ following vowels arid following nasals, but not elgewhere. This hypothesis, represented as a context-sensitive phonological rule, could then be tested against additional data using any of several computer rule testers [3-51.</Paragraph>
    <Paragraph position="3"> Forthcoming modifications will allow queries with set operations, such that the intersection of orthographic entrieshaving final /t/ alternating with /?/ can be requested directly by the query 01 P: =T n P: =Q .</Paragraph>
    <Paragraph position="4"> That is, only entrieq with /t/ and /?/ alternation would be retrieved, and the entties art, fished and raft would not be returned, In order to determine the conditions under which alveolar flapping occurs, the queries O/ P: =DX= and P/ P: =DX= can be used to retrieve phonological items which contain DX () and correspondihg orthographic items. Such a list might  Flapping occprs in a descending stress pattern, e.g. city letter, petty, wrdting in which a stressed vowel precedes the flap and an unstressed vowel follows. In addition, trhe flap appears to occur between unstressed vowels when the sequence rppresents the noun asfix -ity, as in ability. To check this, the query P/ 0: =ITY, S: N could be used to retrieve a81 nouns ending in -ity, and the subset involving affixed forms (i.e. excluding city, pity) could be examined for occurrences of flapping.</Paragraph>
    <Paragraph position="5"> B. Word Error Recognition testing The interactive facility can be used to examine the kinds of word recognition errors which might occur in a speech understanding system due to indeterminacies in segment labelling. If a string is completely specified as /likrg/(coded LIYKIHNX), then it matches a single word, leaking. However, if labelling is less precise, then alternative (and incorrect)word matches might occur. Using the inte~ctive retrieval system, alternative labels and resulting word matches can be examined for any given lexicon.</Paragraph>
    <Paragraph position="6"> In the example above, the labelled string might be L (VOC HIGH ANT) K IH NX with the stressed vowel represented as a set of features: vocalic, high, anterior. Resulting word matches might include leaking and licking.</Paragraph>
    <Paragraph position="7"> If the initial consonant is also specified as a set of features (consonant, sonorant, continuant), as in the string*</Paragraph>
  </Section>
  <Section position="12" start_page="116" end_page="116" type="metho">
    <SectionTitle>
(CON SON CONT) {VOC HIGH ANT) K IH NX
</SectionTitle>
    <Paragraph position="0"> then the resulting word matches might be leaking, lickinq, reeking. If the K is specified less precisely as $ voiceless stop, word matches might include leakinq, licking, reeking, leaping, rippinq.</Paragraph>
    <Paragraph position="1"> The interactive facility allows the system designer to easily determine the nature of possible incorrect matches due to phonological indeterminacy, especially as the size of the lexison increases.</Paragraph>
    <Paragraph position="2"> C. Comparative Linguistic Relationships If the data base is represented as an orthographic list with two associated phonological lists representing two languages or dialects, the interactive system can be used to discover systematic sound correspondences, and to aid in the study of dialect relationships and historical reconstruction.  would retrieve those items in language B which correspond to items in language A with initial /pl-/ clusters, e.g. and paw, indicating that consonant cluster simplification may have occurred in language B. The query B/ A: =IYIY would retrieve those items in language B which correspond to items in language A with final /-ii/, e.g. the drphthongized mia and f ia.</Paragraph>
    <Paragraph position="3"> - A large data base could be accessed in this way to discover systematic correspondences between languages A and B, such as the correspondences /pl-/:/p-/, m:m, /ph-/:/f/, ii:ia aa:a, etc.</Paragraph>
    <Paragraph position="4"> The flexibility of the interactive system, combined with the linguistic intuition of the user, can be used.to specify and retrieve any set of correspondences, without the need to format the data according to initial consonants or clusters, vowel nuclei, finals, etc. Information such as tonal cnntours and stress can also be represented and accessed.</Paragraph>
  </Section>
  <Section position="13" start_page="116" end_page="116" type="metho">
    <SectionTitle>
ACKNOWLEDGEMENT
</SectionTitle>
    <Paragraph position="0"> This research was supported in part by the Advanced</Paragraph>
    <Section position="1" start_page="116" end_page="116" type="sub_section">
      <SectionTitle>
Research Projects Agency of the Department df Defense through
</SectionTitle>
      <Paragraph position="0"> Contract N00014-73-C-0221 administered by the Office of Naval</Paragraph>
    </Section>
    <Section position="2" start_page="116" end_page="116" type="sub_section">
      <SectionTitle>
Research Information Systems Proqram.
</SectionTitle>
      <Paragraph position="0"/>
    </Section>
  </Section>
  <Section position="14" start_page="116" end_page="116" type="metho">
    <SectionTitle>
1975 ACL Meeting
ON THE ROLE OF WORDS AND PHRASES IN AUTOMATIC TEXT ANALYSIS
</SectionTitle>
    <Paragraph position="0"> Automatic indexing nom~ally consists in assigning to documents either single terms, or more specific entities such as phrases, or more general entities such as term classes. Discrimination value analysis assigns an appropriate role in the indexing operation to the single terms, term phrases, and thesaurus categories. To enhance precision it is useful to form phrases from high-frequency single term components. To improve recall, low-frequency terms should be grouped into affinity classes, assigned as content identifiers instead of the single terms.</Paragraph>
    <Paragraph position="1"> Collections in different subj ect areas are used in experiments to characterize the type of phrase an8 word class most effective for content representation.</Paragraph>
    <Paragraph position="2"> The following typical conclusions can be reached: a) the addition of phrases improves performance considerably; b) use of phrases is better with corresponding deletion of single terms in practically all cases; c) the use of both high-frequency and medium-frequency phrases is generally more effective than the use of either phrase-type alone; d) the most effective thesaurus categories are those which include a large number of low-frequency terms; e) the least effective classes either consist of only one or two terms, or else they include terms wi~h unequal frequency characteristics permitting the high-frequency terms to overcome the others.</Paragraph>
    <Paragraph position="3"> The discrimination value theagr is developed and appropriate experimental output is supplied.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML