File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/82/c82-1063_metho.xml

Size: 8,785 bytes

Last Modified: 2025-10-06 14:11:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="C82-1063">
  <Title>COMPUTATIONAL DATA ANALYSIS FOR SYNTAX</Title>
  <Section position="3" start_page="0" end_page="391" type="metho">
    <SectionTitle>
PROPOSITIONS AND SYNTACTIC CODE
</SectionTitle>
    <Paragraph position="0"> The target of the syntactic analysis as well as of the whole project is to obtain a coherent set of mutually related quantitative parameters concerning the Czech language system, its functioning in different communicative spheres, as well as its stylistic values. To guarantee the reliability and representativeness of re- null 392 L. UHLII~OV~,, I. NEBESK,g, and J. KR,~LfK sults, the research is based on the corpt;s of 540 000 word-forms /occurrences/ chosen from non-narrative /i.e. newspaper, administrative and scientific/ texts. During the preparatory works all word-forms in 180 samples - each sample consisting of 3000 running word-forms - were supplied with a special code carrying both morphological and syntactic information about parts of speech, all main morphological categories, relevant in Czech /such as case, nominal gender and number, person, number, tense, mood, voice of verbs etc. with a very detailed subcategorization/ and all main syntactic categories /such as subject, object, predicate, attributes types of adverbials etc., again with the necessary subcategories/.</Paragraph>
    <Paragraph position="1"> The syntactic code was basically digital, with three exceptions: the characters &amp;quot;+&amp;quot; and &amp;quot;-&amp;quot; expressed whether the dependent sentence element follows or precedes the governing word /sentence member/, and the character &amp;quot;!&amp;quot; was used for special defective sentence constructions. For purposes of frequency lists, each word-form was given a basic lexical information /lemma/.</Paragraph>
    <Paragraph position="2"> The coding and lemmatization was done by Linguists in such a way, that after marking boundaries between sentences and clauses each word-form was given information about its syntactic function /&amp;quot;membership in sentence&amp;quot;/ and about its governing word in the dependency tree. Each clause was then given information about its structural position in the complex or compound sentence~ then coordination between words or between clauses was marked; finally, information about the linear arrangement of words in sentences, of clauses in complex and compound sentences, and of sentence wholes in text was added, so as to enable the complete reconstruction of running texts by computer, if necessary, at any time.</Paragraph>
    <Paragraph position="3"> The whole corpus together with the encoded information came by means of punched cards /80 columns/ over a current input programme into external computer memory on magnetic tapes /each tape Library can carry max. 30 texts, the whole corpus is contained in 7 tapes/.</Paragraph>
    <Paragraph position="4"> Each record got a special translation zone to guarantee the possibility of obligatory /non-standard/ sorting with respect to different alphabetic ordering of some Czech Letters /record size on tape is 130, block size BS=6502/. The automatic processing has been executed at the computer TESLA 200 in the computer-centers OTZ~HT and OFPL, CzechosLovak Academy of Sciences, all programmes and</Paragraph>
  </Section>
  <Section position="4" start_page="391" end_page="391" type="metho">
    <SectionTitle>
COMPUTATIONAL DATA ANALYSIS FOR SYNTAX 393
</SectionTitle>
    <Paragraph position="0"> their modifications were written in internal programming language &amp;quot;APS&amp;quot;.</Paragraph>
    <Paragraph position="1"> The survey of the main results, given below, is ordered from the most simple to the more complicated ones, with respect to the computer programmes and with respect to the linguistic information obtained.</Paragraph>
  </Section>
  <Section position="5" start_page="391" end_page="391" type="metho">
    <SectionTitle>
SIMPLE COMPUTATIONAL CHARACTERISTICS
</SectionTitle>
    <Paragraph position="0"> The first set of programmes gives by its simple structure as to the programming technique the essential totals of frequencies /occurrences/ of encoded syntactic categories, individual syntactic features and various items under investigation. Number of results of this kind was obtained by means of reading or repeated reading of the magnetic tape library and through simple adding of different code items.</Paragraph>
    <Paragraph position="1"> The received data offer us a basic survey about the frequencies of sentence elements and simple, complex and compound sentences, about types of syntagms /both determinative and coordinative/, about the frequencies of two-element and one-element sentences and their patterns, about the frequencies of types of subordinate clauses, some word-order and clause-order characteristics and the frequency ratio of simple and complex/compound sentences. In addition, there have been collected some data concerning how often various sentence elements are expressed by a nominal or an adverbial phrase and how often they are expressed by a dependent clause, concerning the most frequent types of complex sentences, including frequencies and functions of various syntactic connectors. Using the cycles within counting programmes the distributions of syntactic units were obtained in a similarly simple and prompt way.</Paragraph>
    <Paragraph position="2"> This part of the computer work yielded the length distributions of clauses and sentences expressed in number of words, or in number of clauses.</Paragraph>
  </Section>
  <Section position="6" start_page="391" end_page="391" type="metho">
    <SectionTitle>
COMPOUND COMPUTATIONAL CHARACTERISTICS
</SectionTitle>
    <Paragraph position="0"> By doubling or chaining of testing subprogrammes and cycles another set of programmes was constructed for more complicated searching and output of syntactic characteristics. Specially commented tab394 L. UHLll~OVPS, I. NEBESK/~ and J. KR~,LIK les, as well as larger sets of numeric data supplying rich material for further steps of analysis were obtained. Whereas the results of the programme set mentioned above referred to frequencies of individual syntactic categories, the programmes reported about in this paragraph were concentrated on their relationships. Attention was paid especially to the relationships between syntactic elements and their part-of-speech appurtenance, to the syntactic relevance of some morphological categories /e.g. of noun cases/, to the correlation between sentence length and complexity of its structure, to the relationship between types of subordinate clauses and their linear position in complex sentences etc. Some of the statistical data obtained have confirmed our intuitive expectations /e.g. concerning syntactic functions of parts of speech and syntactic functions of cases of nouns/, others lead us to a deeper insight into interrelations between linguistic levels, esp. about the connection between the lexical and the syntactic levels.</Paragraph>
  </Section>
  <Section position="7" start_page="391" end_page="391" type="metho">
    <SectionTitle>
TYPES OF VERBAL CONTEXTS SEARCHED BY COMPUTER
</SectionTitle>
    <Paragraph position="0"> Using the computer operation memory we overcame the technical impossibitity of a reverse magnetic tape reading. This enabled us to prepare the third set of output we received whole nents with required code programmes with many variations. As an sentences, sentence types or their compocombinations or with immediate verbal contexts. Thus we could study not only abstract syntactic categories as such, reported above, but we also could take into account the concrete lexical manifestations of various syntactic elements, units and categories.</Paragraph>
    <Paragraph position="1"> Some interesting tendencies were found, concerning the insertion of certain lexical types into different syntactic positions, e.g. types of adjectives typical for predicative positions and other typical for attributive positions. The relationship between the semantics of the co-ordinated syntactic elements and their functions in topic-comment structure was studied, the correlation between the frequencies of subordinate clauses and lexical semantics of the governing predicate was proved /with predicates expressing the attitude of the speaker to the content of communication/, the correlation between morphological category of infinitive and semantic category of modality was found. A special attention was</Paragraph>
  </Section>
  <Section position="8" start_page="391" end_page="391" type="metho">
    <SectionTitle>
COMPUTATIONAL DATA ANALYSIS FOR SYNTAX 395
</SectionTitle>
    <Paragraph position="0"> paid to the syntactic structures with verbs to be and to have.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML