File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/73/c73-1014_metho.xml
Size: 19,636 bytes
Last Modified: 2025-10-06 14:11:05
<?xml version="1.0" standalone="yes"?> <Paper uid="C73-1014"> <Title>G/JNEY G6NENg UNIQUE DECIPHEKABILITY OF CODES WITH CONSTRAINTS WITH APPLICATION TO SYLLABIFICATION OF TURKISH WORDS</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> G/JNEY G6NENg UNIQUE DECIPHEKABILITY OF CODES WITH CONSTRAINTS WITH APPLICATION TO SYLLABIFICATION OF TURKISH WORDS 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> Information lossless automata were first studied by D. A. HUFFMAN (1959). Huffman also devised tests for information losslessness (IL) and information losslessness of finite order (ILF). By treating finite State machines as encoders and decoders, the tests for IL and ILF can be applied to coding theory. This is done by S. EV~.N (1962, 1963, 1965) who devised testing methods for unique decipherability (UD) and unique decipherability of finite delay (UDF), concepts shown to be parallel to IL and ILF.</Paragraph> <Paragraph position="1"> In this paper, tests for UD and UDF for codes with constraints are investigated. The basis of the proposed method is Even's procedure.</Paragraph> <Paragraph position="2"> The constraints are of the form &quot; code word X never follows code word Y&quot; for specific ordered pairs (X, Y) of code words.</Paragraph> <Paragraph position="3"> The need for testing UD and UDF for codes with constraints originally arised in the syllabification prob\]em for Turkish words. The problem is, essentially, to find an algorithm for syllabification of words for a given printed Turkish text. The construction of syllables in Turkish language is very regular and hence it is not difficult to find such algorithms intuitively, by trial and error. By a thorough analysis of the UD and UDF properties of printed word - syllable structure conversion, it is also possible to investigate the effects of the flood of foreign (mostly French) words on the syllable structure of Turkish.</Paragraph> <Paragraph position="4"> In part 2 some basic definitions are given. In part 3 Even's procedure for testing UD and UDF is discussed briefly. The test for codes with constraints is presented in part 4. Finally, in part 5, applications on Turkish syllable structure are discussed briefly.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 184 GUNEY GONEN~ 2. NOTATION AND BASIC DEFINITIONS t </SectionTitle> <Paragraph position="0"> Source symbols will be shown by capital letters A, B, ..... L, W, X, Y, Z. Code symbols will be shown by 0 and 1. A concatenation of a finite number of code symbols is called a code word. A code consists of a finite number of code words of fmite length, each representing a source symbol. A coded message is obtained by concatenating code words, without spacing or any other punctuation. Variable-length codes in which code words are not necessarily of the same length, will only be considered.</Paragraph> <Paragraph position="1"> A code is said to be uniquely decipherable if and only if every coded message can be decomposed into a sequence of code words in only one way. A code is said to be uniquely decipherable of finite delay N if and only if N is the least integer, so that the knowledge of the first N symbols of the coded message suffices to determine its first code word.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. TESTS FOR UD AND UDP </SectionTitle> <Paragraph position="0"> By treating finite state machines as encoders and decoders, tests for UD and UDF can be converted into tests for IL and ILF (S. EVEN, 1965; Z. KOHAVr, 1970). Without going into tests for IL and ILF, we shall give Even's testing procedure for UD and UDF here. At the same time we shall demonstrate the procedure on a binary code T which consists of 4 code words: A=O, B=IO, C=01, and D=101. ~ Procedure 1.</Paragraph> <Paragraph position="1"> (! a) Insert aseparation symbol S at the beginning and end of eacl~ Code word in the code. .....</Paragraph> <Paragraph position="2"> (lb) Let code word X be of length n. Insert the separation Symbol Xi between i-th and (i + 1)-th symbol of Code word X for 1 ~ i ~ n-1. Do</Paragraph> </Section> <Section position="4" start_page="0" end_page="186" type="metho"> <SectionTitle> UNIQUE DECIPHERABILITY OF CODES WITH CONSTRAINTS 185 </SectionTitle> <Paragraph position="0"> this for all code words for which n ~> 2. For example, after steps (la) and (lb), D = 101 becomes D = S1DIODzlS.</Paragraph> <Paragraph position="1"> (lc) The separation symbol to the right of the code symbol t is called the t-successor of the separation symbol to the left of the same code symbol. For example, D 1 is the 1-successor of S, D2 is the 0-successor of D1, and S is the 1-successor olD2, in code word D.</Paragraph> <Paragraph position="2"> Two separation symbols are said to be compatible if (I) They are t-successors of S, for some code symbol t, or, (II) They are t-successors of two separation symbols which are themselves compatibles.</Paragraph> <Paragraph position="3"> If (WX) is a compatible pair, and if Y and Z are t-successors of W and X, respectively, then the compatible pair (YZ) is said to be implied by (WX) under t.</Paragraph> <Paragraph position="4"> Construct a testing table as follows: the column headings are the code symbols. The first row heading is S. The entries in the first row are compatible pairs found by (I) above, under corresponding column t. The other row headings are the compatible pairs. The entries in row (WX), column t, are the compatible pairs implied by (WX) under t.</Paragraph> <Paragraph position="5"> The testing table for code T is shown in fig. 1.</Paragraph> <Paragraph position="6"> (ld) If the tabl e contains pair (SS) then the code is not UD, otherwise it is LID. Since there are (SS) pairs in the testing table for code T, it is not LID. By tracing back the compatibles, starting from a (SS) pair, one can arrive the symbo |S (possibly through several paths).</Paragraph> <Paragraph position="7"> The sequence of code symbols corresponding to this traceback path gives an ambiguous message. In fig. 2 some of these ambiguous messages are shown for code T.</Paragraph> <Paragraph position="9"> (le) If no (SS) pair is generated, then a testing graph is constructed from the table as follows: corresponding to every row in the table there is a vertex in the graph. If (YZ) is implied by (WX) under t, then a directed arc labeled t leads from vertex (WX) to vertex (YZ) in the graph.</Paragraph> <Paragraph position="10"> (1./') A code is uniquely decipherable of finite delay N if and only if its testing graph is loop-free. If the graph is loop-free and the length of the longest path in the graph is r, then N = r + 1.</Paragraph> <Paragraph position="11"> 4. CONSTRAINTS ON CODE WORD OCCURRENCES In the above discussion, there was no constraint whatsoever regarding the occurrence of any code word at any point of the message. On the other hand there may be such a case that, for some specific code, the code word X never follows the code word Y. These constraints may arise from the physical nature of the encoder (for example no letter other than u can follow letter q in an English text) or may be deliberately imposed upon a code to achieve LID or UDF properties.</Paragraph> <Paragraph position="12"> The constraints of the form&quot; code word X never follows code word Y&quot; will be termed a first-order constraint. For the codes with first order constraints, a testing procedure is given below: (2a) Insert a separation symbol Px at the beginning and a separation symbol Qx at the end of each code word X in the code.</Paragraph> <Paragraph position="13"> (2b) Insert separation symbols X~ as in (lb). For example after steps (2a) and (2b), D = 101 becomes PnlDlOD21QD.</Paragraph> <Paragraph position="14"> (2c) Let a number re(X, Y) be defined for every ordered pair of code words (X, Y) in the following way: UNIQUE DECIPHERABILITY OF CODES WITH CONSTRAINTS 187 re(X, 1,')= 1 if the code word X is allowed to occur immediately after the code word Y, -- 0 otherwise.</Paragraph> <Paragraph position="15"> A constraint matrix M in which there is one row and one column for each Code word can be defined such that the element of M in the row X, column Y is re(X, Y).</Paragraph> <Paragraph position="16"> For example, consider code T of part 3. Let the following four constraints be imposed on this code: A never follows C,. C never follows C, A never follows D, and C never follows D. These four constraints can also be expressed as &quot; a code word starting with a 0 never follows a code word ending with a 1 &quot;. The resulting code, called code \[7, and its constraint matrix is shown in fig. 3.</Paragraph> <Paragraph position="17"> (2d) The separation symbol to the right of the code symbol t is called the t-successor of the separation symbol to the left of the same code symbol. Furthermore, a separation symbol X~ (Qx) is the t-successor of the separation symbol Qv if X} (Qx) is a t-successor of Px and re(X, Y) = 1. Two separation symbols are said to be compatible if (I) They are t-successors of Px and Pr for some t, X, and Y, or 188 CONEY GONEN~ (II) They are t-successors*of two separation symbols which are themselves compatible.</Paragraph> <Paragraph position="18"> Construct the testing table as in (lc), with the change: the first row heading is P. The testing table for code U is shown in fig. 3. (2e) If the table contains any pair (QxQy) for some X and Y (possibly iden null tical), then the code is not UD. Otherwise it is UD. For example it is seen from fig. 3 that code U is UD. If the code is not UD, then a traceback of compatibles which implied a pair (QxQY) gives an ambiguous message.</Paragraph> <Paragraph position="19"> (2e) If the code is UD, then one can construct the testing graph as in (le). The testing graph for code U is shown in fig. 4.</Paragraph> <Paragraph position="20"> Fig. 4. Testing graph for code U The longest path in this graph has length 3. Hence the code is UDF of order 4; in other words the knowledge of the first 4 code symbols suffices to determine the first code word, but 3 is not sufficient. To demonstrate that the knowledge of the first 3 code symbols is not sufficient, consider a path of length 3 in the graph, for example the path 101 from P to QnD1. When we receive 101 we can not decide whether this is word D, or word B (= 10) occurred and a word D (----- 101) has just started (the last vertex QnD1 actually points to this ambiguity). But, if the fourth .symbol received is a 0 we can now decide that the first code word was B, andif the fourth symbol isa 1 we decide that the first code word was D.</Paragraph> <Paragraph position="21"> There may be other types of constraints present on the code. A constraint of the form &quot; code word X never follows YZ &quot;, where Y and</Paragraph> </Section> <Section position="5" start_page="186" end_page="186" type="metho"> <SectionTitle> UNIQUE DECIPHERABILITY OF CODES WITH CONSTRAINTS 189 </SectionTitle> <Paragraph position="0"> Z are distinct, will be termed a second order constraint. If there exists such a constraint, then it can be converted into the following first order constraints: create a new code word ;~, identical in structure to Z. Then impose the constraints &quot;X never follows Z, Z never follows Y&quot; (for simplification purposes one can impose the additional constraints: &quot; Z, never follows Z,, X, or Z &quot;). Higher order constraints can be handled similarly.</Paragraph> </Section> <Section position="6" start_page="186" end_page="186" type="metho"> <SectionTitle> 5. SYLLABLE STRUCTURE OF TURKISH LANGUAGE </SectionTitle> <Paragraph position="0"> In Turkish language there are 12 syllable types. These are shown in Table 1.</Paragraph> <Paragraph position="2"> The first six syllable types (types A-F) are syllable types of proper Turkish language. The remaining six types (types G-L) came into Turkish with foreign borrowings. These are somewhat characterized by consonant clusters, which are totally alien to the language. In spoken language, especially as spoken by not-well-educated people, these clusters are simplified by the addition of a vowel before or within the</Paragraph> </Section> <Section position="7" start_page="186" end_page="186" type="metho"> <SectionTitle> 190 GUNEY G6NEN~ </SectionTitle> <Paragraph position="0"> cluster, thereby increasing the number of syllables in the word (G. L.</Paragraph> <Paragraph position="1"> LEwis, 1967). Since our main concern is printed texts we shall not deal with these and other aspects of the spoken language.</Paragraph> <Paragraph position="2"> The treatment of printed Turkish words as messages encoded into a code in which syllables are code words and letters are code symbols enables us to syllabify printed texts automatically. This is important because of the following reasons: 1) Automatic syllabification makes it possible to recognize and count (mainly for statistical purposes) syllable types and/or syllables from texts read into the computer without any syllable separation markers.</Paragraph> <Paragraph position="3"> 2) Automatic syllabification is necessary in automatic typesetting, without automatic syllabification words to be separated at line ends can not be properly syllabified.</Paragraph> <Paragraph position="4"> 3) Automatic syllabification gives insight into the syllable structure, its deformation under some effects, and the relation between spoken and printed .language, thereby helping linguists working on the subject.</Paragraph> <Paragraph position="5"> The first six syllable types a without any constraints obviously form a non-UD code. For example a word 0110 can be decoded as 01.10 (CB) or as 011.0 (EA). On the other hand the phonetic rules of the language put some constraints as to which syllable type can not follow a given syllable type. The set of constraints inherent in the language can be summarized as &quot;each vowel takes the first consonant before it into its syllable &quot; (T. BANGUO~Ltl, 1959). In our notation, the constraint set can be summarized as &quot;no syllable starting with a vowel (0) can follow a syllable ending with a consonant (1) &quot;. The constraint matrix corresponding to this set is shown below.</Paragraph> <Paragraph position="7"/> </Section> <Section position="8" start_page="186" end_page="186" type="metho"> <SectionTitle> 3 Turkish alphabet consists of eight vowels (a, e, z, i, o,//, u, //) and 21 consonants </SectionTitle> <Paragraph position="0"> (b, c, ~, d,f, g, ~, h, .i, k, l, m, n, p, r, s, .s, t, v, y, z). Only one vowel can be present in any syllable. There are no diphtongs in Turkish.</Paragraph> </Section> <Section position="9" start_page="186" end_page="186" type="metho"> <SectionTitle> UNIQUE DECIPHERABILITY OF CODES WITH CONSTRAINTS 191 </SectionTitle> <Paragraph position="0"> Now, by constructing the testing table and graph, it can be shown thatthis code is UDF of order 5. 4 This simply means that there is an algorithm, to syllabify any proper Turkish word which operates in the following manner: 1) The only information required about the characters in the text is about their being vowel, consonant or &quot;other &quot; (such as blank, comma, numeral etc.).</Paragraph> <Paragraph position="1"> 2) When a word is being scanned, its first syllable will be decided upon atthe fifth character of the word or before. Since the code is UD the decision process is completed when the word ends (i.e. upon first blank).</Paragraph> <Paragraph position="2"> The introduction of the syllable types G, H, ..., L of Table 1 into the language causes the &quot;invention&quot; of new constraints. These are not yet thoroughly investigated or explained. One set of constraints can be summarized as: &quot;no syllable starting with two or more consonants can follow a syllable ending with a vowel &quot;.5 With the addition of this set of contraints, the constraint matrix becomes</Paragraph> <Paragraph position="4"> ' It is also interesting to note that the first order constraints to make the code A, B, .... F uniquely decipherable of finite delay are found to be precisely those constraints inherent in the language.</Paragraph> <Paragraph position="5"> No mention of this kind of constraint is found in the literature. This rule, and the one given before must clearly be the result of the shape of vocal organs. We should also mention that no exception at all to these two rules exists.</Paragraph> </Section> <Section position="10" start_page="186" end_page="186" type="metho"> <SectionTitle> 192 GUNEY GSNEN~ </SectionTitle> <Paragraph position="0"> The code thus generated can be shown to be still non-UD. Some typical ambiguities concerning the existing words are shown below: A careful and thorough search (through all borrowings in the language) revealed one fact: if we increase the code symbols from two (vowel, consonant) to three (v ----- vowel, r-----letter &quot;r &quot;, ~ = conso-~ nant other than &quot;r &quot;) then the resulting code becomes UD, and actually UDF of delay 7 for all existing foreign (and of course all native) words. The examples given above hints this. Simply note that the words in the upper line in each set have an r as the second letter of second syllable, whereas a letter other than r appears at the same position of the word, for words of the lower lines, e.g. emprime and enistitii.</Paragraph> <Paragraph position="1"> Finally, with these considerations an algorithm for tlae syllabification is programmed (in rOX~TaAN). This algorithm is based on the state-table of the inverse of the finite state machine which is taken as the encoder device 4,7. The input to the program is a printed text, the output is the same text (numerals etc. skipped), all the words being syllabified. There are minor additions to the program. For example unsyllabifiable words (due to punching errors, etc.) are printed out as they are, but in brackets. The program is run on mM 360/40. An example of input data and corresponding printouts are shown in fig. 5.</Paragraph> </Section> <Section position="11" start_page="186" end_page="186" type="metho"> <SectionTitle> UNIQUE DECIPHERABILITY OF CODES WITH CONSTRAINTS 193 HECE AYIRMA PROGRAMI GELENEK AKARYAKIT UYGULAMA HE*CE A*YIR*MA PROG*RA*MI GE*LE*NEK A*KAR*YA*KIT UY*GU*LA*MA TORTU KONGRE KORKAK KANGREN TABLDOT KONTRAT TANJANT TOR*TU KON*GRE KOR*KAK KAN*GREN TABL*DOT KON*TRAT TAN*JANT STEREOSKOP AHMET RIZA O STRC BB ANI STE*RE*OS*KOP AH*MET RI*ZA O (STRC) CBB) A*NI </SectionTitle> <Paragraph position="0"> ..- . , '</Paragraph> </Section> <Section position="12" start_page="186" end_page="186" type="metho"> <SectionTitle> EMPRIME ENSTITU EKSPER ISTRANCA ISTRONGILOs ISFENKS EM*PRI*ME ENS*TI*TU EKS*PER IS*TRAN*CA IS*TRON*GI*LOS IS*FENKS FBRKET CKANDIRMACAI .12/MAYIS/1971 GUSULHANE CFBRKETJ KAN*DIR*MA*CA MA*YIS GU*SUL*HA*NE SAAT TATAR AMFITEATR TELEKS KREOZOT FLAMA FLUOR SA*AT TA*TAR AM*FI*TE*ATR TE*LEKS KRE*O*ZOT FLA*MA FLU*OR AERODINAMIK AIT ARAP AORT AVURT ARKEOLOG BABA A*E*RO*DI*NA*MIK A*IT A-RAP A*ORT A*VURT AR*KE*O*LOG BA*BA TRAHOM FREKANS STRATEJI STRATOSFER ARTI TRA*HOM FRE*KANS STRA*TE*JI STRA*TOS*FER AR*TI KONTRAST EKSKAVATOR ENSTITU KON*TRAST EKS*KA*VA*TOR ENS*TI*TU </SectionTitle> <Paragraph position="0"> the input data, the lower line is the output.</Paragraph> </Section> class="xml-element"></Paper>