ANDI~E TRETIAKOFF 
RESULTS OBTAINED WITH A NEW METHOD 
FOR. THE AUTOMATIC ANALYSIS OF 
SENTENCE STR.UCTUR.ES 
We present in this paper a method for the automatic analysis of 
sentence structures. 
Our purpose is to constitute a frequency dictionary of the different 
structures used in the language. This dictionary will enable us to se- 
lect the most useful sentence structures in order to recommend their 
exclusive use for the writing of texts intended for automatic translation. 
We think that the automatic translation will be possible only if 
the texts are submitted to rules which limit the complexity of their 
syntax. These limitations will be the less noticed by an author as only 
the most unusual structures would have been left out. Of course the 
number of permitted structures will increase as the automatic transla- 
tion codes are improved. 
The sentence structures are obtained by a statistical analysis of the 
word strings according to procedures developed in the information 
theory. 
In the present paper we have analysed only groups of two con- 
secutive words as an example of our method. 
The same type of analysis can be generalized by considering non- 
consecutive words and groups of more than two words. 
1. GROUPS 
The first step of the analysis is to put the words into groups accord- 
ing to their grammatical properties, for example: noun, adjective, 
article and so on. The number of groups has been limited to keep sig- 
nificative frequencies with respect to the length of the corpus (3500 
words). Inthe text under study, we have used 67 groups. A list of these 
groups is given in Table 3. 
216 ANDP,~E TRBTIAKOFF 
Of course, our classification is somewhat arbitrary as it is based 
on a preliminary knowledge of the language. We will show later how 
the results of the analysis can help us to detect inadequate classifications. 
Each word of the corpus has been replaced by a symbol (two fig- 
ures integer) representing its grammatical group. We consider the 
words inside the sentence, that is to say between two strong punctuation 
signs (. ; ! ?). Inside the sentence all punctuation signs are suppressed. 
We will call now "words" these symbols. 
2. DICTIONARY OF STRINGS 
The second step is the constitution of a string dictionary. 
A sentence containing N words produces (N--1) strings. For in- 
stance, the sentence Her daughter gave me an Italian lesson every day repre- 
sented by the string " 55 04 01 44 45 05 04 85 04 ", produces the fol- 
lowing strings: 
156 55 HER ~ 
2.173 1 ~ 
157 4 DAUGHTEI~ ~ 
0.024 2 ~o+~, 
158 1 GAVE. ~* ~ 
1.267 1 ~+~ 
159 44 ME ~+ 
-0.702 4 ~+ 
160 45 AN ~* , 
2.379 1 +~+ * 
161 5 ITALIAN +~ • + 
1.860 2 ~+~ ~ ' 
162 4 LESSON ~*++ ~ 
-0.421 3 ~++ 
163 85 EVERY +* + 
2.194 1 ~**++ 
164 4 DAY ,** 
First (complete)string 55 04 01 44 45 05 04 85 04 
Second string 04 01 44 45 05 04 85 04 
etc... 01 44 45 05 04 85 04 
44 45 05 04 85 04 
" 45 05 04 85 04 
05 04 85 04 
04 85 04 
Last stung . .................. 85 04 
Each string is obtained by suppressing the first word 
• ing string. 
of the preced- 
AUTOMATIC ANALYSIS OF SENTENCE STRUCTURES 217 
The dictionary emphasizes the identical strings whatever their posi- 
tion in the sentence might be. A sample of the dictionary is given 
in Table 1. 
For example, the string 05 04 which means an adjective followed 
by a common noun at the end of a sentence has the rank number 244, 
occurs 9 times in the sentences number 9, 10, 35 and so on. 
All the strings beginning by the groups 05, 04 are also listed. 
3. SENTENCE STRUCTURE 
The last step of the analysis is the production of sentence structures, 
using the correlations between two consecutive words. 
We can compare the probability Pj of a word j in the corpus and 
the conditional probability Pj (if i) of the same word when the preced- 
ing word is given equal to i. We shall call in this paper "degree of 
correlation " the logarithm of the ratio of the conditional probability 
and the probability: 
C, s = Logs P~ (if i)/P~ 
The degree of correlation will be positive when the probability to 
get a word is increased by the knowledge of the preceding word, and 
negative when this probability is decreased. It is a measure of the "af- 
finity" of two consecutive words. 
This procedure can be generalized by considering groups of more 
than two words, not necessarily consecutive. 
For each sentence of the corpus we can build a structure based 
on the correlation between two consecutive words in the following 
way. Inside the sentence, consecutive words are connected two by two 
in order of decreasing degree of correlation. For instance in the sen- 
tence: 
She loved a good laugh 
we have the following degrees of correlation: 
She loved = 2.56 
loved a = 1.23 
a good ~ 2.38 
good laugh = 1.86 
218 ANDI~E TRETIAKOFF 
Therefore the first words to be connected are She and loved then 
a and good. We will consider that their union is the first level. Then 
the word laugh will be connected to the group a good. This union will 
be a second level and finally the two halves of the sentence are con- 
nected and this union will be the third level. 
This structure can be represented by the following graph, automat- 
ically produced by the computer, and by the string 1 3 1 2 obtained by 
writing the sequence of the successive levels. 
SENTENCE NO 9 
231 44 SHE ~ 
2.564 1 ~ 
232 1 LOVED ~ 
1.232 3 ~ 
233 45 A ~ o 
2.379 1 ~ • 
234 , 5 GOOD ~ ~ o 
1.860 2 ~ 
235 4 LAUGH ~ 
Degrees of correlation: 
She loved ........ 2.56 
loved a ........ 1.23 
a good ......... 2.38 
good laugh ....... 1.86 
String of groups: 44 01 45 
String of levels: 1 3 1 
05 04 
2 
4. DICTIONARY OF STRUCTURES 
This procedure has been applied for all the sentences of the text, 
producing strings of numbers which represent the structure of these 
sentences. 
For each string of numbers, by suppressing the highest num- 
ber we obtain 2 strings representing 2 substructures of this sentence. 
We carry on this procedure till the string has only 1 number, that is 
to say represents the structure of a group of 2 words. 
For instance the structure of the sentence Her daughter gave me an 
Italian lesson every day is represented by the following string of 
numbers: 
AUTOMATIC ANALYSIS OF SENTENCE STRUCTURES 219 
SENTENCE NO 5 
156 55 HER ** 
2.173 1 *~* 
157 4 DAUGHTER ** 
0.024 2 ***** 
158 1 GAVE *~ * 
1.267 1 ~ 
159 44 ME ~ 
--0.702 4 ~ 
160 45 AN ~ 
2.379 1 ~* o 
161 5 ITALIAN *~ * 
1.860 2 *** * 
162 4 LESSON ~*** * 
--0.421 3 ~* 
163 85 EVERY ** 
2.194 1 ~ 
164 4 DAY ~* 
Complete string: 1 2 1 4 1 2 3 1 (level 4) 
1 substring : 1 2 3 1 (level 3) 
2 substrings: 1 2 1 1 2 (level 2) 
3 substrings: 1 1 1 1 (level 1) 
All the structures and substructures are classified in a dictionary, 
giving their frequencies and the positions of the sentences containing 
the corresponding word strings (Table 2). 
For example, the structure 1 4 2 1 3 has the rank number 
65 and is found 5 times in the sentences number 12 16 21 24 41. 
5. CLASSIFICATION ERRORS 
If the structure of a sentence is unsatisfactory, this can be due to 
an error in the classification of a word of this sentence. This observa- 
tion is used to detect and correct classification errors. For example 
in the sentence: 
But come mother 
8t 02 24 
I like to 
44 01 17 
f J 
I 
I 
220 ANDP.EE TRETIAKOFP 
the word come had been classified in a wrong group 02 (indicative of 
intransitive verbs). When corrected (22 = infinitive of intr. verbs) 
we obtain the following structure: 
But I like to come mother 
81 44 01 17 22 24 
t 1 t I 
I-- I 
I 
I 
Another way to check the classification of words into groups is 
to use the quantity of information associated to the law of succession of 
two consecutive words. It is known from communication theory that 
the average amount of information by word is reduced when we know 
the law of succession of two consecutive words. This reduction is 
precisely equal to the average degree of correlation of all the groups: 
~j 
We shall call it quantity of information associated to the law of 
succession of two consecutive words. 
In order to check the validity of the choice of the grammatical 
group for a word, the quantity of information associated to the law 
of succession of the groups is measured. Then, changing the choice of 
the group, the quantity of information is measured again for this new 
classification. The greater the quantity of information associated to a 
law of succession of the groups, the better the distribution of these 
words into these groups. 
6. CONCLUSION 
The sample chosen here (a novel by S. Maugham of 3500 words) 
is too short to obtain significant frequencies for the different structures. 
This sample contains 200 sentences of an average length of 17 words. 
In spite of the simplicity of the method Of analysis employed, 72 
sentences of an average length of 10 words have been correctly ana- 
lysed. 
This shows that the correlation of 2 consecutive words, although 
insufficient, will play an important part in the more elaborated methods 
of analysis that we are now developing. 
AUTOMATIC ANALYSIS OF SENTENCE STRUCTURES 
244 9 
245 1 
246 1 
247 1 
248 1 
249 1 
250 1 
251 1 
252 1 
253 1 
254 2 
255 1 
256 1 
257 1 
258 -1 
259 1 
260 1 
261 1 
262 1 
263 1 
264 1 
TABL~ 1. 
5 4 
9 10 35 37 41 49 54 66 72 
54117229754 
35 
541354 
30 
5 4 1 35 4 7 55 4 85 4 
40 
54144954774 
1 
5 4 3 17 23 35 4 97 45 5 4 97 35 4 
16 
54944341 
68 
5 4 16 24 9 44 1 54 
14 
5 4 80 95 4 16 55 4 44 2 26 
2 
54854 
5 
549724 
13 42 
5497354 
16 
5 4 97 65 5 4 1 35 4 7 55 4 85 4 
40 
554 
41 
5 5~4 1 35 4 
30 
5 17 21 45 4 
17 
5 17 21 66 85 4 
50 
5 17 22 16 35 4 7 45 5 4 
72 
5 17 22 26 
44 
52744 
63 
5 27 55 4 
221 
222 ANDR~E TRETIAKOFF 
TABL~ 2. 
Rank Frequen~ 
56 1 0 13 2 1 4 5 1 2 1 6 1 2 3 1 
22 
57 1 0 1 3 2 1 6 4 3 1 2 5 1 3 2 1 
61 
58 1 0141213 
33 ......................... sentencenumber 
59 I 0 14 12 13 1 2 
48 
60 1 0 14 12 13 5 1 
33 
61 1 0 14 1 2 13 5 1 6 3 I 2 4 
33 
62 1 014123 
15 
63 1 0141231 
63 
64 I 0 1 4 1 2 3 5 1 2 3 4 2 1 3 1 
15 
65 5 014213 
12 16 21 24 41 
66 2 014312 
20 
67 1 0 1 4 3 1 2 5 1 2 
20 
68 1 0 1 5 1 21 3 2 1 4. 
18 
69 1 0 1 5 I 2 1 3 2 1 4 6 
18 
70 1 0 1 5 1 2 1 3 2 1 4 6 7 1 
18 
71 1 01514213 
24 
72 1 0 1 5 1 4 2 1 3 6 1 2 3 -4 
24 
73 1 0152134 
28 
74 1 0 1 5 2 1 3 4 6 1 
28 
75 1 0 1 6 1 2 5 4 3 1 2 
52 
76 1 0 1 6 1 2 5 4 3 1 2 7 1 
AUTOMATIC ANALYSIS OF SENTENCE STRUCTURES 223 
SENTENCE NO 60 
3348 65 THOSE ¢¢ 
1.421 1 ¢¢¢~¢ 
3349 4 THINGS ¢~ 
0.268 3 ~¢ 
3350 3 ARE ~ 
1.873 1 ~ 
3351 36 NOT ~ ~ 
1.804 2 ~ 
3352 31 DONE ~¢~ 
SENTENCE NO 56 
3255 29 
4.345 1 
3256 3 
1.121 2 
3257 5 
-0.095 3 
3258 27 
2.026 2 
3259 55 
2.173 1 
3260 4 
WHAT ~'* 
IS '~ '~ 
WRONG '~'~'~'~ 
WITH *'*'~'~ '~ 
HIS '~* 
MOP,.ALS '~'~ 
SENTENCE NO 58 
3323 54 
2.306 1 
3324 2 
1.570 2 
3325 66 
2.689 1 
3326 5 
-2.392 3 
3327 1 
1.663 1 
3328 24 
THAT ~u 
SOUNDS ~' 
QUITE ~ ~ 
IMPOSSIBLE '~'~ '~ 
CRIED ~ 
KITTY ~ 
224 ANDR~E TRETIAKOFF 
SENTENCE NO 36 
1839 44 
2.564 1 
1840 1 
1.859 2 
1841 55 
2.173 1 
1842 4 
1.012 3 
1843 7 
1.180 2 
1844 45 
1.573 1 
1845 4 
SHE ~ 
FORCED ~ 
HER ~ ~ 
LIPS ~* 
INTO ~ 
A ~ 
SMILE ~ 
SENTENCE NO 20 
1121 35 
2.114 1 
1122 • 4 
0.024 4 
1123 1 
0.273 3 
1124 7 
2.460 1 
1125 55 
2.173 2 
1126 4 
-1.604 5 
1127 35 
3.863 1 
1128 25 
2.087 2 
1129 4 
THE ~6 
MOTHER ~ 
GAVE ~ ~ 
INTO ~ ~ 
HER ~'~*~ 
THE ** * 
SMALLER ** * * 
CHILDREN ~ 
AUTOMATIC ANALYSIS OF SENTENCE STRUCTURES 225 
SENTENCE NO 19 
1079 24 KITTY ~* 
1.644 1 *** 
1080 1 FOUND ~" 
0.926 2 ~*~ 
1081 35 THE ** * * 
2.114 1 *** * 
1082 4 WORK ~* * 
-1.198 4 ~ 
1083 45 A *~ * 
1.573 1 ~**~* * 
1084 4 REFRESHMENT ** * * 
0.221 3 ~* 
1085 87 TO ~ 
2.511 1 ~ 
1086 55 HER ~ ~ 
2.173 2 *~* 
1087 4 SPIRIT ~*~ 
SENTENCE NO 49 
2820 44 SHE '*'* 
2.564 1 '*'~'~','*' 
2821 1 HAD *'* '* 
0.273 3 *** 
2822 7 AMONG ~ * *J 
2.460 1 *'*'* '* ~' 
2823 55 HER ** '* * * 
2.173 2 '*'** * 
2824 4 ANCESTORS '*'*'*'* * 
-5.241 4 ~ 
2825 4 PERSONS ** 
1.012 1 ~** * 
2826 7 OF '~ ~ 
-0.801 2 . ,~ 
2827 5 HISTORIC ** 
1.860 1 o,*~ 
2828 4 IMPORTANCE ~* 
15 
226 ANDl~E TRETIAKOFF 
SENTENCE NO 50 
2830 94 
2.047 4 
2831 68 
5.493 2 
2832 28 
7.456 1 
2833 33 
2.742 3 
2834 5 
0.490 5 
2835 17 
4.406 1 
2836 21 
0.540 3 
2837 66 
0.870 2 
2838 85 
2.194 1 
2839 4 
MUST ~ ~ 
HAVE ** * ~ ~ * 
BEEN ~* * * 
HARD **~*** 
TO *~ 
LEAVE *~' ~' '* 
SO ~¢¢¢ ¢ 
MUCH ~ 
GI(ANDEUI~ ** 
SENTENCE NO 52 
2850 44 
2.564 1 
2851 1 
0.926 6 
2852 35 
3.863 1 
2853 25 
2.087 2 
2854 4 
1.192 5 
2855 9 
2.438 4 
2856 44 
2.727 3 
2857 8 
3.709 1 
2858 46 
2.804 2 
2859 31 
-0.655 7 
2860 1 
1.663 1 
2861 24 
SHE ~'~ 
HAS o* 
THE ~ *~ 
MOSTBEAUTIFUL **~ * '* ~' 
HANDS ~* ~ ~ 
HAVE ~ *~ 
EVER. ~ ~ ~ ~' 
SEEN *~ * 
SAID ~* 
KITTY ~ 
AUTOMATIC ANALYSIS OF SENTENCE STRUCTURES 227 
SENTENCE NO 55 
3216 80 
2.061 1 
3217 46 
0.265 2 
3218 24 
3.245 1 
3219 2 
-0.934 5 
3220 44 
2.564 1 
322l 1 
1.125 2 
3222 6 
2.227 1 
3223 56 
0.505 3 
3224 17 
4.406 1 
3225 21 
1.036 2 
3226 44, , 
0.302 4 
3227 22 
BUT ~* 
SOMETIMES ~ ~ 
SISTERJOSEPH ~'~ ~ 
THOUGHT ~¢' 
HE ~ ¢* 
SPOKE ~* ~ ~* 
BADLY ~ ~ ~ 
ONPURPOSE ~ ~ 
TO ~ ~ ~ 
MAEK ~ ~ ~ ~ 
YOU ~ ~ 
L~UGH ~* 
228 ANDR\]fiE TRETIAKOFF 
SENTENCE NO 32 
1734 7 
2.122 1 
1735 35 
2.114 2 
1736 4 
2.288 1 
1737 97 
0.113 3 
1738 4 
0.085 4 
1739 44 
2.115 1 
1740 3 
1.047 3 
1741 7 
2.122 1 
1742 35 
2.114 2 
1743 4 
-1.074 5 
1744 66 
1.696 3 
1745 49 
2.983 1 
1746 44 
2.212 2 
1747 2 
BY ~'~ 
GO~ 
THE *** 
MERCY ** ,,, * 
OF *o * 
0~)0 
PROVIDENCE 4),~o~) 
WAS '~'~ ~' * = 
AT @* * 
UG@ G 
THE ¢.** o 
DOOR. ~*~= o 
JUST ****** * 
AS '~* q' 
SHE o~ • ,~ 
CAME **** 
AUTOMATIC ANALYSIS OF SENTENCE STRUCTURES 229 
SENTENCE NO 14 o 
728 94 IT *~ 
3.529 1 o*~ 
729 3 WAS o* • 
1.873 2 ou~ 
730 36 NOT *o~o 
1.760 3 *,~,~ 
731 49 TILL ~o o o 
2.983 1 *** * * 
732 44 I *" * * * 
2.564 2 "'~* * 
733 1 MADE ', *',', * 
1.232 4 *"* 
734 45 A *° o ° 
2.379 1 *** * * 
735 5 LONG *" "* * * 
1.860 2 *** * * 
736 4 JOURNEY **** " " * 
1.777 3 *** * 
737 16 IN *" * * 
2.637 1 *'~** * 
738 24 CHINA ** * 
0.075 5 ** 
739 9 THAT ***~ * 
2.438 2 *"" '~ 
740 44 I ***'~ '~ 
2.564 1 *** , 
741 1 FOUND *" • o 
1.710 3 ~o,,o* 
742 54 THIS oo*,*~ 
230 ANDR\]~E TRETIAKOFF 
SEI'~ENCE NO 15 
792 35 
2.114 1 
793 4 
0.085 4 
794 44 
2.564 1 
795 1 
1.447 2 
796 26 
0.329 3 
797 44 
-3.667 5 
798 44 
2,564 1 
799 1 
1.447 2 
800 26 
0.329 3 
801 44 
-2.311 4 
802 16 
1.721 2 
803 85 
2.194 1 
804 4 
1.379 3 
805 41 
5.094 1 
806 89 
THE *** 
MOMENT ** * 
YOU ~ * 
THINK *~' * * * 
OF ~ ~ ~ 
HIM o~ 
YOU ~ 
THINK ~ ~ 
OF ~ ~ 
HIM ~*~ ~ 
IN **** * ,. 
SOME ~ ~ ~ 
SITUATION *~ ~ 
DOING ~ 
SOMETHING ~ 
AUTOMATIC ANALYSIS OF SENTENCE STRUCTURES 231 
SENTENCE NO 22 
1536 44 
2.564 1 
1537 1 
1.430 3 
1538 9 
2.438 2 
1539 44 
2.564 1 
1540 1 
1.267 4 
1541 44 
-0.843 5 
1542 80 
0.124 1 
1543 31 
-0.240 2 
1544 80 
-0.175 1 
1545 5 
-1.993 6 
1546 44 
2.564 1 
1547 1 
1.267 2 
1548 44 
-1.076 3 
1549 7 
-0.345 1 
1550 4 
SHE u~* 
FELT ~o ~ 
THAT ~ ~ 
THEY ~ ~* 
LIKED ~ 
HER ~,~ o 
AND ~o o 
FLATTERED ~ o ~ 
AND o~ ~ 
PROUD ~* 
SHE ~ 
LIKED ~ ~* 
• I* s/, ~ 
THEM ~ 
IN ~ 
RETURN '~ 
/3 
232 ANDR~ TRETIAKOFF 
Group 
01 
21 
31 
41 
51 
TA~L~ 3. 
INDICATIVE (TR-ANSITIVE VERBS) 
INFINITIVE (TR-ANSITIVE VERBS) 
PAST PARTICIPLE (TR-ANSITIVE VERBS) 
PR.ESENT PARTICIPLE (TR-ANSITIVE VERBS) 
GERUND (TR.ANSITIVE VERBS) 
02 INDICATIVE (INTR-ANSITIVE VERBS) 
22 INFINITIVE (INTR.ANSITIVE VERBS) 
32 PAST PARTICIPLE (INTR-ANSITIVE VERBS) 
42 PRESENT PARTICIPLE (INTR-ANSITIVE VER-BS) 
52 GER-UND (INTRANSITIVE VERBS) 
03 INDICATIVE (STATE VERBS) 
23 INFINITIVE (STATE VER-BS) 
33 PAST PAR-TICIPLE (STATE VER.BS) 
43 PR-ESENT PARTICIPLE (STATE VERBS) 
53 GER-UND (STATE VERBS) 
08 INDICATIVE (AUXILIARY VERBS) 
28 INFINITIVE (AUXILIARY VERB) 
68 WOULD, SHOULD, WILL, CAN, MAY, HAVE TO 
78 INDICATIVE (TO DO, AUXILIARY VERB) 
88 INFINITIVE (TO DO, AUXILIARY VERB) 
04 COMMON NOUN 
14 COMMON NOUN (POSSESSIVE CASE) 
24 PROPER. NOUN 
34 PR.OPER. NOUN (POSSESSIVE CASE) 
44 PRONOUN (PER.SONAL) 
54 PRONOUN (DEMONSTRATIVE) 
64 PR.ONOUN (INDEFINITE) 
74 PR.ONOUN (PER.SONAL REFLEXIVE) 
94 PRONOUN (IMPERSONAL) 
05 ADJECTIVE (QUALIFICATIVE) 
15 ADJECTIVE (COMPARATIVE) 
25 ADJECTIVE (SUPERLATIVE) 
35 ARTICLE (DEFINITE) 
45 ARTICLE (INDEFINITE) 
55 ADJECTIVE (POSSESSIVE) 
65 ADJECTIVE (DEMONSTRATIVE) 
85 ADJECTIVE (INDEFINITE) 
95 ADJECTIVE (CARDINAL) 
72 ADJECTIVE (PRESENT PARTICIPLE) 
73 ADJECTIVE (PAST PARTICIPLE) 
93 ADJECTIVE (ORDINAL) 
39 ADJECTIVE (INTERROGATIVE) 
........ ,.= ......... 
AUTOMATIC ANALYSIS OF SENTENCE STRUCTURES 
06 ADVERBS MADE FROM ADJECTIVES 
16 ADVERB (PLACE) 
26 POSTPOSITION 
36 NOT 
46 ADVERB (TIME) 
56 ADVERB (MANNER) 
66 ADVEP.B (QUANTITY) 
76 AS, LIKE 
86 ADVERB (REPETITION) 
96 ADVERB (EXCLAMATIVE) 
07 PREPOSITION 
17 TO (INFINITIVE) 
27 WITH, WITHOUT 
87 TO 
97 OF 
89 NOTHING, SOMETHING 
09 THAT 
19 PRONOUN (tLELATIVE) 
29 PRONOUN (INTERROGATIVE) 
49 CONJUNCTION (TIME) 
59 CONJUNCTION (CAUSE) 
69 CONJUNCTION (SUPPOSITION) 
79 CONJUNCTION (COMPARISON) 
8O AND 
81 BUT, OR. 
/* 
233 

