Linguistic Knowledge Generator 
Satoshi SEKINE * 
Tokyo Information and Communications Research Laboratory 
Matsushita Electric Industrial Co.,Ltd. 
Sofia ANANIADOU Jeremy J.CARROLL Jun'ichi TSUJII 
Centre for Computational Linguistics 
University of Manchester Institute of Science and Technology 
PO Box 88, Manchester M60 1QD, United Kingdom 
1 Introduction 
The difficulties in current NLP applications are sel- 
dom due to the lack of appropriate frameworks for 
encoding our linguistic or extra-linguistic knowledge, 
hut rather to the fact that we do not know in advance 
what actual znstances of knowledge should be, even 
though we know in advance what types of knowledge 
are required. 
It normally takes a long time and requires painful 
trial and error processes to adapt knowledge, for ex- 
ample, in existing MT systems in order to translate 
documents of a new text-type and of a new subject 
domain. Semantic classification schemes for words, 
for example, usually reflect ontologies of subject do- 
mains so that we cannot expect a single classifica- 
tion scheme to be effective across different domains. 
To treat different suhlanguages requires different word 
classification schemes. We have to construct appro- 
priate schemes for given sublanguages from scratch 
\[1\]. 
It has also been reported that not only knowledge 
concerned with extra-linguistic domains but also syn- 
tactic knowledge, such as subcategorization frames of 
verbs (which is usually conceived as a part of general 
language knowledge), often varies from one sublan- 
guage to another \[2\]. 
Though re-usability of linguistic knowledge is cur- 
rently and intensively prescribed \[3\], our contention 
is that the adaptation of existing knowledge requires 
processes beyond mere re-use. That is, 
1. There are some types of knowledge which we 
have to discover from scratch, and which should 
be integrated with already existing knowledge. 
2. It is often the case that knowledge, which is nor- 
mally conceived as valid regardless of subject do- 
mains, text types etc., should be revised signifi- 
cantly. 
In practical projects, the ways of achieving such 
adaptation and discovery of knowledge rely heavily 
*SEKINE is currently a visitor at U.M.I.S T. *eki~eOccl.umist. ac.uk 
on human introspection. In the adaptation of exist- 
ing MT systems, linguists add and revise the knowl- 
edge by inspecting a large set of system translation 
results, and then try to translate another set of sen- 
tences from given domains, and so on. The very fact 
that this trial and error process is time consuming 
and not always satisfactory indicates that human in- 
trospection alone cannot effectively reveal regularities 
or closure properties of sublanguages. 
There have been some proposals to aid this pro- 
cedure by using programs in combination with huge 
corpora \[4\] \[51 \[13\] \[7\]. But the acquisition prog .... in 
these reports require huge amounts of sample texts in 
given domains which often makes these methods un- 
realistic in actual application environments. Further- 
more, the input corpora to such learning programs 
are often required to be properly tagged or anno- 
tated, which demands enormous manual effort, mak- 
ing them far less useful. 
In order to overcmne the difficulties of these meth- 
ods, we propose a Linguistic Knowledge Generator 
(LKG) which working on the principle of "Gradual 
Approximation" involving both human introspection 
and discovery programs. 
In the following section, we will explain the Grad- 
ual Approximation approach. Then a scenario which 
embodies the idea and finally we describe an experi- 
ment which illustrates its use. 
2 Gradual Approximation 
Some of the traditional learning programs which are 
to discover linguistic regularities in a certain dimen- 
sion requires some amount of training corpus to be 
represented or structured in a certain way. For ex- 
ample, a program which learns disambiguation rules 
for parts-of-speech may require training data to be 
represented as a sequence of words with their correct 
parts-of-speech. It may count frequencies of trigram 
of parts-of-speech in corpora to learn rules for disam- 
biguation. On the other hand, a program to discover 
the semautic classes of nouns may require input data 
(sentences) to be accompanied by their correct syn- 
AcrEs DE COLING-92, NANTES, 23-28 Ao~r 1992 5 6 0 PROC. OF COLING-92, NANTES, AUG. 23-28. 1992 
tactic structures, and so on. 
This is also the case for statistical programs. Mean- 
ingful statistics can be obtained only when they are 
applied to appropriate units of data. Frequencies of 
characters or trigrams of characters, for example~ arc 
unlikely to be useful for capturing the structures of 
the semantic domains of a given sublanguage. In 
short, discovery processes can be effectively assisted 
or carried out, if corpora are appropriately repre- 
sented for the purpose. However, to represent or tag 
corpora appropriately requires other sorts of linguistic 
or extra-linguistic knowledge or even the very knowl- 
edge whicb is to he discovered by the program. 
For example, though corpora amtotated with syn- 
tactic structures are usefld tbr discovering semantic 
classes, to assign correct syntactic structures to cor- 
pora requires semantic knowledge in order to prevent 
proliferation of Imssible syntactic structures. 
One possible way of avoiding this chicken-and-egg 
situation is 1o use roughly approximated, imperfect 
knowledge of semantic domains in order to hypothe- 
mac correct syntactic structures of sentences ill cor- 
pora. Because such approximated semantic knowl- 
edge will contain errors or lack necessary information, 
syntactic structures &ssiglled to sentences ii1 corpora 
may contain errors or imperfections. 
Ilowever, if a program or Imnran expert could pro- 
duce more accurate, less imperfect knowledge of se- 
mantic domains from descriptions of corpora (as- 
signed syntactic structures), we could use it to pro- 
duce more accurate, less erroneous syntactic descrip- 
tions of corpora, and repeat the same process again 
to gain fllrtber zmprovcment botb in knowledge of se- 
mantic domains and in syntactic descriptions of cor- 
pora. Thus, we may be able to converge gradually 
to botb correct syntactic descriptions of corpora, and 
semantic classifications of words. 
In order to support such convergence processes, 
I,KG has to maintain the following two types of data. 
1. knowledge sets of various dimensions (morphol- 
ogy, syntax, semantics, pragmatics/ontology of 
extra-linguistic domains etc.), which are hypoth- 
esis, eel by humans or by discovery programs, and 
all of which are imperfect in the sense that they 
contain erroneous generalizations, lack specific 
information, etc. 
2. descriptions of corpora at diverse levels, which 
are based on the hypothesised knowledge in 1. 
Because of the hypothetical nature of the knowl- 
edge in 1, descriptions based on it inevitably con- 
tain errors or lack precision. 
Based on these two types of data, both of which 
contain imperfect, the whole process of discovering 
regularities in sublanguage will be performed as a re- 
laxation process or a gradual repetitive approxima- 
tion process. That is, 
1. hunmn specialists or discovery programs make 
hypotheses based on mrperfect descriptions of 
corpora 
2. hypotheses thus proposed result in more accu- 
rate, less imperfect knowledge 
3. tbe more accurate, less imperfect knowledge in 
2., results m a more accurate description of the 
corpora 
The same process will be repeated from 1., but this 
time, based on the more accurate descriptions of col 
pora than the previous cycle. It will yield further, 
more accurate hypothesized knowledge and descrip- 
tions of corpora and so on. 
3 Algorithm 
In this sec.tkm, we describe a scenario to illustrate 
how our idea of the "Gradual Approximation" works 
to obtain knowledge front actual corpora. The goal of 
the scenario is to discover semantic classes of nouns 
which are effective for determining (disambiguating) 
internal structures of compound nouns, which con- 
sist of sequences of nmms. Note that, because there 
is no clear distinction in Japanese between noun 
phrases and conlllOUlld uonns consisting of sequences 
of nouns, we refer to them collectively as compound 
nouns. The scenario is comprised of three programs, 
ie. Japanese tagging program, Automatic Learning 
Program of Semantic Collocations and clustering pro- 
gram. 
There is a phase of human intervention which accel- 
erates the calculation, but in this scenario, we try to 
minimize it. In the following, we first give an overview 
of the scenario, then explain each program briefly, and 
tlnally report on an experiment that fits this scenario. 
Note that, though we use this simple scenario as an 
illustrative example, the same learning program can 
be used in another inore complex scenario whose aim 
is, for example, to discow~r semantic collocation be- 
tween verbs aml noun/prepositional phrases. 
3.1 Scenario 
This scenario Lakes a corpus without any significant 
annotation as tile input data, and generates, ms the 
result, plausibility values of collocational relations be- 
tween two words and word clusters, based oll the cal- 
culated semantic distances between words. 
The diagram illustrating this scenario is shown in 
Figure 1. The first program to be applied is the 
"Japanese tagging program" which divides a sentence 
into words and generates lists of possible parts-of- 
speech for each word. 
Sequences of words with parts-of-speeches are then 
used to extract candidates for compound nouns (or 
noun phrases consisting of noun sequences), which 
are the input for the next program, the "Auto- 
matic Learning Program for Semantic Collocations" 
(ALPSC). This program constitutes the main part of 
Aergs DE COLING-92, NANTES, 23-28 AOf/'r 1992 5 6 1 PROC. OF COLING-92, NANTES, AUO. 23-28, 1992 
the scenario and produces tile ahove-ment, ioned out- 
put 
Tbe output of the program contain errors. Errors 
here mean that the plausibility values assigned to cob 
locations may lead to wrong determinations of com- 
pound noun structures. Such errors are contained in 
the results, because of the errors in the tagged data, 
the insufficient quality of the corpus and inevitable 
imperfections in tile learning system. 
From the word distance results, word clusters are 
computed by the next program, the "Clustering Pro- 
gram". Because of tile errors in tile word distance 
data, the computed clusters may be counter-intuitive. 
We expect human intervention at this stage to formu- 
late more intnitively reasonable clusters of nouns. 
After revision of the clusters by human specialists, 
the scenario enters a second trial. That is, the ALPSC 
re-computes plausibilit.y values of collocations and 
word distances based on tile revised clusters, the 
"Chlstering Program" generates the next generation 
of dusters, and humans intervene to formulate more 
reasonable clusters, and so on, and so forth. It is ex- 
pected that word clusters after the (i+l)-th trial be- 
comes more intuitively understandable than that of 
the i-th trial and that the repetition eventually con- 
verges towards ideal clusters of nouns and plausibility 
values, in the sense that they are consistent both with 
human introspection and the actual corpus. 
It should be noted that, while the overall process 
works as gradual approximation, the key program in 
the scenario, the ALPSC also works in tile mode of 
gradual approximation as explained in Section 3.2.2. 
3.2 Programs and Iluman interven- 
tions 
We will explain each program briefly. Ifowevcr the 
ALPSC is crucial and tmique, so it will be explained 
in greater detail. 
3,2.1 Program: Japanese tagging program 
This program takes Japanese sentences as an input, 
finds word boundaries arid puts all possible parts-of- 
speech for each word under adjacency constraints. 
From the tagged sentences, sequences of nouns are 
extracted for input to the next program. 
3.2.2 Program: Antomatic Learning Program 
of Semantic Collocations (ALPSC) 
This is the key program which computes plausibil- 
ity values and word distances. In this scenario, the 
ALPSC treats only sequences of nouns, but it can 
generally applied for any structure of syntactic rela- 
tionships. It is an unique program with the following 
points \[8\]: 
l. it does not need a training corpus, which is one of 
the bottle necks of some other learning programs 
2. it learns by using a combination of linguistic 
knowledge and statistical analysis 
3. it uses a parser which produces all possible anal- 
yses 
4 it works as a relaxation process 
While it is included as a part of a larger repetitive 
loop, this program itself contains a repetitive loop. 
Overview 
Before formally describing the algorithm, the %l- 
lowing shni)le example illustrates its working. 
A parser produces all possible syntactic descrip- 
tions aluong words in the form of syntactic depen- 
dency structures, The description is represented by a 
set of tupies, for example, \[head uord, syntactic 
relat±on, argument\]. The only syntactic relation 
m a tuple is MOD for this scenario, but it can be ei- 
ther a grammatical relation like MOD, SUB,J, OBJ, 
etc. or a surface preposition like BY, WITH, etc. 
When two or more tupies share tile same argument 
and tile same syntactic-relation, but have different 
head-words, there is an ambiguity. 
For example, tile description of a compound noun: 
"File transfer operation" contains three tuples: 
\[transfer, MOO, file\] 
\[operation, MOD, file\] 
\[operation, HOD, transfer\] 
The first two tup\]es are redundant, because one 
word can only be an argument in one of tile tuples. 
As repeatedly clahned in tile literature of natural lan- 
gnage understanding, in order to resolve this ambi- 
gmty, a system may have to be able to infer extra- 
linguistic knowledge. A practical problem here is 
that there is no systematic way of accumulating such 
extradinguistic knowledge for given subject fields. 
That is, mdess a system ha-s a full range of contex- 
tual understanding abilities it cannot reject either of 
the possihle interpretations as 'hnpossible'. The best 
a system can do, without full understanding abilities, 
is to select more plausible ones or reject less plausi- 
ble ones. This implies that we have to introduce a 
measure by which we can judge plausibility of :inter- 
pretations'. 
The algorithm we propose computes such measures 
from given data. It gives a plausibility value to each 
possible tuple, based on the sample corpus. For exam- 
pie, when tile tuples (transfer, ROD, file) and 
(opera~:ion, MOD, file) are assigned 0.5 and 0.82 
as their plausibility, this would show the latter tuple 
to be more plausible than the former. 
The algorithm is based on the assumption that 
the ontological characteristics of the objects and ac- 
tions denoted by words (or linguistic expressions in 
general), and the nature of the ontological relations 
AcrEs DE COLING-92, NANTES, 23-28 AOOT 1992 5 6 2 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
among tbein, are exhibited, though implicitly in sam- 
pie texts. For example, uouns denoting objects which 
belong to tile same ontological classes tend to appear 
m similar linguistic contexts. 
Note that we talk about extra-linguistic 'ontology' 
for the sake of explaining the basic idea bebind the 
actual algorithm, ttowever, as you will see, we do 
not represent such things as ontological entities in 
the actual algorithm. The algorithm sinrply counts 
frequencies of co-occurrences among words, and word 
similarity algorithms interpret such co-occurrences ms 
contexts. 
The algoritbm in this progranr computes the plan- 
sibility values of hypothesis-tuples like (operation, 
M0D, Iile), etc., basically by counting frequencies of 
instance-tuples \[operation, ROD, f±le\], etc. gen 
crated from the input data. 
Terminology and notation 
instance-.tuple {h, r, a\] : a token of a dependency rc- 
lat, ion; part of the analysis of a sentence in a 
corpus. 
hypothesis-tuple (h,r,a) : a dependency relation; 
an abstraction or type over identical instance- 
tuples 
}:yale-- repeat time of the relaxation cycle. 
(;'r,, : Gredit of instance tuple 7' with identification 
mmfl, er i, {0, 1\] 
V,,~ : Plausibility value of a hypothesis-tuple T in 
cycle 9. \[0, 1\] 
D~ (w.,rvb) : distance between words, w~ and wb m 
cycle 9. TO, 11 
Algorithm 
The following explanation of the algorithm assumes 
that the illlnlts are sentences. 
1 For a sentence we use a simple grammar to find 
all tuples lmssibly used. Each instance-tuple is 
then given credit in proportion to the number of 
conlpeting tuples. 
1 (~ = (1) 
number of competing tuples 
This credit sbows which rules are suitable for this 
sentence. On the first iteration the split of the 
credit between ambiguous analyses is uniform as 
shown above, but on subsequent iterations plau- 
sibility values of tire hypothesis-tuples VT a-1 be- 
fore the iteration are used to give preference to 
credit for some analyses over others. The formula 
for this will be given later. 
2. tlypotbesis-tuples have a plausibility value which 
indicates their reliability by a number between 
0 and 1. If an instance-tuple occurs frequently 
in the corpus or if it occurs where there are no 
alternative tuples, the plausibility value for the 
corresponding hypothesis must be large. After 
analysing all the sentences of tile corpus, wc get 
a set of sentences with weighted instanee-tuples. 
Each instauee-tuple invokes a hypothesis-tuple. 
For each hypothesis-tulflC, we define the plausi- 
bility value by the following formula. This for- 
mula is designed so that tile value does not ex- 
ceed 1. 
v4 = 1. I\] I1- v,,,,) (2) 
i 
At this stage, the word-distances can be used to 
modify the plausibility wdnes of the hypotbesiu- 
tupies. The word-dist>tnces are either defined 
externally using human mtuitiou or calculated 
in the previous cycle with a formula given later 
Disl`ance between words induces a distance be- 
tween bypothesis-tuples, 'lk~ speed up the cal- 
culation and to get, better resull`s, we use sim- 
ilar hypothesis eft~cts. The plausibility value 
of a hypothesis-tuple is modified based on the 
word distance and plausibility value of a simi- 
lar hypothesis. For each bypotbesis-tuple, the 
plausibility-vMue is increased only as a conse- 
quence of the similar hypothesis-tuple which has 
the greatest ellect. The new plausibility value 
with similar hypothesis-tulfie effect is calculated 
10y tile following formula. 
(3) 
llere, the hypotbesis-tuple 7" is the hypothesis- 
tuple which has the greatest effect on the 
hypothesis-tuple "/' (origmM one). Ilypotbems- 
tuple T and T' }lave all the same elements ex- 
cept one, "Fbe distance between ~" and 2/" is 
the distance betweei1 the different elements, w a 
and wb. Ordinarily the dilierence is in the head 
or argument element, but when the relation is 
a preposition, it is possible to consider distance 
front another preposition. 
Distances between words are calculated on the 
basis of similarity between hypothesis-tuples 
about them. Tbe formula is as follows: 
D~ ( ..... ~) Z,~, (v4 - V4,) e (4) 
n 
7' and 7" are bypothesis-tuldes whose arguments 
are w~ and wb, respectively and whose heads and 
relations are the same fl is a constant parame- 
ter 
ACRES DE COLING-92, NANTES, 23-28 AO~" 1992 5 6 3 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
5. Tbis procedure will be repeated from the begin- 
ning, but modifying the credits ofinstance-tupies 
between ambiguous analyses using tim plausibil- 
ity values of hypothesis-tuples. This will hope- 
fully be more accurate than the previous cycle. 
On the first iteration, we used just a constant 
figure for the credits of instanee-tuples. But 
this time we can use the plausibility value of the 
hypothesis-tuple which was deduced in the pre- 
vious iteration. Hence with each iteration we ex- 
pect more reliable figures, q.'o cMcuLate the new 
credit of instance-tuple T, we use: 
c~, - vT~' (s) 
Ilere, V.r J m the numerator, is the plausibility 
value of a hypothesis-tuple which is the same tu- 
pie as the instance-tuple T V,~ in the denom- 
inator are the plausibility values of competing 
hypothesis tnples in the sentence and the plau- 
sibility value of the same hyl)otbesis-tuple itself. 
ct is a coustant paranleter 
6. Iterate step 1 to 5 several times, until the infor- 
matiou is saturated. 
3,2.3 Program: Clustering i)rogram 
Word clusters are produced based on the word dis- 
tance data which are comlmted in the previous pro- 
gram A non-overlapping clusters algorithm with the 
maximum method was used. The level of the clusters 
was adjusted experimentally to get suitable sizes tor 
bu Foau intervention. 
3.2.4 Human zT~lerventzon. Select: clusters 
The clusters may mbcrit errors contained in the word 
distance data The errors can be classified into the 
following two types. 
1 A Correct cluster overlaps with two or more geu- 
crated clusters. 
2 A generated duster overlaps with two or more 
correct chlsters 
Note that 'correcU bere means that it is correct 
m terms of truman intuition. To ease the laborious 
job of correcting these errors by band, we ignore the 
first type of error, which is much harder to remove 
than the second one. It is not ditlieult to remove the 
second type of error, because the number of words 
in a single cluster ranges from two to about thirty, 
and this number is manageable for hmnans. We try 
to extract purely 'correcU clusters or a subset of a 
correct cluster, from a generated cluster. 
It is our contention that, thougll chlsters contain 
errors, and are mixtures of clusters based on human 
intuition and clusters computed by process, we will 
gradually converge on correct clusters by repeating 
tiffs approximation. 
At this stage, some correct clusters in the produced 
clusters are extracted. This information will be an 
input of the next trial of ALPSC. 
4 Experiment 
We conducted an experiment using compound nouns 
from a computer manual according to the scenario. 
The resnlt for other relations, for example preposi- 
tional attachment, would be not so differeut from this 
result. 
The corpus consisted of 8304 sentences. As the 
result of Japanese tagging program, 1881 candidates, 
616 kinds of compound nouns were extracted. 
Then ALPSC took these compound nouns as an in- 
put. Tuple relations were supposed between all words 
of all compound nouns with the syntactic relation 
'MODIFY'. A tuple has to have a preceding argument 
and a following head For example, from a compound 
noun with 4 words, 5 ambiguous tuples and 1 firm tu- 
pie can be extracted, because each element can be the 
argument m only one tuple. An initial credit of 1/3 
was set for each instance-tuple whose arguments are 
the first word of the compound noun. Similarly, a 
credit i/2 was set for each instance-tuple in which 
the second word is an argument. 
No word distance information was introduced in 
the first trial. Then the learning process was started. 
We have shown the results of first trial in Table 1 
and examples m Figure 2 
The results were classified as correct or incorrect 
etc.. 'Correct' means that a hypotbesis-tuple which 
has the highest plausibility value is the correct tu- 
pie within ambiguous tuples. 'Incorrect' means it 
isn't. 'ludefinite' means that plausibility values of 
some hypothesis-tuples have the same value. 'Un- 
certain' means that it is impossible to declare which 
hypothesis tuple is the best without context. 
j 4 tl 41 \[ r/ s r 1 I 
I 5 II 4 I o I o I 2 I 
Table I: Results of experiment after first ALPSC 
Tile clustering program produced 44 clusters based 
on the word distance data. a sample of the clusters is 
shown in Figure 3. The average number of words in 
a cluster was 3.43, Each produced cluster contained 
one to twenty five words. This is good number to 
treat manually. The human intervention to extract 
correct clusters resulted in 26 clusters being selected 
from 44 produced clusters. The average number of 
ACRES DE COLING-92, NANTES, 23-28 Aot'rr 1992 5 6 4 Paoc. oi: COLING-92, NAYrES, AUO. 23-28, 1992 
words in a cluster is 2.96, It took a linguist who is 
familiar with computer 15 minutes. A sample of the 
selected clusters is shown in Figure 4. 
These clusters were used for the second trial of 
ALPSC, The results of second trial are shown in Ta- 
ble 2. 
a0 I 3 ! f~ I 
s I ~/ i t 
0 L-- ° -1 ~ J \[ t~l-U-i~-I-TF- ~- -\[---TC--/ 
\[_~L(r4.1) L (21.1) I (- 4.8 ) _~_ (-)_A 
Table 2: Results of experiment after second trial 
5 Discussion 
The scenario described above embodies a part of our 
ideas. Several other experiments have already been 
conducted, based on other scenarios such a.q a see 
nario for finding clusters of nouns by which we can re- 
solve ambiguities caused by prepositional attachment 
in English. Though this works in a similar fashion 
as the one we discussed, it has to treat more serf 
ous structural ambiguities and store diverse syntactic 
structures. 
Though we have not compared them in detail, it 
call be expected that the organization of semantic 
clusters of nouns tbat emerge in these two scenarios 
will be different from each other. One reflects colloca- 
tional relations among nouns, while the other reflects 
tlmse between nouns and verbs. By merging these two 
scenarios into one larger scenario, we may be able to 
obtain more accurate or intuitively reasonable notnr 
clusters. We are planning to accumulate a number of 
SUCh scenarios and larger scenarios. We hope we can 
report it. soon. 
As t\)r tim result of the particular experiment in the 
previous section, one eighth of the incorrect results 
have progressed after one trial of the gradual approx- 
iruatiou. This is significant progress in the processing. 
For humans it wmdd be a tremendously laborious job 
as they would be required to examine all the results. 
What bunlans did in the experiment is simply divide 
the produced clasters. 
Although the clusters are produced by a no- 
overlapping clustering algorithm in this experiment, 
we are developing an overlapping clustering program. 
l\[opefulty it will produce clusters which involve the 
concept of word sense ambiguity. It will mean that 
a word can belong to several clusters at a time. 3"he 
method to produce overlapping clusters is one of our 
current research topics. 
Examining the results, we can say that the clus- 
ter effect is not enough to explain word relatious of 
compound nouns, q'here might be some structural 
and syntactic restrictions. This feature of compound 
nouns made it hard to get a higher percentage of cor- 
rect answers in our experiment. Extra-processing to 
address these problems can be introduced into our 
system. 
Because tile process concerns huge amount of lin- 
guistic data which also has ambiguity, it is inevitable 
to be experimental. A sort of repetitive progress is 
needed to make the system smarter. We will need to 
perform a lot of experiments in order to determine 
the type of the human intervention required, as there 
seems to be no means of deternfining this theoreti- 
cally. 
This system is aiming not to sinmlate human 
linguists who conveutionally have derived linguis- 
tic knowledge by computer, but to discover a new 
paradigm where automatic knowledge acquisition 
prograrns and human effort are effectively combined 
to generate linguistic knowledge. 
6 Acknowledgements 
Wc wouht like to thank our colleagues at the 
CCIo and Matsushita, in particular, Mr.a.Phillips, 
MrK.Kageura, Mr.P.Olivier and Mr.Y.Kamm, whose 
comments have been very usefifl. 
ACRES DE COL1NG-92, NANKS, 23-28 AoGr 1992 5 6 5 Pgoc. OF COLING-92, NAI, rtEs, AUG. 23-28, 1992 
~_Sentenees ) 
\[-° ..... I 
(c,.,o.) 
(,,,oo.,,on) 
Figure i. Diagram of the scenario 
{Tr4~(file).a~(high speed),:~(discource).~t,--7" 
(loop),~--7"(group).~-~&(table)._Zl~(duplieation). 
~(reverse).~l~surplus).x~-~(schedule). 
8~(pliority).~(paper).~(float).iE~(regular). 
:~-V(charaeter).t~(sentence-structure).~ 
(words-and-phrase).~-~2~, supposition).~(KkNk). 
~ll~(chinese-ehareter), -y~ ~e ~ 0 (directory). 
,~'.~T'/~(pop-up), ~y-- 
(letter), ~.~9 (back-),~(white-)} 
{~4~H(wild),~l-~(calculation)) 
{3:~-(error),~i~(infinite).~(wait-)) 
{c,Iggi(analysis),i'CJl~(object)) 
{:~-~(hard),~(main-),~t~(assistance).14T(finish)} 
{~;~(access),f'~A~(usage)} 
{~g4~-(exeeute),~POT~(refer)) 
I~(middZe-)} 
{~(connection)) 
(~(retrieval)} 
{~Zl~(regulation)} 
IiPl(for~ard-), 3W-(copy)\] 
{~/:i~(multiple).--~7~ (default)} 
{7~-~V(field)} 
{g_~(return).:Yr~(direction).~(~idth). 
~(full-).~(half-)} 
{~T,(display).~(change)} 
{±(upper-).~,~(management), ~,(delete)} 
{~(expression). 7,~V (font)} 
{~g(internal-),:~(grammar)} 
{~(specification),~(input).~2(output)} 
{-Y----:7'(tape)} 
{~#'-~(modification),~(creation),~bW(final-), 
~-~(mode)} 
{~(ite~)} 
(~(right).~(left)} 
Figure 3. Sample of Produced Clusters in first trial 
\[yJl~--7": group\] \[~-?: execu tel 
\[l~:pe~ission\] \[3~:chareter\] 
(~'n,--~" ~) O. 000 
(~%--7" i~) O. 002 (~L,-7" 3L~) O. 997 X 
(~ I~*q) ~. 000 o 
(~l~tY ~t'-l~) O. 000 
\[J:v-:error\] \[~Lw~:management\] \[n,-~-7:routin\] 
(~- ~) 0.505 o 
(~J~---~-7) 0.494 
\[~1~--7": loop\] \[147:finish\] \[-7-x t- : test\] 
Or,--v" 147") O. 02~ 
(147 -Y-~, ~) O. 976 x 
Figure 2. Sample of the Results 
{:~:r(eharacter),~(words-and-phrases),qi~(KANA), 
~(chinese--character), ~y-(letter)} 
I:~:l~(discourse),~(sentence-structure)} 
{vw~/~(file),-7~ ~y ~U (directory)} 
t~(maiu-),~PJ(assistanee)} 
{g(t~(execute),~(refer)} 
{~r~(direction),~(width)} 
{~(full-),~(half)} 
{~-~,(display).~(change)} 
I~Lu~(management).~J~,(delete)} 
{~l~(expression). 7~b(font)} 
1~,2(input).~2(output)} 
I~(modifieation).f~(creation)} 
(~i(right).~(left)} 
Figure 4. Sample of Selected clusters 
AcrEs DE COLING-92, NANTES, 23-28 AO~" 1992 5 6 6 PROCI OF COLING-92, NANTES, AUG. 23-28, 1992 

References

\[1\] Ralph Grishman: Discovery Procedures for Sub- 
language Selectional Patterns: Initial Experi- 
ments Comp. Linguistics Vol. 12 No.3 (1986) 

\[2\] Sofia Auauiadou: Sublanguage studies as the Ba- 
sis for Conqmter Support for Multilingua} Com- 
munication t'roceedin9 s of 7'ermplarL '90, Kuala 
Lumpur (1990) 

\[3\] A Zampolli: Reusable Linguistic Resources (In- 
vited paper) 5th Conference of the E.A.C.L. (199i) 

\[4\] Kenneth Ward Church: A Stochastic Parts Pro- 
gram and Noun Phrase Parser for Unrestricted 
Text 2nd Uon/erer~ce on A.N.L.P (1988) 

\[5\] Donald llindle anti Mats Rooth: Structural Am- 
biguity and l,exical Relations °~9th Conference of 
the A.C.L. (1991) 

\[6\] Smaja and McKeown: Automatically Extracting 
and Representing Collocations for language gen- 
eration 28th (7onfeve,~ce of tt*e A.C.L. (1991) 

\[7\] Uri Zernik and Paul aacobs: Tagging for Learn- 
rag: Collecting Thematic Relations from Corpus 
13th COLING-90 (1990) 

\[8\] S.Sekine, J.J.Carroll, S. Ananiadou, a.Tsujii: 
Automatic Learning for Semantic Collocation 
3rd Conference on ANLP (1992) 
