File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1104_metho.xml

Size: 18,956 bytes

Last Modified: 2025-10-06 14:14:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1104">
  <Title>A Statistical Analysis of Morphemes in Japanese Terminology</Title>
  <Section position="3" start_page="0" end_page="638" type="metho">
    <SectionTitle>
2 Terminological Data
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="638" type="sub_section">
      <SectionTitle>
2.1 The Data
</SectionTitle>
      <Paragraph position="0"> We use a list of different terms as a sample, and observe the quantitative nature of the constituent elements or morphemes. The quantitative regularities is expected to be observed at this level, because a large portion of terms is complex (Nomura &amp; Ishii, 1989), whose formation is systematic (Sager, 1990), and the quantitative nature of morphemes in terminology is independent of the token frequency of terms, because the term formation is a lexical formation.</Paragraph>
      <Paragraph position="1"> With the correspondences between text and terminology, sentences and terms, and words and morphemes, the present work can be regarded as parallel to the quantitative study of words in texts (Baayen, 1991; Baayen, 1993; Mandelbrot, 1962; Simon, 1955; Yule, 1944; Zipf, 1935). Such terms as 'type', 'token', 'vocabulary', etc. will be used in this context.</Paragraph>
      <Paragraph position="2"> Two Japanese terminological data are used in this study: computer science (CS: Aiso, 1993) and psychology (PS: Japanese Ministry of Education, 1986). The basic quantitative data are given in Table 1, where T, N, and V(N) indicate the number of terms, of running morphemes (tokens), and of different morphemes (types), respectively.</Paragraph>
      <Paragraph position="3"> In computer science, the frequencies of the borrowed and the native morphemes are not very different. In psychology, the borrowed  morphemes constitute only slightly more than 10% of the tokens. The mean frequency N/V(N) of the borrowed morphemes is much lower than the native morphemes in both domains. null</Paragraph>
    </Section>
    <Section position="2" start_page="638" end_page="638" type="sub_section">
      <SectionTitle>
2.2 LNRE Nature of the Data
</SectionTitle>
      <Paragraph position="0"> The LNRE (Large Number of Rare Events) zone (Chitashvili &amp; Baayen, 1993) is defined as the range of sample size where the population events (different morphemes) are far from being exhausted. This is shown by the fact that the numbers of hapax legomena and of dislegomena are increasing (see Figure 1 for hapax).</Paragraph>
      <Paragraph position="1"> A convenient test to see if the sample is located in the LNRE zone is to see the ratio of loss of the number of morpheme types, calculated by the sample relative frequencies as the estimates of population probabilities. Assuming the binomial model, the ratio of loss is obtained by:</Paragraph>
      <Paragraph position="3"> where: f(i, N) : frequency of a morpheme wi in a sample of N.</Paragraph>
      <Paragraph position="4"> p(i, N) = f(i, N)/N : sample relative frequency. m : frequency class or a number of occurrence. V(m, N) : the number of morpheme types occurring m times (spectrum elements) in a sample of N.</Paragraph>
      <Paragraph position="5">  In the two data, we underestimate the number of morpheme types by more than 20% (CL in Table 1), which indicates that they are clearly located in the LNRE zone.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="638" end_page="638" type="metho">
    <SectionTitle>
3 The LNRE Framework
</SectionTitle>
    <Paragraph position="0"> When a sample is located in the LNRE zone, values of statistical measures such as type-token ratio, the parameters of 'laws' (e.g. of Mandelbrot, 1962) of word frequency distributions, etc. change systematically according to the sample size, due to the unobserved events. To treat LNRE samples, therefore, the factor of sample size should be taken into consideration.</Paragraph>
    <Paragraph position="1"> Good (1953) gives a method of re-estimating the population probabilities of the types in the sample as well as estimating the probability mass of unseen types. There is also work on the estimation of the theoretical vocabulary size (Efron &amp; Thisted, 1976; National Language Research Institute, 1958; Tuldava, 1980). However, they do not give means to estimate such values as V(N), V(m, N) for arbitrary sample size, which are what we need. The LNRE framework (Chitashvili &amp; Baayen, 1993) offers the means suitable for the present study.</Paragraph>
    <Section position="1" start_page="638" end_page="638" type="sub_section">
      <SectionTitle>
3.1 Binomial/Poisson Assumption
</SectionTitle>
      <Paragraph position="0"> Assume that there are S different morphemes wi, i = 1,2,...S, in the terminological population, with a probability Pl associated with each of them. Assuming the binomial distribution and its Poisson approximation, we can express the expected numbers of morphemes and of spectrum elements in a given sample of size N as follows:</Paragraph>
      <Paragraph position="2"> As our data is in the LNRE zone, we cannot estimate Pi. Good (1953) and Good &amp; Toulmin (1956) introduced the method of interpolating and extrapolating the number of types for arbitrary sample size, but it cannot be used for extrapolating to a very large size.</Paragraph>
    </Section>
    <Section position="2" start_page="638" end_page="638" type="sub_section">
      <SectionTitle>
3.2 The LNRE Models
</SectionTitle>
      <Paragraph position="0"> Assume that the distribution of grouped probability p follows a distribution 'law', which can be expressed by some structural type distribution G(p) s = ~i=1 I\[p~&gt;p\], where I = 1 when pi &gt; P  and 0 otherwise. Using G(p), the expressions (1) and (2) can be re-expressed as follows:</Paragraph>
      <Paragraph position="2"> same value and indexed by the subscript j that indicates in ascending order the values of p.</Paragraph>
      <Paragraph position="3"> In using some explicit expressions such as lognormal 'law' (Carrol, 1967) for G(p), we again face the problem of sample size dependency of the parameters of these 'laws'. To overcome the problem, a certain distribution model for the population is assumed, which manifests itself as one of the 'laws' at a pivotal sample size Z. By explicitly incorporating Z as a parameter, the models can be completed, and it becomes possible (i) to represent the distribution of population probabilities by means of G(p) with Z and to estimate the theoretical vocabulary size, and (ii) to interpolate and extrapolate V(N) and V(m, N) to the arbitrary sample size N, by such an expression: E\[V(m, N)\] = --I = -(~(Z-'-P))'~)m! e-~(zP) dG(p) The parameters of the model, i.e. the original parameters of the 'laws' of word frequency distributions and the pivotal sample size Z, are estimated by looking for the values that most properly describe the distributions of spectrum elements and the vocabulary size at the given sample size. In this study, four LNRE models were tried, which incorporate the lognormal 'law' (Carrol, 1967), the inverse Gauss-Poisson 'law' (Sichel, 1986), Zipf's 'law' (Zipf, 1935) and Yule-Simon 'law' (Simon, 1955).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="638" end_page="641" type="metho">
    <SectionTitle>
4 Analysis of Terminology
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="638" end_page="638" type="sub_section">
      <SectionTitle>
4.1 Random Permutation
</SectionTitle>
      <Paragraph position="0"> Unlike texts, the order of terms in a given terminological sample is basically arbitrary. Thus term-level random permutation can be used to obtain the better descriptions of sub-samples.</Paragraph>
      <Paragraph position="1"> In the following, we use the results of 1000 term-level random permutations for the empirical descriptions of sub-samples.</Paragraph>
      <Paragraph position="2"> In fact, the results of the term-level and morpheme-level permutations almost coincide, with no statistically significant difference. From this we can conclude that the binomial/Poisson assumption of the LNRE models in the previous section holds for the terminological data.</Paragraph>
    </Section>
    <Section position="2" start_page="638" end_page="640" type="sub_section">
      <SectionTitle>
4.2 Quantitative Measures
</SectionTitle>
      <Paragraph position="0"> Two measures are used for observing the dynamics of morphemes in terminology. The first is the mean frequency of morphemes:</Paragraph>
      <Paragraph position="2"> The repeated occurrence of a morpheme indicates that it is used as a constituent element of terms, as the samples consist of term types. As it is not likely that the same morpheme occurs twice in a term, the mean frequency indicates the average number of terms which is connected by a common morpheme.</Paragraph>
      <Paragraph position="3"> A more important measure is the growth rate, P(N). If we observe E\[V(N)\] for changing N, we obtain the growth curve of the morpheme types. The slope of the growth curve gives the growth rate. By taking the first derivate of E\[V(N)\] given by equation (3), therefore, we obtain the growth rate of the morpheme types: ~N E\[(V(1, g)\] P(N) = E\[V(N)\] = N (6) This &amp;quot;expresses in a very real sense the probability that new types will be encountered when the ... sample is increased&amp;quot; (Baayen, 1991). For convenience, we introduce the notation for the complement of P(N), the reuse ratio:</Paragraph>
      <Paragraph position="5"> which expresses the probability that the existing types will be encountered.</Paragraph>
      <Paragraph position="6"> For each type of morpheme, there are two ways of calculating P(N). The first is on the basis of the total number of the running morphemes (frame sample). For the borrowed morphemes, for instance, it is defined as:</Paragraph>
      <Paragraph position="8"> The second is on the basis of the number of running morphemes of each type (item sample).</Paragraph>
      <Paragraph position="9"> For instance, for the borrowed morphemes:</Paragraph>
      <Paragraph position="11"> Correspondingly, the reuse ratio R(N) is also defined in two ways.</Paragraph>
      <Paragraph position="12"> Pi reflects the growth rate of the morphemes of each type observed separately. Each of them expresses the probability of encountering a new morpheme for the separate sample consisting of the morphemes of the same type, and does not in itself indicate any characteristics in the frame sample.</Paragraph>
      <Paragraph position="13">  On the other hand, Pf and Rf express the quantitative status of the morphemes of each type as a mass in terminology. So the transitions of Pf and Rf, with changing N, express the changes of the status of the morphemes of each type in the terminology. In terminology, Pf can be interpreted as the probability of incorporating new conceptual elements.</Paragraph>
    </Section>
    <Section position="3" start_page="640" end_page="641" type="sub_section">
      <SectionTitle>
4.3 Application of LNRE Models
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the results of the application of the LNRE models, for the models whose mean square errors of V(N) and V(1,N) are minimal for 40 equally-spaced intervals of the sample. Figure 1 shows the growth curve of the morpheme types up to the original sample size (LNRE estimations by lines and the empirical values by dots). According to Baayen (1993), a good lognormal fit indicates high productivity, and the large Z of Yule-Simon model also means richness of the vocabulary. Figure 1 and the chosen models in Table 2 confirm these interpretations. null  From Figure 1, it is observed that the number of the borrowed morpheme types in computer science becomes bigger than that of the native morphemes around N = 15000, while in psychology the number of the borrowed morphemes is much smaller within the given sample range. All the elements are still growing, which implies that the quantitative measures keep changing.</Paragraph>
      <Paragraph position="1"> Figure 2 shows the empirical and LNRE estimation of the spectrum elements, for m = 1 to 10. In both domains, the differences between V(1, N) and V(2, N) of the borrowed morphemes are bigger than those of the native morphemes.</Paragraph>
      <Paragraph position="2"> Both the growth curves in Figure 1 and the distributions of the spectrum elements in Figure 2 show, at least to the eye, the reasonable fits of the LNRE models. In the discussions below, we assume that the LNRE based estimations are  valid, within the reasonable range of N. The statistical validity will be examined later.</Paragraph>
      <Paragraph position="3">  As the population numbers of morphemes are estimated to be finite with the exception of the borrowed morphemes in psychology, limN._,oo X(V(N)) = o% which is not of much interest. The more important and interesting is the actual transition of the mean frequencies within a realistic range of N, because the size of a terminology in practice is expected to be limited.</Paragraph>
      <Paragraph position="4"> Figure 3 shows the transitions of X(V(N)), based on the LNRE models, up to 2N in computer science and 5N in psychology, plotted according to the size of the frame sample. The mean frequencies are consistently higher in computer science than in psychology. Around N =</Paragraph>
      <Paragraph position="6"> 70000, X(V(N)) in computer science is expected to be 10, while in psychology it is 9.</Paragraph>
      <Paragraph position="7"> The particularly low value of X(V(Nbo,,.owed)) in psychology is also notable.</Paragraph>
      <Paragraph position="8">  Figure 4 shows the values of Pf, Pi and Rf, for the same range of N as in Figure 3. The values of Pib(N) and Pi,(N) in both domains show that, in general, the borrowed morphemes are more 'productive' than the native morphemes, though the actual value depends on the domain. Comparing the two domains by Pfau (N), we can observe that at the beginning the terminology of psychology relies more on the new morphemes than in computer science, but the values are expected to become about the same around N -- 70000.</Paragraph>
      <Paragraph position="9"> Pfs for the borrowed and native morphemes show interesting characteristics in each domain. Firstly, in computer science, at the relatively early stage of terminological growth (i.e. N -~ 3500), the borrowed morphemes begin to take the bigger role in incorporating new conceptual elements. Pfb(N) in psychology is expected to become bigger than \['In (N) around N = 47000. As the model estimates the population number of the borrowed morphemes to be infinite in psychology, that the Pfb(N) becomes bigger than Pfn (N) at some stage is logically expected. What is important here is that, even in psychology, where the overall role of the borrowed morphemes is marginal, Pf=(N) is expected to become bigger around N -- 47000, i.e. T ~-- 21000, which is well within the realistic value for a possible terminological size.</Paragraph>
      <Paragraph position="10"> Unhke Pf, the values of Rf show stable transition beyond N = 20000 in both domains,  gradually approaching the relative token frequencies. null</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="641" end_page="642" type="metho">
    <SectionTitle>
5 Theoretical Validity
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="641" end_page="642" type="sub_section">
      <SectionTitle>
5.1 Linguistic Validity
</SectionTitle>
      <Paragraph position="0"> We have seen that the LNRE models offer a useful means to observe the dynamics of morphemes, beyond the sample size. As mentioned, what is important in terminological analyses is to obtain the patterns of transitions of some characteristic quantities beyond the sample size but still within the realistic range, e.g. 2N, 3N, etc. Because we have been concerned with the morphemes as a mass, we could safely use N instead of T to discuss the status of morphemes,  implicitly assuming that the average number of constituent morphemes in a term is stable.</Paragraph>
      <Paragraph position="1"> Among the measures we used in the analysis of morphemes, the most important is the growth rate. The growth rate as the measure of the productivity of affixes (Baayen, 1991) was critically examined by van Marle (1991). One of his essential points was the relation between the performance-based measure and the competence-based concept of productivity. As the growth rate is by definition a performance-based measure, it is not unnatural that the competence-based interpretation of the performance-based productivity measure is requested, when the object of the analysis is directly related to such competence-oriented notion as derivation. In terminology, however, this is not the case, because the notion of terminology is essentially performance-oriented (Kageura, 1995). The growth rate, which concerns with the linguistic performance, directly reflects the inherent nature of terminological structure 1.</Paragraph>
      <Paragraph position="2"> One thing which may also have to be accounted for is the influence of the starting sample size. Although we assumed that the order of terms in a given terminology is arbitrary, it may * not be the case, because usually a smaller sample may well include more 'central' terms. We may need further study concerning the status of the available terminological corpora.</Paragraph>
    </Section>
    <Section position="2" start_page="642" end_page="642" type="sub_section">
      <SectionTitle>
5.2 Statistical Validity
</SectionTitle>
      <Paragraph position="0"> Figure 5 plots the values of the z-score for E\[V\] and E\[V(1)\], for the models used in the analyses, at 20 equally-spaced intervals for the first half of the sample 2. In psychology, all but one values are within the 95% confidence interval.</Paragraph>
      <Paragraph position="1"> In computer science, however, the fit is not so good as in psychology.</Paragraph>
      <Paragraph position="2"> Table 3 shows the X 2 values calculated on the basis of the first 15 spectrum elements at the original sample size. Unfortunately, the X 2 values show that the models have obtained the fits which are not ideal, and the null hypothesis XNote however that the level of what is meant by the word 'performance' is different, as Baayen (1991) is textoriented, while here it is vocabulary-oriented.</Paragraph>
      <Paragraph position="3">  Unlike texts (Baayen, 1996a;1996b), the illfits of the growth curve of the models are not caused by the randomness assumption of the model, because the results of the term-level permutations, used for calculating z-scores, are statistically identical to the results of morpheme-level permutations. This implies that we need better models if we pursue the better curvefitting. On the other hand, if we emphasise the theoretical assumption of the models of frequency distributions used in the LNRE analyses, it is necessary to introduce the finer distinctions of morphemes.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML