XML Viewer - p06-2017

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2017_metho.xml
Size: 16,305 bytes
Last Modified: 2025-10-06 14:10:24
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2017">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Analysis and Synthesis of the Distribution of Consonants over Languages: A Complex Network Approach</Title>
  <Section position="4" start_page="128" end_page="130" type="metho">
    <SectionTitle>
2 PlaNet: The Phoneme-Language
</SectionTitle>
    <Paragraph position="0"> Network We define the network of consonants and languages, PlaNet, as a bipartite graph represented as G = &lt;VL, VC, E&gt; where VL is the set of nodes labeled by the languages and VC is the set of nodes labeled by the consonants. E is the set of edges that run between VL and VC. There is an edge e[?] E between two nodes vl [?] VL and vc [?] VC if and only if the consonant c occurs in the language l. Figure 1 illustrates the nodes and edges of PlaNet.</Paragraph>
    <Section position="1" start_page="128" end_page="129" type="sub_section">
      <SectionTitle>
2.1 Construction of PlaNet
</SectionTitle>
      <Paragraph position="0"> Many typological studies (Lindblom and Maddieson, 1988; Ladefoged and Maddieson, 1996; Hinskens and Weijer, 2003) of segmental inventories have been carried out in past on the UCLA Phonological Segment Inventory Database (UP-SID) (Maddieson, 1984). UPSID initially had 317 languages and was later extended to include 451 languages covering all the major language families of the world. In this work we have used the older version of UPSID comprising of 317 languages and 541 consonants (henceforth UPSID317), for constructing PlaNet. Consequently, there are 317 elements (nodes) in the set VL and 541 elements  (nodes) in the set VC. The number of elements (edges) in the set E as computed from PlaNet is 7022. At this point it is important to mention that  in order to avoid any confusion in the construction of PlaNet we have appropriately filtered out the anomalous and the ambiguous segments (Maddieson, 1984) from it. We have completely ignored the anomalous segments from the data set (since the existence of such segments is doubtful), and included the ambiguous ones as separate segments because there are no descriptive sources explaining how such ambiguities might be resolved. A similar approach has also been described in Pericliev and Vald'es-P'erez (2002).</Paragraph>
    </Section>
    <Section position="2" start_page="129" end_page="130" type="sub_section">
      <SectionTitle>
2.2 Degree Distribution of PlaNet
</SectionTitle>
      <Paragraph position="0"> The degree of a node u, denoted by ku is defined as the number of edges connected to u. The term degree distribution is used to denote the way degrees (ku) are distributed over the nodes (u). The degree distribution studies find a lot of importance in understanding the complex topology of any large network, which is very difficult to visualize otherwise. Since PlaNet is bipartite in nature it has two degree distribution curves one corresponding to the nodes in the set VL and the other corresponding to the nodes in the set VC.</Paragraph>
      <Paragraph position="1"> Degree distribution of the nodes in VL: Figure 2 shows the degree distribution of the nodes in VL where the x-axis denotes the degree of each node expressed as a fraction of the maximum degree and the y-axis denotes the number of nodes having a given degree expressed as a fraction of the total number of nodes in VL .</Paragraph>
      <Paragraph position="2"> It is evident from Figure 2 that the number of consonants appearing in different languages follow a b-distribution 2 (see (Bulmer, 1979) for reference). The figure shows an asymmetric right skewed distribution with the values of a and b equal to 7.06 and 47.64 (obtained using maximum likelihood estimation method) respectively. The asymmetry points to the fact that languages usually tend to have smaller consonant inventory size, 2A random variable is said to have a b-distribution with parameters a&gt; 0 and b&gt; 0 if and only if its probability mass function is given by</Paragraph>
      <Paragraph position="4"> VL. The figure in the inner box is a magnified version of a portion of the original figure.</Paragraph>
      <Paragraph position="5"> the best value being somewhere between 10 and 30. The distribution peaks roughly at 21 indicating that majority of the languages in UPSID317 have a consonant inventory size of around 21 consonants.</Paragraph>
      <Paragraph position="6"> Degree distribution of the nodes in VC: Figure 3 illustrates two different types of degree distribution plots for the nodes in VC; Figure 3(a) corresponding to the rank, i.e., the sorted order of degrees, (x-axis) versus degree (y-axis) and Figure 3(b) corresponding to the degree (k) (x-axis) versus Pk (y-axis) where Pk is the fraction of nodes having degree greater than or equal to k.</Paragraph>
      <Paragraph position="7"> Figure 3 clearly shows that both the curves have two distinct regimes and the distribution is scalefree. Regime 1 in Figure 3(a) consists of 21 consonants which have a very high frequency (i.e., the degree k) of occurrence. Regime 2 of Figure 3(b) also correspond to these 21 consonants.</Paragraph>
      <Paragraph position="8"> On the other hand Regime 2 of Figure 3(a) as well as Regime 1 of Figure 3(b) comprises of the rest of the consonants. The point marked as x in both the figures indicates the breakpoint. Each of the regime in both Figure 3(a) and (b) exhibit a power law of the form</Paragraph>
      <Paragraph position="10"> In Figure 3(a) y represents the degree k of a node corresponding to its rank x whereas in Figure 3(b) y corresponds to Pk and x, the degree k. The values of the parameters A and a, for Regime 1 and Regime 2 in both the figures, as computed by the least square error method, are shown in Table 1.</Paragraph>
      <Paragraph position="11">  VC in a log-log scale It becomes necessary to mention here that such power law distributions, known variously as Zipf's law (Zipf, 1949), are also observed in an extraordinarily diverse range of phenomena including the frequency of the use of words in human language (Zipf, 1949), the number of papers scientists write (Lotka, 1926), the number of hits on web pages (Adamic and Huberman, 2000) and so on. Thus our inferences, detailed out in the next section, mainly centers around this power law behavior. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="130" end_page="131" type="metho">
    <SectionTitle>
3 Inferences Drawn from the Analysis of
</SectionTitle>
    <Paragraph position="0"> In most of the networked systems like the society, the Internet, the World Wide Web, and many others, power law degree distribution emerges for the phenomenon of preferential attachment, i.e., when &amp;quot;the rich get richer&amp;quot; (Simon, 1955). With reference to PlaNet this preferential attachment can be interpreted as the tendency of a language to choose a consonant that has been already chosen by a large number of other languages. We posit that it is this preferential property of languages that results in the power law degree distributions observed in Figure 3(a) and (b).</Paragraph>
    <Paragraph position="1"> Nevertheless there is one question that still remains unanswered. Whereas the power law distribution is well understood, the reason for the two distinct regimes (with a sharp break) still remains unexplored. We hypothesize that, Hypothesis The typical distribution of the consonant inventory size over languages coupled with the principle of preferential attachment enforces the two distinct regimes to appear in the power law curves.</Paragraph>
    <Paragraph position="2"> As the average consonant inventory size in UPSID317 is 21, so following the principle of preferential attachment, on an average, the first 21 most frequent consonants are much more preferred than the rest. Consequently, the nature of the frequency distribution for the highly frequent consonants is different from the less frequent ones, and hence there is a transition from Regime 1 to Regime 2 in the Figure 3(a) and (b).</Paragraph>
    <Paragraph position="3"> Support Experiment: In order to establish that the consonant inventory size plays an important role in giving rise to the two regimes discussed above we present a support experiment in which we try to observe whether the breakpoint x shifts as we shift the average consonant inventory size.</Paragraph>
    <Paragraph position="4"> Experiment: In order to shift the average consonant inventory size from 21 to 25, 30 and 38 we neglected the contribution of the languages with consonant inventory size less than n where n is 15, 20 and 25 respectively and subsequently recorded the degree distributions obtained each time. We did not carry out our experiments for average consonant inventory size more than 38 because the number of such languages are very rare in UPSID317.</Paragraph>
    <Paragraph position="5"> Observations: Figure 4 shows the effect of this shifting of the average consonant inventory size on the rank versus degree distribution curves. Table 2 presents the results observed from these curves with the left column indicating the average inventory size and the right column the breakpoint x.  The table clearly indicates that the transition occurs at values corresponding to the average consonant inventory size in each of the three cases. Inferences: It is quite evident from our observations that the breakpoint x has a strong correlation with the average consonant inventory size, which therefore plays a key role in the emergence of the two regime degree distribution curves.</Paragraph>
    <Paragraph position="6"> In the next section we provide a simplistic mathematical model for explaining the two regime power law with a breakpoint corresponding to the average consonant inventory size.</Paragraph>
  </Section>
  <Section position="6" start_page="131" end_page="131" type="metho">
    <SectionTitle>
4 Theoretical Explanation for the Two
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="131" end_page="131" type="sub_section">
      <SectionTitle>
Regimes
</SectionTitle>
      <Paragraph position="0"> Let us assume that the inventory of all the languages comprises of 21 consonants. We further assume that the consonants are arranged in their hierarchy of preference. A language traverses the hierarchy of consonants and at every step decides with a probability p to choose the current consonant. It stops as soon as it has chosen all the 21 consonants. Since languages must traverse through the first 21 consonants regardless of whether the previous consonants are chosen or not, the probability of choosing any one of these 21 consonants must be p. But the case is different for the 22nd consonant, which is chosen by a language if it has previously chosen zero, one, two, or at most 20, but not all of the first 21 consonants. Therefore, the probability of the 22nd consonant being chosen is,</Paragraph>
      <Paragraph position="2"> denotes the probability of choosing i consonants from the first 21. In general the probability of choosing the n+1th consonant from the hierarchy</Paragraph>
      <Paragraph position="4"> Figure 5 shows the plot of the function P(n) for various values of p which are 0.99, 0.95, 0.9, 0.85, 0.75 and 0.7 respectively in log-log scale. All the curves, for different values of p, have a nature similar to that of the degree distribution plot we obtained for PlaNet. This is indicative of the fact that languages choose consonants from the hierarchy with a probability function comparable to P(n).</Paragraph>
      <Paragraph position="5"> Owing to the simplified assumption that all the languages have only 21 consonants, the first regime is a straight line; however we believe a more rigorous mathematical model can be built taking into consideration the b-distribution rather than just the mean value of the inventory size that can explain the negative slope of the first regime.</Paragraph>
      <Paragraph position="6"> We look forward to do the same as a part of our future work. Rather, here we try to investigate the effect of the exact distribution of the language inventory size on the nature of the degree distribution of the consonants through a synthetic approach based on the principle of preferential attachment, which is described in the subsequent section.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="131" end_page="133" type="metho">
    <SectionTitle>
5 The Synthesis Model based on
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="131" end_page="133" type="sub_section">
      <SectionTitle>
Preferential Attachment
</SectionTitle>
      <Paragraph position="0"> Albert and Barab'asi (1999) observed that a common property of many large networks is that the vertex connectivities follow a scale-free power law distribution. They remarked that two generic mechanisms can be considered to be the cause of this observation: (i) networks expand continuously by the addition of new vertices, and (ii) new vertices attach preferentially to sites (vertices) that are already well connected. They found that  utions, which in turn indicates that the development of large networks is governed by robust self-organizing phenomena that go beyond the particulars of the individual systems.</Paragraph>
      <Paragraph position="1"> Inspired by their work and the empirical as well as the mathematical analysis presented above, we propose a preferential attachment model for synthesizing PlaNet (PlaNetsyn henceforth) in which the degree distribution of the nodes in VL is known. Hence VL={L1, L2, . . ., L317} have degrees (consonant inventory size) {k1, k2, . . ., k317} respectively. We assume that the nodes in the set VC are unlabeled. At each time step, a node Lj (j = 1 to 317) from VL tries to attach itself with a new node i [?] VC to which it is not already connected. The probability Pr(i) with which the node Lj gets attached to i depends on the current degree of i and is given by</Paragraph>
      <Paragraph position="3"> where ki is the current degree of the node i, Vj is the set of nodes in VC to which Lj is not already connected and epsilon1 is the smoothing parameter which is used to reduce bias and favor at least a few attachments with nodes in Vj that do not have a high Pr(i). The above process is repeated until all Lj [?] VL get connected to exactly kj nodes in VC. The entire idea is summarized in Algorithm 1. Figure 6 shows a partial step of the synthesis process illustrated in Algorithm 1.</Paragraph>
      <Paragraph position="4"> Simulation Results: Simulations reveal that for PlaNetsyn the degree distribution of the nodes belonging to VC fit well with the analytical results we obtained earlier in section 2. Good fits emerge repeat for j = 1 to 317 do if there is a node Lj [?] VL with at least one or more consonants to be chosen from VC then Compute Vj = VC-V(Lj), where V(Lj) is the set of nodes in VC to which Lj is already connected; end for each node i [?] Vj do</Paragraph>
      <Paragraph position="6"> where ki is the current degree of the node i and epsilon1 is the model parameter. Pr(i) is the  probability of connecting Lj to i. end Connect Lj to a node i [?] Vj following the distribution Pr(i); end until all languages complete their inventory quota ;  When the language L4 has to connect itself with one of the nodes in the set VC it does so with the one having the highest degree (=3) rather than with others in order to achieve preferential attachment which is the working principle of our algorithm for the range 0.06 [?] epsilon1 [?] 0.08 with the best being at epsilon1 = 0.0701. Figure 7 shows the degree k versus  VC for both PlaNetsyn, PlaNet, and when the model incorporates no preferential attachment; for PlaNetsyn, epsilon1 = 0.0701 and the results are averaged over 100 simulation runs Pk plots for epsilon1 = 0.0701 averaged over 100 simulation runs.</Paragraph>
      <Paragraph position="7"> The mean error3 between the degree distribution plots of PlaNet and PlaNetsyn is 0.03 which intuitively signifies that on an average the variation in the two curves is 3%. On the contrary, if there were no preferential attachment incorporated in the model (i.e., all connections were equiprobable) then the mean error would have been 0.35 (35% variation on an average).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML