File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/e95-1008_metho.xml
Size: 7,993 bytes
Last Modified: 2025-10-06 14:14:02
<?xml version="1.0" standalone="yes"?> <Paper uid="E95-1008"> <Title>Collocation Map for Overcoming Data Sparseness</Title> <Section position="3" start_page="53" end_page="55" type="metho"> <SectionTitle> 2 Collocation Map </SectionTitle> <Paragraph position="0"> In this section, we make a brief introduction on Collocation map, and refer to (ttan 1993) for more discussion on the definition and to (ttan 1995) on infi~rence methods.</Paragraph> <Paragraph position="1"> Bayesian model consists of a network and probability tables defined on the nodes of the network. The nodes in the network repre.sent probabilistic variables of a problem domain. The network can compute probabilistic dependency between an)&quot; combination of the variables. The model is well documented as subjective probability theory (Pearl 1988).</Paragraph> <Paragraph position="2"> Collocation map is an application model of sigmold belief network (Neal 1992) that belongs to belief networks which in turn is a type of Bayesian model. Unlike belief networks, Collocation map does not have deterministic variables thus consists only of probabilistic variables that correspond to words in this case.</Paragraph> <Paragraph position="3"> Sigmoid belief network is different from other belief networks in that it does not have probability distribution table at each node but weights on the edges between the nodes. A node takes binary outcomes (1, -1) and the probability that a node takes an outcome given the vector of outcomes of its preceding nodes is a sigmoid function of the outcomes and the weights of associated edges. In this regard, the sigmoid belief network resembles artificial neural network. Such probabilities used to be stored at nodes in ordinary Bayesian models, and this makes the inferencing very difficult because the probability table can be very big. Sigmoid belief network does away with the NP-hard complexity by avoiding the tables at the loss of expressive generality of probability distributions that can be encoded in the tables.</Paragraph> <Paragraph position="4"> One who works with Collocation map has to deal with two problems. The first is how to construct the network, and the other is how to compute the probabilities on the network.</Paragraph> <Paragraph position="5"> Network can be constructed directly from a set of bigrams obtained from a training sample. Because Collocation map is a directed a~yclic graph,</Paragraph> <Paragraph position="7"> word when facing cycle due to the node. No more than two nodes for each word are needed to avoid the cycle in any case (ltan 1993). Once the network is setup, edges of the network are assigned with weights that are normalized frequency of the edges at a node.</Paragraph> <Paragraph position="8"> The inferencing on Collocation map is not different from that for sigmoid belief network. The time complexity of inferencing by reducing graph on sigmoid belief networks is O(N a) given N nodes (Han 1995). It turned out that inferencing on networks containing more than a few hundred nodes was not practical using either node reduction method or sampling method, thus we adopted the hybrid inferencing method that first reduces the network and applies Gibbs sampling method (Hall 1995). Using the hybrid inferencing method, computation of conditional probabilities took less than a second for a network with 50 nodes, two seconds for a network with 100 nodes, about nine seconds for a network with 200 nodes, and about two minutes for a network with about 1000 nodes.</Paragraph> <Paragraph position="9"> Conditional and marginal probabilities can be approximated from Gibb's sampling. Some conditional probabilities computed from a small network are shown in figure 1. Though the network may not be big enough to model the domain of finance, the resulting values from the small network composed of 9 dependencies seem useful and intuitive. null The computation in figure 1 was done by using graph reduction method. As it is shown in the example inferences, the association between any combination of variables can be measured.</Paragraph> </Section> <Section position="4" start_page="55" end_page="55" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> The goal of our experiment is first to find how data sparseness is related with the frequency based statistics and to show Collocation map based method gives more reliable approximations. In particular, from the experiments we observed the variances of statistics might suggest the level of data sparseness. The less frequent data tended to have higher variances though the values of statistics (mutual information for instance) did not distinguish the level of occurrences. The predictive account of Collocation map is demonstrated by observing the variances of approximations on the infrequent events.</Paragraph> <Paragraph position="1"> The tagged Wall Street Journal articles of Penn Tree corpus were used that contain about 2.6 million word units. In the experiments, about 1.2 million of them was used. Programs were coded in C language, and run on a Sun Spare 10 workstation. null For the first 1.2 million words, the bigrams consisting of four types of categories (NN, NNS, IN, J J) were obtained, and mutual information of each bigram (order insensitive) was computed. The bi-grams were classified into 200 sets according to their occurrences. Figure 2 summarizes the the average MI value and the variance of each frequency range. From figure 3 that shows the occurrence distribution of 378,888 unique bigrams, about 70% of them occur only one time. One interesting and important observation is that those of 1 to 3 frequency range that take about 90% of the population have very high MI values. This results also agree with Dunning's argument about overestimation on the infrequent occurrences in which many infrequent pairs tend to get higher estimation (Dunning 1993). The problem is due to the assumption of normality in naive frequency based statistics according to Dunning (1993). Approximated values, thus, do not indicate the level of data quality. Figure 3 shows variances can suggest the level of data sufficiency. From this observation we propose the following definition on the notion of data sparseness.</Paragraph> <Paragraph position="2"> A set of units belonging to a sample of ordered word units (texts) is cz data-sparse null if and only if the variance of measurements on the set is greater than ~.</Paragraph> <Paragraph position="3"> The definition sets the concept of sparseness within the context of a focused set of linguistic units. For a set of units unoberved from a sample, the given sample text is for sure data-sparse. The above definition then gives a way to judge with respect to observed units. The measurement of data sparseness can be a good issue to study where it may depend on the contexts of research.</Paragraph> <Paragraph position="4"> Here we suggest a simple method perhaps for the first time in the literature.</Paragraph> <Paragraph position="5"> Figure 4 compares the results from using Collocation map and simple frequency statistic. The variances are smaller and the pairs in frequency 1 class have non zero approximations. Because computation on Collocation map is very high, we have chosen 2000 unique pairs at random. The network consists of 988 nodes. Computing an approximation (inferencing) took about 3 minutes.</Paragraph> <Paragraph position="6"> The test size of 2000 pairs may not be sufficient, but it showed the consistent tendency of graceful degradation of variances. The overestimation problem was not significant in the approximations by Collocation map. The average value of zero frequency class to which 50 unobserved pairs belong was also on the line of smooth degradation, and figure 4 shows only the variance.</Paragraph> <Paragraph position="7"> Table 1 summarizes the details of performance gain by using Collocation map.</Paragraph> </Section> class="xml-element"></Paper>