File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0613_metho.xml
Size: 15,876 bytes
Last Modified: 2025-10-06 14:08:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0613"> <Title>Learning Word Meanings and Descriptive Parameter Spaces from Music</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Automatically Uncovering Description </SectionTitle> <Paragraph position="0"> We propose an unsupervised model of language feature collection that is based on description by observation, that is, learning target classifications by reading about the musical artists in reviews and discussions.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Community Metadata </SectionTitle> <Paragraph position="0"> Our model is called community metadata (Whitman and Lawrence, 2002) and has been successfully used in style detection (Whitman and Smaragdis, 2002) and artist similarity prediction (Ellis et al., 2002). It creates a machine understandable representation of artist description by searching the Internet for the artist name and performing light natural language processing on the retrieved pages. We split the returned documents into classes encompassing n-grams (terms of word length n), adjectives (using a part-of-speech tagger (Brill, 1992)) and noun phrases (using a lexical chunker (Ramshaw and Marcus, 1995).) Each pair {artist,term} retrieved is given an associated salience weight, which indicates the relative importance of term as associated to artist. These saliences are computed using a variant of the popular TF-IDF measure gaussian weighted to avoid highly specific and highly general terms. (See Table 2 for an example.) One important feature of community metadata is its timesensitivity; terms can be crawled once a week and we can take into account trajectories of community-level opinion about certain artists.</Paragraph> <Paragraph position="1"> Although tempting, we are reticent to make the claim that the community metadata vectors computationally approach the &quot;linguistic division of labor&quot; proposed in (Putnam, 1987) as each (albeit unaware) member of the networked community is providing a small bit of information and description about the artist in question. We feel that the heavily biased opinion extracted from the Internet is best treated as an approximation of a 'ground truth description.' Factorizing the Internet community into relatively coherent smaller communities to obtain sharpened lexical groundings is part of future work. However, we do in fact find that the huge amount of information we retrieve from these crawls average out to a good general idea of the artists.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Time-Aware Machine Listening </SectionTitle> <Paragraph position="0"> We aim for a representation of audio content that captures as much perceptual content as possible and ask the system to find patterns on its own. Our representation is based on the MPEG-7 (Casey, 2001) standard for content understanding and metadata organization.1 The result of an MPEG-7 encoding is a discrete state number l (l = [1...n]) for each 1100th of a second of input audio. We histogram the state visits into counts for each n-second piece of audio.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Relating Audio to Description </SectionTitle> <Paragraph position="0"> Given an audio and text model, we next discuss how to discover relationships between them. The approach we use is the same as our previous work, where we place the problem as a multi-class classification problem. Our input observations are the audio-derived features, and in training, each audio feature is associated with some salience weight of each of the 200,000 possible terms that our community metadata crawler discovered. In a recent test, training 703 separate SVMs on a small adjective set in the frame-based single term system took over 10 days.</Paragraph> <Paragraph position="1"> In most machine learning classifiers, time is dependent on the number of classes. As well, due to the unsupervised and automatic nature of the description classes, many are incorrect (such as when an artist is wrongly described) or unimportant (as in the case of terms such as 'talented' or 'cool'- meaningless to the audio domain.) Lastly, because the decision space over the entire artist space is so large, most class outputs are negative. This creates a bias problem for most machine learning algorithms. We next show our attempt at solving these sorts of problems using a new classifier technique based on the support vector machine.</Paragraph> <Paragraph position="2"> 1Our audio representation is fully described in (Whitman et al., 2003).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Regularized Least-Squares Classification Regularized Least-Squares Classification (Rifkin, 2002) </SectionTitle> <Paragraph position="0"> allows us to solve 'severe multi-class' problems where there are a great number of target classes and a fixed set of source observations. It is related to the Support Vector Machine (Vapnik, 1998) in that they are both instances of Tikhonov regularization (Evgeniou et al., 2000), but whereas training a Support Vector Machine requires the solution of a constrained quadratic programming problem, training RLSC only requires solving a single system of linear equations. Recent work (Fung and Mangasarian, 2001), (Rifkin, 2002) has shown that the accuracy of RLSC is essentially identical to that of SVMs.</Paragraph> <Paragraph position="1"> We arrange our observations in a Gram matrix K, where Kij [?] Kf(xi,xj) using the kernel function Kf.</Paragraph> <Paragraph position="2"> Kf(x1,x2) is a generalized dot product (in a Reproducing Kernel Hilbert Space (Aronszajn, 1950)) between xi and xj. We use the Gaussian kernel</Paragraph> <Paragraph position="4"> where s is a parameter we keep at 0.5.</Paragraph> <Paragraph position="5"> Then, training an RLSC system consists of solving the system of linear equations</Paragraph> <Paragraph position="7"> where C is a user-supplied regularization constant. The resulting real-valued classification function f is</Paragraph> <Paragraph position="9"> The crucial property of RLSC is that if we store the inverse matrix (K+ IC)[?]1, then for a new right-hand side y, we can compute the new c via a simple matrix multiplication. This allows us to compute new classifiers (after arranging the data and storing it in memory) on the fly with simple matrix multiplications.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Evaluation for a &quot;Query-by-Description&quot; Task </SectionTitle> <Paragraph position="0"> To evaluate our connection-finding system, we compute the weighted precision P(at) of predicting the label t for audio derived features of artist a. We train a new ct for each term t against the training set. ft(x) for the test set is computed over each audio-derived observation frame x and term t. If the sign of ft(x) is the same as our supposed 'ground truth' for that {artist,t}, (i.e. did the audio frame for an artist correctly resolve to a known descriptive term?) we consider the prediction successful.</Paragraph> <Paragraph position="1"> Due to the bias problem mentioned earlier, the evaluation is then computed on the test set by computing a 'weighted precision': where P(ap) indicates overall positive accuracy (given an audio-derived observation, the probability that a positive association to a term is predicted) and Semantically attached terms are discovered by finding strong connections to perception. We then ask a 'professional' in the form of a lexical knowledge base about antonymial relations. We use those relations to infer gradations in perception.</Paragraph> <Paragraph position="2"> P(an) indicates overall negative accuracy, P(a) is defined as P(ap)P(an), which should remain significant even in the face of extreme negative output class bias.</Paragraph> <Paragraph position="3"> Now we sort the list of P(at) and set an arbitrary threshold e. In our implementation, we use e = 0.1. Any P(at) greater than e is considered 'grounded.' In this manner we can use training accuracy to throw away badly scoring classes and then figure out which were incorrect or unimportant.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Linguistic Experts for Parameter </SectionTitle> <Paragraph position="0"> Given a set of 'grounded' single terms, we now discuss our method for uncovering parameter spaces among those terms and learning the knobs to vary their gradation. Our model states that certain knowledge is not inferred from sensory input or intrinsic knowledge but rather by querying a 'linguistic expert.' If we hear 'loud' audio and we hear 'quiet' audio, we would need to know that those terms are antonymially related before inferring the gradation space between them.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 WordNet </SectionTitle> <Paragraph position="0"> WordNet (Miller, 1990) is a lexical database handdeveloped by lexicographers. Its main organization is the 'synset', a group of synonymous words that may replace each other in some linguistic context. The meaning of a synset is captured by its lexical relations, such as hyponymy, meronymy, or antonymy, to other synsets.</Paragraph> <Paragraph position="1"> WordNet has a large community of users and various APIs for accessing the information automatically. Adjectives in WordNet are organized in two polar clusters of synsets, which each focal synset (the head adjective) linking to some antonym adjective. The intended belief is that northern - southern playful - serious unlimited - limited naive - sophisticated foreign - native consistent - inconsistent outdoor - indoor foreign - domestic dissonant - musical physical - mental opposite - alternate censored - uncensored unforgettable - forgettable comfortable - uncomfortable concrete - abstract untamed - tame partial - fair empirical - theoretical atomic - conventional curved - straight lean - rich lean - fat descriptive relations are stored as polar gradation spaces, implying that we can't fully understand 'loud' without also understanding 'quiet.' We use these antonymial relations to build up a new relation that encodes as much antonymial expressivity as possible, which we describe below.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Synant Sets </SectionTitle> <Paragraph position="0"> We define a set of lexical relations called synants, which consist of every antonym of a source term along with every antonym of each synonym and every synonym of each antonym. In effect, we recurse through WordNet's tree one extra level to uncover as many antonymial relations as possible. For example, &quot;quiet&quot;'s anchor antonym is &quot;noisy,&quot; but &quot;noisy&quot; has other synonyms such as &quot;clangorous&quot; and &quot;thundering.&quot; By uncovering these second-order antonyms in the synant set, we hope to uncover as much gradation expressivity as possible. Some example synants are shown in Table 3.</Paragraph> <Paragraph position="1"> The obvious downside of computing the synant set is that they can quickly lose synonymy- following from the example above, we can go from &quot;quiet&quot; to its synonym &quot;untroubled,&quot; which leads to an synantonymial relation of &quot;infested.&quot; We also expect problems due to our lack of sense tagging: &quot;quiet&quot; to its fourth sense synonym &quot;restrained&quot; to its antonym &quot;demonstrative,&quot; for example, probably has little to do with sound. But in both cases we rely again on the sheer size of our example space; with so many possible adjective descriptors and the large potential size of the synant set, we expect our connection-finding machines to do the hard work of throwing away the mistakes.</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Innate Dimensionality of Parameters </SectionTitle> <Paragraph position="0"> Now that we have a set of grounded antonymial adjectives pairs, we would like to investigate the mapping in perceptual space between each pair. We can do this with a multidimensional scaling (MDS) algorithm. Let us call all acoustically derived data associated with one adjective as X1 and all data associated with the syn-antonym X2. An MDS algorithm can be used to find a multidimensional embedding of the data based on pairwise similarity distances between data points. The similarity distances between music samples is based on the representations described in the previous section. Consider first only the data from X1. The perceptual diversity of this data will reflect the fact that it represents numerous artists and songs. Overall, however, we would predict that a low dimensional space can embed X1 with low stress (i.e., good fit to the data) since all samples of X1 share a descriptive label that is well grounded. Now consider the embedding of the combined data set of X1 and X2. In this case, the additional dimensions needed to accomodate the joint data will reflect the relation between the two datasets. Our hypothesis was that the additional perceptual variance of datasets formed by combining pairs of datasets on the basis of adjective pairs which are (1) well grounded, and (2) synants, would small compared to combinations in which either of these two combinations did not hold. Following are intial results supporting this hypothesis.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.1 Nonlinear Dimensionality Reduction </SectionTitle> <Paragraph position="0"> Classical dimensional scaling systems such as MDS or PCA can efficiently learn a low-dimensional weighting but can only use euclidean or tangent distances between observations to do so. In complex data sets, the distances might be better represented as a nonlinear function to capture inherent structure in the dimensions. Especially in the case of music, time variances among adjacent observations could be encoded as distances and used in the scaling. We use the Isomap algorithm from (Tenenbaum et al., 2000) to capture this inherent nonlinearity and structure of the audio features. Isomap scales dimensions given a NxN matrix of distances between every observation in N. It roughly computes global geodesic distance by adding up a number of short 'neighbor hops' (where the number of neighbors is a tunable parameter, here we use k = 20) to get between two arbitrarily far points in input space. Schemes like PCA or MDS would simply use the euclidean distance to do this, where Isomap operates on prior knowledge of the structure within the data. For our purposes, we use the same gaussian kernel function as we do for RLSC (Equation 1) for a distance metric, which has proved to work well for most music classification tasks.</Paragraph> <Paragraph position="1"> Isomap can embed in a set of dimensions beyond the target dimension to find the best fit. By studying the residual variance of each embedding, we can look for the &quot;elbow&quot; (the point at which the variance falls off to the minimum)- and treat that embedding as the innate one.</Paragraph> <Paragraph position="2"> We use this variance to show that our highly-grounded parameter spaces can be embedded in less dimensions than ungrounded ones.</Paragraph> </Section> </Section> class="xml-element"></Paper>