File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0613_evalu.xml

Size: 7,771 bytes

Last Modified: 2025-10-06 13:59:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0613">
  <Title>Learning Word Meanings and Descriptive Parameter Spaces from Music</Title>
  <Section position="9" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
8 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> In the following section we describe our experiments using the aforementioned models and show how we can automatically uncover the perceptual parameter spaces underlying adjective oppositions.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
8.1 Audio dataset
</SectionTitle>
      <Paragraph position="0"> We use audio from the NECI Minnowmatch testbed (Whitman et al., 2001). The testbed includes on average ten songs from each of 1,000 albums from roughly 500 artists. The album list was chosen from the most popular songs on OpenNap, a popular peer-to-peer music sharing service, in August of 2001. We do not separate audio-derived features among separate songs since our connections in language are at the artist level (community meta-data refers to an artist, not an album or song.) Therefore, each artist a is represented as a concatenated matrix of Fa computed from each song performed by that artist.</Paragraph>
      <Paragraph position="1"> Fa contains N rows of 40-dimensional data. Each observation represents 10 seconds of audio data. We choose a random sampling of artists for both training and testing (25 artists each, 5 songs for a total of N observations for testing and training) from the Minnowmatch testbed.</Paragraph>
      <Paragraph position="2"> 8.2 RLSC for Audio to Term Relation Each artist in the testbed has previously been crawled for community metadata vectors, which we associate with the audio vectors as a yt truth vector. In this experiment, we limit our results to adjective terms only. The entire community metadata space of 500 artists ended up with roughly 2,000 unique adjectives, which provide a good sense of musical description. The other term types (ngrams and noun phrases) are more useful in text retrieval tasks, as they contain more specific information such as band members, equipment or song titles. Each audio observation in N is associated with an artist a, which in turn is related to the set of adjectives with pre-defined salience. (Salience is zero if the term is not related, unbounded if related.) We are treating this problem as classification, not regression, so we assign not-related terms a value of -1 and positively related terms are regularized to 1.</Paragraph>
      <Paragraph position="3"> We compute a ct for each adjective term t on the training set after computing the stored kernel. We use a C of 10. After all the cts are stored to disk we then bring out the held-out test set and compute relative adjective weighted prediction accuracy P(a) for each term. The results (in Table 4) are similar to our previous work but we note that our new representation allows us to capture more time- and structure-oriented terms. We see that the time-aware MPEG-7 representation creates a far better sense of perceptual salience than our prior frame-based power spectral density estimation, which threw away all short- and mid-time features.</Paragraph>
      <Paragraph position="4">  aware adjective grounding system. Overall, the attached term list is more musical due to the increased time-aware information in the representation.</Paragraph>
      <Paragraph position="5">  spaces and their weighted precision. The top are the most semantically significant description spaces for music understanding uncovered autonomously by our system.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
8.3 Finding Parameter Spaces using WordNet
Lexical Relations
</SectionTitle>
      <Paragraph position="0"> We now take our new single-term results and ask our professional for help in finding parameters. For all adjectives over our predefined e we retrieve a restricted synant set. This restricted set only retrieves synants that are in our community metadata space: i.e. we would not return 'soft' as a synant to 'loud' if we did not have community-derived 'soft' audio. The point here is to only find synantonymial relations that we have perceptual data to 'ground' with. We rank our synant space by the mean of the P(a) of each polar term. For example, P(asoft) was 0.12 and we found a synant 'loud' in our space with a P(aloud) of 0.26, so our P(aloud...soft) would be 0.19.</Paragraph>
      <Paragraph position="1"> This allows us to sort our parameter spaces by the maximum semantic attachment. We see results of this process in Table 5.</Paragraph>
      <Paragraph position="2"> We consider this result our major finding: from listening to a set of albums and reading about the artists, a computational system has automatically derived the opti- null for different parameter spaces. Note the clear elbows for grounded parameter spaces, while less audio-derived spaces such as &amp;quot;alive - dead&amp;quot; maintain a high variance throughout. Bad antonym relations such as &amp;quot;quiet - soft&amp;quot; also have no inherent dimensionality.</Paragraph>
      <Paragraph position="3"> mal (strongest connection to perception) semantic gradation spaces to describe the incoming observation. These are not the most statistically significant bases but rather the most semantically significant bases for understanding and retrieval.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
8.4 Making Knobs and Uncovering Dimensionality
</SectionTitle>
      <Paragraph position="0"> We would like to show the results of such understanding at work in a classification or retrieval interface, so we then have another algorithm learn the d-dimensional mapping of the two polar adjectives in each of the top n parameter spaces. We also use this algorithm to uncover the natural dimensionality of the parameter space.</Paragraph>
      <Paragraph position="1"> For each parameter space a1 ... a2, we take all observations automatically labeled by the test pass of RLSC as a1 and all as a2 and separate them from the rest of the observations. The observations Fa1 are concatenated together with Fa2 serially, and we choose an equal number of observations from both to eliminate bias. We take this subset of observation Fa12 and embed it into a distance matrix D with the gaussian kernel in Equation 1. We feed D to Isomap and ask for a one-dimensional embedding of the space. The result is a weighting that we can feed completely new unlabeled audio into and retrieve scalar values for each of these parameters. We would like to propose that the set of responses from each of our new 'semantic experts' (weight matrices to determine parameter values) define the most expressive semantic representation possible for music.</Paragraph>
      <Paragraph position="2"> By studying the residual variances of Isomap as in Figure 4, we can see that Isomap finds inherent dimensionality for our top grounded parameter spaces. But for 'ungrounded' parameters or non-antonymial spaces, there is less of a clear 'elbow' in the variances indicating a natural embedding. For example, we see from Figure 4 that the &amp;quot;male - female&amp;quot; parameter (which we construe as gender of artist or vocalist) has a lower inherent dimensionality than the more complex &amp;quot;low - high&amp;quot; parameter and is lower yet than the ungroundable (in audio) &amp;quot;alive - dead.&amp;quot; These results allow us to evaluate our parameter discovery system (in which we show that groundable terms have clearer elbows) but also provide an interesting window into the nature of descriptions of perception.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML