File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1062_metho.xml
Size: 9,527 bytes
Last Modified: 2025-10-06 14:08:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1062"> <Title>Learning to predict pitch accents and prosodic boundaries in Dutch</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> TF*IDF - The TF*IDF metric (Salton, 1989) es- </SectionTitle> <Paragraph position="0"> timates the relevance of a word in a document. Document frequency counts for all token types were obtained from a subset of the same corpus as used for IC calculations. TF*IDF and IC (previous two features) have been succesfully tested as features for accent prediction by (Pan and McKeown, 1999), who assert that IC is a more powerful predictor than TF*IDF.</Paragraph> <Paragraph position="1"> Phrasometer - The phrasometer feature (PM) is the summed log-likelihood of all n-grams the word form occurs in, with n ranging from 1 to 25, and computed in an iterative growth procedure: log-likelihoods of n + 1-grams were computed by expanding all stored n-grams one word to the left and to the right; only the n + 1-grams with higher log-likelihood than that of the original n-gram are stored. Computations are based on the complete ILK Corpus.</Paragraph> <Paragraph position="2"> Distance to previous occurrence - The distance, counted in the number of tokens, to previous occurrence of a token within the same article (D2P). Unseen words were assigned the arbitrary high default distance of 9999.</Paragraph> <Paragraph position="3"> Distance to sentence boundaries - Distance of the current token to the start of the sentence (D2S) and to the end of the sentence (D2E), both measured as a proportion of the total sentence length measured in tokens.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 CART: Classification and regression trees </SectionTitle> <Paragraph position="0"> CART (Breiman et al., 1984) is a statistical method to induce a classification or regression tree from a given set of instances. An instance consists of a fixed-length vector of n feature-value pairs, and an information field containing the classification of that particular feature-value vector. Each node in the CART tree contains a binary test on some categorical or numerical feature in the input vector. In the case of classification, the leaves contain the most likely class. The tree building algorithm starts by selecting the feature test that splits the data in such a way that the mean impurity (entropy times the number of instances) of the two partitions is minimal.</Paragraph> <Paragraph position="1"> The algorithm continues to split each partition recursively until some stop criterion is met (e.g. a minimal number of instances in the partition). Alternatively, a small stop value can be used to build a tree that is probably overfitted, but is then pruned back to where it best matches some amount of held-out data. In our experiments, we used the CART implementation that is part of the Edinburgh Speech Tools (Taylor et al., 1999).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Memory-based learning </SectionTitle> <Paragraph position="0"> Memory-based learning (MBL), also known as instance-based, example-based, or lazy learning (Stanfill and Waltz, 1986; Aha et al., 1991), is a supervised inductive learning algorithm for learning classification tasks. Memory-based learning treats a set of training instances as points in a multi-dimensional feature space, and stores them as such in an instance base in memory (rather than performing some abstraction over them). After the instance base is stored, new (test) instances are classified by matching them to all instances in memory, and by calculating with each match the distance, given by a distance function between the new instance X and the memory instance Y . Cf. (Daelemans et al., 2002) for details. Classification in memory-based learning is performed by the k-NN algorithm (Fix and Hodges, 1951; Cover and Hart, 1967) that searches for the k 'nearest neighbours' according to the distance function. The majority class of the k nearest neighbours then determines the class of the new case. In our k-NN implementation2, equidistant neighbours are taken as belonging to the same k, so this implementation is effectively a k-nearest distance classifier.</Paragraph> <Paragraph position="1"> 3 Optimization by iterative deepening Iterative deepening (ID) is a heuristic search algorithm for the optimization of algorithmic parameter</Paragraph> <Paragraph position="3"> vindt molenaar Wijbrandt. 'Miller Wijbrand thinks that the trees surrounding the mill near shipyard Verolme have to relocate.' and feature selection, that combines classifier wrapping (using the training material internally to test experimental variants) (Kohavi and John, 1997) with progressive sampling of training material (Provost et al., 1999). We start with a large pool of experiments, each with a unique combination of input features and algorithmic parameter settings. In the first step, each attempted setting is applied to a small amount of training material and tested on a fixed amount of held-out data (which is a part of the full training set). Only the best settings are kept; all others are removed from the pool of competing settings.</Paragraph> <Paragraph position="4"> In subsequent iterations, this step is repeated, exponentially decreasing the number of settings in the pool, while at the same time exponentially increasing the amount of training material. The idea is that the increasing amount of time required for training is compensated by running fewer experiments, in effect keeping processing time approximately constant across iterations. This process terminates when only the single best experiment is left (or, the n best experiments). null This ID procedure can in fact be embedded in a standard 10-fold cross-validation procedure. In such a 10-fold CV ID experiment, the ID procedure is carried out on the 90% training partition, and the resulting optimal setting is tested on the remaining 10% test partition. The average score of the 10 optimized folds can then be considered, as that of a normal 10-fold CV experiment, to be a good estimation of the performance of a classifier optimized on the full data set.</Paragraph> <Paragraph position="5"> For current purposes, our specific realization of this general procedure was as follows. We used folds of approximately equal size. Within each ID experiment, the amount of held-out data was approximately 5%; the initial amount of training data was 5% as well. Eight iterations were performed, during which the number of experiments was decreased, and the amount of training data was increased, so that in the end only the 3 best experiments used all available training data (i.e. the remaining 95%).</Paragraph> <Paragraph position="6"> Increasing the training data set was accomplished by random sampling from the total of training data available. Selection of the best experiments was based on their F-score (van Rijsbergen, 1979) on the target class (accent or break). F-score, the harmonic mean of precision and recall, is chosen since it directly evaluates the tasks (placement of accents or breaks), in contrast with classification accuracy (the percentage of correctly classified test instances) which is biased to the majority class (to place no accent or break). Moreover, accuracy masks relevant differences between certain inappropriate classifiers that do not place accents or breaks, and better classifiers that do place them, but partly erroneously.</Paragraph> <Paragraph position="7"> The initial pool of experiments was created by systematically varying feature selection (the input features to the classifier) and the classifier settings (the parameters of the classifiers). We restricted these selections and settings within reasonable bounds to keep our experiments computationally feasible. In particular, feature selection was limited to varying the size of the window that was used to model the local context of an instance. A uniform window (i.e. the same size for all features) was applied to all features except DiA, D2P, D2S, and D2E. Its size (win) could be 1, 3, 5, 7, or 9, where win = 1 implies no modeling of context, whereas win = 9 means that during classification not only the features of the current instance are taken into account, but also those of the preceding and following four instances.</Paragraph> <Paragraph position="8"> For CART, we varied the following parameter values, resulting in a first ID step with 480 experiments: * the minimum number of examples for leaf nodes (stop): 1, 10, 25, 50, and 100 * the number of partitions to split a float feature range into (frs): 2, 5, 10, and 25 * the percentage of training material held out for pruning (held-out): 0, 5, 10, 15, 20, and 25 (0 implies no pruning) For MBL, we varied the following parameter values, which led to 1184 experiments in the first ID step: * the number of nearest neighbours (k): 1, 4, 7, 10, 13, 16, 19, 22, 25, and 28 * the type of feature weighting: Gain Ratio (GR), and Shared Variance (SV) * the feature value similarity metric: Overlap, or Modified Value Difference Metric (MVDM) with back-off to Overlap at value frequency tresholds 1 (L=1, no back-off), 2, and 10 * the type of distance weighting: None, Inverse Distance, Inverse Linear Distance, and Exponential Decay with a = 1.0 (ED1) and a = 4.0 (ED4)</Paragraph> </Section> </Section> class="xml-element"></Paper>