File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/95/p95-1027_evalu.xml

Size: 3,210 bytes

Last Modified: 2025-10-06 14:00:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="P95-1027">
  <Title>A Quantitative Evaluation of Linguistic Tests for the Automatic Prediction of Semantic Markedness</Title>
  <Section position="8" start_page="201" end_page="201" type="evalu">
    <SectionTitle>
7 Evaluation of the Complex
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="201" end_page="201" type="sub_section">
      <SectionTitle>
Predictors
</SectionTitle>
      <Paragraph position="0"> For both decision trees and log-linear regression, we repeatedly partitioned the data in each of the two groups into equally sized training and testing sets, constructed the predictors using the training sets, and evaluated them on the testing sets. This process was repeated 200 times, giving vectors of estimates for the performance of the various methods.</Paragraph>
      <Paragraph position="1"> The simple frequency test was also evaluated in each testing set for comparison purposes. From these vectors, we estimate the density of the distribution of the scores for each method; Figure 1 gives these densities for the frequency test and the log-linear model with smoothing splines on the most difficult case, the morphologically unrelated adjectives.</Paragraph>
      <Paragraph position="2"> Table 3 summarizes the performance of the methods on the two groups of adjective pairs. 4 In order to assess the significance of the differences between 4The applicability of all complex methods was 100% in both groups.</Paragraph>
      <Paragraph position="3"> the scores, we performed a nonparametric sign test (Gibbons and Chakraborti, 1992) for each complex predictor against the simple frequency variable. The test statistic is the number of runs where the score of one predictor is higher than the other's; as is common in statistical practice, ties are broken by assigning half of them to each category. Under the null hypothesis of equal performance of the two methods that are contrasted, this test statistic follows the binomial distribution with p = 0.5. Table 3 includes the exact probabilities for obtaining the observed (or more extreme) values of the test statistic.</Paragraph>
      <Paragraph position="4"> From the table, we observe that the tree-based methods perform considerably worse than frequency (significant at any conceivable level), even when cross-validation is employed. Both the standard and smoothed log-linear models outperform the frequency test on the morphologically unrelated adjectives (significant at the 5% and 0.1% levels respectively), while the log-linear model's performance is comparable to the frequency test's on the morphologically related adjectives. The best predictor over-all is the smoothed log-linear model. 5 The above results indicate that the frequency test essentially contains almost all the information that can be extracted collectively from all linguistic tests. Consequently, even very sophisticated methods for combining the tests can offer only small improvement. Furthermore, the prominence of one variable can easily lead to overfitting the training data in the remaining variables. This causes the decision tree models to perform badly.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML