File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1027_metho.xml

Size: 17,688 bytes

Last Modified: 2025-10-06 14:14:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="P95-1027">
  <Title>A Quantitative Evaluation of Linguistic Tests for the Automatic Prediction of Semantic Markedness</Title>
  <Section position="4" start_page="197" end_page="198" type="metho">
    <SectionTitle>
3 Tests for Semantic Markedness
</SectionTitle>
    <Paragraph position="0"> Markedness in general and semantic markedness in particular have received considerable attention in the linguistics literature. Consequently, several tests for determining markedness have been proposed by linguists. Most of these tests involve human judgments (Greenberg, 1966; Lyons, 1977; Waugh, 1982; Lehrer, 1985; Ross, 1987; Lakoff, 1987) and are not suitable for computer implementation. However, some proposed tests refer to comparisons between measurable properties of the words in question and are amenable to full automation. These tests are:  1. Text frequency. Since the unmarked term can appear in more contexts than the marked one, and it has both general and specific senses, it should appear more frequently in text than the marked term (Greenberg, 1966).</Paragraph>
    <Paragraph position="1"> 2. Formal markedness. A formal markedness relationship (i.e., a morphology relationship between the two words), whenever it exists, should be an excellent predictor for semantic markedness (Greenberg, 1966; Zwicky, 1978).</Paragraph>
    <Paragraph position="2"> 3. Formal complexity. Since the unmarked word is the more general one, it should also be morphologically the simpler (Jakobson, 1962; Battistella, 1990). The &amp;quot;economy of language&amp;quot; prin null ciple (Zipf, 1949) supports this claim. Note that this test subsumes test (2).</Paragraph>
    <Paragraph position="3"> 4. Morphological produclivity. Unmarked words, being more general and frequently used to describe the whole scale, should be freer to combine with other linguistic elements (Winters, 1990; Battistella, 1990).</Paragraph>
    <Paragraph position="4"> 5. Differentialion. Unmarked terms should exhibit higher differentiation with more subdistinetions (Jakobson, 1984b) (e.g., the present  tense (unmarked) appears in a greater variety of forms than the past), or, equivalently, the marked term should lack some subcategories (Greenberg, 1966).</Paragraph>
    <Paragraph position="5"> The first of the above tests compares the text frequencies of the two words, which are clearly measurable and easily retrievable from a corpus. We use the one-million word Brown corpus of written American English (Ku~era and Francis, 1967) for this purpose. The mapping of the remaining tests to quantifiable variables is not as immediate. We use the length of a word in characters, which is a reasonable indirect index of morphological complexity, for tests (2) and (3). This indicator is exact for the case of test (2), since the formally marked word is derived from the unmarked one through the addition of an affix (which for adjectives is always a prefix). The number of syllables in a word is another reasonable indicator of morphological complexity that we consider, although it is much harder to compute automatically than word length.</Paragraph>
    <Paragraph position="6"> For morphological productivity (test (4)), we measure several variables related to the freedom of the word to receive affixes and to participate in compounds. Several distinctions exist for the definition of a variable that measures the number of words that are morphologically derived from a given word. These distinctions involve: Q Whether to consider the number of distinct words in this category (types) or the total fre- null quency of these words (tokens).</Paragraph>
    <Paragraph position="7"> * Whether to separate words derived through affixation from compounds or combine these types of morphological relationships.</Paragraph>
    <Paragraph position="8"> * If word types (rather than word frequencies) are  measured, we can select to count homographs (words identical in form but with different parts of speech, e.g., light as an adjective and light as a verb) as distinct types or map all homographs of the same word form to the same word type.</Paragraph>
    <Paragraph position="9"> Finally, the differentiation test (5) is the one general markedness test that cannot be easily mapped into observable properties of adjectives. Somewhat arbitrarily, we mapped this test to the number of grammatical categories (parts of speech) that each word can appear under, postulating that the unmarked term should have a higher such number.</Paragraph>
    <Paragraph position="10"> The various ways of measuring the quantities compared by the tests discussed above lead to the consideration of 32 variables. Since some of these variables are closely related and their number is so high that it impedes the task of modeling semantic markedness in terms of them, we combined several of them, keeping 14 variables for the statistical analysis.</Paragraph>
  </Section>
  <Section position="5" start_page="198" end_page="199" type="metho">
    <SectionTitle>
4 Data Collection
</SectionTitle>
    <Paragraph position="0"> In order to measure the performance of the markedness tests discussed in the previous section, we collected a fairly large sample of pairs of antonymous gradable adjectives that can appear in howquestions. The Deese antonyms (Deese, 1964) is the prototypical collection of pairs of antonymous adjectives that have been used for similar analyses in the past (Deese, 1964; Justeson and Katz, 1991; Grefenstette, 1992). However, this collection contains only 75 adjectives in 40 pairs, some of which cannot be used in our study either because they are primarily adverbials (e.g., inside-outside) or not gradable (e.g., alive-dead). Unlike previous studies, the nature of the statistical analysis reported in this paper requires a higher number of pairs.</Paragraph>
    <Paragraph position="1"> Consequently, we augmented the Deese set with the set of pairs used in the largest manual previous study of markedness in adjective pairs (Lehrer, 1985). In addition, we included all gradable adjectives which appear 50 times or more in the Brown corpus and have at least one gradable antonym; the antonyms were not restricted to belong to this set of frequent adjectives. For each adjective collected according to this last criterion, we included all the antonyms (frequent or not) that were explicitly listed in the Collins COBUILD dictionary (Sinclair, 1987) for each of its senses. This process gave us a sample of 449 adjectives (both frequent and infrequent ones) in 344 pairs. 2 We separated the pairs on the basis of the how-test into those that contain one semantically unmarked and one marked term and those that contain two marked terms (e.g., fat-lhin), removing the latter.</Paragraph>
    <Paragraph position="2"> For the remaining pairs, we identified the unmarked member, using existing designations (Lehrer, 1985) whenever that was possible; when in doubt, the pair was dropped from further consideration. We also separated the pairs into two groups according to whether the two adjectives in each pair were morphologically related or not. This allowed us to study the different behavior of the tests for the two groups separately. Table 1 shows the results of this cross-classification of the adjective pairs.</Paragraph>
    <Paragraph position="3"> Our next step was to measure the variables described in Section 3 which are used in the various  cording to morphological relationship and markedness status.</Paragraph>
    <Paragraph position="4"> tests for semantic markedness. For these measurements, we used the MRC Psycholinguistic Database (Coltheart, 1981) which contains a variety of measures for 150,837 entries counting different parts of speech or inflected forms as different words (115,331 distinct words). We implemented an extractor program to collect the relevant measurements for the adjectives in our sample, namely text frequency, number of syllables, word length, and number of parts of speech. All this information except the number of syllables can also be automatically extracted from the corpus. The extractor program also computes information that is not directly stored in the MRC database. Affixation rules from (Quirk et al., 1985) are recursively employed to check whether each word in the database can be derived from each adjective, and counts and frequencies of such derived words and compounds are collected. Overall, 32 measurements are computed for each adjective, and are subsequently combined into the 14 variables used in our study.</Paragraph>
    <Paragraph position="5"> Finally, the variables for the pairs are computed as the differences between the corresponding variables for the adjectives in each pair. The output of this stage is a table, with two strata corresponding to the two groups, and containing measurements on 14 variables for the 279 pairs with a semantically unmarked member.</Paragraph>
  </Section>
  <Section position="6" start_page="199" end_page="199" type="metho">
    <SectionTitle>
5 Evaluation of Linguistic Tests
</SectionTitle>
    <Paragraph position="0"> For each of the variables, we measured how many pairs in each group it classified correctly. A positive (negative) value indicates that the first (second) adjective is the unmarked one, except for two variables (word length and number of syllables) where the opposite is true. When the difference is zero, the variable selects neither the first or second adjective as unmarked. The percentage of nonzero differences, which correspond to cases where the test actually suggests a choice, is reported as the applicability of the variable. For the purpose of evaluating the accuracy of the variable, we assign such cases randomly to one of the two possible outcomes in accordance with common practice in classification (Duda and Hart, 1973).</Paragraph>
    <Paragraph position="1"> For each variable and each of the two groups, we also performed a statistical test of the null hypothesis that its true accuracy is 50%, i.e., equal to the expected accuracy of a random binary classifier. Under the null hypothesis, the number of correct responses follows a binomial distribution with parameter p = 0.5. Since all obtained measurements of accuracy were higher than 50%, any rejection of the null hypothesis implies that the corresponding test is significantly better than chance.</Paragraph>
    <Paragraph position="2"> Table 2 summarizes the values obtained for some of the 14 variables in our data and reveals some surprising facts about their performance. The frequency of the adjectives is the best predictor in both groups, achieving an overall accuracy of 80.64% with high applicability (98.5-99%). This is all the more remarkable in the case of the morphologically related adjectives, where frequency outperforms length of the words; recall that the latter directly encodes the formal markedness relationship, so frequency is able to correctly classify some of the cases where formal and semantic markedness values disagree. On the other hand, tests based on the &amp;quot;economy of language&amp;quot; principle, such as word length and number of syllables, perform badly when formal markedness relationships do not exist, with lower applicability and very low accuracy scores. The same can be said about the test based on the differentiation properties of the words (number of different parts of speech). In fact, for these three variables, the hypothesis of random performance cannot be rejected even at the 5% level. Tests based on the productivity of the words, as measured through affixation and compounding, tend to fall in-between: their accuracy is generally significant, but their applicability is sometimes low, particularly for compounds.</Paragraph>
  </Section>
  <Section position="7" start_page="199" end_page="201" type="metho">
    <SectionTitle>
6 Predictions Based on More than
</SectionTitle>
    <Paragraph position="0"> One Test While the frequency of the adjectives is the best single predictor, we would expect to gain accuracy by combining the answers of several simple tests. We consider the problem of determining semantic markedness as a classification problem with two possible outcomes (&amp;quot;the first adjective is unmarked&amp;quot; and &amp;quot;the second adjective is unmarked&amp;quot;). To design an appropriate classifier, we employed two general statistical supervised learning methods, which we briefly describe in this section.</Paragraph>
    <Paragraph position="1"> Decision trees (Quinlan, 1986) is the first statistical supervised learning paradigm that we explored. A popular method for the automatic construction of such trees is binary recursive partitioning, which constructs a binary tree in a top-down fashion.</Paragraph>
    <Paragraph position="2"> Starting from the root, the variable X which better discriminates among the possible outcomes is selected and a test of the form X &lt; consiant is as- null to or better than the observed one is listed in the P- Value column for each test. sociated with the root node of the tree. All training cases for which this test succeeds (fails) belong to the left (right) subtree of the decision tree. The method proceeds recursively, by selecting a new variable (possibly the same as in the parent node) and a new cutting point for each subtree, until all the cases in one subtree belong to the same category or the data becomes too sparse. When a node cannot be split further, it is labeled with the locally most probable category. During prediction, a path is traced from the root of the tree to a leaf, and the category of the leaf is the category reported.</Paragraph>
    <Paragraph position="3"> If the tree is left to grow uncontrolled, it will exactly represent the training set (including its peculiarities and random variations), and will not be very useful for prediction on new cases. Consequently, the growing phase is terminated before the training samples assigned to the leaf nodes are entirely homogeneous. A technique that improves the quality of the induced tree is to grow a larger than optimal tree and then shrink it by pruning subtrees (Breiman et al., 1984). In order to select the nodes to shrink, we normally need to use new data that has not been used for the construction of the original tree.</Paragraph>
    <Paragraph position="4"> In our classifier, we employ a maximum likelihood estimator based on the binomial distribution to select the optimal split at each node. During the shrinking phase, we optimally regress the probabilities of children nodes to their parent according to a shrinking parameter ~ (Hastie and Pregibon, 1990), instead of pruning entire subtrees. To select the optimal value for (~, we initially held out a part of the training data. In a later version of the classifier, we employed cross-validation, separating our training data in 10 equally sized subsets and repeatedly training on 9 of them and validating on the other.</Paragraph>
    <Paragraph position="5"> Log-linear regression (Santner and Duffy, 1989) is the second general supervised learning method that we explored. In classical linear modeling, the response variable y is modeled as y -- bTx+e where b is a vector of weights, x is the vector of the values of the predictor variables and e is an error term which is assumed to be normally distributed with zero mean and constant variance, independent of the mean of y. The log-linear regression model generalizes this setting to binomial sampling where the response variable follows a Bernoulli distribution (corresponding to a two-category outcome); note that the variance of the error term is not independent of the mean of y any more. The resulting generalized linear model (McCullagh and Nelder, 1989) employs a linear predictor y = bTx + e as before, but the response variable y is non-linearly related to through the inverse logit function,</Paragraph>
    <Paragraph position="7"> Note that y E (0, 1); each of the two ends of that interval is associated with one of the possible choices.</Paragraph>
    <Paragraph position="8"> We employ the iterative reweighted least squares algorithm (Baker and Nelder, 1989) to approximate the maximum likelihood cstimate of the vector b, but first we explicitly drop the constant term (intercept) and most of the variables. The intercept is dropped because the prior probabilities of the two outcomes are known to be equal. 3 Several of the variables are dropped to avoid overfitting (Duda and Hart, 1973); otherwise the regression model will use all available variables, unless some of them are linearly dependent. To identify which variables we should keep in the model, we use the analysis of deviance method with iterative stepwise refinement of the model by iteratively adding or dropping one term if the reduction (increase) in the deviance compares  of the frequency method (dotted line) and the smoothed log-linear model (solid line) on the morphologically unrelated adjectives.</Paragraph>
    <Paragraph position="9"> favorably with the resulting loss (gain) in residual degrees of freedom. Using a fixed training set, six of the fourteen variables were selected for modeling the morphologically unrelated adjectives. Frequency was selected as the only component of the model for the morphologically related ones.</Paragraph>
    <Paragraph position="10"> We also examined the possibility of replacing some variables in these models by smoothing cubic Bsplines (Wahba, 1990). The analysis of deviance for this model indicated that for the morphologically unrelated adjectives, one of the six selected variables should be removed altogether and another should be replaced by a smoothing spline.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML