File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1040_metho.xml

Size: 14,677 bytes

Last Modified: 2025-10-06 14:07:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1040">
  <Title>Learning Verb Argument Structure from Minimally Annotated Corpora</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Hypothesis
</SectionTitle>
    <Paragraph position="0"> We create a probabilistic classifier that can automatically classify a set of verbs into argument structure classes with a reasonable error rate. We use the hypothesis introduced by (Stevenson and Merlo, 1999) that although a verb in a particular class can occur in all of the syntactic contexts as verbs from other classes the statistical distributions can be distinguished. In other words, verbs from certain classes will be more likely to occur in some syntactic contexts than others. We identify features that pick out the verb occurences in these contexts. By using these features, we will attempt to determine the classification of those verbs. In the previous section we saw that we sometimes have noun-phrase arguments (NPcauser) as being a causer of the action denoted by the verb. For example, (Stevenson and Merlo, 1999) show that a classifier can exploit these causativity facts to improve classifiction.</Paragraph>
    <Paragraph position="1"> We use some new features in addition to the ones proposed and used in (Merlo and Stevenson, 2001) for this task. In addition, we include as a feature the probabilistic classification of the verb as a transitive or intransitive verb. Thus the classifier is simulaneously placing each verb into the appropriate sub-categorization frame as well as identifying the underlying thematic roles of the verb arguments.</Paragraph>
    <Paragraph position="2"> In our experiment, we will consider the following set of classes (each of these were explained in the previous section): unergative, unaccusative, and object-drop. We test 76 verbs taken from (Levin, 1993) that are in one of these three classes. The particular verbs were chosen to include high frequency as well as low frequency verb tokens in our particular corpus of 23M words of WSJ text.2 We used all instances of these verbs from the WSJ corpus.</Paragraph>
    <Paragraph position="3"> The data was annotated with the right classification for each verb and the classifier was trained on 90% of the verbs taken from the 23M word corpus and tested on 10% of the data using 10-fold cross validation. We describe the experiment in greater detail  An important part of identifying the argument structure of the verb is to find the verb's subcategorization frame (SF). For this paper, we are interested in whether the verb takes an intransitive SF or a transitive SF.</Paragraph>
    <Paragraph position="4"> In general, the problem of identifying subcategorization frames is to distinguish between arguments and adjuncts among the constituents modifying a verb. For example, in &amp;quot;John saw Mary yesterday at the station&amp;quot;, only &amp;quot;John&amp;quot; and &amp;quot;Mary&amp;quot; are required arguments while the other constituents are optional (adjuncts).3 The problem of SF identification using statistical methods has had a rich discussion in the literature (Ushioda et al., 1993; Manning, 1993; Briscoe and Carroll, 1997; Brent, 1994) (also see the refences cited in (Sarkar and Zeman, 2000)). In this paper, we use the method of hypothesis testing to discover the SF for a given verb (Brent, 1994).</Paragraph>
    <Paragraph position="5"> Along with the techniques given in these papers, (Sarkar and Zeman, 2000; Korhonen et al., 2000) also discuss other methods for hypothesis testing such the use of the t-score statistic and the likelihood ratio test. After experimenting with all three of these methods we selected the likelihood ratio test because it performed with higher accuracy on a small set of hand-annotated instances. We use the determination of the verb's SF as an input to our argument structure classifier (see Section 4).</Paragraph>
    <Paragraph position="6"> The method works as follows: for each verb, we need to associate a score to the hypothesis that a particular set of dependents of the verb are arguments of that verb. In other words, we need to assign a value to the hypothesis that the observed frame under consideration is the verb's SF. Intuitively, we either want to test for independence of the observed frame and verb distributions in the data, or we want to test how likely is a frame to be observed with a particular verb without being a valid SF. We develop these intuitions by using the method of hypothesis testing using the likelihood ratio test. For fur3There is some controversy as to the correct subcategorization of a given verb and linguists often disagree as to what is the right set of SFs for a given verb. A machine learning approach such as the one followed in this paper sidesteps this issue altogether, since it is left to the algorithm to learn what is an appropriate SF for a verb. The stance taken in this paper is that the e cacy of SF learning is evaluated on some domain, as is done here on learning verb alternations.</Paragraph>
    <Paragraph position="7"> ther background on this method of hypothesis testing the reader is referred to (Bickel and Doksum, 1977; Dunning, 1993).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Likelihood ratio test
</SectionTitle>
      <Paragraph position="0"> Let us take the hypothesis that the distribution of an observed frame f in the training data is independent of the distribution of a verb v. We can phrase this hypothesis as p( f j v) = p( f j !v) = p( f ), that is distribution of a frame f given that a verb v is present is the same as the distribution of f given that v is not present (written as !v). We use the log likelihood test statistic (Bickel and Doksum, 1977, 209) as a measure to discover particular frames and verbs that are highly associated in the training data.</Paragraph>
      <Paragraph position="2"> Taking these probabilities to be binomially distributed, the log likelihood statistic (Dunning, 1993)</Paragraph>
      <Paragraph position="4"> where, log L(p;n;k)=k log p+(n k) log(1 p) According to this statistic, the greater the value of 2 log for a particular pair of observed frame and verb, the more likely that frame is to be valid SF of the verb. If this value is above a certain threshold it is taken to be a positive value for the binary feature TRAN, else it is a positive feature for the binary feature INTRAN in the construction of the classifier.4</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="3" type="metho">
    <SectionTitle>
4 Steps in Constructing the Classifier
</SectionTitle>
    <Paragraph position="0"> To construct the classifier, we will identify features that can be used to accurately distinguish verbs into di erent classes. The features are computed to be the probability of observing a particular feature with each verb to be classified. We use C5.0 (Quinlan, 1992) to generate the decision tree classifier. The features are extracted from a 23M word corpus of WSJ text (LDC WSJ 1988 collection). Note that the training and test data constructed from this set are produced by the classification of individual verbs into their respective classes taken from (Merlo and Stevenson, 2001).</Paragraph>
    <Paragraph position="1"> We prepare the corpus by passing it through Adwait Ratnaparkhi's part-of-speech tagger (Ratnaparkhi, 1996) (trained on the Penn Treebank WSJ corpus) and then running Steve Abney's chunker (Abney, 1997) over the entire text. The output of this stage and the input to our feature extractor is shown below.</Paragraph>
    <Paragraph position="2">  We use the following features to construct the  classifier. The first four features were discussed and motivated in (Stevenson and Merlo, 1999; Merlo and Stevenson, 2001). In some cases, we have modified the features to include information about part-of-speech tags. The discussion below clarifies 4See (Sarkar and Zeman, 2000) for information on how the threshold is selected.</Paragraph>
    <Paragraph position="3"> the similarities and changes. The features we used in addition are the last two in the following list, the part-of-speech features and the subcategorization frame features. 5 1. simple past (VBD), and past participle(VBN) 2. active (ACT) and passive (PASS) 3. causative (CAUS) 4. animacy (ANIM) 5. Part of Speech of the subject noun-phrase and object noun-phrase 6. transitive (TRAN) and intransitive (INTRAN)  To calculate all the probability values of each features, we perform the following steps.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Finding the main verb of the sentences
</SectionTitle>
      <Paragraph position="0"> To find the main verb, we constructed a deterministic finite-state automaton that finds the main verb within the verb phrase chunks. This DFA is used in two steps. First, to select a set of main verbs from which we select the final set of 76 verbs used in our experiment. Secondly, the actual set of verbs is incorporated into the DFA in the feature selection step.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Obtaining the frequency distribution of the
features
</SectionTitle>
      <Paragraph position="0"> The general form of the equation we use to find the frequency distribution of each feature of the verb is the following:</Paragraph>
      <Paragraph position="2"> where P(Vj) is the distribution of feature j of the verb, N is the total number of features of the particular type (e.g., the total number of CAUS features or ANIM features as described below) and C(Vj) is the number of times this feature of the verb was observed in the corpus. The features computed using this formula are: ACT, PASS, TRAN, INTRAN, VBD, and VBN.</Paragraph>
      <Paragraph position="3"> 5Note that while (Stevenson and Merlo, 1999; Merlo and Stevenson, 2001) used a TRAN/INTRAN feature, in their case it was estimated in a completely di erent way using tagged data. Hence, while we use the same name for the feature here, it is not the same kind of feature as the one used in the cited work.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="3" type="sub_section">
      <SectionTitle>
4.3 The causative feature: CAUS
</SectionTitle>
      <Paragraph position="0"> To correctly obtain the causative values of the testing verbs, we needed to know the meaning of the sentences. In this paper, we approximate the value by using the following approach. Also, the causative value is not a probability but a weight which is subsequently normalized.</Paragraph>
      <Paragraph position="1"> We extract the subjects and objects of verbs and put them into two sets. We use the last noun of the subject noun phrase and object noun phrase (tagged by NN, NNS, NNP, or NNPS), as the subject and object of the sentences. Then the causative value is CAUS= overlapsum of all subject and objects in multiset where the overlap is defined as the largest multiset of elements belonging to both subjects and objects multisets.</Paragraph>
      <Paragraph position="2"> If subject is in the set fa;a;b;cg and object is in set fa;dg, the intersection between both set will be fa;ag, and the causative value will be 2(4+2) = 13.</Paragraph>
      <Paragraph position="3"> If subject is in the set fa;a;b;cg and object is in the set fa;b;dg, the intersection between both set will be fa;a;bg, and the causative value will be  7.</Paragraph>
      <Paragraph position="4"> Note that using this measure, we expect to get higher weights for tokens that occur frequently in the object position and sometimes in the subject position. For example, CAUS(fa;bg;fa;bg) = 24 while CAUS(fa;bg;fa;a;ag) = 35. This di erence in the weight given by the CAUS feature is exploited in the classifier.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.4 The animate feature: ANIM
</SectionTitle>
      <Paragraph position="0"> Similar to CAUS, we can only approximate the value of animacy. We use the following formula to find the value: ANIM = number of occurrence of pronoun in subject/number of occurrence of verbs The set of pronouns used are I, we, you, she, he, and they. In addition we use the set of part-of-speech tags which are associated with animacy in Penn Treebank tagset as part of set of features described in the next section.</Paragraph>
    </Section>
    <Section position="5" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.5 Part of Speech of object and subject
</SectionTitle>
      <Paragraph position="0"> The part-of-speech feature picks up several subtle cues about the di erences in the types of arguments selected by the verb in its subject or object position.</Paragraph>
      <Paragraph position="1"> We count the occurrence of the head nouns of the subject noun phrase and the object noun phrase.</Paragraph>
      <Paragraph position="2"> Then, we find the frequency distribution by using the same formula as before:</Paragraph>
      <Paragraph position="4"> where P(Vj) is the distribution of part of speech j, N is the total number of relevant POS features and C(V j) is the number of occurrences of part of speech j. Also, we limit the part of speech to only the following tags of speech: NNP, NNPS, EX, PRP, and SUCH, where NNP is singular noun phrase, NNPS is plural noun phrase, EX is 'there', PRP is personal pronoun, and SUCH is 'such'.</Paragraph>
      <Paragraph position="5"> 4.6 Transitive and intransitive SF of the verb To find values for this feature we use the technique described in Section 3. For each verb in our list we extract all the subsequent NP and PP chunks and their heads from the chunker output. We then perform subcategorization frame learning with all subsets of these extracted potential arguments. The counts are appropriately assigned to these subsets to provide a well-defined model. Using these counts and the methods in Section 3 we categorize a verb as either transitive or intransitive. For simplicity, any number of arguments above zero is considered to be a candidate for transitivity.</Paragraph>
    </Section>
    <Section position="6" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.7 Constructing the Classifier
</SectionTitle>
      <Paragraph position="0"> After we obtain all the probabilistic distributions of the features of our testing verbs, we then use C5.0 (Quinlan, 1992) to construct the classifier. The data was annotated with the right classification for each verb and the classifier was run on 10% of the data using 10-fold cross validation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML