XML Viewer - p98-2177

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2177_metho.xml
Size: 14,479 bytes
Last Modified: 2025-10-06 14:15:02
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2177">
  <Title>Statistical Models for Unsupervised Prepositional Phrase Attachment</Title>
  <Section position="5" start_page="1079" end_page="1081" type="metho">
    <SectionTitle>
3 Unsupervised Prepositional
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1079" end_page="1079" type="sub_section">
      <SectionTitle>
Phrase Attachment
</SectionTitle>
      <Paragraph position="0"> The exact task of our algorithm will be to construct a classifier cl which maps an instance of an ambiguous prepositional phrase (v, n, p, n2) to either N or V, corresponding to noun attachment or verb attachment, respectively. In the full natural language parsing task, there are more than just two potential attachment sites, but we limit our task to choosing between a verb v and a noun n so that we may compare with previous supervised attempts on this problem.</Paragraph>
      <Paragraph position="1"> While we will be given the candidate attachment sites during testing, the training procedure assumes no a priori information about potential attachment sites.</Paragraph>
    </Section>
    <Section position="2" start_page="1079" end_page="1079" type="sub_section">
      <SectionTitle>
3.1 Generating Training Data From
Raw Text
</SectionTitle>
      <Paragraph position="0"> We generate training data from raw text by using a part-of-speech tagger, a simple chunker, an extraction heuristic, and a morphology database. The order in which these tools are applied to raw text is shown in Table 1. The tagger from (Ratnaparkhi, 1996) first annotates sentences of raw text with a sequence of part-of-speech tags. The chunker, implemented with two small regular expressions, then replaces simple noun phrases and quantifier phrases with their head words. The extraction heuristic then finds head word tuples and their likely attachments from the tagged and chunked text. The heuristic relies on the observed fact that in English and in languages with similar word order, the attachment site of a preposition is usually located only a few words to the left of the preposition. Finally, numbers are replaced by a single token, the text is converted to lower case, and the morphology database is used to find the base forms of the verbs and nouns.</Paragraph>
      <Paragraph position="1"> The extracted head word tuples differ from the training data used in previous supervised attempts in an important way. In the supervised case, both of the potential sites, namely the verb v and the noun n are known before the attachment is resolved. In the unsupervised case discussed here, the extraction heuristic only finds what it thinks are unambiguous cases of prepositional phrase attachment. Therefore, there is only one possible attachment site for the preposition, and either the verb v or the noun n does not exist, in the case of noun-attached preposition or a verb-attached preposition, respectively. This extraction heuristic loosely resembles a step in the bootstrapping procedure used to get training data for the classifier of (Hindle and Rooth, 1993). In that step, unambiguous attachments from the FIDDITCH parser's output are initially used to resolve some of the ambiguous attachments, and the resolved cases are iteratively used to disambiguate the remaining unresolved cases. Our procedure differs critically from (Hindle and Rooth, 1993) in that we do not iterate, we extract unambiguous attachments from unparsed input sentences, and we totally ignore the ambiguous cases. It is the hypothesis of this approach that the information in just the unambiguous attachment events can resolve the ambiguous attachment events of the test data.</Paragraph>
    </Section>
    <Section position="3" start_page="1079" end_page="1081" type="sub_section">
      <SectionTitle>
3.1.1 Heuristic Extraction of
Unambiguous Cases
</SectionTitle>
      <Paragraph position="0"> Given a tagged and chunked sentence, the extraction heuristic returns head word tuples of the form (v,p, n2) or (n,p, n2), where v is the verb, n is the noun, p is the preposition, n2 is the object of the preposition. The main idea of the extraction heuristic is that an attachment site of a preposition is usually within a few words to the left of the preposition. We extract : (v,p, n2) if  The professional conduct of lawyers in other jurisdictions is guided by American Bar Association rules or by state bar ethics codes, none of which permit non-lawyers to be partners in law firms.</Paragraph>
      <Paragraph position="2"> K words to the right of p * No verb occurs between p and n2 Table 1 also shows the result of the applying the extraction heuristic to a sample sentence. The heuristic ignores cases where p = of, since such cases are rarely ambiguous, and we opt to model them deterministically as noun attachments. We will report accuracies (in Section 5) on both cases where p = of and where p ~ of. Also, the heuristic excludes examples with the verb to be from the training set (but not the test set) since we found them to be unreliable sources of evidence.</Paragraph>
    </Section>
    <Section position="4" start_page="1081" end_page="1081" type="sub_section">
      <SectionTitle>
3.2 Accuracy of Extraction Heuristic
</SectionTitle>
      <Paragraph position="0"> Applying the extraction heuristic to 970K unannotated sentences from the 1988 Wall St. Journal 1 data yields approximately 910K unique head word tuples of the form (v,p, n2) or (n,p, n2). The extraction heuristic is far from perfect; when applied to and compared with the annotated Wall St. Journal data of the Penn treebank, only 69% of the extracted head word tuples represent correct attachments. 2 The extracted tuples are meant to be a noisy but abundant substitute for the information that one might get from a treebank. Tables 2 and 3 list the most frequent extracted head word tuples for unambiguous verb and noun attachments, respectively. Many of the frequent noun-attached (n,p, n2) tuples, such as hum to num, 3 are incorrect. The prepositional phrase to hum is usually attached to a verb such as rise or fall in the Wall St. Journal domain, e.g., Profits rose ,{6 ~ to 52 million.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1081" end_page="1082" type="metho">
    <SectionTitle>
4 Statistical Models
</SectionTitle>
    <Paragraph position="0"> While the extracted tuples of the form (n, p, n2) and (v, p, n2) represent unambiguous noun and verb attachments in which either the verb or noun is known, our eventual goal is to resolve ambiguous attachments in the test data of the form (v, n,p, n2), in which both the noun n and verb v are always known. We therefore must use any information in the unambiguous cases to resolve the ambiguous cases. A natural way is to use a classifier that compares the probability</Paragraph>
    <Paragraph position="2"> We do not currently use n2 in the probability model, and we omit it from further discussion.</Paragraph>
    <Paragraph position="3"> We can factor Pr(v,n,p, a) as follows:</Paragraph>
    <Paragraph position="5"> The terms Pr(n) and Pr(v) are independent of the attachment a and need not be computed in d (1), but the estimation of Pr(a\[v,n) and Pr(pla, v,n ) is problematic since our training data, i.e., the head words extracted from raw text, occur with either n or v, but never both n, v. This leads to make some heuristically motivated approximations. Let the random variable C/ range over {true, false}, and let it denote the presence or absence of any preposition that is unambiguously attached to the noun or verb in question. Then p(C/ = true\]n) is the conditional probability that a particular noun n in free text has an unambiguous prepositional phrase attachment. (C/ = true will be written simply as true.) We approximate Pr(alv , n) as follows:</Paragraph>
    <Paragraph position="7"> The rationale behind this approximation is that the tendency of a v,n pair towards a noun (verb) attachment is related to the tendency of the noun (verb) alone to occur with an unambiguous prepositional phrase. The Z(v, n) term exists only to make the approximation a well formed probability over a E {N, V}.</Paragraph>
    <Paragraph position="8"> We approximate Pr(p\[a, v, n) as follows:</Paragraph>
    <Paragraph position="10"> The rationale behind these approximations is that when generating p given a noun (verb) attachment, only the counts involving the noun (verb) are relevant, assuming also that the noun (verb) has an attached prepositional phrase, i.e., d? = true.</Paragraph>
    <Paragraph position="11"> We use word statistics from both the tagged corpus and the set of extracted head word tuples to estimate the probability of generating C/ = true, p, and n2. Counts from the extracted set of tuples assume that C/ -- true, while counts from the corpus itself may correspond to either q5 = true or C/ = false, depending on if the noun  or verb in question is, or is not, respectively, unambiguously attached to a preposition.</Paragraph>
    <Section position="1" start_page="1082" end_page="1082" type="sub_section">
      <SectionTitle>
4.1 Generate C/
</SectionTitle>
      <Paragraph position="0"> The quantities Pr(trueln ) and Pr(truelv ) denote the conditional probability that n or v will occur with some unambiguously attached preposition, and are estimated as follows:</Paragraph>
      <Paragraph position="2"> where c(n) and c(v) are counts from the tagged corpus, and where c(n, true) and c(v, true) are counts from the extracted head word tuples.</Paragraph>
    </Section>
    <Section position="2" start_page="1082" end_page="1082" type="sub_section">
      <SectionTitle>
4.2 Generate p
</SectionTitle>
      <Paragraph position="0"> The terms Pr(p\[n, true) and Pr(plv, true) denote the conditional probability that a particular preposition p will occur as an unambiguous attachment to n or v. We present two techniques to estimate this probability, one based on bigram counts and another based on an interpolation method.</Paragraph>
      <Paragraph position="1">  This technique uses the bigram counts of the extracted head word tuples, and backs off to the uniform distribution when the denominator is zero.</Paragraph>
      <Paragraph position="3"> where ~ is the set of possible prepositions, where all the counts c(...) are from the extracted head word tuples.</Paragraph>
      <Paragraph position="4">  This technique is similar to the one in (Hindle and Rooth, 1993), and interpolates between the tendencies of the (v,p) and (n,p) bigrams and the tendency of the type of attachment (e.g., N or V) towards a particular preposition p. First, define cN(p) = ~n c(n,p, true) as the number of noun attached tuples with the preposition p, and define C N = ~'~pCN(P) as the number of noun attached tuples. Analogously, define cy(p) = ~vc(v,p, true) and cy = ~pcv(p).</Paragraph>
      <Paragraph position="5"> The counts c(n,p, true) and c(v,p, true) are from the extracted head word tuples. Using the above notation, we can interpolate as follows:</Paragraph>
      <Paragraph position="7"/>
    </Section>
  </Section>
  <Section position="7" start_page="1082" end_page="1083" type="metho">
    <SectionTitle>
5 Evaluation in English
</SectionTitle>
    <Paragraph position="0"> Approximately 970K unannotated sentences from the 1988 Wall St. Journal were processed in a manner identical to the example sentence in Table 1. The result was approximately 910,000 head word tuples of the form (v,p, n2) or (n,p, n2). Note that while the head word tuples represent correct attachments only 69% of the time, their quantity is about 45 times greater than the quantity of data used in previous supervised approaches. The extracted data was used as training material for the three classifters Clbase , Clinterp, and Clbigram. Each classifier is constructed as follows: Clbase This is the &amp;quot;baseline&amp;quot; classifier that predicts N of p = of, and V otherwise.</Paragraph>
    <Paragraph position="1"> Clinterp: This classifier has the form of equation (1), uses the method in section 4.1 to generate C/, and the method in section 4.2.2 to generate p.</Paragraph>
    <Paragraph position="2"> clbigram: This classifier has the form of equation (1), uses the method in section 4.1 to generate C/, and the method in section 4.2.1 to generate p.</Paragraph>
    <Paragraph position="3"> Table 4 shows accuracies of the classifiers on the test set of (Ratnaparkhi et al., 1994), which is derived from the manually annotated attachments in the Penn Treebank Wall St. Journal data. The Penn Treebank is drawn from the 1989 Wall St. Journal data, so there is no possibility of overlap with our training data. Furthermore, the extraction heuristic was developed and tuned on a &amp;quot;development set&amp;quot;, i.e., a set of annotated examples that did not overlap with either the test set or the training set.</Paragraph>
    <Paragraph position="4">  ous example rise hum to hum Table 5 shows the two probabilities Pr(a\[v, n) and Pr(p\[a, v, n), using the same approximations as clbigram, for the ambiguous example rise num to num. (Recall that Pr(v) and Pr(n) are not needed.) While the tuple (num, to, num) is more frequent than (rise, to, num), the conditional probabilities prefer a = V, which is the choice that maximizes Pr(v, n,p, a).</Paragraph>
    <Paragraph position="5"> Both classifiers Clinter p and dbigram clearly outperform the. baseline, but the classifier dinterp does not outperform dbigram, even though it interpolates between the less specific evidence (the preposition counts) and more specific evidence (the bigram counts). This may be due to the errors in our extracted training data; supervised classifiers that train from clean data typically benefit greatly by combining less specific evidence with more specific evidence.</Paragraph>
    <Paragraph position="6"> Despite the errors in the training data, the performance of the unsupervised classifiers (81.9%) begins to approach the best performance of the comparable supervised classifiers (84.5%). (Our goal is to replicate the supervision of a treebank, but not a semantic dictionary, so we do not compare against (Stetina and Nagao, 1997).) Furthermore, we do not use the second noun n2, whereas the best supervised methods use this information. Our result shows that the information in imperfect but abundant data from unambiguous attachments, as shown in Tables 2 and 3, is sufficient to resolve ambiguous prepositional phrase attachments at accuracies just under the supervised state-of-the-art accuracy.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML