File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1636_intro.xml
Size: 5,220 bytes
Last Modified: 2025-10-06 14:03:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1636"> <Title>Learning Phrasal Categories</Title> <Section position="4" start_page="301" end_page="302" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"> A PCFG is a tuple (V,M,u0,R,q : R - [0,1]), where V is a set of terminal symbols; M = {ui} is a set of nonterminal symbols; u0 is a start or root symbol; R is a set of productions of the form ui - r, where r is a sequence of terminals and nonterminals; and q is a family of probability distributions over rules conditioned on each rule's left-hand side.</Paragraph> <Paragraph position="1"> As in (Johnson, 1998) and (Klein and Manning, 2003), we annotate the Penn treebank non-terminals with various context information. Suppose u is a Treebank non-terminal. Let l = u[a] denote the non-terminal category annotated with a vector of context features a. A PCFG is derived from the trees in the usual manner, with production rules taken directly from the annotated trees, and the probability of an annotated rule q(l r) = C(l-r)C(l) where C(l - r) and C(l) are the number of observations of the production and its left hand side, respectively.</Paragraph> <Paragraph position="2"> We refer to the grammar resulting from extracting annotated productions directly out of the tree-bank as the base grammar.</Paragraph> <Paragraph position="3"> Our goal is to partition the set of annotated non-terminals into clusters Ph = {phi}. Each possible clustering corresponds to a PCFG, with the set of non-terminals corresponding to the set of clusters.</Paragraph> <Paragraph position="4"> The probability of a production under this PCFG</Paragraph> <Paragraph position="6"> where phs [?] Ph are clusters of annotated non-terminals and where:</Paragraph> <Paragraph position="8"> We refer to the PCFG of some clustering as the clustered grammar.</Paragraph> <Section position="1" start_page="301" end_page="302" type="sub_section"> <SectionTitle> 2.1 Features </SectionTitle> <Paragraph position="0"> Most of the features we use are fairly standard.</Paragraph> <Paragraph position="1"> These include the label of the parent and grandparent of a node, its lexical head, and the part of speech of the head.</Paragraph> <Paragraph position="2"> Klein and Manning (2003) find marking non-terminals which have unary rewrites to be helpful. They also find useful annotating two preterminals (DT,RB) if they are the product of a unary production. We generalize this via two width features: the first marking a node with the number of non-terminals to which it rewrites; the second marking each preterminal with the width of its parent.</Paragraph> <Paragraph position="3"> Another feature is the span of a nonterminal, or the number of terminals it dominates, which we normalize by dividing by the length of the sentence. Hence preterminals have normalized spans of 1/(length of the sentence), while the root has a normalized span of 1.</Paragraph> <Paragraph position="4"> Extending on the notion of a Base NP, introduced by Collins (1996), we mark any nonterminal that dominates only preterminals as Base. Collins inserts a unary NP over any base NPs without NP parents. However, Klein and Manning (2003) find that this hurts performance relative to just marking the NPs, and so our Base feature does not insert.</Paragraph> <Paragraph position="5"> We have two features describing a node's position in the expansion of its parent. The first, which we call the inside position, specifies the nonterminal's position relative to the heir of its parent's head, (to the left or right) or whether the nonterminal is the heir. (By &quot;heir&quot; we mean the constituent donates its head, e.g. the heir of an S is typically the VP under the S.) The second feature, outside position, specifies the nonterminal's position relative to the boundary of the constituent: it is the leftmost child, the rightmost child, or neither.</Paragraph> <Paragraph position="6"> Related to this, we further noticed that several of Klein & Manning's (2003) features, such as marking NPs as right recursive or possessive have the property of annotating with the label of the rightmost child (when they are NP and POS respectively). We generalize this by marking all nodes both with their rightmost child and (an analogous feature) leftmost child.</Paragraph> <Paragraph position="7"> We also mark whether or not a node borders the end of a sentence, save for ending punctuation.</Paragraph> <Paragraph position="8"> (For instance, in this sentence, all the constituents with the second &quot;marked&quot; rightmost in their span would be marked).</Paragraph> <Paragraph position="9"> Another Klein and Manning (2003) feature we try includes the temporal NP feature, where TMP markings in the treebank are retained, and propagated down the head inheritance path of the tree. It is worth mentioning that all the features here come directly from the treebank. For instance, the part of speech of the head feature has values only from the raw treebank tag set. When a preterminal cluster is split, this assignment does not change the value of this feature.</Paragraph> </Section> </Section> class="xml-element"></Paper>