File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-2031_intro.xml
Size: 5,972 bytes
Last Modified: 2025-10-06 14:00:40
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2031"> <Title>Assigning Function Tags to Parsed Text*</Title> <Section position="3" start_page="234" end_page="236" type="intro"> <SectionTitle> 2 Features </SectionTitle> <Paragraph position="0"> We have found it useful to define our statistical model in terms of features. A 'feature', in this context, is a boolean-valued function, generally over parse tree nodes and either node labels or lexical items. Features can be fairly simple and easily read off the tree (e.g. 'this node's label is X', 'this node's parent's label is Y'), or slightly more complex ('this node's head's part-of-speech is Z'). This is concordant with the usage in the maximum entropy literature (Berger et al., 1996).</Paragraph> <Paragraph position="1"> When using a number of known features to guess an unknown one, the usual procedure is to calculate the value of each feature, and then essentially look up the empirically most probable value for the feature to be guessed based on those known values. Due to sparse data, some of the features later in the list may need to be ignored; thus the probability of an unknown feature value would be estimated as</Paragraph> <Paragraph position="3"> where/3 refers to an empirically observed probability. Of course, if features 1 through i only co-occur a few times in the training, this value may not be reliable, so the empirical probability is usually smoothed:</Paragraph> <Paragraph position="5"> The values for )~i can then be determined according to the number of occurrences of features 1 through i together in the training.</Paragraph> <Paragraph position="6"> One way to think about equation 1 (and specifically, the notion that j will depend on the values of fl... fn) is as follows: We begin with the prior probability of f. If we have data indicating P(flfl), we multiply in that likelihood, while dividing out the original prior. If we have data for/3(flfl, f2), we multiply that in while dividing out the P(flfl) term. This is repeated for each piece of feature data we have; at each point, we are adjusting the probability</Paragraph> <Paragraph position="8"> we already have estimated. If knowledge about feature fi makes S more likely than with just fl... fi-1, the term where fi is added will be greater than one and the running probability will be adjusted upward. This gives us the new probability shown in equation 3, which is exactly equivalent to equation 1 since everything except the last numerator cancels out of the equation. The value of j is chosen such that features fl..-fj are sufficiently represented in the training data; sometimes all n features are used, but often that would cause sparse data problems. Smoothing is performed on this equation exactly as before: each term is interpolated between the empirical value and the prior estimated probability, according to a value of Ai that estimates confidence. But aside from perhaps providing a new way to think about the problem, equation 3 is not particularly useful as it is--it is exactly the same as what we had before. Its real usefulness comes, as shown in (Charniak, 1999), when we move from the notion of a feature chain to a feature tree.</Paragraph> <Paragraph position="9"> These feature chains don't capture everything we'd like them to. If there are two independent features that are each relatively sparse but occasionally carry a lot of information, then putting one before the other in a chain will effectively block the second from having any effect, since its information is (uselessly) conditioned on the first one, whose sparseness will completely dilute any gain. What we'd really like is to be able to have a feature tree, whereby we can condition those two sparse features independently on one common predecessor feature. As we said before, equation 3 represents, for each feature fi, the probability of f based on fi and all its predecessors, divided by the probability of f based only on the predecessors. In the chain case, this means that the denominator is conditioned on every feature from 1 to i - 1, but if we use a feature tree, it is conditioned only on those features along the path to the root of the tree.</Paragraph> <Paragraph position="10"> A notable issue with feature trees as opposed to feature chains is that the terms do not all cancel out. Every leaf on the tree will be repretarget ~ feature sented in the numerator, and every fork in the tree (from which multiple nodes depend) will be represented at least once in the denominator. For example: in figure 3 we have a small feature tree that has one target feature and four conditioning features. Features b and d are independent of each other, but each depends on a; c depends directly only on b. The unsmoothed version of the corresponding equation would be</Paragraph> <Paragraph position="12"> Note that strictly speaking the result is not a probability distribution. It could be made into one with an appropriate normalisation--the so-called partition function in the maximum-entropy literature. However, if the independence assumptions made in the derivation of equation 4 are good ones, the partition function will be close to 1.0. We assume this to be the case for our feature trees.</Paragraph> <Paragraph position="13"> Now we return the discussion to function tagging. There are a number of features that seem to condition strongly for one function tag or another; we have assembled them into the feature tree shown in figure 4. 2 This figure should be relatively self-explanatory, except for the notion of an 'alternate head'; currently, an alternate head is only defined for prepositional phrases, and is the head of the object of the prepositional phrase. This data is very important in distinguishing, for example, 'by John' (where John might be a logical subject) from 'by next year' (a temporal modifier) and 'by selling it' (an adverbial indicating manner).</Paragraph> </Section> class="xml-element"></Paper>