File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0505_metho.xml
Size: 16,820 bytes
Last Modified: 2025-10-06 14:10:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0505"> <Title>Efficient Hierarchical Entity Classifier Using Conditional Random Fields</Title> <Section position="5" start_page="33" end_page="34" type="metho"> <SectionTitle> 3 Conditional Random Fields </SectionTitle> <Paragraph position="0"> Conditional random fields (CRFs) (Lafferty et al., 2001; Jordan, 1999; Wallach, 2004) is a statistical method based on undirected graphical models. Let X be a random variable over data sequences to be labeled and Y a random variable over corresponding label sequences. All components Yi of Y are assumed to range over a finite label alphabet K.</Paragraph> <Paragraph position="1"> In this paper X will range over the sentences of a text, tagged with POS-labels and Y ranges over the synsets to be recognized in these sentences.</Paragraph> <Paragraph position="2"> We define G = (V,E) to be an undirected graph such that there is a node v [?] V corresponding to each of the random variables representing an element Yv of Y . If each random variable Yv obeys the Markov property with respect to G (e.g., in a first order model the transition probability depends only on the neighboring state), then the model (Y,X) is a Conditional Random Field. Although the structure of the graph G may be arbitrary, we limit the discussion here to graph structures in which the nodes corresponding to elements of Y form a simple first-order Markov chain.</Paragraph> <Paragraph position="3"> A CRF defines a conditional probability distribution p(Y|X) of label sequences given input sequences. We assume that the random variable sequences X and Y have the same length and use x = (x1,...,xT) and y = (y1,...,yT) for an input sequence and label sequence respectively. Instead of defining a joint distribution over both label and observation sequences, the model defines a conditional probability over labeled sequences. A novel observation sequence x is labeled with y, so that the conditional probability p(y|x) is maximized.</Paragraph> <Paragraph position="4"> We define a set of K binary-valued features or feature functions fk(yt[?]1,yt,x) that each express some characteristic of the empirical distribution of the training data that should also hold in the model distribution. An example of such a feature is</Paragraph> <Paragraph position="6"> Feature functions can depend on the previous (yt[?]1) and the current (yt) state. Considering K feature functions, the conditional probability distribution defined by the CRF is</Paragraph> <Paragraph position="8"> (2) where lj is a parameter to model the observed statistics and Z(x) is a normalizing constant computed as</Paragraph> <Paragraph position="10"> This method can be thought of a generalization of both the Maximum Entropy Markov model (MEMM) and the Hidden Markov model (HMM).</Paragraph> <Paragraph position="11"> It brings together the best of discriminative models and generative models: (1) It can accommodate many statistically correlated features of the inputs, contrasting with generative models, which often require conditional independent assumptions in order to make the computations tractable and (2) it has the possibility of context-dependent learning by trading off decisions at different sequence positions to obtain a global optimal labeling. Because CRFs adhere to the maximum entropy principle, they offer a valid solution when learning from incomplete information. Given that in information extraction tasks, we often lack an annotated training set that covers all possible extraction patterns, this is a valuable asset.</Paragraph> <Paragraph position="12"> Lafferty et al. (Lafferty et al., 2001) have shown that CRFs outperform both MEMM and HMM on synthetic data and on a part-of-speech tagging task. Furthermore, CRFs have been used successfully in information extraction (Peng and McCallum, 2004), named entity recognition (Li and Mc-Callum, 2003; McCallum and Li, 2003) and sentence parsing (Sha and Pereira, 2003).</Paragraph> </Section> <Section position="6" start_page="34" end_page="34" type="metho"> <SectionTitle> 4 Parameter estimation </SectionTitle> <Paragraph position="0"> In this section we'll explain to some detail how to derive the parameters th = {lk}, given the training data. The problem can be considered as a constrained optimization problem, where we have to find a set of parameters which maximizes the log likelihood of the conditional distribution (McCallum, 2003). We are confronted with the problem of efficiently calculating the expectation of each feature function with respect to the CRF model distribution for every observation sequence x in the training data. Formally, we are given a set of training examples D =</Paragraph> <Paragraph position="2"> bracerightBig is a sequence of inputs and y(i) = braceleftBig y(i)1 ,y(i)2 ,...,y(i)T bracerightBig is a se- null quence of the desired labels. We will estimate the parameters by penalized maximum likelihood, optimizing the function:</Paragraph> <Paragraph position="4"> After substituting the CRF model (2) in the likelihood (3), we get the following expression:</Paragraph> <Paragraph position="6"> The function l(th) cannot be maximized in closed form, so numerical optimization is used. The partial derivates are:</Paragraph> <Paragraph position="8"> Using these derivates, we can iteratively adjust the parameters th (with Limited-Memory BFGS (Byrd et al., 1994)) until l(th) has reached an optimum. During each iteration we have to calculate p(y',y|x(i)). This can be done, as for the Hidden Markov Model, using the forward-backward algorithm (Baum and Petrie, 1966; Forney, 1996).</Paragraph> <Paragraph position="9"> This algorithm has a computational complexity of O(TM2) (where T is the length of the sequence and M the number of the labels). We have to execute the forward-backward algorithm once for every training instance during every iteration. The total cost of training a linear-chained CRFs is thus: O(TM2NG) where N is the number of training examples and G the number of iterations. We've experienced that this complexity is an important delimiting factor when learning a big collection of labels. Employing CRFs to learn the 95076 WordNet synsets with 20133 training examples was not feasible on current hardware. In the next section we'll describe the method we've implemented to drastically reduce this complexity.</Paragraph> </Section> <Section position="7" start_page="34" end_page="37" type="metho"> <SectionTitle> 5 Reducing complexity </SectionTitle> <Paragraph position="0"> In this section we'll see how we create groups of features for every label that enable an important reduction in complexity of both labeling and training. We'll first discuss how these groups of features are created (section 5.1) and then how both labeling (section 5.2) and training (section 5.3) are performed using these groups.</Paragraph> <Section position="1" start_page="35" end_page="36" type="sub_section"> <SectionTitle> 5.1 Hierarchical feature selection </SectionTitle> <Paragraph position="0"> To reduce the complexity of CRFs, we assign a selection of features to every node in the hierarchical tree. As discussed in section 2 WordNet defines a relation between synsets which organises the synsets in a tree. In its current form this tree does not meet our needs: we need a tree where every label used for labeling corresponds to exactly one leaf-node, and no label corresponds to a non-leaf node. We therefor modify the existing tree. We create a new top node (&quot;top&quot;) and add the original tree as defined by WordNet as a subtree to this top-node. We add leaf-nodes corresponding to the labels &quot;NONE&quot;, &quot;ADJ&quot;, &quot;ADV&quot;, &quot;VERB&quot; to the top-node and for the other labels (the noun synsets) we add a leaf-node to the node representing the corresponding synset. For example, we add a node corresponding to the label &quot;ENTITY&quot; to the node &quot;entity&quot;. Fig. 3 pictures a fraction of this tree. Nodes corresponding to a label have an uppercase name, nodes not corresponding to a label have a lowercase name.</Paragraph> <Paragraph position="1"> We use v to denote nodes of the tree. We call the top concept vtop and the concept v+ the parent of v, which is the parent of v[?]. We call Av the collection of ancestors of a concept v, including v itself.</Paragraph> <Paragraph position="2"> We will now show how we transform a regular CRF in a CRF that uses hierarchical feature selection. We first notice that we can rewrite eq. 2 as</Paragraph> <Paragraph position="4"> We rewrite this equation because it will enable us to reduce the complexity of CRFs and it has the property that p(yt|yt[?]1,x) [?] G(yt[?]1,yt,x) which we will use in section 5.3.</Paragraph> <Paragraph position="5"> We now define a collection of features Fv for every node v. If v is leaf-node, we define Fv as the collection of features fk(yt[?]1,yt,x) for which it is possible to find a node vt[?]1 and input x for which fk(vt[?]1,v,x) negationslash= 0. If v is a non-leaf node, we define Fv as the collection of features fk(yt[?]1,yt,x) (1) which are elements of Fv[?] for every child node v[?] of v and (2) for every v[?]1 and v[?]2 , children of v, it is valid that for every previous label vt[?]1 and input x fk(vt[?]1,v[?]1 ,x) =fk(vt[?]1,v[?]2 ,x).</Paragraph> <Paragraph position="6"> Informally, Fv is the collection of features which are useful to evaluate for a certain node. For the leaf-nodes, this is the collection of features that can possibly return a non-zero value. For non-leaf nodes, it's useful to evaluate features belonging to Fv when they have the same value for all the descendants of that node (which we can put to good use, see further).</Paragraph> <Paragraph position="7"> We define F'v = Fv\Fv+ where v+ is the parent of label v. For the top node vtop we define F'vtop = Fvtop. We also set</Paragraph> <Paragraph position="9"> We've now organised the collection of features in such a way that we can use the hierarchical relations defined by WordNet when determining the probability of a certain labeling y. We first see</Paragraph> <Paragraph position="11"> This formula has exactly the same result as eq. 2.</Paragraph> <Paragraph position="12"> Because we assigned a collection of features to every node, we can discard parts of the search space when searching for possible labelings, obtaining an important reduction in complexity. We elaborate this idea in the following sections for both labeling and training.</Paragraph> </Section> <Section position="2" start_page="36" end_page="36" type="sub_section"> <SectionTitle> 5.2 Labeling </SectionTitle> <Paragraph position="0"> The standard method to label a sentence with CRFs is by using the Viterbi algorithm (Forney, 1973; Viterbi, 1967) which has a computational complexity of O(TM2). The basic idea to reduce this computational complexity is to select the best labeling in a number of iterations. In the first iteration, we label every word in a sentence with a label chosen from the top-level labels. After choosing the best labeling, we refine our choice (choose a child label of the previous chosen label) in subsequent iterations until we arrive at a synset which has no children. In every iteration we only have to choose from a very small number of labels, thus breaking down the problem of selecting the correct label from a large number of labels in a number of smaller problems.</Paragraph> <Paragraph position="1"> Formally, when labeling a sentence we find the label sequence y such that y has the maximum probability of all labelings. We will estimate the best labeling in an iterative way: we start with the best labeling ytop[?]1 = {ytop[?]11 ,...,ytop[?]1T } choosing only from the children ytop[?]1t of the top node. The probability of this labeling ytop[?]1 is</Paragraph> <Paragraph position="3"> where Z'(x) is an appropriate normalizing constant. We now select a labeling ytop[?]2 so that on every position t node ytop[?]2t is a child of ytop[?]1t .</Paragraph> <Paragraph position="4"> The probabilty of this labeling is (following eq. 5)</Paragraph> <Paragraph position="6"> After selecting a labeling ytop[?]2 with maximum probability, we proceed by selecting a labeling ytop[?]3 with maximum probability etc.. We proceed using this method until we reach a labeling in which every yt is a node which has no children and return this labeling as the final labeling.</Paragraph> <Paragraph position="7"> The assumption we make here is that if a node v is selected at position t of the most probable labeling ytop[?]s the children v[?] have a larger probability of being selected at position t in the most probable labeling ytop[?]s[?]1. We reduce the number of labels we take into consideration by stating that for every concept v for which v negationslash= ytop[?]st , we set G'(yt[?]1,v[?]t ,x) = 0 for every child v[?] of v.</Paragraph> <Paragraph position="8"> This reduces the space of possible labelings drastically, reducing the computational complexity of of children of a concept, the depth of the tree is logq(M). On every level we have to execute the Viterbi algorithm for q labels, thus resulting in a total complexity of</Paragraph> <Paragraph position="10"/> </Section> <Section position="3" start_page="36" end_page="37" type="sub_section"> <SectionTitle> 5.3 Training </SectionTitle> <Paragraph position="0"> We will now discuss how we reduce the computational complexity of training. As explained in section 4 we have to estimate the parameters lk that optimize the function l(th). We will show here how we can reduce the computational complexity of the calculation of the partial derivates [?]l(th)[?]lk (eq. 4). The predominant factor with regard to the computational complexity in the evaluation of this equation is the calculation of p(yt[?]1,y|x(i)).</Paragraph> <Paragraph position="1"> Recall we do this with the forward-backward algorithm, which has a computational complexity of O(TM2). We reduce the number of labels to improve performance. We will do this by making the same assumption as in the previous section: for every concept v at level s, for which v negationslash= ytop[?]st , we set G'(yt[?]1,v[?]t ,x) = 0 for every child v[?] of v. Since (as noted in sect.</Paragraph> <Paragraph position="2"> 5.2) p(vt|yt[?]1,x) [?] G(yt[?]1,vt,x), this has the consequence that p(vt|yt[?]1,x) = 0 and that p(vt,yt[?]1|x) = 0. Fig. 4 gives a graphical representation of this reduction of the search space. The correct label here is &quot;LABEL1&quot; , the grey nodes have a non-zero p(vt,yt[?]1|x) and the white nodes have a zero p(vt,yt[?]1|x).</Paragraph> <Paragraph position="3"> In the forward backward algorithm we only have to account every node v that has a non-zero p(v,yt[?]1|x). As can be easily seen from fig. 4, the number of nodes is qlogqM, where q is the average number of children of a concept. The total complexity of running the forward-backward algorithm is O(T(qlogqM)2). Since we have to run this algorithm once for every gradient compu-</Paragraph> <Paragraph position="5"/> </Section> </Section> <Section position="8" start_page="37" end_page="37" type="metho"> <SectionTitle> 6 Implementation </SectionTitle> <Paragraph position="0"> To implement the described method we need two components: an interface to the WordNet database and an implementation of CRFs using a hierarchical model. JWordNet is a Java interface to WordNet developed by Oliver Steele (which can be found on http://jwn.sourceforge.</Paragraph> <Paragraph position="1"> net/). We used this interface to extract the Word-Net hierarchy.</Paragraph> <Paragraph position="2"> An implementation of CRFs using the hierarchical model was obtained by adapting the Mallet2 package. The Mallet package (McCallum, 2002) is an integrated collection of Java code useful for statistical natural language processing, document classification, clustering, and information extraction. It also offers an efficient implementation of CRFs. We've adapted this implementation so it creates hierarchical selections of features which are then used for training and labeling.</Paragraph> <Paragraph position="3"> We used the Semcor corpus (Fellbaum et al., 1998; Landes et al., 1998) for training. This corpus, which was created by the Princeton University, is a subset of the English Brown corpus containing almost 700,000 words. Every sentence in the corpus is noun phrase chunked. The chunks are tagged by POS and both noun and verb phrases are tagged with their WordNet sense. Since we do not want to learn a classification for verb synsets, we replace the tags of the verbs with one tag</Paragraph> </Section> class="xml-element"></Paper>