File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/h05-1094_intro.xml
Size: 3,189 bytes
Last Modified: 2025-10-06 14:02:56
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1094"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 748-754, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Composition of Conditional Random Fields for Transfer Learning</Title> <Section position="3" start_page="748" end_page="748" type="intro"> <SectionTitle> 2 Linear-chain CRFs </SectionTitle> <Paragraph position="0"> Conditional random fields (CRFs) (Lafferty et al., 2001) are undirected graphical models that are conditionally trained. In this section, we describe CRFs for the linear-chain case. Linear-chain CRFs can be roughly understood as conditionally-trained finite state machines. A linear-chain CRF defines a distribution over state sequences s = {s1,s2,...,sT} given an input sequence x = {x1,x2,...,xT} by making a first-order Markov assumption on states. These Markov assumptions imply that the distribution over sequences factorizes in terms of pairwise functions Pht(st[?]1,st,x) as:</Paragraph> <Paragraph position="2"> The partition function Z(x) is defined to ensure that the distribution is normalized:</Paragraph> <Paragraph position="4"> The potential functions Pht(st[?]1,st,x) can be interpreted as the cost of making a transition from state st[?]1 to state st attimet,similartoatransitionprobabilityinanHMM.</Paragraph> <Paragraph position="5"> Computing the partition function Z(x) requires summing over all of the exponentially many possible state sequences sprime. By exploiting Markov assumptions, however, Z(x) (as well as the node marginals p(st|x) and the Viterbi labeling) can be calculated efficiently by variants of the standard dynamic programming algorithms used for HMMs.</Paragraph> <Paragraph position="6"> We assume the potentials factorize according to a set of features {fk}, which are given and fixed, so that</Paragraph> <Paragraph position="8"> one for each feature.</Paragraph> <Paragraph position="9"> Feature functions can be arbitrary. For example, one feature function could be a binary test fk(st[?]1,st,x,t) that has value 1 if and only if st[?]1 has the label SPEAK-ERNAME, st has the label OTHER, and the word xt begins with a capital letter. The chief practical advantage of conditional models, in fact, is that we can include arbitrary highly-dependent features without needing to estimate their distribution, as would be required to learn a generative model.</Paragraph> <Paragraph position="10"> Given fully-labeled training instances {(sj,xj)}Mj=1, CRF training is usually performed by maximizing the penalized log likelihood</Paragraph> <Paragraph position="12"> where the final term is a zero-mean Gaussian prior placed on parameters to avoid overfitting. Although this maximization cannot be done in closed form, it can be optimized numerically. Particularly effective are gradient-based methods that use approximate second-order information, such as conjugate gradient and limited-memory BFGS (Byrd et al., 1994). For more information on current training methods for CRFs, see Sha and Pereira (2003).</Paragraph> </Section> class="xml-element"></Paper>