File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1069_intro.xml
Size: 8,507 bytes
Last Modified: 2025-10-06 14:01:10
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1069"> <Title>Text Chunking using Regularized Winnow</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Recently there has been considerable interest in applying machine learning techniques to problems in natural language processing. One method that has been quite successful in many applications is the SNoW architecture (Dagan et al., 1997; Khardon et al., 1999). This architecture is based on the Winnow algorithm (Littlestone, 1988; Grove and Roth, 2001), which in theory is suitable for problems with many irrelevant attributes. In natural language processing, one often encounters a very high dimensional feature space, although most of the features are irrelevant. Therefore the robustness of Winnow to high dimensional feature space is considered an important reason why it is suitable for NLP tasks.</Paragraph> <Paragraph position="1"> However, the convergence of the Winnow algorithm is only guaranteed for linearly separable data. In practical NLP applications, data are often linearly non-separable. Consequently, a direct application of Winnow may lead to numerical instability. A remedy for this, called regularized Winnow, has been recently proposed in (Zhang, 2001). This method modifies the original Winnow algorithm so that it solves a regularized optimization problem. It converges both in the linearly separable case and in the linearly non-separable case. Its numerical stability implies that the new method can be more suitable for practical NLP problems that may not be linearly separable.</Paragraph> <Paragraph position="2"> In this paper, we compare regularized Winnow and Winnow algorithms on text chunking (Abney, 1991). In order for us to rigorously compare our system with others, we use the CoNLL-2000 shared task dataset (Sang and Buchholz, 2000), which is publicly available from http://lcgwww.uia.ac.be/conll2000/chunking. An advantage of using this dataset is that a large number of state of the art statistical natural language processing methods have already been applied to the data. Therefore we can readily compare our results with other reported results.</Paragraph> <Paragraph position="3"> We show that state of the art performance can be achieved by using the newly proposed regularized Winnow method. Furthermore, we can achieve this result with significantly less computation than earlier systems of comparable performance. null The paper is organized as follows. In Section 2, we describe the Winnow algorithm and the regularized Winnow method. Section 3 describes the CoNLL-2000 shared task. In Section 4, we give a detailed description of our system that employs the regularized Winnow algorithm for text chunking. Section 5 contains experimental results for our system on the CoNLL-2000 shared task.</Paragraph> <Paragraph position="4"> Some final remarks will be given in Section 6.</Paragraph> <Paragraph position="5"> 2 Winnow and regularized Winnow for binary classification We review the Winnow algorithm and the regularized Winnow method. Consider the binary classification problem: to determine a label a3a5a4 a6a8a7a10a9a12a11a13a9a15a14 associated with an input vector a16 . A useful method for solving this problem is through linear discriminant functions, which consist of linear combinations of the components of the input variable. Specifically, we seek a weight vector a17 and a threshold a18 such that a17a20a19a21a16a23a22a24a18 if its label</Paragraph> <Paragraph position="7"> For simplicity, we shall assume a18a33a25a35a34 in this paper. The restriction does not cause problems in practice since one can always append a constant feature to the input dataa16 , which offsets the effect a3a46a44a47a39 , a number of approaches to finding linear discriminant functions have been advanced over the years. We are especially interested in the Winnow multiplicative update algorithm (Littlestone, 1988). This algorithm updates the weight vector a17 by going through the training data repeatedly. It is mistake driven in the sense that the weight vector is updated only when the algorithm is not able to correctly classify an example.</Paragraph> <Paragraph position="8"> The Winnow algorithm (with positive weight) employs multiplicative update: if the linear discriminant function misclassifies an input training vector a16a38a48 with true label a3a46a48 , then we update each componenta49 of the weight vectora17 as:</Paragraph> <Paragraph position="10"> where a59a62a61 a34 is a parameter called the learning rate. The initial weight vector can be taken as</Paragraph> <Paragraph position="12"> a34 , where a63 is a prior which is typically chosen to be uniform.</Paragraph> <Paragraph position="13"> There can be several variants of the Winnow algorithm. One is called balanced Winnow, which is equivalent to an embedding of the input space into a higher dimensional space as: a65a16a33a25a67a66a16 a11a42a7 a16a45a68. This modification allows the positive weight Winnow algorithm for the augmented input a65a16 to have the effect of both positive and negative weights for the original input a16 .</Paragraph> <Paragraph position="14"> One problem of the Winnow online update algorithm is that it may not converge when the data are not linearly separable. One may partially remedy this problem by decreasing the learning rate parameter a59 during the updates. However, this is rather ad hoc since it is unclear what is the best way to do so. Therefore in practice, it can be quite difficult to implement this idea properly.</Paragraph> <Paragraph position="15"> In order to obtain a systematic solution to this problem, we shall first examine a derivation of the Winnow algorithm in (Gentile and Warmuth, 1998), which motivates a more general solution to be presented later.</Paragraph> <Paragraph position="16"> Following (Gentile and Warmuth, 1998), we consider the loss function a69a32a70a15a56 a36a71a7 a17a20a19a21a16a38a48a72a3a46a48a11a34a73a39 , which is often called &quot;hinge loss&quot;. For each data point a36a16 a48a11a3 a48a39 , we consider an online update rule such that the weighta17a53a48a75a74a76a37 after seeing thea77 -th example is given by the solution to</Paragraph> <Paragraph position="18"> (2) Setting the gradient of the above formula to zero,</Paragraph> <Paragraph position="20"> In the above equation, a98 a80a82a81a84a83a47a85 denotes the gradient (or more rigorously, a subgradient) of</Paragraph> <Paragraph position="22"> be regarded as an approximate solution to (3).</Paragraph> <Paragraph position="23"> Although the above derivation does not solve the non-convergence problem of the original Winnow method when the data are not linearly separable, it does provide valuable insights which can lead to a more systematic solution of the problem.</Paragraph> <Paragraph position="24"> The basic idea was given in (Zhang, 2001), where the original Winnow algorithm was converted into a numerical optimization problem that can handle linearly non-separable data.</Paragraph> <Paragraph position="25"> The resulting formulation is closely related to (2). However, instead of looking at one example at a time as in an online formulation, we incorporate all examples at the same time. In addition, we add a margin condition into the &quot;hinge loss&quot;. Specifically, we seek a linear weight a107a17 that solves a34 is a given parameter called the regularization parameter. The optimal solution a107a17 of the above optimization problem can be derived from the solution a107 a112 of the following dual optimization problem:</Paragraph> <Paragraph position="27"> A Winnow-like update rule can be derived for the dual regularized Winnow formulation. At each data point a36a16 a48a11a3 a48a39 , we fix all a112a76a120 with a121a123a122a25a124a77 , and update a112 to approximately maximize the dual objective functional using gradient ascent:</Paragraph> <Paragraph position="29"> Learning bounds of regularized Winnow that are similar to the mistake bound of the original Winnow have been given in (Zhang, 2001). These results imply that the new method, while it can properly handle non-separable data, shares similar theoretical advantages of Winnow in that it is also robust to irrelevant features. This theoretical insight implies that the algorithm is suitable for NLP tasks with large feature spaces.</Paragraph> </Section> class="xml-element"></Paper>