File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1504_metho.xml
Size: 17,634 bytes
Last Modified: 2025-10-06 14:08:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1504"> <Title>Low-cost Named Entity Classification for Catalan: Exploiting Multilingual Resources and Unlabeled Data</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Learning Algorithms </SectionTitle> <Paragraph position="0"> As previously said, we compare two learning approaches when learning from Catalan examples: supervised (using the AdaBoost algorithm), and unsupervised (using the Greedy Agreement Algorithm).</Paragraph> <Paragraph position="1"> Both of them are briefly described below.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Supervised Learning </SectionTitle> <Paragraph position="0"> We use the multilabel multiclass AdaBoost.MH algorithm (with confidence-rated predictions) for learning the classification models. The idea of this algorithm is to learn an accurate strong classifier by linearly combining, in a weighted voting scheme, many simple and moderately-accurate base classifiers or rules. Each base rule is sequentially learned by presenting the base learning algorithm a weighting over the examples (denoting importance of examples), which is dynamically adjusted depending on the behaviour of the previously learned rules. We refer the reader to (Schapire and Singer, 1999) for details about the general algorithm, and to (Schapire, 2002) for successful applications to many areas, including several NLP tasks. Additionally, a NERC system based on the AdaBoost algorithm obtained the best results in the CoNLL'02 Shared Task competition (Carreras et al., 2002).</Paragraph> <Paragraph position="1"> In our setting, the boosting algorithm combines several small fixed-depth decision trees. Each branch of a tree is, in fact, a conjunction of binary features, allowing the strong boosting classifier to work with complex and expressive rules.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Unsupervised Learning </SectionTitle> <Paragraph position="0"> We have implemented the Greedy Agreement Algorithm (Abney, 2002) which, based on two independent views of the data, is able to learn two binary classifiers from a set of hand-typed seed rules. Each classifier is a majority vote of several atomic rules, which abstains when the voting ends in a tie. The atomic rules are just mappings of a single feature into a class (e.g., if suffix &quot;lez&quot; then PER). When learning, the atomic rule that maximally reduces the disagreement on unlabelled data between both classifiers is added to one of the classifiers, and the process is repeated alternating the classifiers. See (Abney, 2002) for a formal proof that this algorithm tends to gradually reduce the classification error given the adequate seed rules.</Paragraph> <Paragraph position="1"> For its extreme simplicity and potentially good results, this algorithm is very appealing for the NEC task. In fact, results are reported to be competitive against more sophisticated methods (Co-DL, Co-Boost, etc.) for this specific task in (Abney, 2002).</Paragraph> <Paragraph position="2"> Three important questions arise from the algorithm. First, what features compose each view. Second, how seed rules should be selected or whether this selection strongly affects the final classifiers.</Paragraph> <Paragraph position="3"> Third, how the algorithm, presented in (Abney, 2002) for binary classification, can be extended to a multiclass problem.</Paragraph> <Paragraph position="4"> In order to answer these questions and gain some knowledge on how the algorithm works empirically, we performed initial experiments on the big labelled portion of the Spanish data.</Paragraph> <Paragraph position="5"> When it comes to view selection, we tried two alternatives. The first, suggested in (Collins and Singer, 1999; Abney, 2002), divides into one view capturing internal features of the NE, and the other capturing features of its left-right contexts (hereafter referred to as Greedy Agreement pure, or GAa16 ). Since the contextual view turned out to be quite limited in performance, we interchanged some feature groups between the views. Specifically, we moved the Lexical features independent of their position to the contextual view, and the the Bag-of-Words features to the internal one (we will refer to this division as Greedy Agreement mixed, or GAa17 ). The latter, containing redundant and conditionally dependent features, yielded slightly better results in terms of precision-coverage trade-off.</Paragraph> <Paragraph position="6"> As for seed rules selection, we have tried two different strategies. On the one hand, blindly choosing as many atomic rules as possible that decide at least in 98% of the cases for a class in a small validation set of labelled data, and on the other, manually selecting from these atomic rules only those that might be valid still for a bigger data set. This second approach proved empirically better, as it provided a much higher starting point in the test set (in terms of precision), whereas a just slightly lower coverage value, presenting a better learning curve.</Paragraph> <Paragraph position="7"> Finally, we have approached the multiclass setting by a one-vs-all binarization, that is, dividing the classification problem into four binary decisions (one per class), and combining the resultant rules. Several techniques to combine them have been tested, from making a prediction only when one classifier assigns positive for the given instance and all other classifiers assign negative (very high precision, low coverage), to much unrestrictive approaches, such as combining all votes from each classifier (lower precision, higher coverage). Results proved that the best approach is to sum all votes from all non-abstaining binary classifiers, where a vote of a concrete classifier for the negative class is converted to one vote for each of the other classes.</Paragraph> <Paragraph position="8"> The best results obtained in terms of coverage/precision and evaluated over the whole set of training data (and thus more significant than over a small test set) are 80.7/84.9. These results are comparable to the ones presented in (Abney, 2002), taking into account, apart from the language change, that we have introduced a fourth class to be treated the same as the other three. Results when using Catalan data are presented in section 4.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Using only Catalan resources </SectionTitle> <Paragraph position="0"> This section describes the results obtained by using only the Catalan resources and comparing the fully unsupervised Greedy Agreement algorithm with the AdaBoost supervised learning algorithm.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Unsupervised vs. supervised learning </SectionTitle> <Paragraph position="0"> In this experiment, we used the Catalan training set for extracting seed rules of the GA algorithm and to train an AdaBoost classifier. The whole unlabelled Catalan corpus was used for bootstrapping the GA algorithm. All the results were computed over the Catalan test set.</Paragraph> <Paragraph position="1"> Figure 1 shows a precision-coverage plot of AdaBoost (noted as CA, for CAtalan training) and curve for CA has been computed by varying a confidence threshold: CA abstains when the highest prediction of AdaBoost is lower than this threshold. On the one hand, it can be seen that GAa17 is more precise than GAa16 for low values of coverage but their asymptotic behaviour is quite similar. By stopping at the best point in the validation set, the Greedy Agreement algorithm (GAa17 ) achieves a precision of 76.53% with a coverage of 83.62% on the test set.</Paragraph> <Paragraph position="2"> On the other hand, the AdaBoost classifier clearly outperforms both GA models at all levels of coverage, indicating that the supervised training is preferable even when using really small training sets (an accuracy around 70% is obtained by training AdaBoost only with the 20% of the learning examples, i.e., 270 examples).</Paragraph> <Paragraph position="3"> The first three rows of table 2 contain the accuracy of these systems (i.e., precision when coverage is 100%), detailed at the NE type level (best results printed in boldface)2. The fourth row (BTS) corresponds to the best results obtained when additional unlabelled Catalan examples are taken into account, as explained below.</Paragraph> <Paragraph position="4"> It can be observed that the GA models are highly biased towards the most frequent NE types (ORG and PER) and that the accuracy achieved on the less rep2In order to obtain a 100% coverage with the GA models we have introduced a naive algorithm for breaking ties in favour of the most frequent categories, in the cases in which the algorithm abstains.</Paragraph> <Paragraph position="5"> resented categories is very low for LOC and negligible for MIS. The MIS category is rather difficult to learn (also for the supervised algorithm), probably because it does not account for any concrete NE type and does not show many regularities. Considering this fact, we learned the models using only the LOC, ORG, and PER categories and treated the MIS as a default value (assigned whenever the classifier does not have enough evidence for any of the categories). The results obtained were even worse.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Bootstrapping AdaBoost models using </SectionTitle> <Paragraph position="0"> unlabelled examples Ideally, the supervised approach can be boosted by using the unlabelled Catalan examples in a kind of iterative bootstrapping procedure. We have tested a quite simple strategy for bootstrapping. The unlabelled data in Catalan has been randomly divided into a number of equal-size disjoint subsets a18a20a19 . . .a18a22a21 , containing 1,000 sentences each. Given the initial training set for Catalan, noted asa23a22a24 , the process is as follows: 1. Learn thea25a27a26 classification model froma23 a24 2. Fora28a30a29a32a31a30a33a34a33a34a33a36a35 do : (a) Classify the Named Entities in a18a20a19 . . .a18a38a37 using modela25 a37a40a39a22a19 (b) Select a subset a18 of previously classified At each iteration, a new unlabelled fold is included in the learning process. First, the folds are labelled by the current model, and then, a new model is learned using the base training data plus the labelpredicted folds.</Paragraph> <Paragraph position="1"> We devised two variants for selecting the subset of labelled instances to include at each iteration. The first one consists of simply selecting all the examples, and the second one consists of choosing only the most confident ones (in order to avoid the addition of many training errors). For the latter, we have used a confidence measure based on the difference between the first and second highest predictions for the example (after normalization in a57a59a58a60a31a62a61a64a63a49a31a36a65). The confidence parameter has been empirically set to 0.3. These two variants lead to bootstrapping algorithms that will be referred to as CAa50a36a51a53a52a19 ,CAa50a36a51a53a52a55a54 . Finally, a third variant of the bootstrapping algorithm has been tested, consisting of training the a25a27a26 model using the Catalan training set a23 a24 plus a set of examples (of comparable size and distribution over NE types) selected from the most confidently labelled examples by the GAa17 model. This strategy, which is applied in combination with the CAa50a36a51a53a52a55a54 selection scheme, will be referred to as CAa50a64a51a40a52a40a56 . Left-hand side of table 3 contains the results obtained by these bootstrapping techniques for up to 10 iterations. Figures improving the baseline CA model are printed in boldface.</Paragraph> <Paragraph position="2"> It can be observed that, frequently, the bootstrapping procedure decreases the accuracy of the system. This is probably due to two main factors: the supervised learning algorithm cannot recover from the almost 20% of errors introduced by the initial CA model, and the effect of the recognition errors (mostly in segmentation) that are present in the Catalan unlabelled corpus (recall that our NE recogniser is far from perfect, achieving 91.5 ofa66 a19 measure). However, significant differences can be observed between the three variants. Firstly, the simple addition of all the examples (CAa50a36a51a53a52a19 ) systematically decreases performance. Secondly, the selection of confident examples (CAa50a36a51a53a52a55a54 ) minimises the loss but does not allow to improve results (probably because most of the selected examples do not provide new information). Finally, the addition of the examples labelled by GAa17 in the first learning step, though starting with a less accurate classifier, obtains better results in the majority of cases (though the bootstrapping process is certainly unstable). This seems to indicate that the information introduced by these examples is somehow complementary to that of CA.</Paragraph> <Paragraph position="3"> It is worth noting that GAa17 examples do not cover the most frequent cases, since if we use them to train an AdaBoost classifier, we obtain a very low accuracy of 33%. The best result achieved by CAa50a64a51a40a52a40a56 is detailed in the last row of table 2.</Paragraph> <Paragraph position="4"> More complex variations to the above bootstrapping strategy have been experimented. Basically, our direction has concentrated on selecting a right sized set of confident examples from the unlabelled material by considering the cases in which CA and GA models agree on the prediction. In all cases, results lead to conclusions similar to the ones described above.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Using Spanish resources </SectionTitle> <Paragraph position="0"> In this section we extend our previous work on NE recognition (Carreras et al., 2003) to obtain a bilingual NE classification model. The idea is to exploit the large Spanish annotated corpus by learning a Spanish-Catalan bilingual model from the joint set of Spanish and Catalan learning examples. In order to make the model bilingual, we just have to deal with the features that are language dependent, namely the lexical ones (word forms appearing in context patterns and Bag-of-Words). All other features are left unchanged.</Paragraph> <Paragraph position="1"> A translation dictionary from Spanish to Catalan and vice-versa has been automatically built for the word-form features. It contains a list of translation pairs between Spanish and Catalan words. For instance, an entry in a dictionary is &quot;callea67 carrer&quot;, meaning that the Spanish word &quot;calle&quot; (&quot;street&quot; in English) corresponds to the Catalan word &quot;carrer&quot;. In order to obtain the relevant vocabulary for the NEC task, we have run several trainings on the Spanish and Catalan training sets by varying the learning parameters, and we have extracted from the learned models all the involved lexical features. This set of relevant words contains 8,042 words (80% coming from Spanish and 20% coming from Catalan).</Paragraph> <Paragraph position="2"> The translation of these words has been automatically done by applying the InterNOSTRUM Spanish-Catalan machine translation system developed by the Software Department of the University of Alacant3. The translations have been resolved without any context information (so, the MT system is often mistaken), and the entries not recognised by InterNOSTRUM have been left unchanged. A very light posterior hand-correcting has been done in order to fix some minor errors coming between different segmentations of translation pairs.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Cross-Linguistic features </SectionTitle> <Paragraph position="0"> In order to train bilingual classification models, we make use of what we call cross-linguistic features, instead of the monolingual word forms specified in section 2.2. This technique is exactly the same we proposed to learn a Catalan-Spanish bilingual NE recognition module (Carreras et al., 2003). Assume a feature lang which takes value es or ca, depending on the language under consideration. A cross-linguistic feature is just a binary feature corresponding to an entry in the translation dictionary, &quot;es wa67 ca w&quot;, which is satisfied as follows:</Paragraph> <Paragraph position="2"> This representation allows to learn from a corpus consisting of mixed Spanish and Catalan examples.</Paragraph> <Paragraph position="3"> When an example, say in Spanish, is codified, each occurrence of a word form is checked in the dictionary and all translation pairs that match the Spanish entry are codified as cross-linguistic features.</Paragraph> <Paragraph position="4"> The idea here is to take advantage of the fact that the concept of NE is mostly shared by both languages, but differs in the lexical information, which this, we can learn a bilingual model which is able to classify NEs both for Spanish and Catalan, but that may be trained with few --or even any-- data of one language, in our case Catalan.</Paragraph> </Section> </Section> class="xml-element"></Paper>