File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1504_intro.xml
Size: 6,406 bytes
Last Modified: 2025-10-06 14:01:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1504"> <Title>Low-cost Named Entity Classification for Catalan: Exploiting Multilingual Resources and Unlabeled Data</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Setting </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Corpus and data resources </SectionTitle> <Paragraph position="0"> The experimentation of this work has been carried on two corpora, one for each language. Both corpora consist of sentences extracted from news articles of the year 2,000. The Catalan data, extracted from the Catalan edition of the daily newspaper El Peri'odico de Catalunya, has been randomly divided into three sets: a training set (to train a system) and a test set (to perform evaluation) for manual annotation, and a remaining set left as unlabelled. The Spanish data corresponds to the CoNLL 2002 Shared Task Spanish data, the original source being the EFE Spanish Newswire Agency. The training set has been used to improve classification for Catalan, whereas the test set has been used to evaluate the bilingual classifier. The original development set has not been used. Table 1 shows the number of sentences, words and lang. set #sent. #words #NEs Named Entities in each set. Although a large amount of Catalan unlabelled NEs is available, it must be observed that these are automatically recognised with a 91.5% accurate NER module, introducing a certain error that might undermine bootstrapping results.</Paragraph> <Paragraph position="1"> Considered classes include MUC categories PER LOC and ORG, plus a fourth category MIS, including named entities such as documents, measures and taxes, sport competitions, titles of art works and others. For Catalan, we find 33.0% of PER, 17.1% of LOC, 43.5% of ORG and 6.4% of MIS out of the 2,570 manually annotated NEs, whereas for Spanish, out of the 22,355 labelled NEs, 22.6% are PER, 26.8% are LOC, 39.4% are ORG and the remaining 11.2% are MIS.</Paragraph> <Paragraph position="2"> Additionally, we used a Spanish 7,427 trigger-word list typically accompanying persons, organizations, locations, etc., and an 11,951 entry gazetteer containing geographical and person names. These lists have been semi-automatically extracted from lexical resources and manually enriched afterwards.</Paragraph> <Paragraph position="3"> They have been used in some previous works allowing significant improvements for the Spanish NERC task (Carreras et al., 2002; Carreras et al., 2003).</Paragraph> <Paragraph position="4"> Trigger-words are annotated with the corresponding Spanish synsets in the EuroWordNet lexical knowledge base. Since there are translation links among Spanish and Catalan (and other languages) for the majority of these words, an equivalent version of the trigger-word list for Catalan has been automatically derived. In this work, we consider the gazetteer as a language independent resource and is indistinctly used for training Catalan and Spanish models.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Feature codification </SectionTitle> <Paragraph position="0"> The features that characterise the NE examples are defined in a windowa3 anchored at a worda4 , representing its local context used by a classifier to make a decision. In the window, each word arounda4 is codified with a set of primitive features, requiring no linguistic pre-processing, together with its relative position toa4 . Each primitive feature with each relative position and each possible value forms a final binary feature for the classifier (e.g., &quot;the word form at position(-2) is street&quot;). The kind of information coded in these features may be grouped in the following kinds: a5 Lexical: Word forms and their position in the window (e.g., a3a7a6a9a8a11a10 =&quot;bank&quot;), as well as word forms appearing in the named entity under consideration, independent from their position.</Paragraph> <Paragraph position="1"> a5 Orthographic: Word properties regarding how it is capitalised (initial-caps, all-caps), the kind of characters that form the word (contains-digits, all-digits, alphanumeric, roman-number), the presence of punctuation marks (contains-dots, contains-hyphen, acronym), single character patterns (lonelyinitial, punctuation-mark, single-char), or the membership of the word to a predefined class (functional-word1) or pattern (URL).</Paragraph> <Paragraph position="2"> a5 Affixes: The prefixes and suffixes up to 4 characters of the NE being classified and its internal components.</Paragraph> <Paragraph position="3"> a5 Word Type Patterns: Type pattern of consecutive words in the context. The type of a word is either functional (f), capitalised (C), lower-cased (l), punctuation mark (.), quote (') or other (x).</Paragraph> <Paragraph position="4"> a5 Bag-of-Words: Form of the words in the window, without considering positions (e.g., &quot;bank&quot;a12a13a3 ).</Paragraph> <Paragraph position="5"> a5 Trigger Words: Triggering properties of window words, using an external list to determine whether a word may trigger a certain Named Entity (NE) class (e.g., &quot;president&quot; may trigger class PER). Also context patterns to the left of the NE are considered, where each word is marked with its triggering properties, or with a functional-word tag, if appropriate (e.g., the phrase &quot;the president of United Nations&quot; produces pattern f ORG f for the NE 1Functional words are determiners and prepositions which typically appear inside NEs.</Paragraph> <Paragraph position="6"> &quot;United Nations&quot;, assuming that &quot;president&quot; is listed as a possible trigger for ORG).</Paragraph> <Paragraph position="7"> a5 Gazetteer Features: Gazetteer information for window words. A gazetteer entry consists of a set of possible NE categories.</Paragraph> <Paragraph position="8"> a5 Additionally, binary features encoding the length in words of the NE being classified.</Paragraph> <Paragraph position="9"> All features are computed for a a14 -3,+3a15 window around the NE being classified, except for the Bagof-Words, for which aa14 -5,+5a15 window is used.</Paragraph> </Section> </Section> class="xml-element"></Paper>