File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-3004_metho.xml

Size: 14,416 bytes

Last Modified: 2025-10-06 14:10:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-3004">
  <Title>Bootstrapping Named Entity Recognition with Automatically Generated Gazetteer Lists</Title>
  <Section position="3" start_page="16" end_page="18" type="metho">
    <SectionTitle>
3 Feature extraction
</SectionTitle>
    <Paragraph position="0"> Recently diverse machine learning techniques are utilized to resolve various NLP tasks. For all of them crucial role plays the feature extraction and selection module, which leads to optimal classifier performance. This section describes the features used for our Named Entity Recognition task.</Paragraph>
    <Paragraph position="1"> Feature vectors phi={f1,...,fn} are constructed.</Paragraph>
    <Paragraph position="2"> The total number of features is denoted by n, and phi corresponds to the number of examples in the data. In our experiment features represent contextual, lexical and gazetteer information. Here we number each feature and its corresponding argument. null f1: all letters of w08 are in capitals; f2-f8: w[?]3,w[?]2,w[?]1,w0,w+1,w+2,w+3 initiate in capitals; f9: position of w0 in the current sentence;  f21: w+1 is trigger word for location, person or organization; f22: w0 belongs to location gazetteer list; f23: w0 belongs to first person name gazetteer list; f24: w0 belongs to family name gazetteer list; f25: 0 if the majority of the words in an entity are locations, 1 if the majority of the words in an entity are persons and 2 otherwise.</Paragraph>
    <Paragraph position="3"> Features f22, f23, f24 were automatically extracted by a simple pattern validation method we propose below.</Paragraph>
    <Paragraph position="4"> The corpus from where the gazetteer lists were extracted, forms part of Efe94 and Efe95 Spanish corpora provided for the CLEF9 competitions. We conducted a simple preprocessing, where all sgml documents were merged in a single file and only the content situated among the text tags was extracted and considered for further processing. As a result, we obtained 1 Gigabyte of unlabeled data, containing 173468453 words. The text was tokenized and the frequency of all unigrams in the corpus was gathered.</Paragraph>
    <Paragraph position="5"> The algorithm we propose and use to obtain location and person gazetteer lists is very simple. It consists in finding and validating common patterns, which can be constructed and utilized also for languages other than Spanish.</Paragraph>
    <Paragraph position="6"> The location pattern &lt;prepi, wj&gt; , looks for preposition i which indicates location in the Spanish language and all corresponding right capitalized context words wj for preposition i. The dependency relation between prepi and wj, conveys the semantic information on the selection restrictions imposed by the two related words. In a walk through example the pattern &lt;en,[?]&gt; , extracts all right capitalized context words wj as {Argentina, Barcelona, Madrid, Valencia} placed next to preposition &amp;quot;en&amp;quot;. These words are taken as location candidates. The selection restriction implies searching for words appearing after the preposition &amp;quot;en&amp;quot; (e.g. en Madrid) and not before the preposition (e.g. Madrid en).</Paragraph>
    <Paragraph position="7"> Theterminationofthepatternextraction&lt;en,[?]&gt; , initiates the extraction phase for the next prepositions in prepi = {en, En, desde, Desde, hacia, Hacia}. This processes is repeated until the complete set of words in the preposition set are validated.  The extracted capitalized words are passed through a filtering process. Bigrams &amp;quot;prepi Capitalized wordj&amp;quot; with frequency lower than 20 were automatically discarded, because we saw that this threshold removes words that do not tend to appear very often with the location prepositions. In this way misspelled words as Bacelona instead of Barcelona were filtered. From another side, every capitalized word composed of two or three characters, for instance &amp;quot;La, Las&amp;quot; was initiated in a trigram &lt;prepi,Capitalized wordj,Capitalized wordj+1&gt; validation pattern. If these words were seen in combination with other capitalized words and their tri-gram frequency was higher then 20 they were included in the location gazetteer file. With this tri-gram validation pattern, locations as &amp;quot;Los Angeles&amp;quot;, &amp;quot;Las Palmas&amp;quot;, &amp;quot;La Coru~na&amp;quot; ,&amp;quot;Nueva York&amp;quot;10 were extracted.</Paragraph>
    <Paragraph position="8"> In total 16819 entities with no repetition were automatically obtained. The words represent countries around the world, European capitals and mostly Spanish cities. Some noisy elements found in the file were person names, which were accompanied by the preposition &amp;quot;en&amp;quot;. As person names werecapitalizedandhadfrequencyhigherthanthe threshold we placed, it was impossible for these names to be automatically detected as erroneous and filtered. However we left these names, since the gazetteer attributes we maintain are mutually nonexclusive. This means the name &amp;quot;Jordan&amp;quot; can be seen in location gazetteer indicating the country Jordan and in the same time can be seen in the person name list indicating the person Jordan. In a real NE application such case is reality, but for the determination of the right category name entity disambiguation is needed as in (Pedersen et al., 2005).</Paragraph>
    <Paragraph position="9"> Person gazetteer is constructed with graph exploration algorithm. The graph consists of:  and Family Names.</Paragraph>
    <Paragraph position="10"> The graph connects Family Names with First Names, and vice versa. In practice, such a graph is not necessarily connected, as there can be unusual first names and surnames which have no relation with other names in the corpus. Though, the corpus is supposed to contain mostly common names in one and the same language, names from other languages might be present too. In this case, if the foreign name is not connected with a Spanish name, it will never be included in the name list. Therefore, starting from some common Spanish name will very probably place us in the largest connected component11. If there exist other different connected components in the graph, these will be outliers, corresponding to names pertaining to someotherlanguage, orcombinationsofbothvery unusual first name and family name. The larger the corpus is, the smaller the presence of such additional connected components will be.</Paragraph>
    <Paragraph position="11">  Thealgorithmperformsanuninformedbreadthfirst search. As the graph is not a tree, the stop condition occurs when no more nodes are found.</Paragraph>
    <Paragraph position="12"> Nodes and connections are found following the pattern &lt;First name,Family name&gt; . The node from which we start the search can be a common Spanish first or family name. In our example we started from the Spanish common first name Jos'e.</Paragraph>
    <Paragraph position="13"> The notation &lt;i,j&gt; [?] C refers to finding in the corpus C the regular expression12 [A-Z][a-z]* [A-Z][a-z]* This regular expression indicates a possible relation between first name and family name. The scheme of the algorithm is the following: Let C be the corpus, F be the set of first names, and S be the set of family names.</Paragraph>
    <Paragraph position="14">  1. F = {&amp;quot;Jos'e&amp;quot;} 2. [?]i [?] F do Snew = Snew [?]{j} ,[?]j  |&lt;i,j&gt; [?] C 3. S = S [?] Snew 4. [?]j [?] S do</Paragraph>
    <Paragraph position="16"> 11A connected component refers to a maximal connected subgraph, in graph theory. A connected graph, is a graph containing only one connected component.</Paragraph>
    <Paragraph position="18"> then goto 2.</Paragraph>
    <Paragraph position="19"> else finish.</Paragraph>
    <Paragraph position="20"> Suppose we have a corpus containing the following person names: {&amp;quot;Jos'e Garc'ia&amp;quot;, &amp;quot;Jos'e Mart'inez&amp;quot;, &amp;quot;Manolo Garc'ia&amp;quot;, &amp;quot;Mar'ia Mart'inez&amp;quot;, &amp;quot;Mar'ia Fern'andez&amp;quot;, &amp;quot;John Lennon&amp;quot;} [?] C. Initially we have F = {&amp;quot;Jos'e&amp;quot;} and S = [?]. After the 3rd step we would have S = {&amp;quot;Garc'ia&amp;quot;, &amp;quot;Mart'inez&amp;quot;}, and after the 5th step: F = {&amp;quot;Jos'e&amp;quot;, &amp;quot;Manolo&amp;quot;, &amp;quot;Mar'ia&amp;quot;}. During the next iteration &amp;quot;Fern'andez&amp;quot; would also be added to S, as &amp;quot;Mar'ia&amp;quot; is already present in F. Neither &amp;quot;John&amp;quot;, nor &amp;quot;Lennon&amp;quot; are connected to the rest of the names, so these will never be added to the sets. This can be seen in Figure 1 as well.</Paragraph>
    <Paragraph position="21"> In our implementation, we filtered relations appearing less than 10 times. Thus rare combinations like &amp;quot;Jose Madrid, Mercedes Benz&amp;quot; are filtered. Noise was introduced from names related to both person and organization names. For example the Spanish girl name Mercedes, lead to the node Benz, and as &amp;quot;Mercedes Benz&amp;quot; refers also to the car producing company, noisy elements started to be added through the node &amp;quot;Benz&amp;quot;. In total 13713 fist names and 103008 surnames have been automatically extracted.</Paragraph>
    <Paragraph position="22"> We believe and prove that constructing automatic location and person name gazetteer lists with the pattern search and validation model we propose is a very easy and practical task. With our approach thousands of names can be obtained, especially given the ample presence of unlabeled data and the World Wide Web.</Paragraph>
    <Paragraph position="23"> The purpose of our gazetteer construction was not to make complete gazetteer lists, but rather generate in a quick and automatic way lists of names that can help during our feature construction module.</Paragraph>
  </Section>
  <Section position="4" start_page="18" end_page="19" type="metho">
    <SectionTitle>
4 Experiments for delimitation process
</SectionTitle>
    <Paragraph position="0"> In this section we describe the conducted experiments for named entity detection. Previously (Kozareva et al., 2005b) demonstrated that in supervised learning only superficial features as context and ortografics are sufficient to identify the boundaries of a Named Entity. In our experiment the superficial features f1 / f10 were used by the supervised and semi-supervised classifiers. Table 2 shows the obtained results for Begin and Inside tags, whichactuallydetecttheentitiesandthetotal  On the first row are the results of the supervised method and on the second row are the highest results of the bootstrapping achieved in its seventeenth iteration. For the supervised learning 91.88% of the entity boundaries were correctly identified and for the bootstrapping 81.62% were correctly detected. The lower performance  ofbootstrappingisduetothenoiseintroducedduring the learning. Some examples were learned with the wrong class and others didn't introduce new information in the training data.</Paragraph>
    <Paragraph position="1"> Figure 2 presents the learning curve of the bootstrapping processes for 25 iterations. On each iteration 1000 examples were tagged, but only the examples having classes that coincide by the two classifiers were later included in the training data.</Paragraph>
    <Paragraph position="2"> We should note that for each iteration the same amount of B, I and O classes was included. Thus thebalanceamongthethreedifferentclassesinthe training data is maintained.</Paragraph>
    <Paragraph position="3"> According to zprime statistics (Dietterich, 1998), the highest score reached by bootstrapping cannot outperform the supervised method, however if both methods were evaluated on small amount of data the results were similar.</Paragraph>
  </Section>
  <Section position="5" start_page="19" end_page="19" type="metho">
    <SectionTitle>
5 Experiments for classification process
</SectionTitle>
    <Paragraph position="0"> In a Named Entity classification process, to the previously detected Named Entities a predefined category of interest such as name of person, organization, location or miscellaneous names should be assigned. To obtain a better idea of the performanceoftheclassificationmethods,severalexper- null iments were conducted. The influence of the automatically extracted gazetteers was studied, and a comparison of the supervised and semi-supervised methods was done.</Paragraph>
    <Paragraph position="1">  Table 3 shows the obtained results for each one of the experimental settings. The first row indicates the performance of the supervised classifier when no gazetteer information is present. The classifier used f1, f2, f3, f4, f5, f6, f7, f8, f18, f19, f20, f21 attributes. The performance of the second row concerns the same classifier, but including the gazetteer information by adding f22, f23, f24 and f25 attributes. The third row relates to the bootstrapping process. The attributes used for the supervised and semi-supervised learning were the same.</Paragraph>
    <Paragraph position="2"> Results show that among all classes, miscellaneousistheonewiththelowestperformance. This is related to the heterogeneous information of the category. The other three categories performed above 70%. As expected gazetteer information contributed for better distinction of person and location names. Organizationnames benefitted from the contextual information, the organization trigger words and the attribute validating if an entity is not a person or location then is treated as an organization. Bootstrapping performance was not high, due to the previously 81% correctly detected named entity boundaries and from another side to the training examples which were incorrectly classified and included into the training data.</Paragraph>
    <Paragraph position="3"> In our experiment, unlabeled data was used to construct in an easy and effective way person and location gazetteer lists. By their help supervised and semi-supervised classifiers improved performance. Although one semi-supervised method cannot reach the performance of a supervised classifier, we can say that results are promising. We call them promising in the aspect of constructing NE recognizer for languages with no resources or even adapting the present Spanish Named Entity system to other domain.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML