XML Viewer - w02-2023

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-2023_metho.xml
Size: 16,435 bytes
Last Modified: 2025-10-06 14:08:09
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-2023">
  <Title>Named Entity Learning and Verification: Expectation Maximization in Large Co r pora</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Exhaustion Cycle
</SectionTitle>
    <Paragraph position="0"> The overall performance of the algorithm can be estimated as follows: For simplicity let us a s sume that the average number of items (in our task: name elements) findable by any unused item equals N. Then the number of items starts to grow exponentially. Sooner or later, the total number of unseen entities decreases. Hence, most of the N items found are known already.</Paragraph>
    <Paragraph position="1"> The numbers of new items found in each turn decreases, until no more items can be reached.</Paragraph>
    <Paragraph position="2"> So we discriminate between a phase of growth and a phase of exhau s tion.</Paragraph>
    <Paragraph position="3"> The following figures visualize the number of new items per turn and the accumulated total number of items for each turn. Data was taken from an experiment with 19 items of knowledge (see appendix 2). The test was performed on the German corpus and designed to find first and last names only. The phase of growth lasts until the 5th cycle, then exhaustion takes over, as can be seen in figure 1 1 .</Paragraph>
    <Paragraph position="4">  1 Additional runs with more start items produced the same amount of total items in less cycles.</Paragraph>
    <Paragraph position="5"> 2 Note that 25'000 name elements are responsible for the detection of over 150'000 full names.</Paragraph>
    <Paragraph position="6"> Natural growth in the number of items takes place under the following conditions: * Sufficient size of corpus * Sufficient frequency of start items * Suitable relation, e.g. names  If the corpus is not large enough or it is not po s sible to find enough candidates from the start items, exhaustion takes place immed i ately.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Examples
</SectionTitle>
    <Paragraph position="0"> Let us closely examine the learning of items by example: From the known first name John , the candidate for being a last name Hauberg was found in the fragment &amp;quot;...by John Hauberg and..&amp;quot; by the rule FN-UC-LC =&gt; FN-LN-LC and verified in occurrences like &amp;quot;Robert Hauberg, ...&amp;quot;, &amp;quot;Robert Hauberg urges...&amp;quot; using the already known first name Robert .</Paragraph>
    <Paragraph position="1"> Errors occur in the case of words, which are mainly used in positions which are often occ u pied by first names. In German, the algorithm extracts and verifies &amp;quot;Ara&amp;quot; ( era ) and &amp;quot;Transpor t panzer&amp;quot; (Army transportation tank ) because of the common usage &amp;quot;Ara Kohl&amp;quot; and the proper name &amp;quot;Transportpanzer Fuchs&amp;quot; ( fox tank ). In the case of &amp;quot;Ara&amp;quot;, this false first name supports the classifications of the proper last names Hinrichs , Strauss , Bangemann , Albrecht , Gorbatchow , Jelzin and many more.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Precision and Recall
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Precision
</SectionTitle>
      <Paragraph position="0"> Note that precision will be different for the di f ferent types of name elements. Usually surnames are recognized with high precision. First names may be confused with titles, for instance.</Paragraph>
      <Paragraph position="1"> Moreover, precision is language dependent mainly due to the different usage of capital le t ters: In German, nouns start with capital letters and can much easier be confused with names.</Paragraph>
      <Paragraph position="2"> For German first names in the run mentioned above, the algorithm yields a precision of 84.1%. Noise items mainly are titles and profe s sion names, which are spelled with a capital letter in German. Using the additional fact that first names usually do not exceed 10 letters in length, the precision for first names rose to 92.7%.</Paragraph>
      <Paragraph position="3"> For last names, results were excellent with a precision of more than 99%. The same holds for titles, as further experiments showed.</Paragraph>
      <Paragraph position="4"> The ratio number of first names vs. number of last names happens to be about 1:3, overall pr e cision for German scored 97.5%.</Paragraph>
      <Paragraph position="5"> Because of the fewer capitalized words in En g lish the precision for English first names is higher, scoring 92.6% without further filtering. Overall precision for English first and last names was 98.7%.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Recall
</SectionTitle>
      <Paragraph position="0"> Recall mainly depends on the pattern rules used.</Paragraph>
      <Paragraph position="1"> The experiments were performed with the 14 handmade rules given in appendix 1, which surely are not sufficient.</Paragraph>
      <Paragraph position="2"> Calculating the recall is not at all straightfo r ward, because we do not know how many names are contained in our corpora and experiments on small corpora fail to show the natural growth of items described in the previous section. Further, recall will rise with a growing knowledge size.</Paragraph>
      <Paragraph position="3"> So we modified the algorithm in a way that it takes plain text as input, applies the rules to find candidates and checks them in the big corpus.</Paragraph>
      <Paragraph position="4"> Providing a large set of knowledge items, in an experiment processing 1000 sentences, 71.4% of the person names were extracted co r rectly.</Paragraph>
      <Paragraph position="5"> To increase the coverage of the rules it is poss i ble to add rules manually or start a process of rule learning as described below.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3. Propagation of Errors
</SectionTitle>
      <Paragraph position="0"> During the run the error rate increases due to finding candidates and verification through mi s classified items. However, as the &amp;quot;era&amp;quot; example (see section 4) illustrates, misclassified items support the classification of goal items.</Paragraph>
      <Paragraph position="1"> The amount of deterioration highly depends on the pattern rules. Strict rules mean low recall but high precision, whereas general rules have greater coverage but find too much, resulting in a trade-off between precision and recall.</Paragraph>
      <Paragraph position="2"> Table 1 shows the error rate for first names for the illustrated run (see section 3) over the course of time.</Paragraph>
      <Paragraph position="3"> From this we conclude that the algorithm is r o bust against errors and the quality of the classif i cations remains relatively stable during the run</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Classification on character level
</SectionTitle>
    <Paragraph position="0"> In German, most words misclassified as first names were titles and professions. While they cannot be distinguished by the rules used, they differ strongly from the morphological view.</Paragraph>
    <Paragraph position="1"> German titles are usually longer because they are compounds, and parts of compounds are used very frequently.</Paragraph>
    <Paragraph position="2"> In this section, we introduce a method to disti n guish between titles and first names at character level, using the fact that the formation of words follows la n guage-dependent rules.</Paragraph>
    <Paragraph position="3"> This procedure is implemented in the following classifier A : Assume the property we are inte r ested in is visible at the ending of a word (this is basically true for different word classes in la n guages like English, French or German). We build a decision tree [cf. McCarthy &amp; Lehnert 1995] reading the words character-by-character, starting from the end. We stop if the feature is uniquely dete r mined.</Paragraph>
    <Paragraph position="4"> Moreover, we could as well start from the b e ginning of a word ( classifier B ). Finally, we can use any connected substring of the word instead of substrings containing the end or the beginning</Paragraph>
    <Paragraph position="6"> If the training set is large enough and the alg o rithm of the classifier is appropriate, it will cover both general rules as well as many exceptions.</Paragraph>
    <Paragraph position="7"> Classifier A and B only differ on the direction a word is analyzed. We build decision trees with additional default child nodes as follows.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Classifier A: Considering Prefixes
</SectionTitle>
      <Paragraph position="0"> Step 1: Building the ordinary decision tree: Given the training word list we construct a prefix tree [cf. Gusfield 1999, Navarro 2001:38ff]. The leaves in the tree corr e spond to word endings; here we store the feature of the corresponding word.</Paragraph>
      <Paragraph position="1"> Step 2: Reduction of the decision tree: If all children of a given node have the same feature, this feature is lifted to the parent node and the children are deleted.</Paragraph>
      <Paragraph position="2"> Step 3: Insertion of default features: If a node does not yet have a feature, but one of the features is very dominant (say, pre s ent in 80% of the children), this feature will be assigned as d e fault feature.</Paragraph>
      <Paragraph position="3"> For classification, the decision tree is used as follows: Step 1: Searching the tree: Reading the given word from left to right we follow the tree as far as possible. The reading pro c ess stops in a certain node N .</Paragraph>
      <Paragraph position="4"> Step 2: Classification : If the node N has an a s signed feature F then return F . Othe r wise return no decision .</Paragraph>
      <Paragraph position="5"> Figure 2 shows a part of the decision tree built using first names Theoardis, Theobald, The o derich, Theodor, Theresa, Therese, ... and the singular title Theologe (which should be the only title in our training list starting with Theo ). As a result, all children of Theo will be first names; hence they get the feature firstname . The node Theologe gets the feature title.</Paragraph>
      <Paragraph position="6"> This turns out to be singular; hence their parent Theo gets the default feature firstname . As a consequence, Theophil will correctly be class i fied as firstname , while the exception Theologe will still be classified as title .</Paragraph>
      <Paragraph position="7">  As mentioned above, algorithm B works the same way as algorithm A, using suffixes instead of pr e fixes for the decision tree.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Classifier C: Considering Su b strings
</SectionTitle>
      <Paragraph position="0"> Instead of concentrating on prefixes or suffixes, we consider all relevant continuous substrings of a given word. Unfortunately, there is no natural tree structure for this set. Hence, we will co n struct a decision list without default features.</Paragraph>
      <Paragraph position="1"> Given is a training list containing pairs (word, fe a ture): Construction of the decision list Step 1: Collect all substring information . We produce the following list L : For all pairs ( wordN , featureN ) from the trai n ing list we generate all possible pairs of the kind ( continuous substring of wordN , featureN ). If wordN has length n , we have n(n+1)/2 continuous su b strings. Finally the list is sorted alph a betically and duplicates are r e moved.</Paragraph>
      <Paragraph position="2"> Step 2: Removing contradictions : If a substring occurs with more then one feature, these lines are deleted from L .</Paragraph>
      <Paragraph position="3"> Step 3: Removing tails : If a certain string now has a unique feature, all extensions of this string should have the same feature and the corresponding entries are r e moved from L .</Paragraph>
      <Paragraph position="4"> For classification, the decision list is used as follows: Step 1: Look-up of substrings: For a word to be classified we generate its continuous substrings and collect their features from L .</Paragraph>
      <Paragraph position="5"> Step 2: Classification : If all collected features are equal, then return this feature. Ot h erwise, return no decision .</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.3 Properties of the classifiers
</SectionTitle>
      <Paragraph position="0"> In the following, we assume that the classifiers are trained with non-contradictory data. The classifiers now have the following prope r ties: * The classifiers reproduce the results given in the training set. Hence, they can also be trained with rare exce p tions.</Paragraph>
      <Paragraph position="1"> * It is necessary to have a training set covering all aspects of the data, otherwise the dec i sion tree will be co n fused.</Paragraph>
      <Paragraph position="2"> * It is appropriate to return no decision if the classifier stops in the decision tree at a point where children have mixed fe a tures.</Paragraph>
      <Paragraph position="3"> Bagging [cf. Breiman 1996] the three classifiers, we achieved a precision of 94.7% with 94.5% recall, using merely a training set of 1368 exa m ples on a test set of 683 items, distinguishing between the three classes:  This method of postprocessing is applicable to all features visible by the three classifiers, which are: * Features represented by word suffixes or prefixes like inflection and some word fo r mation rules.</Paragraph>
      <Paragraph position="4"> * Words carrying the same feature if they are similar as strings. Candidates are all kinds of proper names, as well as distinguishing parts-of-speech.</Paragraph>
      <Paragraph position="5"> * Words of languages for special purposes, which are often built by combining parts where some of them are very typical for a given domain. Examples are chemical su b stances, professions and titles, or industrial goods.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Rule Learning
</SectionTitle>
    <Paragraph position="0"> Unlike most tasks in Inductive Logic Progra m ming (ILP) [cf. Dzeroski, 2001], our method needs rules-of-thumb that find many candidates like in boosting [cf. Collins, 1999], rather then a rule precision of 100%.</Paragraph>
    <Paragraph position="1"> For automatic rule induction we used a training set of 236 sentences found automatically by taking sentences containing known first names and last names from the corpus. After excessive annotation, all possible rules were built accor d ing to the contexts of known items and afte r wards tested on the training set. To avoid rules too general like UC-UC =FN-UC , the patterns had to contain at least one problem specific tag (i.e. FN, LN, TIT). The rules performing above a certain precision threshold (in our experiments we used 0.7) were taken as input for our alg o rithm. null We obtained 106 rules for first names, 67 for last names and 4 for titles, ranging from very specific rules like</Paragraph>
    <Paragraph position="3"> to very general ones like TI-UC = TI-FN.</Paragraph>
    <Paragraph position="4"> In the table below some rules found by aut o matic induction are shown.</Paragraph>
    <Paragraph position="5">  Using those rules as input for our algorithm, we gained both, higher recall as well as higher pr e cision compared to the handmade rules when starting with the same knowledge. Table 3 shows precision rates for the three classes of name elements, data was taken from a run with 19 start elements, the length filter for first names was applied, and the string classifiers were not. Due to less strict rules, precision decreases. total items interval Prec. FN Prec. LN Prec. TIT  Percentage of first name items from the number of total items was 23,3%, last name items made 75,2% of total items and title items yielded only 1,4%, because to the low number of title rules.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML