File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-2023_intro.xml
Size: 4,559 bytes
Last Modified: 2025-10-06 14:01:41
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-2023"> <Title>Named Entity Learning and Verification: Expectation Maximization in Large Co r pora</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 The Algorithm </SectionTitle> <Paragraph position="0"> Our learning algorithm starts with a set of pa t terns and initial name elements. A large corpus of more then 10 million sentences [cf. Quasthoff & Wolff 2000], taken from newspapers of the last 10 years is used for both, the identification of candidates for new name elements as well as for verifying the candidates found. The alg o rithm stops, if no more new name elements are found.</Paragraph> <Paragraph position="1"> The algorithm implements expectation maxim i zation (EM) [cf. Dempster, 1977, Collins, 1999] in the following way: The combination of a learning step and a verification step are iterated. If more name elements are found, the recall of the verification step increases. The key property of this algorithm is to assure high precision and still get ma s sive recall.</Paragraph> <Paragraph position="2"> From another point of view our algorithm i m plements bootstrapping [cf. Riloff 99], as it starts from a small number of seed words and uses knowledge found during the run to find more ca n didates.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Patterns and Pattern Rules </SectionTitle> <Paragraph position="0"> In a first step the text to be analysed is tagged in the following way: We have two types of tags.</Paragraph> <Paragraph position="1"> The first type is problem dependent. In the case of persons, we have tags for title or profession (TI), first name (FN) and surname (LN). The second tag set is problem independent, but la n guage dependent. In our experiments, we marked words as lower case (LC) or upper case (UC) depending on the first letter. Punctuation marks are marked as PM, determiners as DET.</Paragraph> <Paragraph position="2"> Words can have multiple tags, e.g. UC and FN at the same time.</Paragraph> <Paragraph position="3"> The next step is to find tag sequences which are typical for names , like TI-FN-LN . From here, we can create rules like TI-UC-LN = TI-FN-LN , which means that an upper case word between title and last name is a candidate for a first name. An overview of handmade start rules is given in a p pendix 1.</Paragraph> <Paragraph position="4"> Looking at the rules, it is possible to argue that a rule like UC-LN = FN-LN is a massive ove r generalization. This would be true if we would learn new name elements simply by applying rules. However, the verification step ensures that false friends are eliminated at high rate.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 The Outer Loop </SectionTitle> <Paragraph position="0"> The algorithm is described as follows: Load pattern rules.</Paragraph> <Paragraph position="1"> Let unused name elements = in i tial set of name elements Loop: For each unused name entity Do the learning step and collect new cand i dates For each new candidate Do the verification step Output verified candidates Let unused name elements = ver i fied candidates</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 The Learning Step: Finding Ca nd i - dates </SectionTitle> <Paragraph position="0"> Using the pattern rules and the current name elements, new candidates are searched. Here we use the corpus.</Paragraph> <Paragraph position="1"> Search 255 random sentences co n taining the unused name entity (or all, if <255).</Paragraph> <Paragraph position="2"> Use the pattern rules to identify new candidates as d e scribed above.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 The Verification Step </SectionTitle> <Paragraph position="0"> In a verification step, each candidate is tested before it is used to generate new candidates. We test the following property: Does the candidate appear often enough together with verified name elements? Again, we use the corpus.</Paragraph> <Paragraph position="1"> Search 30 random sentences contai n ing the name element to be ver i fied (or all, if <30).</Paragraph> <Paragraph position="2"> If the ratio fulfilling at least one right side of a pattern rule is above some threshold, the ca n didate is a c cepted.</Paragraph> </Section> </Section> class="xml-element"></Paper>