File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0612_intro.xml
Size: 3,951 bytes
Last Modified: 2025-10-06 14:07:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0612"> <Title>Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence</Title> <Section position="3" start_page="90" end_page="90" type="intro"> <SectionTitle> I </SectionTitle> <Paragraph position="0"> institutions are named after persons, such as &quot;Washington&quot; or &quot;Madison&quot;, or, in some cases, vice-versa: &quot;Ion Popescu Topolog&quot; is the name of a Romanian writer, who added to his name the name of the river &quot;Topolog&quot;).</Paragraph> <Paragraph position="1"> Clearly, in many: cases, the context for only one occurrence of a new word and its morphological information is not enough to make a decision. But, as noted in Katz (1996), a newly introduced entity will be repeated, &quot;if not for breaking the monotonous effect of pronoun use~ then for emphasis and clarity&quot;. Moreover, he claims that the number of instances of the new entity is not associated with the document length but with the importance of the entity with regard to the subject/discourse. We will use this prop-erty in conjunction with the one sense per discourse tendency noted by Gale, Church and Yarowsky (1992b), who showed that words strongly tend to exhibit only one sense in a document/discourse. By gathering contextual information about the entity from each of its occurrences in the text and using morphological clues as well, we expect to classify entities more effectively than if they are considered in isolation, especially those that are very important with regard to the' subject. When analyzing large texts, a segmentation phase should be considered, so that all the instances of a name in a segment have a high probability of belonging to the same class, and thus the contextual information for all instances within a segment c'an be used jointly when making a decision. Since the precision of the segmentation is not critical, a language independent segmentation system like the one presented by Amithay, Richmond and Smith (1997) is adequately reliable for this task.</Paragraph> <Paragraph position="2"> 1.2 Tokenized Text vs. Plain Text There are two basic alternatives for handling a text.</Paragraph> <Paragraph position="3"> The first one is to tokenize it and classify the individual tokens or group of tokens. This alternative works for languages that use word separators (such as spaces or punctuation), where a relatively simple set of separator patterns can adequately tokenize the text. The second alternative is to classify entities simply with respect to a given starting and ending character position, without knowing the word boundaries, but just the probability (that can be learned automatically) of a boundary given the neighboring contexts. This second alternative works for languages like Chinese, where no separators between the words are typically used. Since for the first class of languages we can define a priori probabilities for boundaries that will match the actual separators, this second approach represents a generalization of the one using tokenized text.</Paragraph> <Paragraph position="4"> However, the first method, in which the text is tokenized, presents! the advantage that statistics for both tokens and types can be kept and, as the results show, the statistics for types seem to be more reliable than those for tokens. Using the second method, there is no single definition of &quot;type&quot;, given that there are multiple possible boundaries for each token instance, but there are ways to gather statistics, such as considering what we may call &quot;probable types&quot; according to the boundary probabilities or keeping statistics on sistrings (semi-infinite strings).</Paragraph> <Paragraph position="5"> Some other advantages and disadvantages of the two methods will be discussed below.</Paragraph> </Section> class="xml-element"></Paper>