File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1122_evalu.xml
Size: 5,103 bytes
Last Modified: 2025-10-06 13:59:09
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1122"> <Title>Named Entity Discovery Using Comparable News Articles</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Evaluation and Discussion </SectionTitle> <Paragraph position="0"> To evaluate the performance, we ranked 966 single words and 810 consecutive word pairs which are randomly selected. We measured how many Named Entities are included in the highly ranked words.</Paragraph> <Paragraph position="1"> We manually classified as names the words in the following categories used in IREX (Sekine and Isahara, 2000): PERSON, ORGANIZATION, LOCA-TION, and PRODUCT. In both experiments, we regarded a name which can stand itself as a correct Named Entity, even if it doesn't stretch to the entire noun phrase.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Single-word Experiment </SectionTitle> <Paragraph position="0"> Table 1 shows an excerpt of the ranking result. For each word, the type of the word, the document frequency and the similarity (score) sim(w) is listed.</Paragraph> <Paragraph position="1"> Obvious typos are classified as &quot;typo&quot;. One can observe that a word which is highly ranked is more likely a Named Entity than lower ones. To show this correlation clearly, we plot the score of the words and the likelihood of being a Named Entity in Figure 2. Since the actual number of the words is discrete, we computed the likelihood by counting Named Entities in a 50-word window around that score.</Paragraph> <Paragraph position="2"> Table 3 shows the number of obtained Named Entities. By taking highly ranked words (sim(w) , hood of being a Named Entity (Single-word). The horizontal axis shows the score of a word. The vertical axis shows the likelihood of being a NE. One can see that the likelihood of NE increases as the score of a word goes up. However there is a huge peak near the score zero.</Paragraph> <Paragraph position="3"> 0:6), we can discover rare Named Entities with 90% accuracy. However, one can notice that there is a huge peak near the score sim(w) = 0. This means that many Named Entities still remain in the lower score. Most such Named Entities only appeared in one newspaper. Named Entities given a score less than zero were likely to refer to a completely different entity. For example, the word &quot;Stan&quot; can be used as a person name but was given a negative score, because this was used as a first name of more than 10 different people in several overlapping periods.</Paragraph> <Paragraph position="4"> Also, we took a look at highly ranked words which are not Named Entities as shown in Table 2.</Paragraph> <Paragraph position="5"> The words &quot;carseats&quot;, &quot;tiremaker&quot;, or &quot;neurotripic&quot; happened to appear in a small number of articles.</Paragraph> <Paragraph position="6"> Each of these articles and its comparable counterparts report the same event, but both of them use the same word probably because there was no other succinct expression to paraphrase these rare words.</Paragraph> <Paragraph position="7"> This way these three words made a high spike. The repetition of articles. This word appeared a lot of times and some of them made the spike very sharp, but it turned out that the document frequency was undesirably inflated by the identical articles. The word &quot;mishandle&quot; was used in a quote by a per-son in both articles, which also makes a undesirable spike.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Multi-word Experiment </SectionTitle> <Paragraph position="0"> In the multi-word experiment, the accuracy of the obtained Named Entities was lower than in the single-word experiment as shown in Table 4, although correlation was still found between the score and the likelihood. This is partly because there were far fewer Named Entities in the test data. Also, many word pairs included in the test data incorrectly capture a noun phrase boundary, which may contain an incomplete Named Entity. We think that this problem can be solved by using a chunk of words instead of two consecutive words. Another notable example in the multi-word ranking is a quoted word pair from the same speech. Since a news article sometime quotes a person's speech literally, such word pairs are likely to appear at the same time in both newspapers. However, since multi-word expressions are much more varied than single-word ones, the overall frequency of multi-word expressions is lower, which makes such coincidence easily stand out. We think that this kind of problem can be alleviated to some degree by eliminating completely identical sentences from comparable articles.</Paragraph> <Paragraph position="1"> The obtained ranking of word pairs are listed in Table 5. The relationship between the score of word pairs and the likelihood of being Named Entities is plotted in Figure 3.</Paragraph> </Section> </Section> class="xml-element"></Paper>