File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1122_metho.xml
Size: 8,694 bytes
Last Modified: 2025-10-06 14:08:49
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1122"> <Title>Named Entity Discovery Using Comparable News Articles</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Synchronicity of Names </SectionTitle> <Paragraph position="0"> In this paper we propose another method to strengthen the lexical knowledge for Named Entity tagging by using synchronicity of names in comparable documents. One can view a &quot;comparable&quot; document as an alternative expression of the same content. Now, two document sets where each document of one set is associated with one in the other set, is called a &quot;comparable&quot; corpus. A comparable corpus is less restricted than a parallel corpus and usually more available. Several different newspapers published on the same day report lots of the same events, therefore contain a number of comparable documents. One can also take another view of a comparable corpus, which is a set of paraphrased documents. By exploiting this feature, one can extract paraphrastic expressions automatically from parallel corpora (Barzilay and McKeown, 2001) or comparable corpora (Shinyama and Sekine, 2003).</Paragraph> <Paragraph position="1"> Named Entities in comparable documents have one notable characteristic: they tend to be preserved across comparable documents because it is generally difficult to paraphrase names. We think that it is also hard to paraphrase product names or disease names, so they will also be preserved. Therefore, if one Named Entity appears in one document, it should also appear in the comparable document.</Paragraph> <Paragraph position="2"> Consequently, if one has two sets of documents which are associated with each other, the distribution of a certain name in one document set should look similar to the distribution of the name in the other document set.</Paragraph> <Paragraph position="3"> We tried to use this characteristic of Named Entities to discover rare names from comparable news articles. We particularly focused on the time series distribution of a certain word in two newspapers.</Paragraph> <Paragraph position="4"> We hypothesized that if a Named Entity is used in two newspapers, it should appear in both newspapers synchronously, whereas other words don't.</Paragraph> <Paragraph position="5"> Since news articles are divided day by day, it is easy to obtain the time series distribution of words appearing in each newspaper.</Paragraph> <Paragraph position="6"> Figure 2 shows the time series distribution of the two words &quot;yigal&quot; and &quot;killed&quot;, which appeared in several newspapers in 1995. The word &quot;yigal&quot; (the name of the man who killed Israeli Prime Minister Yitzhak Rabin on Nov. 7, 1995) has a clear spike.</Paragraph> <Paragraph position="7"> There were a total of 363 documents which included the word that year and its occurrence is synchronous between the two newspapers. In contrast, the word &quot;killed&quot;, which appeared in 21591 documents, is spread over all the year and has no clear characteristic. null</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experiment </SectionTitle> <Paragraph position="0"> To verify our hypothesis, we conducted an experiment to measure the correlation between the occurrence of Named Entity and its similarity of time series distribution between two newspapers.</Paragraph> <Paragraph position="1"> First, we picked a rare word, then obtained its document frequency which is the number of articles which contain the word. Since newspaper articles are provided separately day by day, we sampled the document frequency for each day. These numbers form, for one year for example, a 365-element integer vector per newspaper. The actual number of news articles is oscillating weekly, however, we normalized this by dividing the number of articles containing the word by the total number of all articles on that day. At the end we get a vector of fractions which range from 0.0 to 1.0.</Paragraph> <Paragraph position="2"> Next we compared these vectors and calculated the similarity of their time series distributions across different news sources. Our basic strategy was to use the cosine similarity of two vectors as the likelihood of the word's being a Named Entity. However, several issues arose in trying to apply this directly.</Paragraph> <Paragraph position="3"> Firstly, it is not always true that the same event is reported on the same day. An actual newspaper sometimes has a one or two-day time lag depending on the news. To alleviate this effect, we applied a simple smoothing to each vector. Secondly, we needed to focus on the salient use of each word, otherwise a common noun which constantly appears almost every day has an undesirable high similarity between newspapers. To avoid this, we tried to intensify the effect of a spike by comparing the deviation of the frequency instead of the frequency itself. This way we can degrade the similarity of a word which has a &quot;flat&quot; distribution.</Paragraph> <Paragraph position="4"> In this section we first explain a single-word experiment which detects Named Entities that consist of one word. Next we explain a multi-word experiment which detects Named Entities that consist of exactly two words.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Single-word Experiment </SectionTitle> <Paragraph position="0"> In a single-word experiment, we used two one-year newspapers, Los Angeles Times and Reuters in 1995. First we picked a rare word which appeared in either newspaper less than 100 times throughout the year. We only used a simple tokenizer and converted all words into lower case. A part of speech tagger was not used. Then we obtained the document frequency vector for the word. For each word w which appeared in newspaper A, we got the document frequency at date t: fA(w;t) = dfA(w;t)=NA(t) where dfA(w;t) is the number of documents which contain the word w at date t in newspaper A. The normalization constant NA(t) is the number of all articles at date t. However comparing this value between two newspapers directly cannot capture a time lag. So now we apply smoothing by the following formula to get an improved version of fA:</Paragraph> <Paragraph position="2"> Here we give each occurrence of a word a &quot;stretch&quot; which sustains for W days. This way we can capture two occurrences which appear on slightly different days. In this experiment, we used W = 2 and r = 0:3, which sums up the numbers in a 5-day window. It gives each occurrence a 5-day stretch which is exponentially decreasing.</Paragraph> <Paragraph position="3"> Then we make another modification to f0A by computing the deviation of f0A to intensify a spike:</Paragraph> <Paragraph position="5"> A where -f0A and is the average and the standard deviation of f0A(w):</Paragraph> <Paragraph position="7"> Similarly, we calculated another time series FB(w) for newspaper B. Finally we computed sim(w), the cosine similarity of two distributions of the word w with the following formula:</Paragraph> <Paragraph position="9"> Since this is the cosine of the angle formed by the two vectors, the obtained similarity ranges from !1:0 to 1:0. We used sim(w) as the Named Entity score of the word and ranked these words by this score. Then we took the highly ranked words as Named Entities.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Multi-word Experiment </SectionTitle> <Paragraph position="0"> We also tried a similar experiment for compound words. To avoid chunking errors, we picked all consecutive two-word pairs which appeared in both newspapers, without using any part of speech tagger or chunker. Word pairs which include a pre-defined stop word such as &quot;the&quot; or &quot;with&quot; were eliminated. As with the single-word experiment, we measured the similarity between the time series distributions for a word pair in two newspapers. One different point is that we compared three newspapers 1 rather than two, to gain more accuracy. Now the ranking score sim(w) given to a word pair is calculated as follows:</Paragraph> <Paragraph position="2"> where simXY (w) is the similarity of the distributions between two newspapers X and Y , which can be computed with the formula used in the single-word experiment. To avoid incorrectly multiplying two negative similarities, a negative similarity is treated as zero.</Paragraph> </Section> </Section> class="xml-element"></Paper>