File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-2025_metho.xml
Size: 7,531 bytes
Last Modified: 2025-10-06 14:09:49
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-2025"> <Title>Unsupervised Discrimination and Labeling of Ambiguous Names</Title> <Section position="4" start_page="145" end_page="145" type="metho"> <SectionTitle> 3 Feature Identi cation </SectionTitle> <Paragraph position="0"> We start by identifying features from a corpus of text which we refer to as the feature selection data.</Paragraph> <Paragraph position="1"> This data can be the test data, i.e., the contexts to be clustered (each of which contain an occurrence of the ambiguous name) or it may be a separate corpus. The identi ed features are used to translate each context in the test data to a vector form.</Paragraph> <Paragraph position="2"> We are exploring the use of bigrams as our feature type. These are lexical features that consist of an ordered pair of words which may occur next to each other, or have one intervening word. We are interested in bigrams since they tend to be less ambiguous and more speci c than individual unigrams.</Paragraph> <Paragraph position="3"> In order to reduce the amount of noise in the feature set, we discard all bigrams that occur only once, or that have a log-likelihood ratio of less than 3.841.</Paragraph> <Paragraph position="4"> The latter criteria indicates that the words in the bi-gram are not independent (i.e., are associated) with 95% certainty. In addition, bigrams in which either word is a stop word are ltered out.</Paragraph> </Section> <Section position="5" start_page="145" end_page="146" type="metho"> <SectionTitle> 4 Context Representation </SectionTitle> <Paragraph position="0"> We employ both rst and second order representations of the contexts to be clustered. The rst order representation is a vector that indicates which of the features identi ed during the feature selection process occur in this context.</Paragraph> <Paragraph position="1"> The second order context representation is adapted from (Schcurrency1utze, 1998). First a co-occurrence matrix is constructed from the features identi ed in the earlier stage, where the rows represent the rst word in the bigram, and the columns represent the second word. Each cell contains the value of the log-likelihood ratio for its respective row and column word-pair.</Paragraph> <Paragraph position="2"> This matrix is both large and sparse, so we use Singular Value Decomposition (SVD) to reduce the dimensionality and smooth the sparsity. SVD has the effect of compressing similar columns together, and then reorganizing the matrix so that the most signi cant of these columns come rst in the matrix. This allows the matrix to be represented more compactly by a smaller number of these compressed columns.</Paragraph> <Paragraph position="3"> The matrix is reduced by a factor equal to the minimum of 10% of the original columns, or 300. If the original number of columns is less than 3,000 then the matrix is reduced to 10% of the number of columns. If the matrix has greater than 3,000 columns, then it is reduced to 300.</Paragraph> <Paragraph position="4"> Each row in the resulting matrix is a vector for the word the row represents. For the second order representation, each context in the test data is represented by a vector which is created by averaging the word vectors for all the words in the context.</Paragraph> <Paragraph position="5"> The philosophy behind the second order representation is that it captures indirect relationships between bigrams which cannot be done using the rst order representation. For example if the word ergonomics occurs along with science, and workplace occurs with science, but not with ergonomics, then workplace and ergonomics are second order co-occurrences by virtue of their respective co-occurrences with science.</Paragraph> <Paragraph position="6"> Once the context is represented by either a rst order or a second order vector, then clustering can follow. A hybrid method known as Repeated Bisections is employed, which tries to balance the quality of agglomerative clustering with the speed of partitional methods. In our current approach the number of clusters to be discovered must be speci ed. Making it possible to automatically identify the number of clusters is one of our high priorities for future work.</Paragraph> </Section> <Section position="6" start_page="146" end_page="146" type="metho"> <SectionTitle> 5 Labeling </SectionTitle> <Paragraph position="0"> Once the clusters are created, we assign each cluster a descriptive and discriminating label. A label is a list of bigrams that act as a simple summary of the contents of the cluster.</Paragraph> <Paragraph position="1"> Our current approach for descriptive labels is to select the top N bigrams from contexts grouped in a cluster. We use similar techniques as we use for feature identi cation, except now we apply them on the clustered contexts. In particular, we select the top 5 or 10 bigrams as ranked by the log-likelihood ratio.</Paragraph> <Paragraph position="2"> We discard bigrams if either of the words is a stopword, or if the bigram occurs only one time. For discriminating labels we pick the top 5 or 10 bigrams which are unique to the cluster and thus capture the contents that separates one cluster from another.</Paragraph> </Section> <Section position="7" start_page="146" end_page="146" type="metho"> <SectionTitle> 6 Experimental Data </SectionTitle> <Paragraph position="0"> Our experimental data consists of two or more un-ambiguous names whose occurrences in a corpus have been con ated in order to create ambiguity.</Paragraph> <Paragraph position="1"> These con ated forms are sometimes known as pseudo words. For example, we take all occurrences of Tony Blair and Bill Clinton and con ate them into a single name that we then attempt to discriminate.</Paragraph> <Paragraph position="2"> Further, we believe that the use of arti cial pseudo words is suitable for the problem of name discrimination, perhaps more so than is the case in word sense disambiguation in general. For words there is always a debate as to what constitutes a word sense, and how nely drawn a sense distinction should be made. However, when given an ambiguous name there are distinct underlying entities associated with that name, so evaluation relative to such true categories is realistic.</Paragraph> <Paragraph position="3"> Our source of data is the New York Times (January 2000 to June 2002) corpus that is included as a part of the English GigaWord corpus.</Paragraph> <Paragraph position="4"> In creating the contexts that include our con ated names, we retain 25 words of text to the left and also to the right of the ambiguous con ated name. We also preserve the original names in a separate tag for the evaluation stage.</Paragraph> <Paragraph position="5"> We have created three levels of ambiguity: 2-way, 3-way, and 4-way. In each of the three categories we have 3-4 examples that represent a variety of different degrees of ambiguity. We have created several examples of intra-category disambiguation, including Bill Clinton and Tony Blair (political leaders), and Mexico and India (countries). We also have inter-category disambiguation such as Bayer, Bank of America, and John Grisham (two companies and an author).</Paragraph> <Paragraph position="6"> The 3-way examples have been chosen by adding one more dimension to the 2-way examples. For example, Ehud Barak is added to Bill Clinton and Tony Blair, and the 4-way examples are selected on similar lines.</Paragraph> </Section> class="xml-element"></Paper>