File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1116_intro.xml
Size: 5,395 bytes
Last Modified: 2025-10-06 14:02:10
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1116"> <Title>Term Aggregation: Mining Synonymous Expressions using Personal Stylistic Variations</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The replacement of words with a representative synonymous expression dramatically enhances text analysis systems. We developed a text mining system called TAKMI (Nasukawa, 2001) which can find valuable patterns and rules in text that indicate trends and significant features about specific topics using not only word frequency but also using predicate-argument pairs that indicate dependencies among terms. The dependency information helps to distinguish between sentences by their meaning.</Paragraph> <Paragraph position="1"> Here are some examples of sentences from a PC call center's logs, along with the extracted dependency pairs: AF customer broke a tp AXcustomer...break, break...tp AF end user broke a ThinkPad AXend user...break, break...ThinkPad In these examples, &quot;customer&quot; and &quot;end user&quot; and &quot;tp&quot; and &quot;ThinkPad&quot; can be assumed to have the same meaning in terms of this analysis for the call center's operations. Thus, these two sentences have the same meaning, but the differences in expressions prevent us from recognizing their identity. The variety of synonymous expressions causes a lack of consistency in expressions. Other examples of synonymous expressions are:</Paragraph> <Paragraph position="3"> One way to address this problem is by assigning canonical forms to synonymous expressions and variations of inconsistent expressions. The goal of this paper is to find those of synonymous expressions and variations of inconsistent expressions that can be replaced with a canonical form for text analysis. We call this operation &quot;term aggregation&quot;. Term aggregation is different from general synonym finding. For instance, &quot;customer&quot; and &quot;end user&quot; may not be synonyms in general, but we recognize these words as &quot;customer&quot; in the context of a manufacturers' call center logs. Thus, the words we want to aggregate may not be synonyms, but their role in the sentences are the same in the target domain from the mining perspective. Yet, we can perform term aggregation using the same methods as in synonym finding, such as using word feature similarities.</Paragraph> <Paragraph position="4"> There are several approaches for the automatic extraction of synonymous expressions, such as using word context features, but the results of such approaches tend to contain some antonymous expressions as noise. For instance, a system may extract &quot;agent&quot; as a synonymous expression for &quot;customer&quot;, since they share the same feature of being human, and since both words appear as subjects of the same predicates, such as &quot;talk&quot;, &quot;watch&quot;, and &quot;ask&quot;.</Paragraph> <Paragraph position="5"> In general, it is difficult to distinguish synonymous expressions from antonymous expressions based on their context. However, if we have a coherent corpus, one in which the use of expressions is consistent for the same meaning, the words extracted from that corpus are guaranteed to have different meanings from each other.</Paragraph> <Paragraph position="6"> pora. Words with similar contexts within incoherent corpora consist of various expressions including synonyms and antonyms, as in the left hand side of this figure, because of the use of synonymous expressions as in the upper right box of the figure.</Paragraph> <Paragraph position="7"> In contrast, words with similar contexts within each coherent corpus do not contain synonymous expressions, as in the lower right box of the figure.</Paragraph> <Paragraph position="8"> By using the information about non-synonymous expressions with similar contexts, we can deduce the synonymous expressions from the words with similar contexts within incoherent corpora by removing the non-synonymous expressions.</Paragraph> <Paragraph position="9"> In this paper, we use a set of textual data written by the same author as a coherent corpus. Our assumption is that one person tends to use one expression to represent one meaning. For example, &quot;user&quot; for &quot;customer&quot; and &quot;agt&quot; for &quot;agent&quot; as in Figure 1. Our method has three steps: extraction of synonymous expression candidates, extraction of noise candidates, and re-evaluation with these candidates.</Paragraph> <Paragraph position="10"> In order to evaluate the performance of our method, we conducted some experiments on extracting term aggregation sets. The experimental results indicate that our method leads to better precision than the basic synonym extraction approach, though the recall rates are slightly reduced.</Paragraph> <Paragraph position="11"> The rest of this paper is organized as follows.</Paragraph> <Paragraph position="12"> First we describe the personal stylistic variations in each author's text in Section 2, and in Section 3 we will give an overview of our system. We will present the experimental results and discussion in Section</Paragraph> </Section> class="xml-element"></Paper>