File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1116_metho.xml

Size: 16,587 bytes

Last Modified: 2025-10-06 14:08:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1116">
  <Title>Term Aggregation: Mining Synonymous Expressions using Personal Stylistic Variations</Title>
  <Section position="3" start_page="0" end_page="1" type="metho">
    <SectionTitle>
Authors' Corpora
</SectionTitle>
    <Paragraph position="0"> According to our assumption, each author uses a unique expression to represent one semantic concept, even though various expressions can be used for representing the same meaning. To evaluate this assumption, we analyzed a call center's corpus, which was typed in by the call takers in a personal computer service call center  tomer&amp;quot; in each Call Taker's Text.</Paragraph>
    <Paragraph position="1"> Table1 shows variations of the expressions for &amp;quot; customer&amp;quot; which were used by the call takers. This table shows that each call taker mainly used one  The IBM PC Help Center unique expression to represent one meaning with a consistency ratio of about 80%, but the other 20% are other expressions.</Paragraph>
    <Paragraph position="2"> These results show our assumption holds for the tendency for one expression to have one meaning within the same author's corpus. However, it also demonstrated that multiple expressions for the same meaning appear within the same author's corpus even though the distribution of the appearences clearly leans toward one expression. Thus, we should consider this fact when we apply this assumption. null</Paragraph>
  </Section>
  <Section position="4" start_page="1" end_page="3" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Data Overview
</SectionTitle>
      <Paragraph position="0"> In our experiments we used one month's worth of data stored in the call center, containing about five million words. The number of unique nouns was 29,961, and the number of unique verbs was 11,737, and 3,350,200 dependency pairs were extracted from the data. We then created ten subcorpora in such a manner that each of them contains data provided by the same call taker. The average number of predicate-argument pairs in each subcorpus was 37,454. In our experiments, we selected ten authors' corpus according to their size from the larger one.</Paragraph>
      <Paragraph position="1"> To evaluate the experiments, we manually created some evaluation data sets. The evaluation data sets were made for ten target words, and the average number of variants was 7.8 words for each target word. Some examples are shown in Table2.</Paragraph>
      <Paragraph position="2"> target concept variants  customer customer, cu, cus, cust, end user, user, eu HDD harddisk, hdd drive, HD, HDD, hdds, harddrive, hd, H.D battery Battery, batteyr, battery, battary, batt, bat screen display, monitor, moniter, Monitor  For the cannonical expressions for each target word, we simply selected the most frequent expression from the variants.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Text Analysis Tool for Noisy Data
</SectionTitle>
      <Paragraph position="0"> In the call center data there are some difficulties for natural language processing because the data contains a lot of informal writing. The major problems are; AF Words are often abbreviated AF There are many spelling errors AF Case is used inconsistently Shallow processing is suitable for such noisy data, so we used a Markov-model-based tagger, essentially the same as the one described in (Charniak, 1993) in our experiments  . This tagger assigns a POS based on the distribution of the candidate POSs for each word and the probability of POS transitions extracted from a training corpus, and we used a manually annotated corpus of articles from the Wall Street Journal in the Penn Treebank corpus  as a training corpus. This tagger treats an unknown word that did not appear in the training corpus as a noun. In addition, it assigns a canonical form to words without inflections.</Paragraph>
      <Paragraph position="1"> After POS tagging for each sequence of words in a document, it is possible to apply a cascaded set of rules, successively identifying more and more complex phrasal groups. Therefore, simple patterns will be identified as simple noun groups and verb groups, and these can be composed into a variety of complex NP configurations. At a still higher level, clause boundaries can be marked, and even (nominal) arguments for (verb) predicates can be identified. The accuracy of these analyses is lower than the accuracy of the POS assignment.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.3 Term Aggregation using Personal Stylistic
Variations
</SectionTitle>
      <Paragraph position="0"> In this section we explain how to aggregate words using these word features. We have three steps for the term aggregation: creating noun feature vectors, extracting synonymous expressions and noise candidates, and a re-evaluation.</Paragraph>
      <Paragraph position="1">  There is a number of research reports on word similarities, and the major approach is comparing their contexts in the texts. Contexts can be defined in two different ways: syntactic-based and window-based techniques. Syntactic-based techniques consider the linguistic information about part-of-speech categories and syntactic groupings/ relationships. Window-based techniques consider an arbitrary number of words around the given  This shallow syntactic parser is called CCAT based on the TEXTRACT architecture (Neff, 2003) developed at IBM Watson Research Center.</Paragraph>
      <Paragraph position="2">  word. The words we want to aggregate for text analysis are not rigorous synonyms, but the &amp;quot;role&amp;quot; is the same, so we have to consider the syntactic relation based on the assumptions that words with the same role tend to modify or be modified by similar words (Hindle, 1990; Strzalkowski, 1992). On the other hand, window-based techniques are not suitable for our data, because the documents are written by several authors who have a variety of different writing styles (e.g. selecting different prepositions and articles). Therefore we consider only syntactic features: dependency pairs, which consist of nouns, verbs, and their relationships. A dependency pair is written as (noun, verb(with its relationship)) as in the following examples.</Paragraph>
      <Paragraph position="3"> (customer, bootAZ) (customer, shut offAZ) (tp, shut offAY) The symbol AZ means the noun modifies the verb, and AY means the verb modifies the noun. By using these extracted pairs, we can assign a frequency value to each noun and verb as in a vector space model. We use a noun feature vector (NFV) to evaluate the similarities between nouns. The NFVs are made for each authors' corpora and for the entire corpus, which contains all of the author's corpora.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.3.2 Extract Synonymous Expression
Candidates and Noise Candidates
</SectionTitle>
      <Paragraph position="0"> The similarity between two nouns that we used in our approach is defined as the cosine coefficient of the two NFVs. Then we can get the relevant candidate lists that are sorted by word similarities between nouns and the target word. The noun list from the entire corpus is based on the similarities between the target's NFV in the entire corpus and the NFVs in the entire corpus. These words are the synonymous expression candidates, which is the base-line system. The noun lists from the authors' corpora are extracted based on the similarities between the target's NFV in the entire corpus and the NFVs in each authors' corpora. The most similar word in an author's corpus is accepted as a synonymous expression for the target word, and the other similar words in the author's corpus are taken to not have the same meaning as the target word, even though the features are similar. These words are then taken as the noise candidates, except for the most relevant words in each candidate list. If there are N authors, then N lists are extracted.</Paragraph>
    </Section>
    <Section position="5" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.3.3 Re-evaluation
</SectionTitle>
      <Paragraph position="0"> On the basis of our assumption, we propose a simple approach for re-evaluation: deleting the noise candidates in the synonymous expression candidates.</Paragraph>
      <Paragraph position="1"> However, as shown in Section 2, each author does not necessarily use only one expression for one meaning. For instance, while the call taker B in Table 1 mostly uses &amp;quot;cust&amp;quot;, he/she also uses other expressions to a considerable degree. Accordingly if we try to delete all noise candidates, such synonymous expressions will be eliminated from the final result. To avoid this kind of over-deleting, we classified words into three types, &amp;quot;Absolute Term&amp;quot;, &amp;quot;Candidate Term&amp;quot;, and &amp;quot;Noise Candidate&amp;quot;. First, we assigned the &amp;quot;Candidate Term&amp;quot; type to all of the extracted terms from the entire corpus. Second, the most relevant word extracted from each author's corpus was turned into an &amp;quot;Absolute Term&amp;quot;. Third, the words extracted from all of the authors' corpora, except for the most relevant word in each author's corpus, were turned into the &amp;quot;Noise Candidate&amp;quot; type. In this step an &amp;quot;Absolute Term&amp;quot; does not change if the word is a noise candidate. Then the words listed as &amp;quot;Absolute Term&amp;quot; or &amp;quot;Candidate Term&amp;quot; are taken as the final results of the reevaluation. null</Paragraph>
    </Section>
    <Section position="6" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.4 An Actual Example
</SectionTitle>
      <Paragraph position="0"> In this section we will show an actual example of how our system works. In this example, the target word is &amp;quot;battery&amp;quot;. First, the synonymous expression candidates are extracted from the entire corpus using the NFV of the target word in the entire corpus and the NFVs in the entire corpus. The relevant list is shown in Table 3. In this candidate list, we can find many synonymous expressions for &amp;quot;battery&amp;quot;, such as &amp;quot;batt&amp;quot;, &amp;quot;batterie&amp;quot;, etc, however we also see some noise, such as &amp;quot;cover&amp;quot;, &amp;quot;adapter&amp;quot;, etc. In this step these words are tentavely assigned as &amp;quot;Candidate Term&amp;quot;.</Paragraph>
      <Paragraph position="1"> Second, the noise candidates are extracted from each authors' corpora by estimating the similarities between the target word's NFV in the entire corpus and the NFVs in the author's corpora. The noise candidate lists from two authors are shown in Table 4. The most relevant words in each author's corpora are &amp;quot;battery&amp;quot; and &amp;quot;batt&amp;quot;, so the same words in the extracted &amp;quot;Candidate Term&amp;quot; list are turned into &amp;quot;Absolute Term&amp;quot; and remain undeleted even when &amp;quot;battery&amp;quot; and &amp;quot;batt&amp;quot; appear in the same author's corpus. The rest of the words in the noise candidate lists are noise, so the same words in the &amp;quot;Candidate Term&amp;quot; list are turned into &amp;quot;Noise Candidate&amp;quot;, such as &amp;quot;cover&amp;quot;, &amp;quot;adapter&amp;quot;, &amp;quot;cheque&amp;quot;, and &amp;quot;screw&amp;quot;. Finally, we can get the term aggregation result as a list consisting of the words marked &amp;quot;Absolute Term&amp;quot; and &amp;quot;Candidate Term&amp;quot;. The results are shown in Table 5.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="3" end_page="6" type="metho">
    <SectionTitle>
4 Experimental Results and Discussion
</SectionTitle>
    <Paragraph position="0"> For the evaluation, we used general evaluation metrics, precision  .To measure the system's performance, we calculated the precision and the recall for the top N significant words of the baseline system and the re-evaluated system.</Paragraph>
    <Section position="1" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
4.1 Estimate of the Size of Cut-off Term
</SectionTitle>
      <Paragraph position="0"> In our experiments, we used the metrics of precision and recall to evaluate our method. These metrics are based on the number of synonymous expressions correctly extracted in the top N ranking. To define this cut-off term rank N for the data, we did some preliminary experiments with a small amount of data.</Paragraph>
      <Paragraph position="1"> With the simple noise deletion approach we expect to increase the precision, however, the recall is not expected to be increased by using this method.</Paragraph>
      <Paragraph position="2"> We defined the maximum top value of N as satiation. null Figure 2 shows the performance against rank N for the entire corpus. We can see the satiation point at 20 in the figure. Therefore, we set N equal to 20 in our experiments for synonymous expression extraction from the entire corpus.</Paragraph>
      <Paragraph position="3"> At the same time, we want to know the highest value of n to obtain the noise candidates. In each author's corpus a lower recall is acceptable, because we will remove these words as noise from the results of the entire corpus.</Paragraph>
      <Paragraph position="4"> These results lead to the conclusion that the window size of the rank N for the entire corpus and the  rank n for each corpus should have the same value, 20. During the evaluation, we extracted the synonymous expressions with the top 20 similarities from the entire corpus and removed the noise candidates with the top 20 similarities from each author's corpora. null</Paragraph>
    </Section>
    <Section position="2" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
4.2 Most Relevant Word Approach
</SectionTitle>
      <Paragraph position="0"> The basic idea of this method is that one author mostly uses a unique expression to represent one meaning. According to this idea, the most similar words in each authors' corpora tend to be synonymous expression candidates. Comparing these two methods, one is a system for removing noise and the other is a system for extracting the most similar word.</Paragraph>
      <Paragraph position="1"> According to the assumption of one person mostly using one unique expression to represent one meaning, we can extract the synonymous expressions that are the most similar word to the target word in each author's corpus. In comparison with the approach using the most similar word in each author's corpus and removing the noise, we calculated the recall rates for the most similar word approach. Table 6 shows the recall rates for the system with the entire corpus, the system using the top word from three authors' corpora, five authors' corpora, and ten authors' corpora.</Paragraph>
      <Paragraph position="2"> entire 3 5 10 corpus authors authors authors  in the authors' corpora are not necessarily synonymous expressions for the target word, since some authors use other expressions in their corpus.</Paragraph>
    </Section>
    <Section position="3" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
4.3 Noise Deletion Approach
</SectionTitle>
      <Paragraph position="0"> For evaluating the deleting noise approach, the performance against the number of authors is shown in Figure 3. We extracted the top 20 synonymous expression candidates from the entire corpus, and removed the top 20 (except for the most similar words) noise candidates from the authors' corpora.</Paragraph>
      <Paragraph position="1"> Figure 3 contains the entire corpus result, and the results after removing the noise from three authors' corpora, five authors' corpora, and ten authors' corpora. null This figure shows that the noise reduction approach leads to better precision than the basic approach, but the recall rates are slightly reduced. This is because they sometimes remove words that are not noise, when an author used several expressions for the same word. In spite of that, the F-measures are increased, showing the method improves the accuracy by 37% (when using 10 authors' corpora).</Paragraph>
      <Paragraph position="2"> In addition, the table indicates that the improvement relative to the number of authors is not yet at a maximum. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML