XML Viewer - w04-3221

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3221_intro.xml
Size: 9,649 bytes
Last Modified: 2025-10-06 14:02:49
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3221">
  <Title>Attribute-Based and Value-Based Clustering: An Evaluation</Title>
  <Section position="3" start_page="0" end_page="3" type="intro">
    <SectionTitle>
2 Methods
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Using Text Patterns to Build Concept
Descriptions
</SectionTitle>
      <Paragraph position="0"> Our techniques for extracting concept descriptions are simpler than those used in other work in at least two respects. First of all, we only extracted values expressed as nominal modifiers, ignoring properties expressed by verbal constructions in which the concept occurred as an argument (e.g., Lin's (dog obj-of have)).</Paragraph>
      <Paragraph position="1"> (We originally made this simplification to concentrate on the comparison between attributes and values (many verbal relations express more complex properties), but found that the resulting descriptions were still adequate for clustering.) Secondly, our data were not parsed or POS-tagged prior to extracting concept properties; our patterns are word-based. Full parsing is essential when complete descriptions are built (see below) and allows the specification of much more general patterns (e.g., matching descriptions modified in a variety of ways, see below), but is computationally much more expensive, particularly when Web data are used, as done here. We also found that when using the Web, simple text patterns not requiring parsing or POS tagging were sufficient to extract large numbers of instances of properties with a good degree of precision.</Paragraph>
      <Paragraph position="2"> Our methods for extracting 'values' are analogous to those used in the previous literature, apart from the two simplifications just mentioned: i.e., we just consider every nominal modifier as expressing a potential property. The pattern we use to extract values is as follows:</Paragraph>
      <Paragraph position="4"> where C is a concept, and the wildcard (*) stands for an unspecified value. The restriction to instances containing is or was to ensure that the C actually stands for a concept (i.e., avoiding modifiers) proved adequate to ensure precision. An example of text matching this pattern is: * ... an inexpensive car is ...</Paragraph>
      <Paragraph position="5"> The pattern we use for extracting concept attributes is based on linguistic tests for attributes already discussed, e.g., in (Woods, 1975).</Paragraph>
      <Paragraph position="6"> According to Woods, A is an attribute of C if we can say [V is a/the A of C]: e.g., brown is a color of dogs. If no V can be found which is a value of A, then A can not be an attribute for the concept C. This test only selects attributes that have values, and is designed to exclude other functions defined over concepts, such as parts. But some of these functions can be (and have been) viewed as defining attributes of concepts as well; so for the moment we used more general patterns identifying all relational nouns taking a particular concept as arguments. (We return on the issue of the characterization of attributes below.) Our pattern for attributes is shown below: * &amp;quot;the * of the C [is|was]&amp;quot; where again C is a concept, but the wildcard denotes an unspecified attribute. Again, is/was is used to increase precision. An example of text matching this pattern is: * ... the price of the car was ...</Paragraph>
      <Paragraph position="7"> Both of the patterns we use satisfy Hearst's desiderata for good patterns (Hearst, 1998): they are (i) frequent, (ii) precise, and (iii) easy to recognize. Patterns similar to our attribute pattern were used by Berland and Charniak (1999) and Poesio et al (2002) to find object parts only; after collecting their data, Berland and Charniak filtered out words ending with &amp;quot;ness&amp;quot;, &amp;quot;ing&amp;quot;, and &amp;quot;ity&amp;quot;, because these express qualities of objects, and used a ranking method to rank the remaining words.</Paragraph>
      <Paragraph position="8"> (An accuracy of 55% for the top 50 proposed parts was reported.) We found that these patterns can be used to collect other sorts of 'attributes', as well.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="3" type="sub_section">
      <SectionTitle>
2.2 Web Data Collection through Google
</SectionTitle>
      <Paragraph position="0"> In recent years there has been growing evidence that using the Web as a corpus greatly reduces the problem of data sparseness, and its size more than compensates the lack of balance (e.g., (Keller and Lapata, 2003)). The benefits from using the Web over even large corpora like the BNC for extracting semantic relations, particularly when using simple text patterns, were informally pointed out in (Poesio, 2003) and demonstrated more systematically by Markert et al (submitted). These findings were confirmed by our experiments. A comparison of numbers of instances of some patterns using the Web and the BNC is shown in  patterns in BNC and the Web. Web frequency is based on Google counts We collect our data from the Web using the Google search engine, accessed via the freely available Google Web API  . The API only allows to retrieve the first 1,000 results per search request; to overcome this restriction, we use the daterage feature of the Google search request. This feature allows the user to fragment the search space into a number of periods, hence retrieving only pages that have been updated during a specified period. In the two experiments presented here, we aimed to collect up to 10,000 matches per search request using the daterage feature: we divided the search space into 100 days starting from January, 1990 until mid 2004. (The procedure we used does not guarantee collecting all the instances in the accessed periods, because if there are more than  are two strings and the wildcard denotes an unspecified single word. For example, the search request &amp;quot;a * car is&amp;quot; catches instances such as: [a red car is], [a small car is], and [a sport car is]. It is worth mentioning that Google does not pay attention to punctuation marks; this is one area in which parsing would help.</Paragraph>
      <Paragraph position="1"> When receiving results from Google, we do not access the actual Web pages, but instead we process the snippets that are returned by Google.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
2.3 Clustering Methods
</SectionTitle>
      <Paragraph position="0"> The task that we use to compare concept descriptions is lexical acquisition via clustering. We experimented with clustering systems such as COBWEB (Fisher, 1987) and SUBDUE (Cook and Holder, 2000) before settling on CLUTO 2.1 (Karypis, 2002). CLUTO is a general-purpose clustering tool that implements three different clustering algorithms: partitional, agglomerative, and graph partitioning algorithms. CLUTO produces both flat and hierarchical clusters. It uses a hard clustering technique, where each concept can be assigned to only one cluster. The software allows to choose a similarity metric between a set including extended Jaccard and cosine. CLUTO was optimized to cluster data of large sizes in a reasonable time. The software also provides analysis and visualization tools.</Paragraph>
      <Paragraph position="1"> In this paper, we use extended Jaccard, which was found to produce more accurate results than the cosine function in similar tasks (Karypis, 2002; Curran and Moens, 2003). In CLUTO, the extended Jaccard function works only with the graph partitioning algorithm.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
2.4 Evaluation Measures
</SectionTitle>
      <Paragraph position="0"> We used two types of measures to evaluate the clusters produced by CLUTO using the concept descriptions discussed above, both of which compare the clusters produced by the system to model clusters. Accuracy is computed by dividing the number of correctly clustered concepts by the total number of concepts. The number of correctly clustered concepts is determined by examining  Also, registered users of the API can send up to 1,000 requests per day, but our daily limit was increased by Google to 20,000 requests per day.</Paragraph>
      <Paragraph position="1">  Snippets are text excerpts captured from the actual web pages with embedded HTML tags. We process the snippets by removing the HTML tags and extracting the targeted piece of text that was specified in the request. each system cluster, finding the class of each concept in the model clusters, and determining the majority class. The cluster is then labeled with this class; the concepts belonging to it are taken to be correctly clustered, whereas the remaining concepts are judged to be incorrectly clustered. In the contingency table evaluation (Swets, 1969; Hatzivassiloglou and McKeown, 1993), the clusters are converted into two lists (one for the system clusters and one for the model clusters) of yes-no answers to the question &amp;quot;Does the pair of concepts occur in the same cluster?&amp;quot; for each pair of concepts. A contingency table is then built, from which recall (R), precision (P), fallout, and F measures can be computed. For example, if the model clusters are: (A, B, C) and (D), and the system clusters are: (A, B) and (C, D), the yes-no lists are as in Table 2, and the contingency table is as in Table 3.</Paragraph>
      <Paragraph position="3"/>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML