File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2602_metho.xml

Size: 31,509 bytes

Last Modified: 2025-10-06 14:09:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2602">
  <Title>Towards Full Automation of Lexicon Construction</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Co-clustering to de ne surrogate
</SectionTitle>
    <Paragraph position="0"> syntactic tags Many applications of text processing rely on or bene t from information regarding the parts of speech of individual terms. While part of speech is a somewhat uid notion, the computational linguistics community has converged on a handful of standard tag sets, and taggers are now available in a number of languages. Since some high-quality taggers are in the public domain, any application that could bene t from part-of-speech information should have access to it.</Paragraph>
    <Paragraph position="1"> However, using a speci c tagger and its tag set entails adopting the assumptions it embodies, which may not be appropriate for the target application. In the worst case, the domain of interest may include text in a language not covered by available taggers. Even when a tagger is available, the domain may involve usages substantially different from those in the corpus for which the tagger was developed. Many current taggers are tuned to relatively formal corpora, such as newswire, while many interesting domains, such as email, netnews, or physicians' notes, are replete with elisions, jargon, and neologisms.</Paragraph>
    <Paragraph position="2"> Fortunately, using distributional characteristics of term contexts, it is feasible to induce part-of-speech categories directly from a corpus of suf cient size, as several papers have made clear (Brown et al., 1992; Schcurrency1utze, 1993; Clark, 2000).</Paragraph>
    <Paragraph position="3"> Distributional information has uses beyond part of speech induction. For example, it is possible to augment a xed syntactic or semantic taxonomy with such information to good effect (Hearst and Schcurrency1utze, 1993). Our objective is, where possible, to work directly with the inferred syntactic categories and their underlying distributions. There are many applications of computational linguistics, particularly those involving shallow processing, such as information extraction, which can bene t from such automatically derived information, especially as research into acquisition of grammar matures (e.g., (Clark, 2001)).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The Co-clustering Algorithm.
</SectionTitle>
      <Paragraph position="0"> Our approach to inducing syntactic clusters is closely related to that described in Brown, et al, (1992) which is one of the earliest papers on the subject. We seek to nd a partition of the vocabulary that maximizes the mutual information between term categories and their contexts.</Paragraph>
      <Paragraph position="1"> We achieve this in the framework of information theoretic co-clustering (Dhillon et al., 2003), in which a space of entities, on the one hand, and their contexts, on the other, are alternately clustered in a way that maximizes mutual information between the two spaces. By treating the space of terms and the space of contexts as separate, we part ways with Brown, et al. This allows us to experiment with the notion of context, as well as to investigate whether pooling contexts is useful, as has been assumed.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 De nitions
</SectionTitle>
      <Paragraph position="0"> Given a corpus, and some notion of term and context, we derive co-occurrence statistics. More formally, the input to our algorithm is two nite sets of symbols, say a0a2a1a4a3a6a5a8a7a10a9a11a5a13a12a14a9a16a15a17a15a17a15a18a9a19a5a21a20a23a22a25a24 and  a1a27a3a29a28a30a7a30a9a19a28a31a12a32a9a29a15a18a15a17a15a18a9a19a28a30a20a23a33a34a24 , together with a set of co-occurrence count data consisting of a non-negative integer a35a37a36a29a38a40a39a42a41 for every pair of symbols a43 a5a21a44a19a9a11a28a46a45a6a47 , that can be drawn from a0 and a26 . The output is two sets of sets: a0a49a48a50a1a51a3a6a5a13a48a7 a9a16a15a17a15a18a15a17a9a11a5a52a48a20 a22a54a53 a24 and</Paragraph>
      <Paragraph position="2"> cluster ), none of the a5 a48a44 intersect each other, the union of all the a5 a48a44 is a0 (similar remarks apply to the a28 a48a45 and a26 ). The co-clustering algorithm chooses the partitions a0 a48 and  a48 to (locally) maximize the expected mutual information between them.</Paragraph>
      <Paragraph position="3"> The multinomial parameters a56a13a36a29a39 of a joint distribution over a0 and a26 may be estimated from this co-occurrence data as a56a21a36a6a39 a1 a35a54a36a6a39a30a57a34a58 a36a30a59a39 a35a54a36a29a39 , using the naive maximum likelihood method. We follow a more fully Bayesian procedure to obtain pseudo-counts a60a54a36a6a39 that are added to the counts a35a54a36a6a39 to obtain smoothed estimates. Due to space limitations, we de ne but do not fully discuss our procedure here. We apply the Evidence method in the Dice Factory setting of MacKay and Peto (1994), to obtain a pseudo-count a60 a36a30a61 for every symbol a5a63a62a64a0 by treating each a28a4a62 a26 as a sample of (not from) a random process a65a66a43 a5a68a67a28a69a47 , in a Multinomial/Dirichlet setting. By a symmetric procedure, we also obtain pseudo-counts a60a37a61a39 for each a28a70a62 a26 . These are combined according to a60a13a36a6a39</Paragraph>
      <Paragraph position="5"> then the totals a35a54a36a6a39a66a77a79a60a13a36a29a39 are rescaled by a35a37a57a83a43a74a60a84a77a64a35 a47 , where a60a13a75 a1 a58</Paragraph>
      <Paragraph position="7"> This quanti es average improvement in one's knowledge upon learning the speci c value of an event drawn from a0 . It is large or small depending on whether a0 has many or few probable values.</Paragraph>
      <Paragraph position="8"> The mutual information between random variables a0 and a26 can be written:</Paragraph>
      <Paragraph position="10"> This quanti es the amount that one expects to learn indirectly about a0 upon learning the value of a26 , or vice versa. The following relationship holds between the information of a joint distribution and the information of the marginals and mutual information:</Paragraph>
      <Paragraph position="12"> From this we see that the expected amount one can learn upon hearing of a joint event a43 a5a37a9a11a28a69a47 is bounded by what one can learn about a0 and a26 separately. Combined with another elementary result, a103 a75a78a104a106a105a107a75a97a80a108a104a108a109 and symmetrically a103a16a80 a104a64a105 a75a102a80 a104a79a109 , we see that a joint event a43 a5a82a9a11a28a69a47 yields at least as much information as either event alone, and that one cannot learn more about an event a28 from a26 by hearing about an event a5 from a0 than one would know by hearing about a28 explicitly.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 The Algorithm
</SectionTitle>
      <Paragraph position="0"> The co-clustering algorithm seeks partitions a0 a48 of a0 and</Paragraph>
      <Paragraph position="2"> der a constraint limiting the total number of clusters in each partition. The mutual information is computed from the distributions estimated as discussed in Section 2.2, by summing a65a66a43 a5a37a9a11a28a69a47 over the elements within each cluster to obtain a65a66a43 a5 a48 a9a19a28 a48 a47 .</Paragraph>
      <Paragraph position="3"> We perform an approximate maximization of a105a110a75 a53 a80 a53 using a simulated annealing procedure in which each trial move takes a symbol a5 or a28 out of the cluster to which it is tentatively assigned and places it into another. It is straightforward to obtain a formula for the change in</Paragraph>
      <Paragraph position="5"> a53 under this operation that does not involve its complete re-computation. We use an ad hoc adaptive cooling schedule that seeks to continuously reduce the rejection rate of trial moves from an initial level near 50%, staying at each target rejection rate long enough to visit a xed fraction of the possible moves with high probability.</Paragraph>
      <Paragraph position="6"> After achieving one rejection rate target for the required number of moves, the target is lowered. The temperature is also lowered, but will be raised again to an intermediate value if the resulting rejection rate is below the next target, or lowered further if the rejection rate remains above the next target.</Paragraph>
      <Paragraph position="7"> Candidate moves are chosen by selecting a non-empty cluster uniformly at random, randomly selecting one of its members, then randomly selecting a destination cluster other than the source cluster. When temperature 0 is reached, all possible moves are repeatedly attempted until no move leads to an increase in the objective function.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Co-clustering for Term Categorization
</SectionTitle>
      <Paragraph position="0"> Applying co-clustering to the problem of part of speech induction is straightforward. We de ne a0 to be the space of terms under some tokenization of the corpus, and a26 to be the space of contexts of those terms, which are a function of the close neighborhood of occurrences from a0 .</Paragraph>
      <Paragraph position="1">  experimented with concatenations of terms, and more complex de nitions based on relative position. The results reported here are based on the simple context de nition of one term to the left and one to the right, regarded as separate events.</Paragraph>
      <Paragraph position="2"> Given a particular tokenization and method for de ning context, we can derive input for the co-clustering algorithm. Sparse co-occurrence tables are created for each term of interest; each entry in such a table records a context identi er and the number of times the corresponding context occurred with the reference term. For expediency, and to avoid problems with sparse statistics, we retain only the most frequent terms and contexts. (We chose the top 5000 of each.) In Section 3.1, we show that we can overcome this limitation through subsequent processing.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.5 Experimental details and results
</SectionTitle>
      <Paragraph position="0"> We conducted experiments with the Reuters-21578 corpus a relatively tiny one for such experiments.</Paragraph>
      <Paragraph position="1"> Clark (2000) reports results on a corpus containing 12 million terms, Schcurrency1utze (1993) on one containing 25 million terms, and Brown, et al, (1992) on one containing 365 million terms. In contrast, we count approximately</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.8 million terms in Reuters-21578.
</SectionTitle>
      <Paragraph position="0"> Only the bodies of articles in the corpus were considered. Each such article was segmented into paragraphs, but not sentences. Paragraphs were then converted into token arrays, with each token corresponding to one of the following: an unbroken string of alphabetic characters or hyphens, possibly terminated by an apostrophe and additional alphabetic characters; a numeric expression; or a single occurrence and unit of punctuation presumed to be syntactically signi cant (e.g., periods, commas, and question marks). Alphabetic tokens were casenormalized, and all numeric expressions were replaced with the special token &lt;num&gt;. For the purposes of constructing context distributions, special contexts (&lt;bop&gt; and &lt;eop&gt;) were inserted at the beginnings and endings of each such array.</Paragraph>
      <Paragraph position="1"> We applied the co-clustering algorithm to the most frequent 5000 terms and most frequent 5000 contexts in the corpus, clustering each into 200 categories.</Paragraph>
      <Paragraph position="2"> Co-clustering alternately clustering terms and contexts is faster than simple clustering against the full set of contexts. Table 1 presents computation times for experiments with one grounding space on the same  corpus. Clusters are ordered according to their impact on mutual information, least to greatest ascending. Within each cluster, terms are ordered most frequent to least.</Paragraph>
      <Paragraph position="3"> machine under similar loads. While the exact time to completion is a function of particularities such as machine speed, cluster count, and annealing schedule, the relative durations (co-clustering nishes in 1/6 the time) are representative. This may be counter-intuitive, since co-clustering involves two parallel clustering runs, instead of a single one. However, the savings in the time it takes to compute the objective function (in this case, mutual information with 200 contexts, instead of 5000) typically more than compensates for the additional algorithmic steps.</Paragraph>
      <Paragraph position="4"> Table 2 lists clusters that illustrate both strengths and weaknesses of our approach. While many of the clusters correspond unambiguously to some part of speech, we can identify four phenomena that sometimes prevent the clusters from corresponding to unique part-of-speech tags: 1. Lack of distributional evidence. In several cases, the grounding space chosen provides no evidence for a distinction made by the tagger. Examples of this are cluster 199, where the is equated with the possessive form of many nouns; cluster 145, where present tense and past tense verbs are both represented; and cluster 96, where personal pronouns are equated with surnames.1 2. Predominant idioms and contexts. If a term is used predominantly in a particular idiom, then the context supplied by that idiom may have the strongest in uence on its cluster assignment, occa1Far from a bad thing, however, this last identi cation suggests some avenues for research in unsupervised pronominal reference resolution.</Paragraph>
      <Paragraph position="5"> sionally leading to counter-intuitive clusters. An obvious example of this is cluster 71. All of the terms in this cluster are typically followed by the context  of.</Paragraph>
      <Paragraph position="6"> 3. Lexical ambiguity. If a term has two or more frequent syntactic categories, the algorithm assigns it (in the best case) to a cluster corresponding to its more frequent sense, or (in the worst case) to a junk or singleton cluster. This happens with the word may (cluster 37, above) in all our experiments. null 4. Multi-token lexemes. In order to tally context dis- null tributions, we must commit to an initial xed segmentation of the corpus. While English orthography insures that this is not dif cult, there exist nevertheless xed collocations (commonly called multi-word units, MWUs), such as New York, which inject statistical noise under the default segmentation.</Paragraph>
      <Paragraph position="7"> Of these four problems, the last two are probably more serious, since they give rise to specious distinctions. Depending on the application, problems 1 and 2 may not be problems at all. In this corpus, for example, the term regarding (cluster 180) may never be used in any but a quasi-prepositional sense. And proper nouns in the possessive arguably do share a syntactic function with the.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Re nements
</SectionTitle>
    <Paragraph position="0"> Lexical categorizations, such as those provided by a part of speech tagger or a semantic resource like Wordnet, are usually a means to an end, almost never applications in their own right. While it is interesting to measure how faithfully an unsupervised algorithm can reconstruct prior categories, we neither expect to achieve anything like perfect performance on this task, nor believe that it is necessary to do so. In fact, adherence to a speci c tag set can be seen as an impediment, inasmuch as it introduces brittleness and susceptibility to noise in categorization.</Paragraph>
    <Paragraph position="1"> It is nevertheless interesting to ignore the confounding factors enumerated in Section 2.5 and measure the agreement between term categories induced by co-clustering and the tags assigned by a tagger. Using the tagger from The XTag Project (Project, 2003), we measured the agreement between our clusters and the tagger output over the terms used in clustering. We found that the clusters captured 85% of the information in the tagged text (the tagged data had an entropy of 2.68, while mutual information between clusters and tags is 2.23). In a theoretical optimal classi er, this yields a ninefold increase in accuracy over the default rule of always choosing the most frequent tag.</Paragraph>
    <Paragraph position="2"> In order to make our distributional lexicon useful, however, we need to extend its reach beyond the few thousand most frequent terms, on the one hand, and adjust for phenomena that lead to sub-optimal performance, on the other. We call the process of expanding and adjusting the lexicon after its initial creation lexicon optimization.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Increasing Lexicon Coverage
</SectionTitle>
      <Paragraph position="0"> For tractability, the initial classes are induced using only the most frequent terms in a corpus. (While we cluster using only the 5000 most frequent terms, the corpus contains approximately 41,000 distinct word-forms.) This yields consistent results and broad coverage of the corpus, but leaves us unable to categorize about 5% of tokens. Clearly, in order for our automatically constructed resource to be useful, we must introduce these uncovered terms into the lexicon, or better still, nd a way to apply it to individual novel tokens.</Paragraph>
      <Paragraph position="1">  In light of the current state of the art in part of speech tagging, the occurrence of these unknown terms does not pose a signi cant problem. It has been known for some years that good performance can be realized with partial tagging and a hidden Markov model (Cutting et al., 1992). Note that the notion of partial tagging described in Cutting, et al, is essentially different from what we consider here. Whereas they assume a lexicon which, for every term in the vocabulary, lists its possible parts of speech, we construct a lexicon which imposes a single sense (or a few senses; see Section 3.2) on each of the few thousand most frequent terms, but provides no information about other terms.</Paragraph>
      <Paragraph position="2"> As in Cutting, et al, however, we can use Baum-Welch re-estimation to extract information from novel terms, and apply the Viterbi algorithm to dispose of a particular occurrence. While the literature suggests that Baum-Welch training can degrade performance on the tagging task (Elworthy, 1994; Merialdo, 1994), we have found in early experiments that agreement between a tagger trained in this way and the tagger from the XTag Project consistently increases with each iteration of Baum-Welch, eventually reaching a plateau, but not decreasing. We attribute this discrepancy to the different structure of our problem.</Paragraph>
      <Paragraph position="3">  Note that a HMM is under no constraint to handle a given term in a consistent fashion. A single model can and often does assign a single term to multiple classes, even in a single document. When a term is suf ciently frequent, a more robust approach may be to assign it to a category using only its summary co-occurrence statistics. The idea is straightforward: Create an entry in the lexicon for the novel term and measure the change in mutual information associated with assigning it to each of the Term Freq. Cluster Example Terms weizsaecker 30 baker morgan shearson provoke 20 take buy make glut 10 price level volume councils 5 prices markets operations stockbuilding 3 earnings income profits ammonia 2 energy computer petroleum unwise 2 expected likely scheduled  tual information objective function. Each row shows a term not present in the initial clustering, its corpus frequency, and example terms from the cluster to which it is assigned.</Paragraph>
      <Paragraph position="4"> available categories. Assign it to the category for which this change is maximized.</Paragraph>
      <Paragraph position="5"> As Table 3 demonstrates, this procedure works surprisingly well, even for words with low corpus frequencies. Of course, as frequencies are reduced, the likelihood of making a sub-optimal assignment increases. At some point, the decision is better made on an individual basis, by a classi er trained to account for the larger context in which a novel term occurs, such as an HMM. We are currently investigating how to strike this trade-off, in a way that best exploits the two available techniques for accommodating novel tokens.</Paragraph>
      <Paragraph position="6"> Lexical ambiguity (or polysemy) and xed collocations (multi-word units) are two phenomena which clearly lead to sub-optimal clusters. We have achieved promising results resolving these problems while remaining within the co-clustering framework. The basic idea is as follows: If by treating a term as two distinct lexemes (or, respectively, a pair of commonly adjacent terms as a distinct lexeme), we can realize an increase in mutual information, then the term is lexically ambiguous (respectively, a xed collocation). In the case of polysemy resolution, this involves factoring the context distribution into two or more clusters. In the case of a xed collocation, we consider the effect of treating an n-gram as a lexical unit.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Polysemy Resolution
</SectionTitle>
      <Paragraph position="0"> To determine whether a term is polysemous we must determine whether the lexicon's mutual information can be increased by treating the term as two distinct lexemes.</Paragraph>
      <Paragraph position="1"> Given a particular term, we make this determination by attempting to factor its context distribution into every possible pair of distinct clusters.2 Faced with a candidate pair, we posit two senses of the target term, one in each  representative terms. The third column lists sample terms from the two clusters into which each term is divided.</Paragraph>
      <Paragraph position="2"> cluster. The probability mass associated with each event type in the term's context distribution is then assigned to one or the other hypothetical sense, always to the one that improves mutual information the most (or hurts it the least). Once the probability mass of the original term has been re-apportioned in this way, the resulting change in mutual information re ects the quality of the hypothetical sense division. The maximum change in mutual information over all such cluster pairs is then taken to be the polysemy score for the target term.</Paragraph>
      <Paragraph position="3"> Table 4 shows how this procedure handles selected terms from the Reuters corpus. Positive changes in mutual information clearly correspond to polysemy in the target term. In the Reuters corpus, there are a fair number of terms that have a noun and a verb sense, such as act and vote in the table. Note, too, the result of polysemy resolution run on unambiguous terms either a nonsensical division, as with france, or division into two closely related clusters, in both cases, however, with a decrease in mutual information.</Paragraph>
      <Paragraph position="4"> Note that the problem of lexical ambiguity has been studied elsewhere. Schcurrency1utze (1993; 1995) proposes two distinct methods by which ambiguity may be resolved. In one paper, a separate model (a neural network) is trained on the results of clustering in order to classify individual term occurrences. In the other, the individual occurrences of a term are tagged according to the distributional properties of their neighbors. Clark (2000) presents a framework which in principle should accommodate lexical ambiguity using mixtures, but includes no evidence that it does so. Furthermore, a mixture distribution species the proportion of occurrences of a term that should be tagged one way or another, but does not prescribe what to do with every individual event. In contrast to the above approaches, we derive a lexicon which succinctly lists the possible syntactic senses for a term and provides a means to disambiguate the sense of a single occurrence. More-Phrase Example Cluster Terms cubic feet francs barrels ounces hong kong london tokyo texas pointed out added noted disclosed los angeles london tokyo texas merrill lynch texaco chrysler ibm we don't we i you saudi arabia japan canada brazil morgan stanley texaco chrysler ibm managing director president chairman smith barney underwriters consumers  units in Reuters, along with example terms from the cluster to which each was assigned.</Paragraph>
      <Paragraph position="5"> over, a shortcoming of occurrence-based methods of polysemy resolution is that a given term may be assigned to an implausibly large number of categories. By analyzing this behavior at the type level, rather than the token level, we not only can exploit the corpus-wide behavior of a term, but we can enforce the linguistically defensible constraint that it have only a few senses.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Multi-Word Units
</SectionTitle>
      <Paragraph position="0"> In English, orthography provides a convenient clue to textual word segmentation. Doing little more than breaking the text on whitespace boundaries, it is possible to perform a linguistically meaningful statistical analysis of a corpus. Multi-word units (MWUs) are the exception to this rule. Treating terms such as York terms which in a particular corpus may not be meaningful in isolation gives rise to highly idiosyncratic context distributions, which in turn add noise to cluster statistics or lead to the production of junk clusters.</Paragraph>
      <Paragraph position="1"> In order to recognize such cases, we apply a variant of our by now familiar lexicon optimization rule: We posit a lexical entry for a given candidate MWU, nd the cluster to which it is best suited, and ask whether creating the lexeme improves the situation. In principle, we can conduct this process in the same way as with novel terms and polysemy. Here, however, we report the results of a simple surrogate technique. After assembling the context distribution of the candidate MWU (an n-gram), we compute the Hellinger distance between this distribution and that of each cluster. The Hellinger distance between two distributions a65 and a112 is de ned as:</Paragraph>
      <Paragraph position="3"> The candidate MWU is then tentatively assigned to the cluster for which this quantity is minimized and its distance to this cluster is noted (call this distance  Wordnet in each MWU score band.</Paragraph>
      <Paragraph position="4"> a113a123a122a16a124a42a125a40a126a42a127 ). We then compute the distance between each of the n-gram's constituent terms and its respective cluster (a113a129a128a131a130a132a125a40a127 a71a68a133a16a133a16a133 a113a129a128a131a130a132a125a40a127a135a134 ). The MWU score is the difference between the maximum term distance and the n-gram distance, or a136a66a137a46a138</Paragraph>
      <Paragraph position="6"> score of a candidate MWU increases with its closeness of t to its cluster and the lack of t of its constituent terms.</Paragraph>
      <Paragraph position="7"> Table 5 shows the ten bi-grams that score highest using this heuristic. Note that they come from a number of syntactic categories. In this list, the only error is the phrase we don't, which is determined to be syntactically substitutable for pronouns. Note, however, that this is the only collocation in this list consisting entirely of closed-class terms. To the extent that we can recognize such terms, it is easy to rule out such cases.</Paragraph>
      <Paragraph position="8"> Table 6 benchmarks this technique against Wordnet.</Paragraph>
      <Paragraph position="9"> Breaking the range of MWU scores into bands, we ask what fraction of n-grams in each band can be found in Wordnet. The result is a monotonic decrease in Wordnet representation. Investigating further, we nd that almost all of the missing n-grams that score high are absent because they are corpus-speci c concepts, such as Morgan Stanley and Smith Barney. On the other end, we nd that low-scoring n-grams present in Wordnet are typically included for reasons other than their ability to serve as independent lexemes. For example, on that appears to have been included in Wordnet because it is a synonym for thereon.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Directions
</SectionTitle>
      <Paragraph position="0"> We have begun research into characterizing more precisely the grammatical roles of the clusters found by our methods, with an eye to identifying the lowest-level expansions in the grammar responsible for the text. Inasmuch as information extraction can rely on shallow methods, the ability to produce a shallow parser without supervision should enable rapid creation of information extraction systems for new subject domains and languages.</Paragraph>
      <Paragraph position="1"> We have had some success distinguishing open-class clusters from closed-class clusters, on the one hand, and head clusters from modi er clusters, on the other.</Paragraph>
      <Paragraph position="2">  among the 5000 most frequent terms, using the -1 a140 +1 grounding space. In general, closed-class terms have higher entropy.</Paragraph>
      <Paragraph position="3"> Schone and Jurafsky (2001) list several universal characteristics of language that can serve as clues in this process, some of which we exploit. However, their use of perfect clusters renders some of their algorithmic suggestions problematic. For example, they propose using the tendency of a cluster to admit new members as an indication that it contains closed-class (or function) terms. While we do nd large clusters corresponding to open classes and small clusters to closed classes, the separation is not always clean (e.g., cluster 199 in Table 2). Small clusters often contain open-class terms with predominant corpus-speci c idiomatic usages. For example, Reuters-21578 has special usages for the terms note, net, and pay, in additional to their usual usages.</Paragraph>
      <Paragraph position="4"> While the size of its cluster is a useful clue to the open- or closed-class status of a term, we are forced to search for other sources of evidence. Once such indicator is the entropy of the term's context distribution. Table 7 lists the ve most and least entropic among the 5000 most frequent terms in Reuters-21578. Function terms have higher entropy not only because they are more frequent than non-function terms, but also because a function term must participate syntactically with a wide variety of content-carrying terms. While entropy alone also does not yield a clean separation between function and content terms, it may be possible to use it in combination with the suggestion of Schone and Jurafsky to produce a reliable separation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML