File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-1003_intro.xml

Size: 10,871 bytes

Last Modified: 2025-10-06 14:06:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1003">
  <Title>A Preliminary Study of Word Clustering Based on Syntactic Behavior</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Headwords and Dependencies
</SectionTitle>
    <Paragraph position="0"> The data we extract are based on the concept of headwords. Such headwords are chosen for every constituent in the parse tree by means of a simple set of rules. These have been used in various studies in this field, see (Collins, 1996, Magerman, 1995, Jelinek et el., 1994). Every headword is propagated up through the tree such that every parent receives a headword from the head-child. Figure 1 gives an example of a parse tree with headwords.</Paragraph>
    <Paragraph position="1"> Following the techniques suggested by (Collins, 1996), a parse tree can subsequently be described as a set of dependencies. Every word except the head-word of the sentence depends on the lowest head-word it is covered by. The syntactic relation is then given by the triple of nonterminals: the modifying nonterminal, the modified nonterminal, and the non-terminal that covers the joint phrase. Table 1 gives an example of such a description.</Paragraph>
    <Paragraph position="2"> On one point our method is different from the method suggested by Collins. Collins uses a reduced sentence in which every basic noun phrase (i.e., a noun phrase that has no noun phrase as a child) is reduced to its headword. The reason for this is that it improves co-occurrence counts and adjacency statistics. We however do not reduce the sentence Hogenhout ~ Matsumoto 16 Word Clustering from Syntactic Behavior Wide R. Hogenhout and Yuji Matsumoto (1997) Word Clustering from Syntactic Behavior.  (ed.) CoNLL97: Computational Natural Language Learning, ACL pp 16-24. (~) 1997 Association for Computational Linguistics  In T.M. EUison since we do. not need to consider adjacency statistics or unresolved ambiguities, and therefore never face the problem that a word in a basic noun phrase, that is not the headword, is adjacent to or modifies something outside of the basic noun phrase.</Paragraph>
    <Paragraph position="3"> Table 1 gives the relations for one sentence, but instead of considering one sentence we collect such patterns for the whole corpus and study statistics for individual words. In this way it can be discovered that, for example, a particular verb is often used transitively, or that a particular preposition is mostly used to produce locative prepositional phrases. Words can be distinct or similar in this respect, but note that this is not related to semantic similarity. Words such as eat and drink have a semantic similarity, but may be completely different in their syntactic behavior, whereas tend and appear do not have an obvious semantic relation, but they do have a similarity since they can both be used as raising verbs, as will be exemplified later.</Paragraph>
    <Paragraph position="4"> Throughout this paper we will use the term &amp;quot;word&amp;quot; to refer to words for which the part of speech has already been disambiguated. In tables and figures we emphasize this by indicating the part of speech together with the word.</Paragraph>
    <Paragraph position="5">  The next step we take, is eliminating one of the two words in this table of dependencies. Consider tables 2 and 3. These show we can take three &amp;quot;observations&amp;quot; from the sentence by eliminating either the headword or the dependent word. If headwords are eliminated we obtain three observations, for the words John, Smith and fast. If dependent words are eliminated we also obtain three observations, two for works and one for Smith.</Paragraph>
    <Paragraph position="6"> By collecting the observations over the entire corpus we can see to/by what sort of words and with what kind of relations a word modifies or is modified. We consider the following distributions: p(R, talwdt~) (1) p(R, tdlWhth) (2) where R indicates the triple representing the syntactic relation, Wd a dependent word that modifies headword Wh, and td and th their respective part of speech tags. For example, in the second line of table 3, which corresponds to distribution 1, R is (NP,S,VP), th is &amp;quot;verb&amp;quot;, wd is &amp;quot;Smith&amp;quot; and td is &amp;quot;proper noun&amp;quot;.</Paragraph>
    <Paragraph position="7"> Statistics of the distributions (1) and (2) can easily be taken from a treebank. We took such data from the Wall Street Journal Treebank, calculating the probabilities with the Maximum Likelihood Estimator: null f(R, th, Wdtd) p(R, th\]Wdtd) = ER',t' f(R',t', Wdtd) where f stands for frequency. Note that we only extract the dependency relations, and ignore the structure of the sentence beyond these relations. This shows the equation for distribution (1), distribution (2) is calculated likewise.</Paragraph>
    <Paragraph position="8"> Compare the dependency behavior of the proper nouns Nippon and Rep. in table 4. The word Nippon is Japanese for Japan, and mainly occurs in names of companies. The word Rep. is the abbreviation of Republic, and obviously occurs mainly in names of countries. As can be seen, the word Rep. occurs far more frequently, but the distributions are highly similar. Both always modify another proper noun, about 33% of the time forming an NP-SBJ and 67% of the time an NP. Both are a particular kind of proper noun that almost always modifies other proper nouns and almost never appears by itself.</Paragraph>
    <Paragraph position="9"> It also became clear that the noun company is very different from a noun such as hostage, since company often is the subject of a verb, while hostage is rarely in the subject position. Both are also very different from the noun year, which is frequently used as the object of a preposition.</Paragraph>
    <Paragraph position="10"> The gerund including has an extremely strong tendency to produce prepositional phrases, as in &amp;quot;Safety advocates, including some members of Congress,...&amp;quot;, making it different from most other gerunds. A past tense such as fell has an unusual high frequency as the head of a sentence rather than a verb phrase, which is probably a peculiarity of the Wall Street Journal ( &amp;quot;Stock prices fell... &amp;quot;).</Paragraph>
    <Paragraph position="11"> Our observation is that among words which have the same part of speech, some word groups exhibit behavior that is extremely similar, while others display large differences. The method we suggest aims Hogenhout E4 Matsumoto 17 Word Clustering from Syntactic Behavior at making a clustering based on such behavior. By using this technique any number of clusters can be obtained, sometimes far beyond what humans can be expected to recognize as distinct categories.</Paragraph>
    <Paragraph position="13"> Clustering of words based on syntactic behavior has to our knowledge not been carried out before, but clustering has been applied with the goal of obtaining classes based on co-occurrences. Such clusters were used in particular for interpolated n-gram language models.</Paragraph>
    <Paragraph position="14"> By looking at co-occurrences it is possible to find groups of words such as \[director, chief, professor, commissioner, commander, superintendent\]. The most prominent method for discovering their similarity is by finding words that tend to co-occur with these words. In this case they may for example co-occur with words such as decide and lecture.</Paragraph>
    <Paragraph position="15"> The group of verbs \[tend, plan, continue, want, need, seem, appear\] also share a similarity, but one has to look at structures rather than meaning or co-occurrences to see why. All these verbs tend to occur in the same kind of structures, as can be seen in the following examples from the Wall Street Journal.</Paragraph>
    <Paragraph position="16"> The funds' share prices tend to swing more than the broader market.</Paragraph>
    <Paragraph position="17"> Investors continue to pour cash into money funds.</Paragraph>
    <Paragraph position="18"> Cray Research did not want to fund a project that did not include Seymour.</Paragraph>
    <Paragraph position="19"> No one has worked out the players' average age, but most appear to be in their late 30s.</Paragraph>
    <Paragraph position="20"> What these verbs share is the property that they often modify an entire clause (marked as 'S' in the Wall Street Journal Treebank) rather than noun phrases or prepositional phrases, usually forming a subject raising construction. This is only a tendency, since all of them can be used in a different way as well, but the tendency is strong enough to make their usage quite similar. Co-occurrence based clustering ignores the structure in which the word occurs, and would therefore not be the right method to find related similarities.</Paragraph>
    <Paragraph position="21"> As mentioned, co-occurrence based clustering methods often also aim at producing semantically meaningful clusters. Various methods are based on Mutual Information between classes, see (Brown et al., 1992, McMahon and Smith, 1996, Kneser and Ney, 1993, Jardino and Adda, 1993, Martin, Liermann, and Ney, 1995, Ueberla, 1995). This measure cannot be applied in our case since we look at structure and ignore other words, and consequently algorithms using that measure cannot be applied to the problem we deal with.</Paragraph>
    <Paragraph position="22"> The mentioned studies use word-clusters for interpolated n-gram language models. Another application of hard clustering methods (in particular bottom-up variants) is that they can also produce a binary tree, which can be used for decision-tree based systems such as the SPATTER parser (Magerman, 1995) or the ATR Decision-Tree Part-Of-Speech Tagger (Black et al., 1992, Ushioda, 1996). Hogenhout ~ Matsumoto 18 Word Clustering from Syntactic Behavior In this case a decision tree contains binary questions to decide the properties of a word.</Paragraph>
    <Paragraph position="23"> We present a hard clustering algorithm, in the sense that every word belongs to exactly one cluster (or is one leaf in the binary word-tree of a particular part of speech). Besides hard algorithms there have also been studies to soft clustering (Pereira, Tishby, and Lee, 1993, Dagan, Pereira, and Lee, 1994) where the distribution of every word is smoothed with the nearest k words rather than placed in a class which supposedly has a uniform behavior. In fact, in (Dagan, Markus, and Markovitch, 1993) it was argued that reduction to a relatively small number of pre-determined word classes or clusters may lead to substantial loss of information. On the other hand, when using soft clusteringit is not possible to give a yes/no answer about class membership, and binary word trees cannot be constructed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML