File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-3002_metho.xml

Size: 11,647 bytes

Last Modified: 2025-10-06 14:10:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-3002">
  <Title>Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering</Title>
  <Section position="5" start_page="7" end_page="9" type="metho">
    <SectionTitle>
2 Method
</SectionTitle>
    <Paragraph position="0"> The method employed here follows the coarse methodology as described in the introduction, but differs from other works in several respects.</Paragraph>
    <Paragraph position="1"> Although we use 4-word context windows and the top frequency words as features (as in Schutze 1995), we transform the cosine similarity values between the vectors of our target words into a graph representation.</Paragraph>
    <Paragraph position="2"> Additionally, we provide a methdology to identify and incorporate POS-ambiguous words as well as low-frequency words into the lexicon.</Paragraph>
    <Section position="1" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
2.1 The Graph-Based View
</SectionTitle>
      <Paragraph position="0"> Let us consider a weighted, undirected graph</Paragraph>
      <Paragraph position="2"> ). Vertices represent entities (here: words); the weight of an edge between two vertices indicates their similarity.</Paragraph>
      <Paragraph position="3"> As the data here is collected in feature vectors, the question arises why it should be transformed into a graph representation. The reason is, that graph-clustering algorithms such as e.g. (van Dongen, 2000; Biemann 2006), find the number of clusters automatically  . Further, outliers are handled naturally in that framework, as they are represented as singleton nodes (without edges) and can be excluded from the clustering. A threshold s on similarity serves as a parameter to influence the number of non-singleton nodes in the resulting graph.</Paragraph>
      <Paragraph position="4"> For assigning classes, we use the Chinese Whispers (CW) graph-clustering algorithm, which has been proven useful in NLP applications as described in (Biemann 2006). It is time-linear with respect to the number of edges, making its application viable even for graphs with several million nodes and edges. Further, CW is parameter-free, operates locally and results in a partitioning of the graph, excluding singletons (i.e. nodes without edges).</Paragraph>
    </Section>
    <Section position="2" start_page="7" end_page="9" type="sub_section">
      <SectionTitle>
2.2 Obtaining the lexicon
</SectionTitle>
      <Paragraph position="0"> Partitioning 1: High and medium frequency words Four steps are executed in order to obtain  partitioning 1: 1. Determine 200 feature and 10.000 target words from frequency counts 2. construct graph from context statistics 3. Apply CW on graph.</Paragraph>
      <Paragraph position="1"> 4. Add the feature words not present in the  partitioning as one-member clusters. The graph construction in step 2 is conducted by adding an edge between two words a and b  This is not an exclusive characteristic for graph clustering algorithms. However, the graph model deals with that naturally while other models usually build some meta-mechanism on top for determining the optimal number of clusters.</Paragraph>
      <Paragraph position="2">  with weight w=1/(1-cos(a,b)), if w exceeds a similarity threshold s. The latter influences the number of words that actually end up in the graph and get clustered. It might be desired to cluster fewer words with higher confidence as opposed to running in the danger of joining two unrelated clusters because of too many ambiguous words that connect them.</Paragraph>
      <Paragraph position="3"> After step 3, we already have a partition of a subset of our target words. The distinctions are normally more fine-grained than existing tag sets.</Paragraph>
      <Paragraph position="4"> As feature words form the bulk of tokens in corpora, it is clearly desired to make sure that they appear in the final partitioning, although they might form word classes of their own  . This is done in step 4. We argue that assigning separate word classes for high frequency words is a more robust choice then trying to disambiguate them while tagging.</Paragraph>
      <Paragraph position="5"> Lexicon size for partitioning 1 is limited by the computational complexity of step 2, which is time-quadratic in the number of target words. For adding words with lower frequencies, we pursue another strategy.</Paragraph>
      <Paragraph position="6">  words As noted in (Dunning, 1993), log-likelihood statistics are able to capture word bi-gram regularities. Given a word, its neighbouring co-occurrences as ranked by the log-likelihood reflect the typical immediate contexts of the word. Regarding the highest ranked neighbours as the profile of the word, it is possible to assign similarity scores between two words A and B according to how many neighbours they share, i.e. to what extent the profiles of A and B overlap. This directly induces a graph, which can be again clustered by CW.</Paragraph>
      <Paragraph position="7"> This procedure is parametrised by a log-likelihood threshold and the minimum number of left and right neighbours A and B share in order to draw an edge between them in the resulting graph. For experiments, we chose a minimum log-likelihood of 3.84 (corresponding to statistical dependence on 5% level), and at least four shared neighbours of A and B on each side. Only words with a frequency rank higher than 2,000 are taken into account. Again, we obtain several hundred clusters, mostly of open word classes. For computing partitioning 2, an efficient algorithm like CW is crucial: the graphs  This might even be desired, e.g. for English not. as used for the experiments consisted of  nodes/edges.</Paragraph>
      <Paragraph position="8"> The procedure to construct the graphs is faster than the method used for partitioning 1, as only words that share at least one neighbour have to be compared and therefore can handle more words with reasonable computing time.</Paragraph>
      <Paragraph position="9"> Combination of partitionings 1 and 2 Now, we have two partitionings of two different, yet overlapping frequency bands. A large portion of these 8,000 words in the overlapping region is present in both partitionings. Again, we construct a graph, containing the clusters of both partitionings as nodes; weights of edges are the number of common elements, if at least two elements are shared. And again, CW is used to cluster this graph of clusters. This results in fewer clusters than before for the following reason: While the granularities of partitionings 1 and 2 are both high, they capture different aspects as they are obtained from different sources. Nodes of large clusters (which usually consist of open word classes) have many edges to the other partitioning's nodes, which in turn connect to yet other clusters of the same word class. Eventually, these clusters can be grouped into one.</Paragraph>
      <Paragraph position="10"> Clusters that are not included in the graph of clusters are treated differently, depending on their origin: clusters of partition 1 are added to the result, as they are believed to contain important closed word class groups. Dropouts from partitioning 2 are left out, as they mostly consist of small, yet semantically motivated word sets. Combining both partitionings in this way, we arrive at about 200-500 clusters that will be further used as a lexicon for tagging.</Paragraph>
      <Paragraph position="11"> Lexicon construction A lexicon is constructed from the merged partitionings, which contains one possible tag (the cluster ID) per word. To increase text coverage, it is possible to include those words that dropped out in the distributional step for partitioning 1 into the lexicon. It is assumed that these words dropped out because of ambiguity. From a graph with a lower similarity threshold s (here: such that the graph contained 9,500 target words), we obtain the neighbourhoods of these words one at a time. The tags of those neighbours - if known - provide a distribution of possible tags for these words.</Paragraph>
    </Section>
    <Section position="3" start_page="9" end_page="9" type="sub_section">
      <SectionTitle>
2.3 Constructing the tagger
</SectionTitle>
      <Paragraph position="0"> Unlike in supervised scenarios, our task is not to train a tagger model from a small corpus of hand-tagged data, but from our clusters of derived syntactic categories and a considerably large, yet unlabeled corpus.</Paragraph>
      <Paragraph position="1"> Basic Trigram Model We decided to use a simple trigram model without re-estimation techniques. Adopting a standard POS-tagging framework, we maximize the probability of the joint occurrence of tokens</Paragraph>
      <Paragraph position="3"> estimated from word trigrams in the corpus whose elements are all present in our lexicon.</Paragraph>
      <Paragraph position="4"> The last term of the product, namely P(c</Paragraph>
      <Paragraph position="6"> dependent on the lexicon  . If the lexicon does not</Paragraph>
      <Paragraph position="8"> ) only depends on neighbouring categories. Words like these are called out-of-vocabulary (OOV) words.</Paragraph>
    </Section>
    <Section position="4" start_page="9" end_page="9" type="sub_section">
      <SectionTitle>
Morphological Extension
</SectionTitle>
      <Paragraph position="0"> Morphologically motivated add-ons are used e.g.</Paragraph>
      <Paragraph position="1"> in (Clark, 2003) and (Freitag 2004) to guess a more appropriate category distribution based on a word's suffix or its capitalization for OOV words. Here, we examine the effects of Compact Patricia Trie classifiers (CPT) trained on prefixes and suffixes. We use the implementation of (Witschel and Biemann, 2005). For OOV words, the category-wise product of both classifier's distributions serve as probabilities P(c</Paragraph>
      <Paragraph position="3"> w=ab=cd be a word, a be the longest common prefix of w that can be found in all lexicon words, and d be the longest common suffix of w that can be found in all lexicon words. Then</Paragraph>
      <Paragraph position="5"> sentences, tokens, tagger and tagset size, corpus coverage of top 200 and 10,000 words.</Paragraph>
      <Paragraph position="6"> CPTs do not only smoothly serve as a substitute lexicon component, they also realize capitalization, camel case and suffix endings naturally.</Paragraph>
      <Paragraph position="7">  Although (Charniak et al. 1993) report that using</Paragraph>
      <Paragraph position="9"> ) instead leads to superior results in the supervised setting, we use the 'direct' lexicon probability. Note that our training material size is considerably larger than hand-labelled POS corpora.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="9" end_page="9" type="metho">
    <SectionTitle>
3 Evaluation methodology
</SectionTitle>
    <Paragraph position="0"> We adopt the methodology of (Freitag 2004) and measure cluster-conditional tag perplexity PP as the average amount of uncertainty to predict the tags of a POS-tagged corpus, given the tagging with classes from the unsupervised method. Let  be the mutual information between two random variables X and Y. Then the cluster-conditional tag perplexity for a gold-standard tagging T and a tagging resulting from clusters C is computed as )exp()exp(</Paragraph>
  </Section>
  <Section position="7" start_page="9" end_page="9" type="metho">
    <SectionTitle>
 |TCTCT
MIIPP [?]== .
</SectionTitle>
    <Paragraph position="0"> Minimum PP is 1.0, connoting a perfect congruence on gold standard tags.</Paragraph>
    <Paragraph position="1"> In the experiment section we report PP on lexicon words and OOV words separately. The objective is to minimize the total PP.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML