File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0503_evalu.xml

Size: 3,884 bytes

Last Modified: 2025-10-06 13:59:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0503">
  <Title>Using Morphology and Syntax Together in Unsupervised Learning</Title>
  <Section position="6" start_page="24" end_page="25" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> We obtain the morphological analysis of the Brown corpus (Kucera and Francis, 1967) using the Linguistica software (http://linguistica.</Paragraph>
    <Paragraph position="1"> uchicago.edu), and we use the TreeTagger to assign a Penn TreeBank-style part-of-speech tag to each token in the corpus. We then carry out our experiment using the Brown corpus modified in the way we described above. Thus, for each token of the Brown corpus that our morphology analyzer analyzed, we have the following information: its stem, its signature  Our parameters are by design restrictive, so that we declare only few signatures to be similar, and therefore the cliques that we find in the graph are relatively small. One way to enlarge the size of collapsed signatures would be to loosen the similarity criterion. This, however, introduces too many new edges in the signatures graph, leading in turn to spurious collapses of signatures. We take a different approach, and apply our algorithms iteratively. The idea is that if in the first iteration, two cliques did not have enough edges between their elements to become a single new signature, they may be more strongly connected in the second iteration if many of their elements are sufficiently similar. On the other hand, cliques that were dissimilar in the first iteration remain weakly connected in the second.</Paragraph>
    <Paragraph position="2"> (i.e., the signature to which the stem is assigned), the suffix which the stem attains in this occurrence of the word (hence, the signature-transform), and the POS tag. For example, the token polymeric is analyzed into the stem polymer and the suffix ic, the stem is assigned to the signature O.ic.s, and thus this particular token has the signature transform O.ic.s_ic. Furthermore, it was assigned POS-tag JJ, so that we have the following entry: &amp;quot;polymeric JJ O.ic.s_ic&amp;quot;.</Paragraph>
    <Paragraph position="3"> Before performing signature collapsing, we calculate the description length of the morphology and the compressed length of the words that our algorithm analyzes and call it baseline description length (DL  ).</Paragraph>
    <Paragraph position="4"> Now we apply our signature collapsing algorithm under several different parameter settings for the similarity threshold th, and calculate the description length DL th of the resulting morphological and lexical analysis using (3). We know that the smaller the set of signatures, the smaller is the cost of the model. However, a signature collapse that combines signatures with different distributions over the lexical categories will result in a high cost of the data term (3c). The goal was therefore to find a method of collapsing signatures such that the reduction in the model cost will be higher than the increase in the compressed length of the data so that the total cost will decrease.</Paragraph>
    <Paragraph position="5"> As noted above, we perform this operation iteratively, and refer to the description length of the i th iteration, using a threshold th, as</Paragraph>
    <Paragraph position="7"> We used random collapsing in our experiments to ensure the expected relationship between appropriate collapses and description length. For each signature collapsing, we created a parallel situation in which the number of signatures collapsed is the same, but their choice is random. We calculate the description length using this &amp;quot;random&amp;quot; analysis as</Paragraph>
    <Paragraph position="9"> predict that this random collapsing will not produce an improvement in the total description length.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML