File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0606_intro.xml

Size: 2,635 bytes

Last Modified: 2025-10-06 14:01:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0606">
  <Title>Unsupervised discovery of morphologically related words based on orthographic and semantic similarity</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In recent years, there has been much interest in computational models that learn aspects of the morphology of a natural language from raw or structured data. Such models are of great practical interest as tools for descriptive linguistic analysis and for minimizing the expert resources needed to develop morphological analyzers and stemmers. From a theoretical point of view, morphological learning algorithms can help answer questions related to human language acquisition.</Paragraph>
    <Paragraph position="1"> In this study, we present a system that, given a corpus of raw text from a language, returns a ranked list of probable morphologically related word pairs.</Paragraph>
    <Paragraph position="2"> For example, when run with the Brown corpus as its input, our system returned a list with pairs such as pencil/pencils and structured/unstructured at the top.</Paragraph>
    <Paragraph position="3"> Our algorithm is completely knowledge-free, in the sense that it processes raw corpus data, and it does not require any form of a priori information about the language it is applied to. The algorithm performs unsupervised learning, in the sense that it does not require a correctly-coded standard to (iteratively) compare its output against.</Paragraph>
    <Paragraph position="4"> The algorithm is based on the simple idea that a combination of formal and semantic cues should be exploited to identify morphologically related pairs.</Paragraph>
    <Paragraph position="5"> In particular, we use minimum edit distance to measure orthographic similarity,1 and mutual information to measure semantic similarity. The algorithm does not rely on the notion of affix, and it does not depend on global distributional properties of substrings (such as affix frequency). Thus, at least in principle, the algorithm is well-suited to discover pairs that are related by rare and/or non-concatenative morphological processes.</Paragraph>
    <Paragraph position="6"> The algorithm returns a list of related pairs, but it does not attempt to extract the patterns that relate the pairs. As such, it can be used as a tool to pre1Given phonetically transcribed input, our model would compute phonetic similarity instead of orthographic similarity. July 2002, pp. 48-57. Association for Computational Linguistics.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML