File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/02/w02-0606_concl.xml
Size: 4,109 bytes
Last Modified: 2025-10-06 13:53:23
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0606"> <Title>Unsupervised discovery of morphologically related words based on orthographic and semantic similarity</Title> <Section position="7" start_page="0" end_page="0" type="concl"> <SectionTitle> 5 Conclusion and Future Directions </SectionTitle> <Paragraph position="0"> We presented an algorithm that, by taking a raw corpus as its input, produces a ranked list of morphologically related pairs at its output. The algorithm finds morphologically related pairs by looking at the degree of orthographic similarity (measured by minimum edit distance) and semantic similarity (measured by mutual information) between words from the input corpus.</Paragraph> <Paragraph position="1"> Experiments with German and English inputs gave encouraging results, both in terms of precision, and in terms of the nature of the morphological patterns found within the output set.</Paragraph> <Paragraph position="2"> In work in progress, we are exploring various possible improvements to our basic algorithm, including iterative re-estimation of edit costs, addition of a context-similarity-based measure, and extension of the output set by morphological transitivity, i.e. the idea that if word a is related to word b, and word b is related to word c, then word a and word c should also form a morphological pair.</Paragraph> <Paragraph position="3"> Moreover, we plan to explore ways to relax the requirement that all pairs must have a certain degree of semantic similarity to be treated as morphologically related (there is evidence that humans treat certain kinds of semantically opaque forms as morphologically complex - see Baroni (2000) and the references quoted there). This will probably involve taking distributional properties of word substrings into account.</Paragraph> <Paragraph position="4"> From the point of view of the evaluation of the algorithm, we should design an assessment scheme that would make our experimental results more directly comparable to those of Yarowsky and Wicentowski (2000), Schone and Jurafsky (2000) and others. Moreover, a more in depth qualitative analysis of the results should concentrate on identifying specific classes of morphological processes that our algorithm can or cannot identify correctly.</Paragraph> <Paragraph position="5"> We envisage a number of possible uses for the ranked list that constitutes the output of our model. First, the model could provide the input for a more sophisticated rule extractor, along the lines of those proposed by Albright and Hayes (1999) and Neuvel (2002). Such models extract morphological generalizations in terms of correspondence patterns between whole words, rather than in terms of affixation rules, and are thus well suited to identify patterns involving non-concatenative morphology and/or morphophonological changes. A list of related words constitutes a more suitable input for them than a list of words segmented into morphemes. null Rules extracted in this way would have a number of practical uses - for example, they could be used to construct stemmers for information retrieval applications, or they could be integrated into morphological analyzers.</Paragraph> <Paragraph position="6"> Our procedure could also be used to replace the first step of algorithms, such as those of Goldsmith (2001) and Snover and Brent (2001), where heuristic methods are employed to generate morphological hypotheses, and then an informationtheoretically/probabilistically motivated measure is used to evaluate or improve such hypotheses. More in general, our algorithm can help reduce the size of the search space that all morphological discovery procedures must explore.</Paragraph> <Paragraph position="7"> Last but not least, the ranked output of (an improved version of) our algorithm can be of use to the linguist analyzing the morphology of a language, who can treat it as a way to pre-process her/his data, while still relying on her/his analytical skills to extract the relevant morphological generalizations from the ranked pairs.</Paragraph> </Section> class="xml-element"></Paper>