File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/c04-1152_concl.xml
Size: 3,300 bytes
Last Modified: 2025-10-06 13:53:57
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1152"> <Title>Efficient Unsupervised Recursive Word Segmentation Using Minimum Description Length</Title> <Section position="8" start_page="0" end_page="0" type="concl"> <SectionTitle> 6 Conclusions </SectionTitle> <Paragraph position="0"> We have given a firmer foundation for the use of minimal description length (MDL) criteria for morphological analysis by giving a novel local formulation of the change in description length (DL) upon resegmentation of the corpus on a prefix (or suffix), segmented morphs, though a better segmentation would have included them.</Paragraph> <Paragraph position="1"> which enables an efficient algorithm for greedy construction of a morph dictionary using an MDL criterion. The algorithm we have devised is generic, in that it may easily be applied to any local description length model. Early results of our method, as evaluated by examination of the morphs it extracts, show high accuracy in finding meaningful morphs based solely on orthographic considerations; in fact, we find that Model 1, which depends only on the number of morphs in the dictionary (and not on frequencies in the corpus at all) gives surprisingly good results, though Model 2 may generally be preferable (more experiments on varied and larger corpora still remain to be run).</Paragraph> <Paragraph position="2"> We see two immediate directions for future work.</Paragraph> <Paragraph position="3"> The first comprises direct improvements to the techniques presented here. Rather than segmenting prefixes and suffixes separately, the data structures and algorithms should be extended to segment both prefixes and suffixes in the current morph list, depending on which gives the best overall DL improvement. Related is the need to enable approximate matching of 'boundary' characters due to orthographic shifts such as-yto-i-, as well as incorporating other orthographic filters on possible morphs (such as requiring prefixes to contain a vowel). Another algorithmic extension will be to develop an efficient beam-search algorithm (avoiding copying the entire data structure), which may improve accuracy over the current greedy search method. In addition, we will investigate the use of more sophisticated DL models, including, for example, semantic similarity between candidate affixes and stems, using the probability of occurrence of individual characters for coding, or using n-gram probabilities for coding the corpus as a sequence of morphs (instead of the unigram coding model used here and previously). null The second direction involves integrating the current algorithm into a larger system for more comprehensive morphological analysis. As noted above, due to the greedy nature of the search, a recombination step may be needed to 'glue' morphs that got incorrectly separated (such asun-and-der-).</Paragraph> <Paragraph position="4"> More fundamentally, we intend to use the algorithm presented here (with the above extensions) as a sub-routine in a paradigm construction system along the lines of Goldsmith (2001). It seems likely that efficient and accurate MDL segmentation as we present here will enable more effective search through the space of possible morphological signatures.</Paragraph> </Section> class="xml-element"></Paper>