File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/98/w98-1241_concl.xml
Size: 5,718 bytes
Last Modified: 2025-10-06 13:58:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1241"> <Title>Reconciliation of Unsupervised Clustering, Segmentation and Cohesion</Title> <Section position="7" start_page="0" end_page="0" type="concl"> <SectionTitle> 6. Reconciling the Methods </SectionTitle> <Paragraph position="0"> The Harris (1960) approach works on the insight that within a unit, particularly a closed class functional unit such as an affix, there is less freedom of choice than at the boundary of units. This depends strongly on the fact that the number of affixes is much lower than individual characters, whilst their frequency is so much higher than the morphs they collate with. Viz. they define large cosets.</Paragraph> <Paragraph position="1"> The Powers (1992) approach works by finding the groups of segments which have the largest cosets, and thus have high frequency and low information, their information content tending to be more syntactic than semantic. The segmentation and classification occur simultaneously, and it seems there is no advantage to doing perplexity-based segmentation before doing the classification, although this has not yet been investigated. The segmentation process may however be repeated, finding the subsequent perplexity or information maxima. In addition, even the initial functional segments found may be used directly to learn or check a grammar (Entwisle and Groves, 1994; Powers, 1997b), although this already makes use of the known word segmentation and the assumption, which is for English is an excellent first approximation, that affixes are either word initial or word final, and that it is this prefixes and suffixes which determine the syntactic roles of the words.</Paragraph> <Paragraph position="2"> 7. Augmenting the Methods The approach used by Entwisle and Groves (1994) is only semi-automatic, and wasn't originally conceived as a learning system. When a sentence fails to parse, it means that a constraint must be relaxed, and this constraint is identified manually--being a system which involves no statistics, which is being trained on text which may contain errors (e.g. one error was discovered in the first chapter of the Alice Corpus, Carroll, 1865), and where the relaxation may involve the supplying of new roles or the removal of a' constraint at any one of a number of possible points.</Paragraph> <Paragraph position="3"> The approach used by Powers (1997b) is only intended to identify typing errors and substitution errors (e.g.</Paragraph> <Paragraph position="4"> 'there' for 'their') and builds and stores a differential grammar only when the word can be disambiguated from its closed-class context, but already constraints based on the closed class words and functional affixes suffices to perform better than commercial grammar checkers.</Paragraph> <Paragraph position="5"> The segmentation and classification methods on their own do not attempt to cheek cohesive constraints, such as agreement, but doing so could be expected to reduce the ambiguity which is so rife. Powers (1992) reports one word with around 5000 different 'parses'.</Paragraph> <Paragraph position="6"> The specific approach we are using in our current work is to extend the structure determined by a version of the approach of Powers (1992 and 1997a) which produces binary grammar rules. The extended structure augments a higher level unit with features constructed from or inherited from the lower level units. This construction is being carded out virtually at present, while we examine the best way to propogate information, and we investigate and seek to differentiate the specific hypotheses that (a) the more frequent, or Co) the higher perplexity, segments play the morpho-syntactic cohesive roles, whilst their binary siblings hold the primary content to be retained and passed on.</Paragraph> <Paragraph position="7"> Whilst this strategy is the one suggested by the primary morphological cohesion, and could straightforwardly be applied after a single segmentation pass, using the hierarchical classification approach produces a far stronger hypothesis, predicting that vowels in English, where they are strongest under both conditions (a) and Co), play a primarily structural or phonological role, and that affixes, prepositions, articles, relatives, conjunctions and the like act as the heads of their superordinate structures.</Paragraph> <Paragraph position="8"> An additional aim of the present project is to seek to Powers 309 Unsupervised Clustering, Segmentation and Cohesion tease apart homonyms and their manifestations at the other levels, including the dual role of the letter 'y' (sometimes clearly vowel as in 'xylophone', sometimes ambiguously consonantal as in 'play, playing, played'), the suffix '-s' and the word 'to'. In Powers (1997a) both 'y' and space were identified as vowels using certain clustering techniques and methods (and the issues are discussed in that paper). We are generalizing the approach of identifying a class, such as the vowels, and then identifying those units, such as 'y', which atypically have a larger coset than the class which has been selected as having maximal coverage (resolving the Powers (1992) dilemma in favour of coverage as the preferred metric).</Paragraph> <Paragraph position="9"> 8. Discussion and Conclusion This extended abstract documents work in progress, contrasting existing approaches in recent publications and setting out the direction we are following.</Paragraph> <Paragraph position="10"> Preliminary results should be available at the workshop, but the paper is mainly intended to provoke discussion of the pro's and con's of the two approaches to segmentation.</Paragraph> </Section> class="xml-element"></Paper>