File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/92/c92-4195_evalu.xml
Size: 2,733 bytes
Last Modified: 2025-10-06 14:00:10
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-4195"> <Title>BROAD COVERAGE AUTOMATIC MORPHOLOGICAL SEGMENTATION OF GERMAN WORDS</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 3 EVALUATION </SectionTitle> <Paragraph position="0"> The experimental morph list used during the development of versions 1, 2, and 3 of&quot; our syntax and classilication scheme consisted of 2,200 morphs mainly selected from Ortmann (1985). In the fourth system the morph list was extended to ahnost 11,00(1 morphs (cf Table 1). To evaluate the different versions (each consisting of syntax and classification scheme), two word sets were used each conraining 3,0(10 words. The first set consisted of rank 1 - 1,000, 300,(X) 1 - 301,000, and 600,001-601,000 of a ti'equency list sorted in descending order which was created fi~om a corpus containing articles of a German business newspaper with about 31,000,(X)0 running words. This set was used for the iterative improvement of&quot; our system. 1t is called test set. The second set, serving as a control set, con~ rained rank 1,001-2,000, 200,001-201,000, and 400,001-401,000 of a corpus obtained from a common newspaper with about 13,200,000 running words. The control set was necessary because of the risk that the later versions were designed in such a way as to cope only with those errors which arose when applying the earlier versions to tire test set.</Paragraph> <Paragraph position="1"> Table 2 shows the improvement in coverage inainly achieved by extending the morph list: While - refcrriug to the control set only - the third system segmented only !/125 of the input words (=47.5%), the lourth system segmented 2,492 words (= 83%). The quality of the segmentations of the tburth system is a little worse. The reason for this effect is the larger morph list allowing more nonsense concatenations of morphs. Although we made grammar and classification system more restrictive, it would have been too costly to strive for equal or better segmentation quality.</Paragraph> <Paragraph position="2"> Table 3 gives an overview of the words which were not segmented by the fourth system.</Paragraph> <Paragraph position="3"> Many of them are proper names.</Paragraph> <Paragraph position="4"> percentage related to total number of 3,000 Number of segmentations and ratio of segmentations obtained per segmented word on average Number of correct segmentations and percentage related to total number of segmentations Number of wrong segmentations and percentage related to total number of segmentations Number of words with at least one correct segmentation and percentage related to number of segmented words</Paragraph> </Section> class="xml-element"></Paper>