File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/96/c96-2208_evalu.xml
Size: 1,305 bytes
Last Modified: 2025-10-06 14:00:23
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2208"> <Title>The Automatic Extraction of Open Compounds from Text Corpora</Title> <Section position="6" start_page="1145" end_page="1145" type="evalu"> <SectionTitle> 4.2 Results of extraction </SectionTitle> <Paragraph position="0"> Thailand-Japan (40,401 bytes)' The results of extraction examined in both large and small file sizes are very satisfactory. Very few illegible strings are extracted though the threshold of the difference value is set to be as low as 10 in Figure 3, and 4 in Figure 4. The suitable value of the threshold of difference varies with the size of text corpus file. To obtain more meaningful strings fl'om a large file, we have to set a relatively high threshold of extraction. One of the advantages of our method is that there is an inherent trade~off between the quantity and the quality of the extracted strings. In the case of Figure 3, to limit the amount of illegible strings to not exceed 15% of the total extracted strings, we set the threshold to 30. As a result, we obtained 154 words, 114 fixed expressions and only 46 illegible strings. Furthermore, we found that of the 154 words appearing in the text, there were 84 words that were not found in a standard Thai dictionary.</Paragraph> </Section> class="xml-element"></Paper>