File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/99/w99-0626_concl.xml

Size: 2,553 bytes

Last Modified: 2025-10-06 13:58:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0626">
  <Title>Automatic Construction of Weighted String Similarity Measures</Title>
  <Section position="6" start_page="217" end_page="217" type="concl">
    <SectionTitle>
5 Conclusion
</SectionTitle>
    <Paragraph position="0"> In this paper three approaches were introduced with the common goal of generating language dependent string matching functions automatically in order to improve the recognition of string similarity. However, the first two approaches differ from the third one with regard to their general principle.</Paragraph>
    <Paragraph position="1"> Both the first and the second approach produce an independent string matching function which does not rely on the comparison of characters itself.</Paragraph>
    <Paragraph position="2"> Therefore, these approaches are independent from the character sets which are used in each language.</Paragraph>
    <Paragraph position="3"> The difference between the first and the second approach concerns segmentation. While approach 1 uses a simple segmentation into sequences of characters the second approach groups vowels and consonants into n-grams. Because of the large variety of possible n-grams it is much less probable to get a hit when matching word pairs. Therefore, a much lower threshold has to be used in approach 2 in order to obtain cognate candidates. The problem with this is a much higher risk of finding wrong candidates especially for short strings. However, both approaches produce results with high precision between 92% and 97%. The recall is lower than the value which can be reached by means of LCSR scores. Compared at a similar level of precision the first approach returns roughly 87% and the second approach 39% as many candidates as LCSR extraction.</Paragraph>
    <Paragraph position="4"> The third approach is based on LCS calculations.</Paragraph>
    <Paragraph position="5"> The goal is to add matching values for common non-identical characters and n-grams. It is not so flexible when applied to languages with different character sets, but it does produce the best result in the experiments that were carried out with Swedish/English word pairs. The basic set of cognates obtained by LCSR extraction was extended by about 21%. Even the precision for the resulting list could be estimated with a slight improvement from 92.5% for LCSR extraction to 95.5% 6 . Therefore, the third approach is by far the best of the three methods if languages with a fairly common character set are considered.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML