File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-3208_evalu.xml

Size: 12,460 bytes

Last Modified: 2025-10-06 13:59:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3208">
  <Title>Morphology Induction from Limited Noisy Data Using Approximate String Matching</Title>
  <Section position="6" start_page="63" end_page="67" type="evalu">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="63" end_page="63" type="sub_section">
      <SectionTitle>
6.1 Dictionaries
</SectionTitle>
      <Paragraph position="0"> The BRIDGE system (Ma et al., 2003) processes scanned and OCRed dictionaries to reproduce electronic versions and extract information from dictionary entries. We used the BRIDGE system to process two bilingual dictionaries, a Cebuano-English (CebEng) dictionary (Wolff, 1972) and a Turkish-English (TurEng) dictionary (Avery et al., 1974), and extract a list of headword-example of usage pairs for our experiments. The extracted data is not perfect: it has mistagged information, i.e. it may include some information that is not the headword or example of usage, or some useful information may be missing, and OCR errors may occur. OCR errors can be in different forms: Two words can be merged into one, one word can be split into two, or charac-</Paragraph>
    </Section>
    <Section position="2" start_page="63" end_page="63" type="sub_section">
      <SectionTitle>
in Experiments
</SectionTitle>
      <Paragraph position="0"> Along with the headword-example of usage pairs from more than 1000 pages, we randomly selected 20 pages for detailed analysis. Table 1 provides details of the data from two dictionaries we use in our experiments.</Paragraph>
      <Paragraph position="1"> Both Cebuano and Turkish are morphologically rich. Cebuano allows prefixes, suffixes, circumfixes, infixes, while Turkish is an agglunative language. The two dictionaries have different characteristics. The example of usages in CebEng are complete sentences given in italic font while TurEng has phrases, idioms, or complete sentences as examples of usages indicated in bold font.</Paragraph>
    </Section>
    <Section position="3" start_page="63" end_page="63" type="sub_section">
      <SectionTitle>
6.2 Protocol
</SectionTitle>
      <Paragraph position="0"> We ran our algorithm first on all of the data and then on a randomly selected 20 pages from each dictionary. We manually extracted the affixes from each of the 20 pages. We then evaluated the MIND results with this ground truth. During the evaluation, even if the number of an affix in the ground truth and result are same, if they were extracted from different words, this is counted as an error. We also examined the cause of each error in this data.</Paragraph>
      <Paragraph position="1"> We then compare our results from the whole TurEng data with the state-of-the-art Linguistica (Goldsmith, 2001) algorithm. Finally, we used the suffixes extracted by MIND and Linguistica to segment words in a Turkish treebank.</Paragraph>
    </Section>
    <Section position="4" start_page="63" end_page="64" type="sub_section">
      <SectionTitle>
6.3 Analysis
</SectionTitle>
      <Paragraph position="0"> number of affixes and number of different types of  affixes (in parenthesis) are presented for two dictionaries, CebEng and TurEng, and two data sets, the whole dictionary and 20 randomly selected pages. The top part of the table gives the exact match results and the bottom part shows the approximate match results. For Cebuano, approximate match part of the framework finds many more affixes than it does for Turkish. This is due to the different structures in the two dictionaries. We should note that although MIND incorrectly finds a few prefixes, circumfixes, and infixes for Turkish, these all have count one.</Paragraph>
      <Paragraph position="1"> Table 3 contains some of the most frequent extracted affixes along with their exact and approximate counts, and samples of headword-example of usage word pairs they were extracted from. Each word is segmented into one root and one suffix, therefore when a word takes multiple affixes, they are all treated as a compound affix.</Paragraph>
      <Paragraph position="2"> Dictionary GT cnt. Res.cnt. Misses Additions  truth and MIND results along with number of missed and incorrectly added affixes on 20 of these pages of data. MIND only missed 5% of the affixes in the ground truth in both data sets.</Paragraph>
      <Paragraph position="3"> We also examined the causes of each miss and addition. Table 5 presents the causes of errors in the output of MIND with an example for each cause. We should emphasize that a valid affix such as Turkish suffix -mi is counted as an error since the suffix ini should be extracted for that particular headword-example of usage pair. An OCR error such as the misrecognition of a as d, causes both the miss of the prefix mag- and incorrect addition of mdg- for Cebuano. There are some cases that cannot be correctly identified by the framework. These usually involve dropping the last vowel because of morphophonemic rules. For the Cebuano dictionary, merge and split caused several errors, while Turkish data does not have any such errors. Main reason is the different structure and format of the original dictionaries. In the Cebuano dictionary, an italic font which may result in merge and split is used to indicate example of usages.</Paragraph>
      <Paragraph position="4"> For the Cebuano data, five invalid suffixes, three invalid prefixes, and two invalid circumfixes are found, while one valid suffix and one valid circumfix are missed. For the Turkish data, three invalid suffixes, one invalid prefix, and two valid suffixes are found while two valid suffix are missed. When we look at the invalid affixes in the data, most of them (six of the Cebuano, and all of the Turkish ones) have count one, and maximum count in an invalid affix is five. Therefore, if we use a low threshold, we can eliminate many of the invalid affixes.</Paragraph>
    </Section>
    <Section position="5" start_page="64" end_page="65" type="sub_section">
      <SectionTitle>
6.4 Comparison to Linguistica
</SectionTitle>
      <Paragraph position="0"> We compared our system with Linguistica, a publicly available unsupervised corpus-based morphology learner (Goldsmith, 2001). Linguistica induces paradigms in a noise-free corpus, while MIND makes use of string searching algorithms and allows one to deal with noise at the cost of correctness.</Paragraph>
      <Paragraph position="1"> MIND emphasize segmenting a word into its root and affixes. We trained Linguistica using two different data sets from TurEng7: 1) Whole headword7We would like to do the same comparison in Cebuano. For the time being, we could not find a treebank and native speakers  example of usage sentence pairs, and 2) Headwordcandidate example words that our algorithm returns. In the first case (Ling-all), Linguistica uses more data than our algorithm, so to avoid any biases resulting from this, we also trained Linguistica using the headword and candidate example word (Lingcand). We only used the suffixes, since Turkish is a suffix-based language. The evaluation is done by a native speaker.</Paragraph>
      <Paragraph position="2"> Figure 1 presents the analysis of the suffix lists produced by Linguistica using two sets of training data, and MIND. The suffix lists are composed of suffixes the systems return that have counts more than a threshold. The results are presented for six threshold values for all of the data. We use thresholding to decrease the number of invalid affixes caused primarily by the noise in the data. For the MIND results, the suffixes over threshold are the ones that have positive exact counts and total counts (sum of exact and approximate counts) more than the threshold. Although Linguistica is not designed for thresholding, the data we use is noisy, and we explored if suffixes with a corpus count more than a threshold will eliminate invalid suffixes. The table on the left gives the total number of suffixes, the percentage of suffixes that have a count more than a threshold value, the percentage of invalid suffixes, and percentage of missed suffixes that are discarded by thresholding for the whole TurEng dictionary. The number of affixes MIND finds are much more than that of Linguistica. Furthermore, number of invalid affixes are lower. On the other hand, the number of missed affixes is also higher for MIND since, for this particular data, there are many affixes with counts less than 5. 41% of the affixes have an exact count of 1. The main reason for this is the agglunative nature of Turkish language. The effect of thresholding can also be examined in the graph for Cebuano.</Paragraph>
      <Paragraph position="3"> on the right in Figure1 which gives the percentage of valid suffixes as a function of threshold values.</Paragraph>
      <Paragraph position="4"> MIND takes advantage of thresholding, and percentage of valid suffixes rapidly decrease for threshold  Threshold, Invalid, and Missed Suffixes Found by Linguistica and MIND for Different Threshold Values for 20 pages of Turkish Data Table 6 presents the same results for 20 pages from TurEng for three threshold values. MIND performs well even with very small data and finds many valid affixes. Linguistica on the other hand finds very few.</Paragraph>
    </Section>
    <Section position="6" start_page="65" end_page="67" type="sub_section">
      <SectionTitle>
6.5 Stemming
</SectionTitle>
      <Paragraph position="0"> To test the utility of the results, we perform a simple word segmentation, with the aim of stripping the inflectional suffixes, and find the bare form of the word. A word segmenter takes a list of suffixes, and their counts from the morphology induction system (Linguistica or MIND), a headword list as a dictionary, a threshold value, and the words from a treebank. For each word in the treebank, there is a root form (rf), and a usage form (uf). The suffixes with a count more than the threshold are indexed according to their last letters. For each word in the treebank, we first check if uf is already in the dictionary, i.e. in the headword list. If we cannot find it  Linguistica and MIND for Different Threshold Values in the dictionary, we repeatedly attempt to find the longest suffix that matches the end of uf, and check the dictionary again. The process stops when a dictionary word is found or when no matching suffixes can be found at the end of the word. If the word the segmenter returns is same as rf in the treebank, we increase the correct count. Otherwise, this case is counted as an error.</Paragraph>
      <Paragraph position="1"> In our stemming experiments we used METU-Sabanci Turkish Treebank8, a morphologically and syntactically annotated treebank corpus of 7262 grammatical sentences (Atalay et al., 2003; Oflazer et al., 2003). We skipped the punctuation and multiple parses,9 and ran our word segmentation on 14950 unique words. We also used the headword list extracted from TurEng as the dictionary. Note that, the headword list is not error-free, it has OCR errors. Therefore even if the word segmenter returns the correct root form, it may not be in the dictionary and the word may be stripped further.</Paragraph>
      <Paragraph position="2"> The percentage of correctly segmented words are presented in Figure 2. We show results for six threshold values. Suffixes with counts more than the threshold are used in each case. Again for MIND results, we require that the exact match counts are more than zero, and the total of exact match and ap- null by Different Systems for Different Threshold Values proximate match counts are more than the threshold. For Linguistica, suffixes with a corpus count more than the threshold are used. For each threshold value, MIND did much better than Ling-cand.</Paragraph>
      <Paragraph position="3"> MIND outperformed Ling-all for thresholds 0 and 1. For the other values, the difference is small. We should note that Ling-all uses much more training data than MIND (503 vs. 1849 example of words), and even with this difference the performance of MIND is close to Ling-all. We believe the reason for the close performance of MIND and Ling-all in segmentation despite the huge difference in the number of correct affixes they found due to the fact that affixes Ling-all finds are shorter, and more frequent. In its current state, MIND does not segment compound affixes, and find several long and less frequent affixes. These long affixes can be composed  by shorter affixes Linguistica finds.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML