File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3208_metho.xml

Size: 11,163 bytes

Last Modified: 2025-10-06 14:11:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3208">
  <Title>Morphology Induction from Limited Noisy Data Using Approximate String Matching</Title>
  <Section position="4" start_page="60" end_page="60" type="metho">
    <SectionTitle>
3 Approach
</SectionTitle>
    <Paragraph position="0"> Dictionary entries contain headwords, and the examples of how these words are used in context (i.e. examples of usage). Our algorithm assumes that each example of usage will contain at least one instance of the headword, either in its root form, or as one of its morphological variants. For each headword-example of usage pair, we find the headword occurrence in the example of usage, and extract the affix if the headword is in one of its morphological variants. We should note that we do not require the data to be perfect. It may have noise such as OCR errors, and our approach successfully identifies the affixes in such noisy data.</Paragraph>
  </Section>
  <Section position="5" start_page="60" end_page="63" type="metho">
    <SectionTitle>
4 Framework
</SectionTitle>
    <Paragraph position="0"> Our framework has two stages, exact match and approximate match, and uses three string distance metrics, the longest common substring (LCS), approximate string matching with k differences (k-DIFF), and string edit distance (SED). We differentiate between exact and approximate matches and assign two counts for each identified affix, exact count and approximate count. We require that each affix should have a positive exact count in order to be in the final affix list. Although approximate match can be used to find exact matches to identify prefixes, suffixes, and circumfixes, it is not possible to differentiate between infixes and OCR errors. For these reasons, we process the two cases separately.</Paragraph>
    <Paragraph position="1"> First we briefly describe the three metrics we use and the adaptations we made to find the edit operations in SED, and then we explain how we use these metrics in our framework.</Paragraph>
    <Section position="1" start_page="61" end_page="62" type="sub_section">
      <SectionTitle>
4.1 String Searching Algorithms
</SectionTitle>
      <Paragraph position="0"> Longest Common Substring (LCS) Given two strings p = p1...pn and q = q1...qm, LCS finds the longest contiguous sequence appearing in p and q.</Paragraph>
      <Paragraph position="1"> The longest common substring is not same as the longest common subsequence because the longest common subsequence need not be contiguous.</Paragraph>
      <Paragraph position="2"> There is a dynamic programming solution for LCS1 that finds the longest common substring for two strings with length n and m in O(nm).</Paragraph>
      <Paragraph position="3"> String Edit Distance (SED) Given two strings p and q, SED is the minimum number of edit operations which transformsptoq. The edit operations allowed are insertions, deletions, and substitutions. In our algorithm, we set the cost of each edit operation to 1. A solution based on dynamic programming computes the distance between strings in O(mn), where m and n are the lengths of the strings (Wagner and Fischer, 1974).</Paragraph>
      <Paragraph position="4"> Approximate string matching with k differences (k-DIFF) Given two stringspandq, the problem of approximate string matching with k differences is finding all the substrings of q which are at a distance less than or equal to a given value k from p. Insertions, deletions and substitutions are all allowed. A dynamic programming solution to this problem is the same as the classical string edit distance solution with one difference: the values of the first row of the table are initialized to 0 (Sellers, 1980). This initialization means that the cost of insertions of letters of q at the beginning of p is zero. The solutions are all the values of the last row of table which are less or equal to k. Consequently, the minimum value on the last row gives us the distance of the closest occurrence of the pattern.</Paragraph>
      <Paragraph position="5"> String Edit Distance with Edit Operations (SED-path) In our framework, we are also interested in tracing back the editing operations performed in achieving the minimum cost alignment.</Paragraph>
      <Paragraph position="6"> In order to obtain the sequence of edit operations, we can work backwards from the complete distance matrix. For two strings p and q with lengths n and m respectively, the cell L[n,m] of the distance matrix L gives us the SED between p and q. To get to the cell L[n,m], we had to come from one of 1)</Paragraph>
      <Paragraph position="8"> three options was chosen can be reconstructed given these costs, edit operation costs, and the characters p[n],q[m] of the strings. By working backwards, we can trace the entire path and thus reconstruct the alignment. However, there are ambiguous cases; the same minimum cost may be obtained by a number of edit operation sequences. We adapted the trace of the path for our purposes as explained below.</Paragraph>
      <Paragraph position="9"> Let path be the list of editing operations to obtain minimum distance, and SED-path be the SED algorithm that also returns a path. The length of the path is max(n,m), and path[j] contains the edit operation to change q[j] (or p[j] if n &gt; m). Path can contain four different types of operations: Match (M), substitution (S), insertion (I), and deletion (D).</Paragraph>
      <Paragraph position="10"> Our goal is finding affixes and in case of ambiguity, we employed the following heuristics for finding the SED operations leading the minimum distance:  with the last I Case 1 ensures that if one word has more characters than the other, an insertion operation is selected for those characters.</Paragraph>
      <Paragraph position="11"> If there is an ambiguity, and an M/S or I operation have the same minimum cost, Case 2 gives priority to the insertion operation until a match case is encountered, while Case 3 gives priority to match/substitution operations if a match case was seen previously.</Paragraph>
      <Paragraph position="12"> Below example shows how Case 4 helps us to localize all the insertion operations. For the headword-candidate example word pair abirids makaabir'ids, the path changes from (1) to (2) using Case 4, and correct prefix is identified as we explain in the next section.</Paragraph>
      <Paragraph position="13">  (1) I M I I I M M M S M M=Prefix m(2) I I I I M M M M S M M=Prefix maka null 5 Morphology Induction from Noisy Data (MIND)  The MIND framework consists of two stages. In the exact match stage, MIND framework checks if the headword occurs without any changes or errors (i.e. if headword occurs exactly in the example of usage). If no such occurrence is found an approximate match search is performed in second stage. Below we describe these two stages in detail.</Paragraph>
    </Section>
    <Section position="2" start_page="62" end_page="62" type="sub_section">
      <SectionTitle>
5.1 Exact Match
</SectionTitle>
      <Paragraph position="0"> Given a list of (noisy) headword-example of usage pairs (w,E), the exact match first checks if the head-word occurs in E in its root form.2 If the headword cannot be found in E in its root form, for each ei in E, the longest common substring, LCS(w,ei), is computed.3 Let el be the ei that has the longest common substring (l) with w.4 If w = l, and for some suffixsand some prefixpone of the following conditions is true, the affix is extracted.</Paragraph>
      <Paragraph position="1">  1. el = ws (suffix) or 2. el = pw (prefix) or 3. el = pws (circumfix)  The extracted affixes are added to the induced affix list, and their exact counts are incremented. In the third case p-s is treated together as a circumfix. For the infixes, there is one further step. If w = wprimel and el = eprimell, we compute LCS(wprime,eprimel). If eprimel = wprimes, for some suffix s, s is added as an infix to the induced affix list. (This meansel = wprimesl wherew = wprimel.) The following sample run illustrates how the exact match part identifies affixes. Given the Cebuano headword-example of usage pair (abtik) -(naabtikan sad ku sa b'at'a), the word naabtikan is marked as the candidate that has the longest common substring with headword abtik. These two words have the following alignment, and we extract the circumfix na-an. In the illustration below,  example words that are shorter than the headword. Although there are some languages, such as Russian, in which headwords may be longer than the inflected forms, such cases are not in the scope of this paper.</Paragraph>
      <Paragraph position="2"> 4Note that the length of the longest common substring can be at most the length of the headword, in which case the longest common substring is the headword itself.</Paragraph>
      <Paragraph position="3"> straight lines represent matches, and short lines ending in square boxes represent insertions.</Paragraph>
    </Section>
    <Section position="3" start_page="62" end_page="63" type="sub_section">
      <SectionTitle>
5.2 Approximate Match
</SectionTitle>
      <Paragraph position="0"> When we cannot find an exact match, there may be an approximate match resulting from an error with OCR or morphophonemic rules5, and we deal with such cases separately in the second part of the algorithm. For each ei in E, we compute the difference between headword, and example word, k-DIFF(w,ei). The example word that has the minimum difference from the headword is selected as the most likely candidate (ecand). We then find the sequence of the edit operations performed in achieving the minimum distance alignment to transform ecand to w using SED-path algorithm we described above.6 Let cnt(X) be the count of X operation in the computed path. If cnt(I) = 0, this case is considered as an approximate root form (with OCR errors). The following conditions are considered as possible errors and no further analysis is done for such cases:</Paragraph>
      <Paragraph position="2"> Otherwise, we use the insertion operations at the beginning and at the end of the path to identify the type of the affix (prefix, suffix, or circumfix) and the length of the suffix (number of insertion operations).</Paragraph>
      <Paragraph position="3"> The identified affix is added to the affix list, and its approximate count is incremented. All the other cases are dismissed as errors. In its current state, the infix affixes are not handled in approximate match case.</Paragraph>
      <Paragraph position="4"> The following sample shows how approximate match works with noisy data. In the Cebuano input  pair (ambihas) -- (ambsh'asa pagbutang ang duha ka silya arun makakit'a ang maglingkud sa luyu), the first word in the example of usage has an OCR error, i is misrecognized as s. Moreover, there is a vowel change in the word caused by the affix. An exact match of the headword cannot be found in the example of usage. The k-DIFF algorithm returns ambsh'asa as the candidate example of usage word, with a distance 2. Then, the SED-path algorithm returns the path M M M S M S M I, and algorithm successfully concludes that a is the suffix as shown below in illustration (dotted lines represent substitutions). null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML