File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2208_metho.xml

Size: 8,256 bytes

Last Modified: 2025-10-06 14:14:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2208">
  <Title>The Automatic Extraction of Open Compounds from Text Corpora</Title>
  <Section position="3" start_page="0" end_page="1143" type="metho">
    <SectionTitle>
2 Problem Description
</SectionTitle>
    <Paragraph position="0"> It is a non-trivial task to identify a word ill the text of a language which has no specific punctuation to mark word boundaries. Up to the present, lexicographers' efforts have been inhibited by insufficient corpora and limited computational facilities. Almost all lexicon knowledge bases have been created with reliance oll human intuition. \]\]1 recent years, a large amount of text corpora haw', become available, and it is now becoming possible to conduct more rigorous experiments on text corpora. We address the following problems in such  a way that they are able to be solved by the way of statistical' methods.</Paragraph>
    <Paragraph position="1"> 1. There is no good evidence to support the itemization of a word in a dictionary. In  traditional dictionary making, lexicographers have had to rely on citations collected by human readers from limited text corpora. More rare words rather than common words are found even in standard dictionaries (Church and Hanks , 1990). This is the problem in making a lexical entry list in dictionary construction. null 2. It is hard to decide where to segment a string into its component words. It is also hard to enumerate words from a text, though it is reported that the accuracy of recent word segmentation methods using a dictionary and heuristic methods is higher than 95% in case of Thai (Virach , 1993). The accuracy depends mostly on word entries in the dictionary, and the priority for selecting between candidate words when there is more than one solution for word segmentation. This is the problem in assigning priority information for selection.</Paragraph>
  </Section>
  <Section position="4" start_page="1143" end_page="1145" type="metho">
    <SectionTitle>
3 Word Extraction from Text
</SectionTitle>
    <Paragraph position="0"> Corpora We used a window size of 4 to 32 for n-gram data accumulation. The value is arbitrary but this range has proven sufficient to avoid collecting illegible strings.</Paragraph>
    <Section position="1" start_page="1143" end_page="1143" type="sub_section">
      <SectionTitle>
3.1 Algorithm
</SectionTitle>
      <Paragraph position="0"> Define that, \]a I is the number of clusters ~ in the string 'a', n(a) is the number of occurrences of the string ~&amp;', &amp;ud n(a+l) is the number of occurrences of the string 'a' with one additional cluster added. As the length of a string increases the number of occurrences of that string will decrease. Therefore, null</Paragraph>
      <Paragraph position="2"> For the string 'a', n(a+l) decreases significantly from n(a) when 'a' is a frequently used string in contrast to 'a+l'. From this, it can be seen that 'a' is a rigid expression of an open compound when it satisfies the condition 'n(a + 1) &lt;&lt; n(a). (2) In such a case, 'a' is considered a rigid expression that is used frequently in the text, and 'a+l' is just a string that occurs in limited contexts.</Paragraph>
      <Paragraph position="3">  spelling rules.</Paragraph>
      <Paragraph position="4"> Since we count the occurrence of strings generated from an arbitrary position in tile text, with only the above observation, only the right end position of a string can be assumed to determined a rigid expression. To identify the correct starting position of a string, we apply the same observation to the leftward extension of a string. Therefore, we have to include the direction to the string observation. null Further define that, +a is the right observation of the string 'a', and -a is the left observation of the string 'a'. Then, n(-t-a+l) is the number of occurrences of the string 'a' with one cluster added to its right, and n(-a+l) is the number of occurrences of the string 'a' with one cluster added to its left. Following the same reasoning as above, we will obtain,</Paragraph>
      <Paragraph position="6"> A string 'a' is a rigid expression if it satisfies the following conditions,</Paragraph>
      <Paragraph position="8"/>
    </Section>
    <Section position="2" start_page="1143" end_page="1143" type="sub_section">
      <SectionTitle>
3.2 Data preparation
</SectionTitle>
      <Paragraph position="0"> Following are the steps for creating n-gram text data according to the fundamental features of Thai text corpora. The results are shown in Table 1 and Table 2. In each table, &amp;quot;n&amp;quot; is the number of occurrences and &amp;quot;d&amp;quot; is the difference in occurrence with the next string.</Paragraph>
      <Paragraph position="1">  1. Tokenize the text at locations of spaces, tabs and newline characters.</Paragraph>
      <Paragraph position="2"> 2. Produce n-gram strings following Thai  spelling rules. Only strings that have possible boundaries are generated, and their occurrence counted. For example, shifting a string from 'a+6' to 'a+7' in the Table 1, the string at 'a+7' is '~z~t~ff.~' and not 'fl~g~|~'l.lfll~l\], despite the first character after 'a+6' being '~'. According to o/ the Thai spelling rules, the character ' ' can never stand by itself. It needs both of an initial consonant and a final consonant. We call this three character unit a cluster.</Paragraph>
      <Paragraph position="3">  3. Create both rightward (Table 1) and leftward (Table 2) sorted strings. The frequency of each string is the same but the strings are lexically reversed and ordered based on this reversed state.</Paragraph>
      <Paragraph position="4"> 1144 4. Calculate the diiference between the occurrenee of adjoining strings ill the sorted lists. Let {t(a) be the difference wdue of the string 'a', then</Paragraph>
      <Paragraph position="6"> The difference w~lue (d) is generated separately for the rightward and legward sorted string ta.bles.</Paragraph>
      <Paragraph position="7"> The occurrences (n) ill both Tal}le 1 an{l Table 2 apparently SUl}port the conditions (3) all{\[ (4).</Paragraph>
    </Section>
    <Section position="3" start_page="1143" end_page="1143" type="sub_section">
      <SectionTitle>
ble
3.3 Extraction
</SectionTitle>
      <Paragraph position="0"> According to condition (5) the string %' ( a~ un ) in Table 3 is considered an open compound because the difference of betweml n(a) and n(a+l) is as high as 450. However, 'a~u~l' is an illegible string and cannot be used on as indivi{lual basis in general text. Observing tile same string :a' in Table 1, the difference between n(a) and n(a+l) is only 68. It is not comparably high enough to be selected. Therefore, we have to determine the minimum wflue of the difference when there is more than one branch extended from a string.</Paragraph>
      <Paragraph position="1">  in Figure 1, we obtain the string '~o~ ~ ~1~ a,lnl~a~' 1)y observing the significant change in d just before the next string '~l~u~l~:l.l~lA~i~a{fi' The string could be wrongly selected if we do not observe its behaviour ill the leftward sorted string table, to determine tit(', correct left boundary. Thus, we (}bserve tile count of string '~itlSg~l~\],ltll~li~,~' when it is extended leftward, as shown ill Figure 2.</Paragraph>
    </Section>
    <Section position="4" start_page="1143" end_page="1145" type="sub_section">
      <SectionTitle>
an Arbitrary String
</SectionTitle>
      <Paragraph position="0"> By unifying the results of both methods of the observation, we iinally obtain tile word</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1145" end_page="1145" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> We have applied our method to an actual Thai text corpora without word segmentation preprocessing. null</Paragraph>
    <Section position="1" start_page="1145" end_page="1145" type="sub_section">
      <SectionTitle>
4.1 Natural language data
</SectionTitle>
      <Paragraph position="0"> We selected 'Thai Revenue Code (1995)', as large as 705,513 bytes, and 'Convention for Avoidance of Double Taxation between Thailand and Japan', which has a smaller size of 40,401 bytes. The purpose is to show that our method is effective regardless of the size of the data file.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML