File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1239_metho.xml

Size: 15,004 bytes

Last Modified: 2025-10-06 14:15:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1239">
  <Title>Morphemes as Necessary Concept for Structures Discovery from Untagged Corpora Hervd D~jean GREYC - CNRS - UPRESA 6072 Universit~ de Caen - Basse Normandie</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The General Structure of
Sentences
</SectionTitle>
    <Paragraph position="0"> Natural Languages are a linear object. It means that sentences are sequences of sounds. In the case of written sentences, we consider them as sequences of letters (or characters). We also consider that languages are not only sequences of sound but are structured in several structural levels. We claim that these different levels are formally indicated in the sentences. How? Since sentences are unidimensional object, a simple way is the use of boundaries indicators between the elements which composed the sentences. Applying this principle on several Languages, we find out three multilingual and hierarchical levels: the morpheme level, the chunk level and the clause level. One usehll formal criterium in the discovery of these structures is the position of words and morphemes relatively to beginnings and ends of sentences.</Paragraph>
    <Paragraph position="1"> The morpheme level is already well known in Linguistics. The morphemes are the basic elements of the structure. In this paper, we call morphemes the atTlxes of the language. These elements are discovered during the operation of words segmentation.</Paragraph>
    <Paragraph position="2"> The morphemes contain as much structural informations as grammatical words and are essential to the discovery of the syntactic structures. Section 3 explains how we list them.</Paragraph>
    <Paragraph position="3"> The higher level is the chunk one. We note that  some elements have a specific behaviour: they never occur at the beginning or at the end of the sentences. For example, the English word the never ends the sentences. There exists in all the studied languages similar elements (words or morphemes) that we can consider as indicating the beginning or the end of structures, the grammatical words as well as the morphemes are consider as boundaries indicators.</Paragraph>
    <Paragraph position="4"> We systematically consider grammatical words either as beginning indicator or as ending indicator. In practice, we tie them to their nearest lexical element (the following lexical for beginnings and preceding lexical for endings) In the same way, prefixes are considered as beginning and suffixes as ending. For example, both postpositions and inflexional suffixes are consider as ending indicators. The structures generated by these elements correspond to a lexical element (the nucleus of the chunk) surrounded by grammatical elements (words or morphemes, generally a combination of both).</Paragraph>
    <Paragraph position="5"> The chunks may be viewed as non recursive phrases. Though each chunk of the corpus has not systematically boundaries indicators, there generally exists enough chunks which are delimited in order to allow the discovery of these boundaries. The discovery of such indicators is automatically realized for a large part.</Paragraph>
    <Paragraph position="6"> The last level is the clause level. By working on boundaries indicators, we have noted that some indicators have a more specific behaviour. They mainly occur at the beginning or at the end of sentences. Furthermore, since some chunks have the same behaviour, clause boundaries are indicated either by morpheme, sole grammatical words or chunks. These elements always characterize elements of clauses: conjunctions or verbal phrases.</Paragraph>
    <Paragraph position="7"> For instance, English conjunction but begins sentences 672 times out of 760 occurrences. This behaviour is specific to clause boundaries indicators. German clause is, most of the times, dosed either by grammatical words such as her, zur~ck, verbal particles, or by verbal phrases. In Turkish, the conjunction area (but) occurs 763 times and begins 743 times. The Turkish clause is closed by verbal chunks, which implies that all the verbal morphemes (-tit, yor) axe well marked as absolute endings. All the languages which have a SOV or OSV structure offer obvious end boundaries for clauses, and languages which have VSO or VOS structure offer beginnings boundaries for clauses. These formal informations do not cover all the formal characteristics of the languages, but they offer enough informations in order to discover the different syntactic relations between chunks, and offer a good starting point in order to find specific structures of a language, the position of the finite verb in German for instance.</Paragraph>
    <Paragraph position="8"> In practice, we note that some languages privilege beginning indicators (prepositional languages as many European ones), others privilege ending indicators (postpositional languages as Turkish or Japanese) either at chunk level as at clause level, but they generally use the two methods. Some languages (Asian tonal languages) have a low number of boundaries indicators that complicates the chunks and clauses discovery. For the moment, we have stopped this study at clause level (or sequences of clause), but there perhaps exists higher levels.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Morphemes Discovery
</SectionTitle>
    <Paragraph position="0"> We now explain in details how the morphemes of a particular language are found. We refer the readers to other works dealing with this problem (Brent, Murthy, and Lunsberg, 1995), (de Marcken, 1995).</Paragraph>
    <Paragraph position="1"> Our aim is not the realization of a morphological analysis of each word of the corpus, but the production of the list of the morphemes for a given language. We do not try to discover all the morphemes contained in the corpus, since only the hundred most frequent ones are necessary in order to climb to chunks level. The method is inspired by the works of Zellig Harris. His algorithm is based on the number of different letters which follow a given sequence of letters. The increase of this number indicates a morpheme boundary. For instance, after the English sequence direc, we only find, in our corpus, one letter t. After direct, we find four letters: /, l, o, and e (directly, director, directed, direction). This increase indicates a boundary between the root (direct and the SUffLxes (-ion, -ly, -or and -ed). The algorithm works well when the corpus contains enough occurrences of a stem family. But, it may generate wrong segmentations. For example from the list started, startled, startling, the algorithm outputs this segmentation: start-ed, start-Ied~ start-ling. The errors occur when two kinds of stem families are used for the segmentation. (Harris, 1955) exposes several variations more or less complex. Their implementation does not furnish great improvements.</Paragraph>
    <Paragraph position="2"> Our idea for improving the segmentation is to divide into three steps this operation. The first step computes the list of the most frequent morphemes.</Paragraph>
    <Paragraph position="3"> The second steps extends the list by segmenting words with the help of the morphemes already generated. The third step consists in the segmentation of all the words with the morphemes obtained at the second step. The algorithm is illustrated with the suffixes segmentation, but the discovery of prefixes is totally symmetric: we just reverse the letters of</Paragraph>
    <Paragraph position="5"> the words.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 The discovery of the most frequent
morphemes
</SectionTitle>
      <Paragraph position="0"> The discovery of the most frequent morphemes is based on Harris algorithm. We try to find beginnings or endings of words which have the following property: after a given sequence of letters, we count the number of different letters. If this number is higher than a threshold (half of the letters of the alphabet), we arrive at a morpheme boundary, except in the case we are in the sequence which corresponds to a longer morpheme, a case we can detect. For example, before the sequence on, we found 18 different letters, thus on may be a morpheme. But 292 of these words in the corpus end with ion out of 367 which end with on. Since the longest sequence ion represents more than 50% of the word ended by on, we consider that on is a part of the morpheme -ion I. We only keep on the sequences which have a frequency higher than 100.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 The discovery of other morphemes
</SectionTitle>
      <Paragraph position="0"> Once these morphemes are found, we use them in order to segment words and to find out other morphemes thanks to the following rule: For a given sequence of letters (light in Table 3.2), we check on if the next sequences of letters correspond to morphemes already found. If half of them belongs to the morphemes found (like -s -ed -ing -ly -er, then the others (-hess -en -est) are also considered as morphemes. null This algorithm also generates wrong morphemes, but the frequency of them is very low (1 or 2). Thus, we only keep on new morphemes which have a frequency higher than a given threshold (5 in practice). The morphemes with a frequency lower than this threshold are not found. The morphemes list may greatly depend on the type of corpus used. The number of morphemes depends on the morphology of the language. In Vietnamese, no morpheme is found.</Paragraph>
      <Paragraph position="1">  In English, a list of fifty morphemes is generated (Table 3). The Turkish list contains more than 500 morphemes. We note that morphemes have a similar behaviour as words: a small number of them possesses a high frequency and corresponds to the major occurrences of the corpus. We do not try to generate all the morphemes of the corpus, since the hundred most frequent morphemes are sufficient for the construction of the higher level (the chunk level). Some morphemes of the list given in Table 3 are composed of a sequence of morphemes (ful-ly, ence-s). In highly morphological languages, most of the morphemes correspond to sequence of elementary morphemes. We do not try to resegment these elements now. Because of the presence of one letter morphemes, the resegmentation inevitably lead to the segmentation of the morphemes in letters. We wait the chunk level in order to refine these morphemes (Section 4).</Paragraph>
      <Paragraph position="2">  suflLxes:-y -ward -ure -s -ry -ously -ous -ors -or -ness -ments -ment -ly -less -ively -ive -ity -ious -ions -ion -ings -ingly -ing -in -ily -ies -ic -ible -fully -ful -est -es -ers -er -ence -ences -en -ement -ements -ely -ed -e -ations -ation -ance -ances -ally -al -age -ably -able -'S prefixes: dis- in- pro- re- un-</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 The segmentation of the words
</SectionTitle>
      <Paragraph position="0"> Once the list of the morphemes is found, we use it for segmenting all the words of the corpus. We segment the words by applying the longest match algorithm: we segment each word with the longest morpheme which matches beginning or ending of the word. In order to allow the chunks discovery, there are some words which are not segmented: the most frequent ones (5% of the words). They generally cot-D~\]ean 297 Morphemes for Structures Discovery respond to grammatical words, and we do not segment them in order to make easier the chunks discovery. The following section explains how the lexical words which appear in this list are segmented. We check on the segmentation of 500 words randomly selected and we obtain 8 segmentations we consider as wrong (as compla-in, forse-en or in German word antwortest 2 segmented in antwor-test with the morpheme -test (correct in lern-test 3 , preterit 2 pers.). Harris' algorithm realizes the segmentation of words during the discovery of morphemes. The dissociation of the two phase allows a more correct segmentation. With Harris algorithm, the words startling, startled and started generate the following segmentation: start-ed, start-led, start-ling (Section 3). With our method, the segmentation is startl-ing, startl-ed and start-ed since -ling and -led are not morphemes. It may be generated some errors (as antwortest) but only for few words.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The correction of words
</SectionTitle>
    <Paragraph position="0"> segmentation We now explain how the frequent lexical words and the morphemes composed of a sequence of other morphemes are segmented. The method use the contextual informations discovered in the chunk level. During the construction of chunks, we generate bi-grams of morphemes (Table 4). We use these bi-grams in order to refine the segmentation. Each word or morpheme occurring in a context corresponding to chunk structure will be segmented. For example, the German word Hauses (house) occurring in des Hauses is segmented in des Haus.es thanks to the context des S-es 4. The algorithm is the same for sequences of morphemes. The French sequence antes is segmented is ante-s thanks to the  ~(you) answer. Antwort-en: to answer Z(you) learned. Lern-en: to learn.</Paragraph>
    <Paragraph position="1"> aS for Stem</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 The necessity of morphemes in a
</SectionTitle>
    <Paragraph position="0"> procedure of discovery The morphemes level allows the emergence of structures which hardly appear at word level: structures which are marked by morphemes like the concordance structures. For example, the French structure (les-S-s S-s) or German one ( des-S-en S-es) are easily found thanks to their frequencies. Other structures are also easily found like adverb-verb structure in English, characterized by the high frequency of the bigrams (S-ly S-ed). Another useful morphemes are inflectional ones which mark relations between chunks at clause level. The relations between chunks are discover since bigrams composed of grammatical words and morphemes belonging to contiguous chunks. Frequent bigrams generally correspond to relations between two chunks (like S-ed S-ly). A positional criterium allows the elimination of bad frequent bigrams like (o\]-S S-ed) (Noun Complement - Verb sequence): since this bigram ne/rer begins a sentence, we consider that the structure is not complete and requires another chunk in order to complete the relational structure (the-S of-S S-ed).</Paragraph>
    <Paragraph position="1"> We conclude by claiming that morphemic level is essential and unavoidable in a procedure of syntactic structures discovery.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML