File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/73/c73-1002_metho.xml

Size: 14,656 bytes

Last Modified: 2025-10-06 14:11:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="C73-1002">
  <Title>Let A and B be sets. Then denotes: P(A) : the set of all subsets of A (the powerset of A) A* : the free monoid over A A x B: the cartesian product of A and B</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
10 ~rERNER BRECHT
</SectionTitle>
    <Paragraph position="0"> for all texts of a language. There remains only one possibility. One has to ascribe the words and sequences of words which every text consists of with one or more (homography) descriptions. Then one can try to derive the descriptions of the text out of the descriptions of the words or sequences of words.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. BASIC DEFINITIONS
2.1. Remark.
</SectionTitle>
    <Paragraph position="0"> Let A and B be sets. Then denotes: P(A) : the set of all subsets of A (the powerset of A) A* : the free monoid over A A x B: the cartesian product of A and B I xl : the &amp;quot;length&amp;quot; of x ~ A* (the number of elements of A which x consists of);I xl ~ IN.</Paragraph>
    <Paragraph position="1"> Now we'll define the expressions &amp;quot;character &amp;quot; and &amp;quot;string&amp;quot; for a language. We use five basic sets:</Paragraph>
    <Paragraph position="3"> From these five sets we derive: a) CHAR: = LETTU DIG U BLANKU PS U SS the set of &amp;quot;characters&amp;quot; If x ~ CHAR, we say: &amp;quot;x is a character &amp;quot;. b) CHAR*: the free monoid over CHAR.</Paragraph>
    <Paragraph position="4"> If x ~ CHAR*, we say: &amp;quot;x is a string &amp;quot; or: x is a sequence of characters&amp;quot; or: x is a text &amp;quot; c) LISS: = LETTO DIG U SS the set of &amp;quot;Characters without blank and punctuation-signs&amp;quot; d) LISS*: the free monoid over LISS.</Paragraph>
    <Paragraph position="5"> If x ~ LISS*, we say: &amp;quot;x is a string without blank and punc- null tuation-signs &amp;quot;.</Paragraph>
    <Paragraph position="6"> MORPHOLOGICAL ANALYSIS 11 2.2. Remark.</Paragraph>
    <Paragraph position="7"> \[LISS c CHAR =)LISS* c CHAR*\] =) \[e ~ LISS*, e empty element ~) e ~ CHAR*, e empty element\] Now we define the expression &amp;quot; word &amp;quot;. 2.3. Definition.</Paragraph>
    <Paragraph position="8"> WORD1 : ~ {x \] x ~ BLANK* A J X\[ &gt; O} WORm: ---- {x I x ~ LISS*,, I~1 &gt; O} WORD : ~ WORD1 U WORD2 UPS If x ~ WORD, we say: &amp;quot;x is a word &amp;quot;. 2.4. Examples.</Paragraph>
    <Paragraph position="9"> t_a t_a t._a ~ WORD, because t_a t_a t...a ~ WORD1, I t...a t:_a t..a \[ = 3 WHEN E WORD, because WHEN ~ WORD2, I WHEN I~ 4 ! ~WORD, because ! ~PS (1! -~1, '! ' regarded as an element of PS*) But WHENt_..a ~. WORD STOP! ~ WORD ~ ~ ! ~ WORD.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. THE INPUT FOR THE MORPHOLOGICAL ANALYSIS
</SectionTitle>
    <Paragraph position="0"> Our analysis will accept every x ~ CHAR*.</Paragraph>
    <Paragraph position="1"> 3.1. Remark.</Paragraph>
    <Paragraph position="2"> x ~ CHAR*, I xl = 0 (x = e) is a trivial case because there is nothing to analyse.</Paragraph>
    <Paragraph position="3"> Let x be a text, x ~ CHAR*. If we want to analyse x, we say: &amp;quot;x is the input for the analysis &amp;quot; or for short: &amp;quot;x is the input &amp;quot;. 3.2. Examples.</Paragraph>
    <Paragraph position="5"/>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
12 WERNER BRECHT
4. TIIE SEGMENTATION OF TIIE INPUT
</SectionTitle>
    <Paragraph position="0"> We want to divide some text in a well-defined sequence of words and then take off the blanks.</Paragraph>
    <Paragraph position="1"> 4.1. Definition.</Paragraph>
    <Paragraph position="2"> Let segml be a mapping between CHAR* and (P(WORD))*</Paragraph>
    <Paragraph position="4"> 4.2. Example.</Paragraph>
    <Paragraph position="5"> Let x be: x: = GOOD u_~ t__l t__l DAY!</Paragraph>
    <Paragraph position="7"> Let x be: x: = wlwv..w,,,e CHAR*. Then segml(x)=y c) Hence segml is a bijection.</Paragraph>
    <Paragraph position="8"> 4.4. Definition.</Paragraph>
    <Paragraph position="9"> Let segm2 be a mapping between (P(WORD))* and (P(WORD))*</Paragraph>
    <Paragraph position="11"> k,e{1, 2 .... , m} (i=l, 2 .... , n)^l_&lt;kl&lt;k,.&lt;... &lt;k,_&lt;m We change our notation u,:=w,, (i=1, 2, .,., n) and get</Paragraph>
    <Paragraph position="13"/>
    <Paragraph position="15"/>
  </Section>
  <Section position="6" start_page="0" end_page="17" type="metho">
    <SectionTitle>
MORPHOLOGICAL ANALYSIS 15
5. REMARKS TO THE CONCEPT '~ LEXICON &amp;quot;
</SectionTitle>
    <Paragraph position="0"> * In using the expression &amp;quot;lexicon &amp;quot;all actions identifying and describing words, sentences and texts can be concentrated in a single concept.</Paragraph>
    <Paragraph position="1"> In a formal sense any lexicon is a set of &amp;quot;items &amp;quot;.</Paragraph>
    <Paragraph position="2"> Definition.</Paragraph>
    <Paragraph position="3"> LEX: = ({w, B)x}, X e A (.4: any index-set) The pair (w, B)x is called an &amp;quot;item &amp;quot; of the lexicon. For every item holds:</Paragraph>
    <Paragraph position="5"> a language.</Paragraph>
    <Paragraph position="6"> c) B is any description of w. Let 113 be the set of all intended descriptions of all sequences of words and punctuation-signs of some language. Then LEX C/ (P(WORD))* X 113 such that (., B) L X: b) c) LEX is a relation between (P(WORD))* and XB. In general a sequence {wl} {w,} ... {w=} has more than one description (by homography for example). Hence LEX can't be a map between (P(WORD))* and 113.</Paragraph>
    <Paragraph position="7"> There are two ways to define a lexicon.</Paragraph>
    <Paragraph position="8"> a) The extensional definition. All the elements of the lexicon are listed off. In this case we often call such a lexicon a &amp;quot;list &amp;quot;  ui e WORD2 O PS (i= 1, 2, ..., n).</Paragraph>
    <Paragraph position="9"> Let {q, t~ .... , t,}c_ {1, 2, ..., n}; p ~&gt; 1 Let k be a map k: {tl, t, ..... tp}---+ {tl, t,, ..., tp} such that k is ONE-TO-ONE and ONTO.</Paragraph>
    <Paragraph position="10"> Then (k(q), k(&amp;), ..., k(tp)) is a permutation of (tx, t, ..... tp). We call {u~(tl)}, {uk(t,)} ..... {u,(tp)} a &amp;quot; subsequence&amp;quot; of {u~} ... {u,}</Paragraph>
    <Paragraph position="12"> are subsequences of {u~} {us} {us}  Definition.</Paragraph>
    <Paragraph position="13"> Let T be the set of all subsequences (derived in the above shown manner) of a given x = {ul} {us} ... {u~} Let t ~ T be such a subsequence.</Paragraph>
    <Paragraph position="14"> Then a &amp;quot; morphological analysing step &amp;quot; related to the subsequence t (for short: mast) is a relation between {t} and LEX. mas t c {t} 3&lt; LEX such that (t, (w, B)) ~ mast: ~ t= w Case I. mast * O Then we say: we have identified the subscquence t in our lexicon and all related B's are descriptions of t. Case 2. mas t = O Then we say: our lexicon does not (yet) contain the subsequence t. We are not able to give any description of t.</Paragraph>
  </Section>
  <Section position="7" start_page="17" end_page="17" type="metho">
    <SectionTitle>
7. THE MORPHOLOGICAL ANALYSIS
</SectionTitle>
    <Paragraph position="0"> The concept of the &amp;quot;morphological analysing step &amp;quot; is related to one and only one subsequence t e T.</Paragraph>
    <Paragraph position="1"> The concept of the &amp;quot; morphological analysis &amp;quot; however is more general.</Paragraph>
    <Paragraph position="2"> Let x ~ segm(CHAR*), x. ep, x = {ul} {us} ... {u,} Let T' be a subset of T (T' ~ T) such that to every (u~} (i = 1,  2 ..... n) there exists at least one t e T' which contains {u,}. 18 WERNER BRECHT 7.1. Definition.</Paragraph>
    <Paragraph position="3">  A &amp;quot; morphological analysis related to T' &amp;quot; (for short: mar,) of {ul} (u2} ... {u,) is the set of all mast such that t e T'. ma T, : = {mas, I t ~ T'} Remark.</Paragraph>
    <Paragraph position="4"> Let i~ {1, 2, ..., n} Let t~i be a subsequence of (ul} {u~} ... (u,} containing (u,}. Let T~ be the set of all t~.</Paragraph>
    <Paragraph position="5"> We say that our analysis failed if there exists one {ui} such that mas%= 0 for all t~ e T~ In the other case we say that our analysis had been successful. In general there are more than one T' such that mar, is successful. It has to be left to the user to fix the sets T' for his special intentions and for his special possibilities.</Paragraph>
    <Paragraph position="6"> 7.2. Definition.</Paragraph>
    <Paragraph position="7"> Let AT be the set of all T' (T' defined as above). A &amp;quot;morphological analysis &amp;quot; (for short: ma) is the set of all mas such that there exists a T' e AT with t e T' ma: ={mas~ I ~t T' e AT^ t e T'} Remark.</Paragraph>
    <Paragraph position="8"> Let z ~ CHAR* be a text such that there exists a x = segm (z) with x :# %. Then in practice we say: ma is a morphological analysis of the text z.</Paragraph>
    <Paragraph position="9"> 8. EXAMPLE FOR A PRACTICAL MORPHOLOGICAL ANALYSIS This example shows the practice of a morphological analysis of a german text and has indeed been programmed in Bonn to be the basis of the above mentioned syntax-analysis.</Paragraph>
    <Paragraph position="10"> Let x ~ segm(CHAR*), x * ep, x = (Ul) (u2} ... (Un} Let T' be the following set:</Paragraph>
    <Paragraph position="12"/>
  </Section>
  <Section position="8" start_page="17" end_page="17" type="metho">
    <SectionTitle>
MORPHOLOGICAL ANALYSIS 19
</SectionTitle>
    <Paragraph position="0"> Then holds: masqc {(u,}} X LEX (i-= 1, 2, ..., n) such that ({u,}, (w, B)) e mas,,} ~ w = {u,} This simple case of a morphological analysis we call a word-analysis' of a given text.</Paragraph>
    <Paragraph position="1"> We get mar, = {mast~, mast,, ..., mas~} Each massi (i = 1, 2 .... , n) is a set too.  'word-by-Hence we have to write: mar,-----{ {((ul}, (w,, Bu)) .... , ({ul,} (wi, B1J)}, {({u,}, (w2, Bus)) ..... ({u,}, (w,, B2z,))}, dego.</Paragraph>
    <Paragraph position="2"> {((.n), (w., B.,)), ..., ((u.), (w., ) For short we can write: mar,: \[(u,} &lt;-&gt; ((w o B,,), ..., (w o B,q)) (i= 1, 2 ..... n)\] Or: maT,: \[{u,}-C/-&gt; (Bil; Bi2 , ..., BiLi) (i -~-- 1, 2, ..., n)\] Now one can see that the result of a word-by-word-analysis can easily be represented with the following matrix-concept: ul, Bll, B12, * .......... Blz~ \ u~, B~I, B~,. .......... B2L, |/ ,,.</Paragraph>
    <Paragraph position="3"> \u.,, B,a, B.2 ............ B.L,, In our syntax-analysis in Bonn a great deal of the morphological analysis is done by word-by-word-analysis. We are successful in describing articles, nouns~ adverbs, adjectives and so on, but we have some trouble with our verbs.</Paragraph>
    <Paragraph position="4"> In the german language the prefix of some verbs may be found far away from the stem of the verb.</Paragraph>
    <Paragraph position="5"> 20 VC/~RNE_R Bl~CHT Example.</Paragraph>
    <Paragraph position="6"> The verbs zulaufen and laufen are two quite different verbs. We will regard the following three german sentences:  1) Ein I-Iund ist mir zugelaufen.</Paragraph>
    <Paragraph position="7"> 2) Lauf mir nur nicht zu.</Paragraph>
    <Paragraph position="8"> 3) Zu ist er mir gelaufen.</Paragraph>
    <Paragraph position="9">  A word-by-word-analysis will succeed only with sentence 1). In 2) and 3) we'll find the verb laufen instead of zulaufen. That means that we get a wrong description of our verb and a wrong description of zu which .exists in the german language also without any relation to a verb. Therefore to analyse our verbs a word-by-word-analysis is impossible.</Paragraph>
    <Paragraph position="10"> In our analysis we differ between two parts of the lexicon. The first one allows word-by-word-analysis and is intensionally defined for proper-names, nouns and adjectives and is extensionally defined for all other words without verbs. The extensionally defined part of this lexicon consists at this time of nearly 2000 items. The second one is our verb-lexicon which is intensionally defined. There exists an extensionally defined verb-stem-lexicon which contains at this moment the stems with their prefixes of nearly 400 german verbs. This stem-lexicon is quickly increasing and is coded in the following manner: \[(stem} \[description of &amp;quot;stem &amp;quot;\]\] \[(prefix} (stem} \[description of &amp;quot;prefix stem&amp;quot;\]\] \[{lauf} \[description of &amp;quot;lauf&amp;quot;\]\] \[(zu} (lauf} \[description of &amp;quot;zulauf&amp;quot;\]\] We start the morphological analysis with a word-by-word-analysis. If our analysis was successful we have got to each word of some text at least one description. Some of these descriptions may be wrong That's because of the homography and because of the verbs. We can't solve the homography-problem in this early part of the analysis. If we have identified a word to be a verb we are looking if in the same sentence there exists a word which can be prefix of this verb. If we find a possible prefix the verb gets the descriptions resulting of the prefix as well as the descriptions without this prefix. Working in this way we get a lot of information for the words of our text. Some information is wrong but we can be sure that the right information</Paragraph>
  </Section>
  <Section position="9" start_page="17" end_page="17" type="metho">
    <SectionTitle>
MORPHOLOGICAL ANALYSIS 21
</SectionTitle>
    <Paragraph position="0"> is among the descriptions. It is left to the syntax (or maybe to the semantic) to isolate the right descriptions.</Paragraph>
    <Paragraph position="1"> Formally we can describe the verb-analysis as a set of masj such that t: = {u'} {u&amp;quot;} where {u&amp;quot;} has been recognized as a verb and {u'} can be every word (without {u&amp;quot;}) of the same sentence in which {u&amp;quot;} exists.</Paragraph>
    <Paragraph position="2"> Given some text {u}l {u2} ... {un}.</Paragraph>
    <Paragraph position="3"> Given a word-by-word-analysis which shows that {ui} may be a verb.</Paragraph>
    <Paragraph position="4"> T'::{t I t~-{u'} {ui}^u' e{ul, ..., ui.,, ui.,, ...,%} } Then holds: mas, C/ { (u'} {ul} } x LEX such that ({u'} {u,}, (w, /3)) e mas, C/, w = {u'} {u,} One might call this procedure a &amp;quot;two-word-analysis &amp;quot;. We can imagine a &amp;quot; three-word-analysis &amp;quot; and so on too, but up to now in our practice in Bonn the morphological analysis consists only of a &amp;quot; word-by-word-analysis &amp;quot; and a &amp;quot; two-word-analysis &amp;quot; in the above shown manner.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML