File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2202_intro.xml
Size: 4,945 bytes
Last Modified: 2025-10-06 14:06:03
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2202"> <Title>Word Extraction from Corpora and Its Part-of-Speech Estimation Using Distributional Analysis</Title> <Section position="3" start_page="0" end_page="1119" type="intro"> <SectionTitle> 2 Hypothesis </SectionTitle> <Paragraph position="0"> In this section, first we define environment of a string occurring ix, a corpus. Next, we prol)ose a hypothesis which gives foundation to our word extraction method.</Paragraph> <Section position="1" start_page="0" end_page="1119" type="sub_section"> <SectionTitle> 2.1 Environment of a String in a Corpus </SectionTitle> <Paragraph position="0"> We detine tile &quot;environment&quot; of a type (character string, group of morl)hemes , or as tile prol)ability distribution of the elements preceding and followdeg ing occurrences of that type in a corpus. The elements which precede tile type a.re described by the left probability distribution, and those which follow it, by the right probability distribution. For instance, Table 1 shows the one-character envio romnent of the string &quot;.~-L&quot; in the I~DR corpus (Jap, 1993). This string occurs 181 times, wittl 12 different characters appearing to its left and l0 to its right.</Paragraph> <Paragraph position="1"> In general, a probability distribution can be regarded a.s a vector, so the concatenatiori of two vectors is also a vector. Thus, the concatenation of the left and right probability distributions for a type is what we call the &quot;environment&quot; of that type, and we represent this by D in the subsequent part of this paper.</Paragraph> <Paragraph position="2"> freq. prob. str. str. freq. prob.</Paragraph> <Paragraph position="3"> 13 7.2% , :~b v,, 16 8.9% 6 3.3% o < 3 1.6% 13 7.2% ~ ~ 8 4.4% 10 5.6% ~ ~: 10 5.6% 8 4.4% <&quot; J\[ 7 3.8% 14 7.8% if- & 41 22.6% 19 10.4% m ~ 38 21.0% 4 2.2% ?& ab 16 8.9% 7 3.8% ~ % 4 2.2% 4 2.2% C/0 L/ 38 21.0% 83 45.9% 181 100.0% total 181 100.0%</Paragraph> </Section> <Section position="2" start_page="1119" end_page="1119" type="sub_section"> <SectionTitle> 2.2 Hypothesis Concerning Environment </SectionTitle> <Paragraph position="0"> In general, if a string a is a word which belongs to a POS, it is expected that the environment D(a) of the string in a particular corpus will be similar to the environment D(pos) of that POS. Since a word can belong to more than one POS, it is expected that the environment of tire string will be similar to the summation across all POSs of the environment of each POS multiplied by tile probability that the string occurs ms that POS.</Paragraph> <Paragraph position="1"> Therefore, we obtain the following formula:</Paragraph> <Paragraph position="3"> where p(poskla) is the probability that the string a belongs to posk, and D(posk) is tire environment of posk. In this formula, summation is calculated for the set of POSs in consideration.</Paragraph> <Paragraph position="4"> As an example, let us take the string &quot;~.1_.&quot;, which is used in tile corpus only as a verb and an adjective. Ifp(Adjl'Z~l..) and p(Verbl~-b ) are the probabilities that a particular instance of the string is used as an adjective and a verb respectively, then the enviromnent of the string &quot;x~-I~&quot; is described by the tbllowing formula: D(-x~-U) p(Adjl~- L )D(Adj) + p( VerblX~ - L )D(Verb), In most cases, however, formula (1) cannot be solved as a linear equation, since the dimension of probability distribution vector D is greater than that of the independent variables. In addition, we need to minimize the effects of sample bias inherent in statistical estimates of this sort. We therefore reason that the question is to find the set of p(posk let) which minimizes the difference between both sides of formula (1) in terms of some measure. We use, as this measure, the square of Euclidean distance betwen vectors. Then it follows that the problem is formalized as an optimization problem (minimize). The decision variables are the elements of tile probability distribution vector p which expresses tile likelihood that the string is used as each POS:</Paragraph> <Paragraph position="6"> n is tile number of POSs in consideration. Since each element ofp represents a probability, the feasible region V is given as follows:</Paragraph> <Paragraph position="8"> The minimum value of F(p) will be relatively small when tile environment of the string can be decomposed into a linear summation of some POS environments, while it will be relatively large when such a decomposition does not exist. Since all true words must belong to one or more POSs, the minimum value of F(p) can be used to decide whether a string is a word or not. We call this value the &quot;word measure,&quot; and accept as words all strings with word measure Less than a certain threshold.</Paragraph> </Section> </Section> class="xml-element"></Paper>