File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/91/h91-1068_evalu.xml

Size: 6,337 bytes

Last Modified: 2025-10-06 14:00:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1068">
  <Title>Analyzing Language in Restricted Domains: Sublanguage Description and Processing. Lawrence Erlbaum Assoc.,</Title>
  <Section position="6" start_page="347" end_page="349" type="evalu">
    <SectionTitle>
5 Lewis and Croft \[71 define the syntactic phrase as &amp;quot;any pxir of
</SectionTitle>
    <Paragraph position="0"> non-function words in a sentence that are heads of syntactic structures connected by a grammatical relation.&amp;quot; 6 Partial processing may include tagging and/or a limited parsing, see, for example \[7\], and also \[9\] for a more comprehensive view.  SKN'r~Nc~: The techniques are discussed and related to a general tape manipulation routine.</Paragraph>
    <Paragraph position="1">  may not be straightforwardly applicable to the analysis of text where the basic tokens are words of natural language. Church and Hanks \[2\] used Fano's mutual information to compute word co-occurrence patterns in a 44 million word corpus of Associated Press news stories, but they also noted that this measure often produces counterintuitive results. The reason is that the observed frequencies of many words remain low even in very large corpora. For very small counts the mutual information becomes unstable and fails to produce credible results. ~ Ideally, a measure of relation between words should be stable even at low counts and more sensitive to fluctuations in frequency among different words. We are particularly interested in the low and medium frequency words became of their high indexing value. An interesting comparison among different functions used to study word co-occurrences in the Longmen dictionary is presented by Wilks et al. \[11\]. They assumed that the best function would most closely reflect a correlation between a chance co-occurrence and a minimum relatedness between words, on the one hand, and between the maximum observed frequency of co-occurrence and a maximum relatedness, on the other, s Another question is whether the relatedness measure should be symmetric. In other words, for any given pair of words, can we assume that they contribute equally to their mutual relationship7 We felt that the words making up a syntactic phrase do not contribute equally to the informational value of the phrase and that their contributions depend upon the distribution characteristics of each word within a particular type of text. For example, in a general computer science text the information attached to the phrase parallel system is more significantly related to the word * This may be contrasted with a distribution of symbols from a small finim alphabet.</Paragraph>
    <Paragraph position="2"> s A chance co-occurrence of a pair of words is when the probability of their occurring together is the product of the probabilities of their being observed independently. Two words have the largest possible frequency of co-occunence if they never occur separately. Unfo~onately, a chance co-occurrence is very difficult to observe.</Paragraph>
    <Paragraph position="3"> parallel than to the word system. This relationship can change if the phrase is found in a different type of text where parallel is more commonplace than system, for example, in a text from a parallel computation subdomain.</Paragraph>
    <Paragraph position="4"> Based on these considerations, we introduce an asymmetric measure of informational contribution of words in syntactic phrases. This measure IC (x, \[x,y \]) is based on (an estimate of) the conditional probability of seeing a word y to the right of the word x, 9 modified with a dispersion parameter for x. The dispersion parameter, d,, understood as the number of distinct words with which x is paired, has been defined as follows (f~y is the observed frequency of the pair \[x,y\]): Y where iff~y&gt;O For each word x occurring in any of the selected syntactic phrases, the informational contribution of this word in a pair of words Ix, y\] is calculated according to the following formula: lC(x, \[x,y \])= far th+d,- 1 where n z is the number of pairs in which x occurs at the same position as in Ix, y\]. IC(x, Ix, y\]) takes values from the &lt;0,1&gt; interval; it is equal to 0 when x and y never occur together (i.e., fay = 0), and it is equal to 1 when x occurs only with y (i.e., fay = nx and d~ = 1). Empirical tests with this formula on the CACM-3204 collection give generally satisfactory results, and a further improvement may be possible if larger corpora are used (perhaps 1 million words or more). For each pair of words Ix,y\] two informational contribution values are calculated: IC(x, \[x,y \]) and IC(y, \[x,y\]), and they may differ considerably as seen in Table 1.1deg The relative similarity between any two words is measured in terms of their occurrence in common contexts end is the sum of the informational contributions of the context weighted with the informational contribution of the less significant of the two words. A partial similarity for words xl and x2 in the context of another word y is therefore given as: si, n,(xx,xz) = p,(xl,x2) (IC (y, \[xx,y \]) + IC (y, \[x2,y \])) where p,(xl,X2) = min (IC (xi,\[x x,y \]),IC (x2, \[x2,y \])) The total similarity between two words xx and x2 is given as a sum of all partial similarities, normalized with a logarithmic function. null</Paragraph>
    <Paragraph position="6"> We calculated the similarity measure for any two words which occurred in at least two common contexts, that is, those which have been paired with a common word in at least two distinct occasions. The results are summarized in Tables 1 to 3. In Table  and 3 show the top elements in the similarity classes generated for words graramar and memory. We noted that the similarity value of about 2.0 or more usually coincided with a high degree of correlation in meaning, while the smaller values were generally less interesting.</Paragraph>
    <Paragraph position="7"> This first classification can be further improved by distinguishing among word senses. Many words have multiple senses, and these, rather than the lexical words themselves, should be used in indexing a text. However, obtaining a right dissociation between different senses of a word presents a separate research problem which is beyond the scope of this paper.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML