XML Viewer - p94-1052

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/p94-1052_metho.xml
Size: 4,369 bytes
Last Modified: 2025-10-06 14:13:56
<?xml version="1.0" standalone="yes"?>
<Paper uid="P94-1052">
  <Title>CONCEPTUAL ASSOCIATION FOR COMPOUND NOUN ANALYSIS</Title>
  <Section position="4" start_page="0" end_page="337" type="metho">
    <SectionTitle>
METHOD
</SectionTitle>
    <Paragraph position="0"> Extraction Process: The corpus used to collect information about compound nouns consists of some 7.8 million words from Grolier's multimedia on-line encyclopedia. The University of Pennsylvania morphological analyser provides a database of more than 315,000 inflected forms and their parts of speech. The Grolier's text was searched for consecutive words  listed in the database as always being nouns and separated only by white space. This prevented comma-separated lists and other non-compound noun sequences from being included. However, it did eliminate many CNs from consideration because many nouns are occasionally used as verbs and are thus ambiguous for part of speech. This resulted in 35,974 noun sequences of which all but 655 were pairs. The first 1000 of the sequences were examined manually to check that they were not incidentally adjacent nouns (as in direct and indirect objects, say). Only 2% did not form CNs, thus establishing a reasonable utility for the extraction method. The pairs were then used as a training set, on the assumption that a two word noun compound is unambiguously bracketed) Thesaurus Categories: The 1911 version of Roget's Thesaurus contains 1043 categories, with an average of 34 single word nouns in each. These categories were used to define concepts in the sense of Resnik and Hearst (1993). Each noun in the training set was taagged with a list of the categories in which it appeared.&amp;quot; All sequences containing nouns not listed in Roget's were discarded from the training set.</Paragraph>
    <Paragraph position="1"> Gathering Associations: The remaining 24,285 pairs of category lists were then processed to find a conceptual association (CA) between every ordered pair of thesaurus categories (ti, t2) using the formula below. CA(t1, t2) is the mutual information between the categories, weighted for ambiguity. It measures the degree to which the modifying category predicts the modified category and vice versa. When categories predict one another, we expect them to be attached in the syntactic analysis.</Paragraph>
    <Paragraph position="2"> Let AMBIG(w) = the number of thesaurus categories w appears in (the ambiguity of w).</Paragraph>
    <Paragraph position="3"> Let COUNT(wb w2) = the number of instances of Wl modifying w2 in the training set</Paragraph>
    <Paragraph position="5"> where i ranges over all possible thesaurus categories.</Paragraph>
    <Paragraph position="6"> Note that this measure is asymmetric. CA(tbt2) measures the tendency for tl to modify t2 in a compound noun, which is distinct from CA(t2, tO.</Paragraph>
    <Paragraph position="7"> Automatic Compound Noun Analysis: The following procedure can be used to syntactically I This introduces some additional noise, since extraction can not guarantee to produce complete noun compounds 2 Some simple morphological rules were used at this point to reduce plural nouns to singular forms analyse ambiguous CNs. Suppose the compound consists of three nouns: wl w2w3. A left-branching analysis, \[\[wl w2\] w3\] indicates that wl modifies w2, while a right-branching analysis, \[wl \[w2 w3\]\] indicates that wl modifies something denoted primarily by w3. A modifier should be associated with words it modifies.</Paragraph>
    <Paragraph position="8"> So, when CA(pottery, mug) &gt;&gt; CA(pottery, coffee), we prefer (pottery (coffee mug)). First though, we must choose concepts for the words. For each wi (i = 2 or 3), choose categories Si (with wl in Si) and Ti (with wi in Ti) so that CA(Si, Ti) is greatest. These categories represent the most significant possible word meanings for each possible attachment. Then choose wi so that CA(Si, Ti) is maximum and bracket wl as a sibling of wi. We have then chosen the attachment having the most significant association in terms of mutual information between thesaurus categories.</Paragraph>
    <Paragraph position="9"> In compounds longer than three nouns, this procedure can be generalised by selecting, from all possible bracketings, that for which the product of greatest conceptual associations is maximized.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML