File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0203_metho.xml

Size: 11,292 bytes

Last Modified: 2025-10-06 14:14:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0203">
  <Title>Desiderata for Tagging with WordNet Synsets or MCCA Categories</Title>
  <Section position="5" start_page="13" end_page="13" type="metho">
    <SectionTitle>
SCIENTIST, SOCIOLOGIST
</SectionTitle>
    <Paragraph position="0"> These words are a subset of the WordNet synsets headed at PERSON, in particular, s)n~sets headed by</Paragraph>
  </Section>
  <Section position="6" start_page="13" end_page="13" type="metho">
    <SectionTitle>
CREATOR;
EXPERT: AUTHORITY: PROFESSION/~,
</SectionTitle>
    <Paragraph position="0"> Other synsets under EXPERT and AlYrI-IORITY do not fall into this category. Thus, the heuristic Detached roles is like a Hearst &amp; SchCttze super-category, but not constructed on a statistical metric, rather on underlying semantic components.</Paragraph>
    <Paragraph position="1"> Other categories do not fall out so neatly. The category Sanction (120 words) has an average expected frequency of .08 percent, with a range over the four contexts of.06 to .10 percent. It includes the following words (and their inflected forms):  APPLAUD, APPLAUSE, APPROVE, CONGRATUI.ATE, CONGRATULATION, CONVICT, CONVICTION, DISAPPROVAL, DISAPPROVE, HONOR, JUDGE, JUDGMENT, JUDGMENTAL, MERIT, MISTREAT, REJECT, RF_JECTION, RIDICULE, SANCTION, SCORN, SCORNFUL, SHAME, SHAMEFULLY  Examination of the WordNet symets is similarly successful here, identifying many words (particularly verbs) in a subtrce rooted at RJDOE. However, the set is defined as well by including a dcrivational lexical rule to allow forms in other parts of speech. Another meaning component is seen in APPROVE and DISAPPROVE, namely, the negative or pejorative prefix, again requiring a lexical rule as part of the category's definition. Such lexical rules would be encoded as described in (Copcstake &amp; Briscoe, 199 I). This set of words (rooted primarily in the verbs of the set) corresponds to the (Levin, 1993) Characterize (class 29.2), Declare (29.4), Admire (31.2), and Judgment verbs (33) and hence may have particular syntactic and semantic patterning. The verb flames attached to WordNet verb synsets are not sufficiently detailed to cover the granularity necessary to characterize an MCCA category. Instead, the definition of this class might, following (Davis, 1996), inherit a sort notionrel, which has a &amp;quot;perceiver&amp;quot; and a &amp;quot;perceived&amp;quot; argument (thus capturing syntactic patterning) with 71dentification of these synscts facilitates extension of the MCCA dictionary to include further hyponyms of these symets.</Paragraph>
    <Paragraph position="2"> perhaps a selectional restriction on the &amp;quot;perceiver&amp;quot; that the type of action is an evaluative one (thus providing semantic patterning).</Paragraph>
    <Paragraph position="3"> Another complex category is Normaave, consisting of 76 words, with an average expected frequency of .60 percent and a range over the four contexts of.37 to .79 percent. This category also has words fi'om all parts of speech and thus will entail the use of derivational lexical rules in its definition. This category includes the following (along with various inflectional forms):</Paragraph>
  </Section>
  <Section position="7" start_page="13" end_page="13" type="metho">
    <SectionTitle>
ABSOLUTE, ABSOLUTELY, CONSEQUENCE,
CONSEQUENTLY, CORRECT, CORRECTLY,
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="8" start_page="13" end_page="14" type="metho">
    <SectionTitle>
UNEQUIVOCALLY, UNUSUAL, UNUSUALLY
</SectionTitle>
    <Paragraph position="0"> The use of the heuristic Normatiw to label this category clearly reflects the presence in these words of a sernRntic component oriented around characterizing something in terms of expectations. But, of particular interest here, are the adverb forms. McTavish has also used the heuristic Reasoning for this category. These adverbs are content disjuncta (Quirk, et al., 1985: 8.127-33), that is, words betokening a speaker's comment on the content of what the speaker is saying, in this case, compared to some norm or standard. Thus, part of the defining characteristics for this category is a specification for lexical items that have a \[contentdisjunct +\] feature.</Paragraph>
    <Paragraph position="1"> These examples of words in the Sanction and Normaave categories (repeated in other categories) indicates a need to define categories not only in terms of supercategories using the Hearst &amp; Sch tze model, but also with additional lexical semantic information not present in WordNet or MCCA categories. In particular, we seC/ the need for encoding derivational and morphological relations, finer-grained characterization of government patterns, feature specifications, and primitive semantic components.</Paragraph>
    <Paragraph position="2"> In any event, we have seen that MCCA categories are consistent with WordNet synsets. They recapitulate the WordNet synsets by acting as supemategories similar to those identified in Hearst &amp; Sch(ltze. To this extent, results from MCCA tagging would be similar to those of Hearst &amp; Schtttze. The MCCA methods suggest further insights based on what purposes we are trying to achieve from tagging.</Paragraph>
    <Paragraph position="4"/>
  </Section>
  <Section position="9" start_page="14" end_page="15" type="metho">
    <SectionTitle>
5 Analysis of Tagged Texts
</SectionTitle>
    <Paragraph position="0"> The important questions at this point are why there is value in having additional lexical semantic information associated with tagging and why MCCA categories and WordNet synsets are insufficienL The answer to these questions beans to emerge by considering the further analysis performed after a text has been &amp;quot;classified&amp;quot; on the basis of the MCCA tagging. As described above, MCCA produces a set of C-seores and E-scores for each text. These scores are then subjected to analysis to provide additional results useful in social seience and information retrieval applications.</Paragraph>
    <Paragraph position="1"> The two sets of scores are used for computing the distance among texts. This distance is used directly or in exploration of the differences between texts. Unlike other content analysis techniques (or classification techniques used for measuring the distance between documents in information retrieval), MCCA uses the non-agglomerative technique of multidimensional sealing (MDS). s This technique (Kruskal &amp; Wish, 1977) produces a map when given a matrix of distances.</Paragraph>
    <Paragraph position="2"> MDS does not presume that a 2-climensional representation displays the distances between texts.</Paragraph>
    <Paragraph position="3"> Rather, it unfolds the dimensions one-by-one, starting with 2, examines statistically how &amp;quot;stressed&amp;quot; the solution is, and then adds furthor dim~asions until the stress shows signs of reaching an asymptote. Output from the sealing provides &amp;quot;rotation&amp;quot; maps at each dimension projected onto 2-dimensional space.</Paragraph>
    <Paragraph position="4"> McTavish, et al. illustrates the simple and the more complex use of these distance metrics. In the simple use, the distance between transcripts of nursing home patients, staff, and administrators was used as a measure of social distance among these three groups.</Paragraph>
    <Paragraph position="5"> This measure was combined with various ch~terist/cs of nursing homes (size, type, location, etc.) for further analysis, using standard statistical techniques such as correlation and diseriminant analysis.</Paragraph>
    <Paragraph position="6"> In the more complex use, the MDS results identify the concepts and themes that are different and similar in the transcripts. This is accomplished by visually inspecting the MDS graphical output. Examination of the 48Agglomerative techniques cluster the two closest texts (with whatever distance metric) and then successively add texts one-by-one as they are closest to the existing cluster.</Paragraph>
    <Paragraph position="7"> dimensional context vectors provides an initial characterization of the texts. The analyst identifies the contextual focus (traditional, practical, emotional, or anMytic) and the ways in which the texts differ from one another. This provides general themes and pointers for identifying the conceptual differences among the texts.</Paragraph>
    <Paragraph position="8"> MDS analysis of the E-score vectors identifies the major concepts that differentiate the texts. The analyst examines the graphical output to label points with the dominant MCCA categories. The &amp;quot;meaning&amp;quot; (that is, the underlying concepts) of the MDS graph is then described in terms of category and word emphases.</Paragraph>
    <Paragraph position="9"> These are the results an investigator uses in reporting on the content analysis using MCCA.</Paragraph>
    <Paragraph position="10"> This is the point at which the insufficieney of MCCA categories (and WordNet synsets) becomes visible. In examining the MDS output, the analysis is subjective and based only on identification of particular sets of words that distinguish the concepts in each text (much like the techniques described in (I~lgamff, 1996) that are used in authorship attribution). If the MCCA categories had richer definitions based on additional lexical semantic information, the analysis could be performed based on less subjective and more rigorously defined principles.</Paragraph>
    <Paragraph position="11"> (Burstein, et al., 1996) describe techniques for using lexical semantics to classify responses to test questions. An essential component of this classification process is the identification of sublc',dcens that cut across parts of ~h, along with conc,~t grammars based on collapsing phrasal and constituent nodes into a generalized XP representation. As seen above in the procedures for defining MCCA categories, addition of lexical semantic information in the form of derivational and morphological relations and semantic components common across part of speech boundaries--information now lacking in WordNet synsets--would facilitate the development of concept grammars.</Paragraph>
    <Paragraph position="12"> (Briscoe &amp; Carroll, 1997) describe novel techniques for constructing a subcategorization dictionary from analysis of corpora. They note that their system needs further refinement, suggesting that adding information to lexical entries about diathesis alternation possibilities and semantic selectional preferences on argument heads is likely to improve their results. Again, the procedures for analyzing MCCA categories seem to require this type of information.</Paragraph>
    <Paragraph position="13"> We have diseussed elsewhere (Litkowski &amp; Harris, 1997) extension of a discourse analysis algorithm  incorporating lexical cohesion l:,rinciples. In this extension, we found it necessary to require use of the AGENTIVE and CONSTITLrHVE qtmlia of nouns (see (Pustejovsky, 1995: 76)) as selectional specifications on verbs to maintain lexical cohesion. With such information, we were able not only to provide a more coherent discourse analysis of a text segment, but also possibly to summarize the text better.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML