File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2213_metho.xml

Size: 17,281 bytes

Last Modified: 2025-10-06 14:14:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2213">
  <Title>Using a Hybrid System of Corpusand Knowledge-Based Techniques to Automate the Induction of a Lexical Sublanguage Grammar</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. Porting Bottleneck
</SectionTitle>
    <Paragraph position="0"> The trMitional gramnmr knowledgebase is the product of a never-ending attempt by linguists to impose order on something that refuses to be pinned down because it is a living thing. To a great extent, of course, these linguists are able to point to regularities, because language is first of all a practical thing, a means to communicate, and there must be a colnmon base for such transfer to take phtce. But all rules have exceptions, and often it turns out these exception s are not isolated or random, so tile rule is finetuned. The problem is that what is &amp;quot;grannnatical&amp;quot; depends on the tmwritten rules of a certain domain.</Paragraph>
    <Paragraph position="1"> When the core grammar is augmented to acconnnodate all these idiosyncracies, the danger is not that an ungrammatical sentence might slip through, but that perfectly legitimate input receives an incorrect analysis that is sanctioned by some peripheral grammar rule that doesn't apply to the domain under investigation. The semantic cmnponent which gets this false positive may reject it and request a second reading, and the correct parse will most probably come down the pipeline eventually if the grammar is truly broad-coverage, but a semantic module is not always well equipped to detect such errors and may have a difficult time enough trying to resolve attachment problems, anaphoric references, etc., even when presented with the &amp;quot;right&amp;quot; parse.</Paragraph>
    <Paragraph position="2"> In systems that use a lexical grammar, i.e., whore part of the grammatical &amp;quot;knowledge&amp;quot; is stored outside the non-terminals of the grammar proper, using subcategorization frames associated with terminals (words in Ihe lexicon), the peril likewise {s that this resource becomes bhmted over time with options exercised only in certain settings or when the word is used in a marginal sense.</Paragraph>
    <Paragraph position="3"> Clearly something must be done to separate the wheat from the chaff; the problem is twofold: getting the grammar and lcxicon to a ccrtain level of competence was a laborious and timc-consmning process, and undoing this (i.e., eliminating unwanted options) is ahnost as difficult and painfifl as the constant augmenting in the first place. And secondly, what constitutes wheat and chaff is different for each domain, so this &amp;quot;dieting&amp;quot; must bc repeated lot every port.</Paragraph>
    <Paragraph position="4"> Corpus-based techniques can help automate this filtering, i.e., the source text should be viewed not only as an &amp;quot;obstacle&amp;quot; to be tamed (parsed), but as a resource that is best authority on what is grammatical for the domain.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="1165" type="metho">
    <SectionTitle>
2. Data-Driven Attuning
</SectionTitle>
    <Paragraph position="0"> Since the carly 90s, there has been a surge of intcrest in corpus-based NLP rescarch; some researchers have tackled the grammar proper, making it a probabilistic system,'or doing away with a rule-based system altogethcr and inducing a customizcd grammar from scratch using stochastic methods.</Paragraph>
    <Paragraph position="1"> Dcspite the shortcomings of knowlcdgc-based systems, it seems wrong to throw away all that has been gained, imperfect as it is. Rather, a hybrid system shoukl be developed where the strengths of both paradigms arc combined. A good example el that is a probabilistic Contcxt Free Grammar.</Paragraph>
    <Paragraph position="2"> Both Brcnt (1993) and Manning (1993), who attempt to induce a lexicon of subcategorization features do so by completely discarding all pre-existing knowledge; both systems are stand-ahmc, without a parsing engine to test or use the &amp;quot;learned&amp;quot; information. Brcnt in fact takes the &amp;quot;fronl scratch&amp;quot; to an extreme, and models his system aftcr the way a child learns to understand hmguage. The algorithm of both authors basically inw)lves a pattern matcher that scans the input for a verb, and once an anchor is found, its right context is searched for cues lot subcategorization fi'ames. Brent's cues are very primitive, but because hc only picks up frmnes when the indicators are mmmbiguous, his results are very reliable, albeit sparse (unless a very large training corpus is used). Manning's triggers on the other hand are more sophisticated, but because they are less dependable he must rely on heavy statistical filtering to reduce the &amp;quot;noise.&amp;quot; Although Manning's work in inducing features certainly accomplishes the goal of customizing the lexicon to a particulm&amp;quot; domain, the  porting process is still very much a manual enterprise in that he must write a mini-parser, a finite state machine that includes an NP recognizer &amp;quot;and various other rules to recognize certain cases that appear frequently&amp;quot; (1993, 237).</Paragraph>
    <Paragraph position="3"> The dilemma of any pattern matching approach is in essence a bootstrapping problem; if the goal is to induce syntactic information (in the form of lexical features), then paradoxically some heavy syntactic processing power is needed to &amp;quot;parse&amp;quot; the training data to mine for evidence that a particular verb subcategorizes for an object option, while avoiding false triggers (imposter patterns). Manning has built into his finite state device a panic mode to skip over ambiguous elements, but the trick is to recognize when things get hairy; that is where a lot of programming el'tort takes place, and this finetuning is  never over (and must be repeated for every port to a new domain) as Manning himself admits (1993, 238).</Paragraph>
    <Paragraph position="4"> 3. Category Space of Context Digests  The category space described in this paper uses a very different approach to induce subcategorization frames; instead of starting fi'om scratch, the existing rich lexicon is exploited and features are assigned to new words based on their paradigmatic relatedness to known words. Thus instead of having to &amp;quot;hunt&amp;quot; for evidence, this approach is able to exploit the expertise of seasoned linguists who constructed the initial lexicon, which was intentionally designed to be broad-coverage. Such a strategy not only avoids having to distinguish good cues from irrelevant triggers, but is capable of inducing some features like ASSERTION for which there is no marker that would indicate its presence.</Paragraph>
    <Paragraph position="5"> A category space is a multi-dimensional space in which the syntactic category of words is represented by a vector of co-occurrence counts (Schiitze 1993). Proximity between two such vectors, or context digests, can be used to measure the paradigmatic relatedness of the words they represent (Schtitze and Pedersen 1993). Paradigmatic relatedness indicates how well two words can be substituted lbr each other, i.e., how similar their syntactic behavior is. This is not the same as the synonym relationship, which is based on semantic similarity.</Paragraph>
    <Paragraph position="6"> There are two general approaches in the literature to collecting distributional information: window-based and syntactically-based (Charniak 1993). In the latter scheme the text is scanned until a section is found that is deemed to be relevant. The &amp;quot;rough&amp;quot; structure of the sentence is computed, a process known as partial parsing. This produces a flat tree with phrase boundaries marked and identified by type, but without much internal detail.</Paragraph>
    <Paragraph position="7"> A second approach to collecting relevant distributional information is to keep co-occurrence counts of the nearest lexical neighbors of a word, usually within a fixed distance or &amp;quot;window.&amp;quot; Markov models, for example, predict the POS of a word based on the tags of the two or three words preceding it (bigrams and trigrams respectively). Schatze has experimented with window lengths of four words, two hundred letter lburgrams and two thousand characters (Schtitze 1993).</Paragraph>
    <Paragraph position="8"> In the research presented here, a window of tour was adopted, i.e., for words of interest in the domain of physical chemistry, co-occurrence counts were kept between those words and their immediate left neighbors (Wi_l wi), immediate right neighbors (wi Wi+l), and left and right neighbors that are two words away (wi-2 wi and wi wi+2 respectively).</Paragraph>
    <Paragraph position="9"> One importance difference between the category space reported here fi'om the one in Schiitze and Pedersen (1993) is that words were disambiguated by part of speech so as not to mix up context information of unrelated tokens, a problem Sch/itze acknowledges plagues his system (1993, 254). The corpus was tagged using Brill's tagger (Brill 1993), which is based on what he calls transformatiombased error-driven learning. 1430 word types tagged as verbs occurred frequently enough (&gt;10x) in the training corpus to warrant constructing a vector or context digest. As Zipf's law would predict, there is a long tail of word types which occur too infrequently to permit gathering useful statistics.</Paragraph>
    <Paragraph position="10"> Each window of the context digests tracks co-occurrence counts with word types of ~ POS, provided these types have a minimum frequency of 100 in the training corpus. For &amp;quot;rare&amp;quot; neighbors, the algorithm simply records the neighbor's POS, a compromise to keep the size of the arrays manageable, while providing some information on the syntactic context.</Paragraph>
    <Paragraph position="11"> Context digests are formed by combining the 4 fixed windows, each consisting of co-occurrence counts with 5,509 possible neighbors. In addition, some limited long(er)-distance information is appended to the vector: the training corpus has been augmented with bracketing information, that is, with implicit trees that exhibit binary branching, but whose nonterminals are unlabelled. This is another application of Brill's transformation-based error-driven learner (Brill 1993), which was trained on 32,000 bracketed sentences from the Penn Treebank. These phrasal boundaries are of variable length, and can in fact span the whole sentence. Ideally, the name of the type phrase that the verb occurred in should be used as a clustering feature, but since this information is unavailable (the non-terminals in the trees implicit in the bracketing are unlabelled) the next best thing is used, and each boundary is marked by a pair of tags occurring on either side of the bracket.</Paragraph>
    <Paragraph position="12"> Each context digest for verbs, then, contains  27,654 possible entries. The resulting matrix is very sparse, however; the density for the verb category space is only 1.5 percent. Hence the distributional information is generalized by means of a matrix manipulation method called Singular Value Decomposition (SVD). This technique is el'ten used in factor analysis, because reducing the representation to a low dimensionality allows one to better visualize the space, lit is exactly this compactness of representation that has led Schtitze to apply SVD to the field of NLP, to reduce the number of input parameters to a neural net, without sacrificing too many of the fine distinctions in the original text (Schiitze 1993). Deerweester et al. (1990) introduced SVD to the field of inlk)rmation retrieval lor improved docmnent representations; the original term-document matrix is decomposed into linearly independent factors, many (5t' which are very small. An approximate model with fewer dimensions can be constructed by ignoring these small components. By combining only the first k linearly independent components, a reduced model is built which disregards lesser terminology variations, because k is smaller than the number of rows (terms).</Paragraph>
    <Paragraph position="13"> To generalize the associational patterns in the category space that was bootstrapped from the physical chcmistry corpus, SVD was applied with it conservative value for k of 350. The tool used for this purpose was a slightly modified version of the las2 module from the SVDPACKC package (Berry et al.</Paragraph>
    <Paragraph position="14"> 1993). Tim generalizing effect of SVD causes the category space for verbs to become much less sparse: 35.4 percent of the entries now have non-zero %ounts.&amp;quot; Most of these are new counts, i.e. SVD infers context similarities between words that may not be apparent in the original co-occurrence matrix due to the natural randomness in any corpus sample. The average number (51' context digests that are very similar (greater than 97 percent confidence) remains fairly constant alter SVD, but the dimension reduction provides a lot more information about syntactic behavior when a less strict cutoff value is adopted (say  90 percent).</Paragraph>
    <Paragraph position="15"> 4. Induction based on Neighborhoods  Proximity in this reduced space is then used to find for all the context digests a neighborhood of words that are paradigmatically related. Proximity can be computed by using the cosine similarity measure, which was a major feature of the SMART information retrieval system (Salton 1983). This measures the cosine of the angle between two context digests, which can be viewed as vectors in a sdimensional space.</Paragraph>
    <Paragraph position="16"> The category space can be clustered by comparing pairs of context digests using the cosine similarity measure; such clusters contain words whose syntactic behavior is substantially similar. The degree of similarity depends on the adopted threshold value. However, these neighborhoods are not traditional clnsters; each verb has its own individual representation in a multi-dimensional space, i.e. is the center of its own neighborhood. Typically any given verb is a vector which silnultaneously belongs in several neighborhoods.</Paragraph>
    <Paragraph position="17"> Verbal subcatcgorization frames like transitivity, or the ability to take a that-complement or to-infinitive can be induced for new words based on a &amp;quot;composite&amp;quot; of features associated with &amp;quot;similar&amp;quot; verbs that are. defined in the lexicon. The knowledgebase used in this research is the domain-independent lexicon of PUNDIT, a broad-coverage symbolic NLP system, which contains 164 verbs with detailed subcategorization inl'ormation (Hirsehman et al. 1989). PUNDIT's features are a subset (51&amp;quot; Sager's Linguistic String Project (Sager 1981), which include sebctional restrictions, features that license constructs, and object options that affect the interpretation of a sentence.</Paragraph>
    <Paragraph position="18"> The induction works as follows: each verb lms its own neighborhood, formed by computing the cosine similarity weight between it and all other verbs in the category space, and by retaining those whosc weight excecds a certain threshold. If there are no nearby verbs with known teatures, more remote words can be used for deciding on whether a certain feature should apply to the verb being examined, especially if a substantial majority of these &amp;quot;distant relatives&amp;quot; are in agreement. If the features are treated as boolean values (present/not present), it will most certainly happen in neighborhoods with liberal cutoff points that there will be some disagreement for individual options, so a heuristic must negotiate these &amp;quot;eonllicts&amp;quot; and settle for the best abstraction. Such a heuristic should have the following three characteristics: null 1) verbs that are close to the word being examined should carry more weight in the decision process than verbs that are closer to the perimeter.</Paragraph>
    <Paragraph position="19"> 2) both positive and negative evidence (the absence of a feature for a particular verb) should be considered.</Paragraph>
    <Paragraph position="20"> 3) given the fact that the presence of a feature is the result of a positive decision/action (by a linguist), whereas the absence may be an oversight, there should be it (slight) bias in favor of the former; the sensitivity threshold can bc adjusted by shifting the point at which the weight of evidence is considered sufficient to decide in favor of adopting the feature.</Paragraph>
    <Paragraph position="21"> The existing verbs in the lexicon themselves undergo a similar process whereby they are fitted to the domain: some of their &amp;quot;generic&amp;quot; features which me not appropriate m'e dropped, whereas &amp;quot;gaps&amp;quot; in object options are filled. The net result is that the grammar  becomes attuned to the sublanguage: parses become possible because the enabling features are present, while the search space is pruned of many false positives because unnecessary features are omitted.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML