File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2121_metho.xml

Size: 19,015 bytes

Last Modified: 2025-10-06 14:07:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2121">
  <Title>McMahon J.G., Smith F.J.: Improving Statistical Language Model Pelformance with Automatically Generated Word Ilierarchies.</Title>
  <Section position="4" start_page="0" end_page="836" type="metho">
    <SectionTitle>
2 Related work
</SectionTitle>
    <Paragraph position="0"> Three main approaches have been proposed for the automatic extraction of lexical semantics knowledge: syntax-based, n-gram-based and window-based. Syntax-based methods (referred also as knowledge-rich in contrast to the others knowledge-poor methods) (Pereira and Thishby, 1992; Grefenstette, 1993; Li and Abe, 1997) represent the words under consideration as vectors containing statistic values of their syntactic properties in relation to a given set of words (e.g. statistics of object syntax relations referring to a set of verbs) and cluster the considered words according to similarity of the corresponding vectors. Methods that use bigrams (Brown et al., 1992) or trigrams (Martin et al., 1998) cluster words considering as a word's context the one or two immediately adjacent words and employ as clustering criteria the minimal loss of average  nmtual information and the perplexity improvement respectively. Such methods are oriented to language modeling and aim primarily at rough but fast clustering of large vocabularies. Brown et al. (1992) also proposed a window method introducing the concept of &amp;quot;semantic stickiness&amp;quot; of two words as the relatively frequent close occurrence between them (less than 500 words distance). Although this is an efficient and entirely knowledge-poor method tbr extracting both semantic relations and clusters, the extracted relations are not restricted to semantic similarity but extend on thematic roles. Moreover its applicability to small and specialized corpora is uncertain.</Paragraph>
  </Section>
  <Section position="5" start_page="836" end_page="836" type="metho">
    <SectionTitle>
3 A knowledge-poor approach
</SectionTitle>
    <Paragraph position="0"> In order to achieve portability we approach the issue from a knowledge-poor perspective. Syntax-based methods employ partial parsers which require highly language-dependent resources (morphological/grammatical analysis), and/or properly tagged training corpus in order to detect syntactic relations between sentence constituents.</Paragraph>
    <Paragraph position="1"> On the other hand, n-gram methods operate on large corpora and, in order to reduce computational resources, consider as context words only the immediately adjacent ones.</Paragraph>
    <Paragraph position="2"> Medium-distance word context is not exploited.</Paragraph>
    <Paragraph position="3"> Since large corpora are available only for few domains we aimed at developing a method for processing small or medium sized corpora exploiting the most of contextual information, that is, the tifll sentential context of words. Our approach was driven by the observation that in domain-constrained corpora, unlike fiction or general journalese, the vocabulary is limited, the syntactic structures are not complex and that medium-distance lexical patterns are frequently used to express similar facts.</Paragraph>
    <Paragraph position="4"> Specifically we have developed two different algorithms in respect to the context consideration they employ: Word-based and Pattern-based. The former acquires word-based contextual data (extended up to sentence boundaries), according to the distributional similarity of which, word similarity relations are extracted. The latter detects common patterns throughout the corpus that indicate possible word similarities. For example, consider the sentence fragments: &amp;quot;...while the S&amp;I' index inched up 0.3%.&amp;quot; &amp;quot;The DAX #Mex inched up O. 70 point to close...&amp;quot; Although their syntactic structures are different, the common contextual pattern (appearing beyond immediately adjacent words) indicates a possible similarity between the tokens 'S&amp;P' and 'DAX'.</Paragraph>
    <Paragraph position="5"> Word pairs that persistently appear such context similarity throughout the corpus (frequently observed in technical texts) are confidently indicated as semantically similar. Our method captures such context similarity and extracts a proportionate measure about semantic similarity between lexical items.</Paragraph>
    <Paragraph position="6"> Most approaches (Brown et al., 1992; Li &amp; Abe, 1997) inherently extract semantic knowledge in the abstracted form of semantic clusters. Our method produces semantic similarity relations as an intermediate (and information-richer) semantics representation formalism, from which cluster hierarchies can be generated. Of great importance is that soft clustering methods can also be applied to this set of relations and cluster polysemous words to more than one classes.</Paragraph>
    <Paragraph position="7"> Stock lnarket-financial news and Modem Greek, were used as domain and language test case respectively, ltowever demonstrative examples taken from the WSJ corpus have been used throughout the paper as well.</Paragraph>
  </Section>
  <Section position="6" start_page="836" end_page="837" type="metho">
    <SectionTitle>
4 Context Similarity Estimation
</SectionTitle>
    <Paragraph position="0"> The main idea supporting context-based word clustering is that two words that can substitute one another in several different contexts always providing meaningful word sequences are probably semantically similar. Present n-gram based methods utilize this assumption considering as a context of a focus word only the one or two immediately adjacent parameter words.</Paragraph>
    <Paragraph position="1"> In the present work, we consider as word context the whole sentence in which the examined word appears, excluding only the semantically empty (i.e. functional) words such as articles, conjunctions, particles, auxiliaries. Adopting this word context notion we proceed to tile ibllowing analysis: Let us consider a text corpus Tc with vocabulary Vc and Vs _c Vc the set of words that are of interest in extracting semantic similarity relations between them. Vx comprises the non-functional words of  Vc appearing in Tc with a frequency higher than a threshold (set to 20 in the presented experiments) in order to acquire sufficient data for every focus word. Let Vr_~ Vc be the set of words that will be used as context parameters. Ideally, any word appearing at least twice in the corpus could be used as context parameter. However we specified this word frequency threshold to 10 in order to diminish computational time. Consider a sentence of Tc: Sm= WI,W2,..-,Wj-I,Wj,Wj+I,...,Wk We define as sentential context o fw~ in S,,, the set of the pairs of the sentence words which are members of Vp, accompanied by their corresponding distance from wj: Cs.,(wi)= {(i-j,w~),i=l..k,(i ~j), Vw~ e VI, } Equation (1): Sentential context of wi in S,,, More formally, Cs,.(wi) can be represented as a binary-valued matrix defined over the set ~t =Sxo) where 8={ -l,l,-2,2,...,-Lm,Lm}, Lm being the maximum word distance we regard that carries useful contextual information (for full sentence distance Lm=Lma~-I where L,,,x the maximum sentence length in Tc), and o~ the ordered set Vs: Cs,,, (wi) = {ci,,, (d, w)}, where: de&amp;wE~ cj,,.(d,w)=J'l, w=wj,w i cco, d=i-j (2) 0, otherwise Summing over all corpus sentences we obtain the contextual data matrices for every wj c V s : (w/) : {cj (d, w)} = C,,, (w j) (s) de6, we0) m The word semantic similarity estimation has been reduced to matrices similarity estimation. The obtained contextual matrices are compared using a weighted Tanimoto measure (Chamiak, 1993) and a word similarity measure S,.(wl,wj) is obtained: ZE\[h(d)&amp;quot; ci(d,w)&amp;quot; cj(d,w)l d w</Paragraph>
    <Paragraph position="3"> d w The weight function h(d) defines the desired influence that the distance between words should have to context similarity estimation. In this experiment we set: h(d)=l/ld I. In order to reduce computational time the denominator was set to ZEci(d,w)+ZZcj(d,w), a modification d w d w that has minimal effect on the final result. Experimental results of this method (Word-based Context Similarity Estimation- WCSE), are shown in Table 1. Note that, since the Cfc,(wj) matrix is sparse, (l) was used as data storing formula instead of (2), in order to diminish computational cost.</Paragraph>
    <Paragraph position="4"> The previously described algorithm is handling all contextual data in a uniform way. However, study of the results showed that preference should be given to hits derived from many different similar contexts instead of few ones appearing many tilnes. This would clearly give better results since the latter case may be due to often-used stereotyped expressions or repeated ~3cts. In order to achieve this we modified (4) to:</Paragraph>
  </Section>
  <Section position="7" start_page="837" end_page="838" type="metho">
    <SectionTitle>
5 Dynamic pattern detection for
</SectionTitle>
    <Paragraph position="0"> context similarity estimation In the previously described method the notion of word context is based on independent intra-sentential word co-occurrences. Itowever similarity of contextual patterns is much more reliable word similarity criterion than word-based context similarity. That is, if the sentential contexts Cs,,,(w~) and Cs,(w/) have at least two common elements, we count this as a much more confident hit regarding the wi and w.i similarity. A measure expressing the weight of the common pattern is obtained. Since the patterns under detection vary across languages and domains we need a method that extracts them dynamically, regardless of the text genre, domain or language.</Paragraph>
    <Paragraph position="1"> For this purpose we propose an algorithm that performs a sentence-by-sentence comparison along the corpus. This comparison is based on the  cross-correlation concept as it is used in digital signal processing. A sentence can be considered as a digital signal where every semantic token corresponds to a signal sample. In order to detect words with common contexts every sentence is checked on matching every other one partially (i.e. matching the semantic category of one or more tokens) on every possible relative position between the two sentences. Wherever colnmon patterns of semantic tokens are lbund the neighboring respective tokens on the two sentences are stored as candidate semantic relatives.</Paragraph>
    <Paragraph position="2"> During this process contextual data are not maintained in memory; instead the detection of a common pattern in both sentences results to the storage of several hits (i.e. candidate silnilar word pairs) or to the increase of their corresponding similarity measure according to the pattern similarity ot'their contexts.</Paragraph>
    <Paragraph position="3"> Let Sm and S, be two sentences that undergo the cross-correlation procedure. If 8~={dx, x=l..xl, xl&gt;l }, is the set of word distances that satisfy the equality: ci,,, , (d,, Wy) = c./.,, (d,-, Wy ) = 1, then the pair (wi,wi) is stored as a hit accompanied by the l'ollowing context similarity measure: .v 1 .v 1 Keeping only the first term we obtain the same result as in the WCSE method with weight function h(d) =l/\]d\[. The second term augments the score in proportion to the cohesion and the size o1' the: detected pattern depending on the position of wi (or, equivalently, wi). I)ividing (6) by the total length of S,,, and S. (i.e. 1~.,,, =L,c/I.,, ) we obtain a normalized measure of the cross-correlation of the two sentences:</Paragraph>
    <Paragraph position="5"> applied throughout the corpus.</Paragraph>
    <Paragraph position="6"> In order to reduce search thne and required memory during the whole process a pruning mechanism is applied at regular time intervals to eliminate word pairs with a relatively very low semantic similarity score.</Paragraph>
    <Paragraph position="7"> Dividing (8) by the product of the word probabilities P(wi)-P(wi) we obtain the normalized similarity measure FN(Wl,Wj).</Paragraph>
    <Paragraph position="8"> in order to constrain the degradation of our results due to sparse data regarding less frequent words, we multiply (8) by Pc, a data sufficiency measure function of P(wi) and P(wi) , obtaining Fu, a more reliable measure, ltere we employed:</Paragraph>
    <Paragraph position="10"> where we used PTI,=30/ITcl, ITcl being the size of the corpus.</Paragraph>
    <Paragraph position="11"> Finally, sorting the resulting pairs by Fu and keeping the N-best scoring pairs, we obtain the preponderant semantically related candidates.</Paragraph>
  </Section>
  <Section position="8" start_page="838" end_page="839" type="metho">
    <SectionTitle>
6 Preprocessing
</SectionTitle>
    <Paragraph position="0"> In order to apply the above described algorithms some preprocessing is necessalT:  1. A trainable sentence splitter and a rule-based chunker are applied. Sentence boundaries confine the scope of context while phrase boundaries determine the nmximunl extent of semantic tokens (see below).</Paragraph>
    <Paragraph position="1"> 2. The next step of the preprocessing is what we call &amp;quot;xemantic lokenizalion&amp;quot;. We try to reduce  context parameters and simultaneously to incerease the volulne of contextual data either by reducing the volume of both the focus and parameter word set or by discaring or merging lexical items resulting in reduction of the distance between semantic tokens. Words or word sequences are thus classified in common semantic categories employing syntactical, morphological and coltocational intbrmation: a.Functionals (auxiliaries, determiners) are discarded since they do not modit) semantically their head words. Words of indeterminable semantic content (pronouns, low fi'equency words) are treated as empty tokens.</Paragraph>
    <Paragraph position="2"> b.Known domain-independent lexical patterns incorporating arithmetic and temporal  expressions (e.g. dates, numbers, amounts, etc.) are regarded as a single semantic token and tagged accordingly. Their information content is indifferent to semantic kmowledge acquisition; therefore we preserve only class information. c.Frequently appearing lexical patterns which represent single semantic entities in the specific domain are treated as a single (albeit composite) &amp;quot;semantic token&amp;quot;. Their detection is based on the following algorithm (cf. Smadja, 1993): 1. Extract &amp;quot;significant bigrams&amp;quot; confined inside noun phrases i.e. immediately adjacent words that contain a relatively high amount of</Paragraph>
    <Paragraph position="4"> 2. Combine significant bigrams together to obtain &amp;quot;significant n-grams&amp;quot; found in the corpus and confined inside noun phrases as well. Discard subsumed m-grams (re&lt;n) only if they do not occur indepentently in the corpus.</Paragraph>
    <Paragraph position="5"> 3. Tag throughout the corpus the significant n null grams as single semantic tokens, starting from the higher-order ones.</Paragraph>
    <Paragraph position="6"> Semantic entities that are lexically represented as sticky word chains may be either standard in the framework of the information extraction task - named entities, such as &amp;quot;Latin America&amp;quot; (location), &amp;quot;Russian president Boris Yeltsin&amp;quot; (person), &amp;quot;Tpdne~a Mct~c6ovk%-Qpdmlg&amp;quot; (&amp;quot;Bank of Macedonia and Thrace&amp;quot;; organization) or representations of donminspecific typical events (&amp;quot;Gt6~qm 1 tm'coztKo6 Keq)c0~cdon&amp;quot; = rise of equity capital), abstract concepts (&amp;quot;Dow Jones industrials&amp;quot;), etc. To ensure that the detected &amp;quot;sticky&amp;quot; phrases actually represent semantic entities, human inspection is necessary for discarding the spurious ones, since repeated word sequences that do not constitute always single semantic entities often appear in specialized texts.</Paragraph>
    <Paragraph position="7"> From the above it is apparent that we use the term &amp;quot;semantic token&amp;quot; to refer to a recognized semantic pattern (e.g. &lt;date&gt;), a rigid word chain (e.g.</Paragraph>
    <Paragraph position="8"> &amp;quot;Dow Jones industrials') or a single content word. The context similarity estimation algorithms were run using vocabularies of focus and parameter words derived from the extracted set of semantic tokens.</Paragraph>
  </Section>
  <Section position="9" start_page="839" end_page="839" type="metho">
    <SectionTitle>
7 Incorporating heuristics
</SectionTitle>
    <Paragraph position="0"> From the study of the erroneously extracted semantic relations certain systematic errors were detected. For example, adjectives, adverbs or adjunctive nouns that occur interpolating in otherwise similar contexts lead to the extraction of spurious pairs. Consider ~br example the phrases: &amp;quot;11 c~6~qml &amp;quot;nlg ztgl\]g &amp;quot;Cllg J~v~ixrll~&amp;quot; (= the increase of the benzine price) and &amp;quot;11 c~6~qcn 1 zqg &amp;quot;cqa\]g ~d0~31mlg zqg \[~cv~iwlC/' (=the increase of the di,sposal benzine price). Every algorithm based on word adjacency data outputs as erroneous hits the pairs {benzine-disposal} and { increase-disposal}.</Paragraph>
    <Paragraph position="1"> A rule that was applied to deal with this problem is: If wiGS m and wjGSn have similar contexts, count the pair (wi,w j) as a hit only if wi~wj+3 and wj~wi+1.</Paragraph>
    <Paragraph position="2"> Such contextual rules can be applied only using the cross-correlation method for the context similarity estimation (either pattern-based or word-based).</Paragraph>
  </Section>
  <Section position="10" start_page="839" end_page="840" type="metho">
    <SectionTitle>
8 Word Clustering
</SectionTitle>
    <Paragraph position="0"> Although the obtained semantically related N-best pair list constitutes already a thesaurus-like and information-rich form of semantic knowledge representation, many NLP applications (e.g.</Paragraph>
    <Paragraph position="1"> language modeling) require word clusters instead of word relations. However, since a word similarity measure has been extracted, the formation of clusters is a rather trivial problem, although more complex for &amp;quot;soft clustering&amp;quot; (i.e. a word can be classified in more than one classes).</Paragraph>
    <Paragraph position="2"> In order to construct word classes we applied the unsupervised agglomerative hard clustering algorithn3 shown in Figure 1 over the set of senmntic relations. Each distinct lexical item is initially assigned to a cluster and then clusters are merged into larger ones according to the average linkage measure. Merging of clusters stops when the distance between the more proximate clusters exceeds a threshold proportional to the average distance between words. Tracking the successive merges we obtain sub-cluster hierarchies, such as the one shown in l&lt;igure 2.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML