File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0630_metho.xml

Size: 5,357 bytes

Last Modified: 2025-10-06 14:15:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0630">
  <Title>Automatically Merging Lexicons that have Incompatible Part-of-Speech Categories</Title>
  <Section position="4" start_page="249" end_page="251" type="metho">
    <SectionTitle>
3 Anti-lexicon
</SectionTitle>
    <Paragraph position="0"> Based on the above intuition on utilizing negative information, we propose an improved model using something we call an anti-lexicon, that indicates the POS tags that a lemma cannot have, which we will call its anti-tags. A POS tag p is called an anti-tag a of a lemma m if p can never be a tag of m.</Paragraph>
    <Paragraph position="1"> The anti-lexicon consists of a set of pieces of this negative information, each called an anti-lexeme ~l: dej where -~p is the anti-tag of lemma m, and p is a POS used in the lexicon.</Paragraph>
    <Paragraph position="2"> Some examples of anti-lexemes are:</Paragraph>
    <Paragraph position="4"> where &amp;quot;IN&amp;quot;,&amp;quot;JJ&amp;quot; and &amp;quot;VB&amp;quot; are the preposition, adjective and verb tags in Brill lexicon respectively.</Paragraph>
    <Paragraph position="5"> Similar to a traditional lexicon which contains lexemes in the form of pairs of lemmas and their corresponding possible POS tag(s), an anti-lexicon contains anti-lexemes which are simple pairs that associate a lemma with an anti-tag.</Paragraph>
    <Paragraph position="6"> The anti-lexicon can be automatically generated quickly and easily. To illustrate the idea, consider an example lexicon where we add the lemma &amp;quot;Central&amp;quot; and the POS &amp;quot;NP&amp;quot; to the example lexicon we have been working with.</Paragraph>
    <Paragraph position="7">  Suppose we want to know whether &amp;quot;Central&amp;quot; can be a &amp;quot;NN&amp;quot;, and whether &amp;quot;calculate&amp;quot; can be a &amp;quot;NN&amp;quot;. The fact that &amp;quot;apple&amp;quot; can be tagged by both &amp;quot;NN&amp;quot; and &amp;quot;NP&amp;quot;, but not &amp;quot;VB&amp;quot;, gives &amp;quot;Central&amp;quot; (a lemma that is already known to be able to serve as an &amp;quot;NP&amp;quot;) a higher likelihood of possibly serving as an &amp;quot;NN&amp;quot; than &amp;quot;calculate&amp;quot; (a lemma that is not known to be able to serve as an &amp;quot;NP&amp;quot;).</Paragraph>
    <Paragraph position="8"> Based on this assumption that lexemes with similar semantics will have similar POS tags, we conceptualize this kind of pattern in terms of &amp;quot;cohesion&amp;quot; between lemmas and POS tags in a lexicon. The &amp;quot;cohesion&amp;quot; of a lemma I and a POS tag p measures the likelihood of a POS tag p being a possible tag of a lemma l, and is defined as:  SimScore(/Y, q~ input : Two POS feature vectors of integers with values {0, 1, 2} (representing POS tags that may come from different tagsets) P= {Pl,P2,..-,Pn} q'= {ql,q2,... ,q,} * A score in the range (0, 1) representing the similarity between iff and ( output algorithm:</Paragraph>
    <Paragraph position="10"> if (l, p) in lexicon; if F(pl,p2,... ,Pn) &gt; 0; otherwise set {P, pl,p2, ...... ,p~} to the the lemmas that can have all the POS in the set</Paragraph>
    <Paragraph position="12"> which F(pl,p2,... ,p~) denotes the total number of lemmas in the lexicon for which pl,p2,... ,p,~ are all legal POS tags of l, and the probability Pr(plpl,p2,... ,Pn) is just a simple relative frequency of the lemmas that can have all the POS in the</Paragraph>
    <Paragraph position="14"> Therefore &amp;quot;NN&amp;quot; is more likely to be associated to &amp;quot;Central&amp;quot; than &amp;quot;calculate&amp;quot;, which implies &amp;quot;NN&amp;quot; will be less likely to be the valid POS to &amp;quot;calculate&amp;quot; than to &amp;quot;Central&amp;quot;* Under this intuition, we create an anti-lexicon by considering the cohesion of all possible combinations of lemmas and POS  tags. Entries with low cohesion will be considered as anti-lexemes and inserted into the anti-lexicon.</Paragraph>
    <Paragraph position="15"> An anti-lexicon A is created by: A = ,,4 U (l, a) iff cohesion(l, a) &lt; A where A is a threshold called anti-threshold, usually a very small real number between 0 and 1.</Paragraph>
    <Paragraph position="16"> In our example, if we set anti-threshold to 0.4, &amp;quot;NN&amp;quot; will become an anti-tag for &amp;quot;calculate&amp;quot; but not for &amp;quot;Central&amp;quot;. Since the lemmas in actual lexicons usually have many possible POS tags, their cohesion to any POS tag will in turn be smaller than the cohesion in our simple example. To create a more accurate anti-lexicon, we should set the anti-threshold to smaller value.</Paragraph>
  </Section>
  <Section position="5" start_page="251" end_page="251" type="metho">
    <SectionTitle>
4 Lexicon merging algorithm
</SectionTitle>
    <Paragraph position="0"> Given a POS mapping table B between the POS tagset 7:' used by the original lexicon 12 q and the POS tagset Q used by the additional lexicon PSP, we merge the entries from the additional lexicon into the original lexicon by an algorithm as shown in algorithm 3.</Paragraph>
    <Paragraph position="1"> This algorithm does not exclude m-to-n POS mappings; that is, a lexeme in the additional lexicon can generate more than one lexeme and we can merge all of them into the original lexicon.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML