File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-3007_intro.xml

Size: 2,174 bytes

Last Modified: 2025-10-06 14:02:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-3007">
  <Title>Exploiting Aggregate Properties of Bilingual Dictionaries For Distinguishing Senses of English Words and Inducing English Sense Clusters</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Resources
</SectionTitle>
    <Paragraph position="0"> First we collected, from Internet sources and via scanning and running OCR on print dictionaries, 82 dictionaries between English and a total of 44 distinct foreign languages from a variety of language families.</Paragraph>
    <Paragraph position="1"> Over 213K distinct English word types were present in a total of 5.5M bilingual dictionary entries, for an av- null ships among 3 words. The derived synonymy relation S holds between fair and blond, and between fair and just. S does not hold between blond and fair. We can infer that fair has at least 2 senses and, further, we can represent them by blond and just.</Paragraph>
    <Paragraph position="2">  the aggregate bilingual dictionary provides for partitioning the meanings of fair into distinct senses: blond and just.</Paragraph>
    <Paragraph position="3"> erage of 26 and a median of 3 foreign entries per English word. Roughly 15K English words had at least 100 foreign entries; over 64K had at least 10 entries.</Paragraph>
    <Paragraph position="4"> No complex or hierarchical structure was assumed or used in our input dictionaries. Each was initially parsed into the &amp;quot;lowest common denominator&amp;quot; form. This consisted of a list of pairs of the form (foreign word, English word). Because bilingual dictionary structure varies widely, and even the availability and compatibility of part-of-speech tags for entries is uncertain, we made the decision to compile the aggregate resource only with data that could be extracted from every individual dictionary into a universally compatible format. The unique pairs extracted from each dictionary were then converted to 4tuples of the form: &lt;foreign language, dictionary name, foreign word, English word&gt; before being inserted into the final, combined dictionary data set.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML