File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-3007_intro.xml
Size: 2,174 bytes
Last Modified: 2025-10-06 14:02:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-3007"> <Title>Exploiting Aggregate Properties of Bilingual Dictionaries For Distinguishing Senses of English Words and Inducing English Sense Clusters</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Resources </SectionTitle> <Paragraph position="0"> First we collected, from Internet sources and via scanning and running OCR on print dictionaries, 82 dictionaries between English and a total of 44 distinct foreign languages from a variety of language families.</Paragraph> <Paragraph position="1"> Over 213K distinct English word types were present in a total of 5.5M bilingual dictionary entries, for an av- null ships among 3 words. The derived synonymy relation S holds between fair and blond, and between fair and just. S does not hold between blond and fair. We can infer that fair has at least 2 senses and, further, we can represent them by blond and just.</Paragraph> <Paragraph position="2"> the aggregate bilingual dictionary provides for partitioning the meanings of fair into distinct senses: blond and just.</Paragraph> <Paragraph position="3"> erage of 26 and a median of 3 foreign entries per English word. Roughly 15K English words had at least 100 foreign entries; over 64K had at least 10 entries.</Paragraph> <Paragraph position="4"> No complex or hierarchical structure was assumed or used in our input dictionaries. Each was initially parsed into the &quot;lowest common denominator&quot; form. This consisted of a list of pairs of the form (foreign word, English word). Because bilingual dictionary structure varies widely, and even the availability and compatibility of part-of-speech tags for entries is uncertain, we made the decision to compile the aggregate resource only with data that could be extracted from every individual dictionary into a universally compatible format. The unique pairs extracted from each dictionary were then converted to 4tuples of the form: <foreign language, dictionary name, foreign word, English word> before being inserted into the final, combined dictionary data set.</Paragraph> </Section> class="xml-element"></Paper>