File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-2011_intro.xml

Size: 5,808 bytes

Last Modified: 2025-10-06 14:00:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2011">
  <Title>Word-for-Word Glossing with Contextually Similar Words</Title>
  <Section position="3" start_page="0" end_page="79" type="intro">
    <SectionTitle>
2. Resources
</SectionTitle>
    <Paragraph position="0"> The input to our algorithm includes a collocation database (Lin, 1998b) and a corpus-based thesaurus (Lin, 1998a), which are both available on the Interne0. In addition, we require a bilingual thesaurus. Below, we briefly describe these resources.</Paragraph>
    <Section position="1" start_page="0" end_page="78" type="sub_section">
      <SectionTitle>
2.1. Collocation database
</SectionTitle>
      <Paragraph position="0"> Given a word w in a dependency relationship (such as subject or object), the collocation database can be used to retrieve the words that occurred in that relationship with w, in a large corpus, along with their frequencies 2. Figure 1 shows excerpts of the entries in the collocation database for the words corporate, duty, and fiduciary. The database contains a total of 11 million unique dependency relationships.</Paragraph>
      <Paragraph position="1">  words that occur in a dependency relationship (rather than the linear proximity of a pair of words).</Paragraph>
    </Section>
    <Section position="2" start_page="78" end_page="79" type="sub_section">
      <SectionTitle>
2.2. Corpus-based thesaurus
</SectionTitle>
      <Paragraph position="0"> Using the collocation database, Lin used an unsupervised method to construct a corpus-based thesaurus (Lin, 1998a) consisting of 11839 nouns, 3639 verbs and 5658 adjectives/adverbs. Given a word w, the thesaurus returns a clustered list of similar words of w along with their similarity to w. For example, the clustered similar words of duty are shown in Table 1.</Paragraph>
      <Paragraph position="1"> 2.3. Bilingual thesaurus Using the corpus-based thesaurus and a bilingual dictionary, we manually constructed a bilingual thesaurus. The entry for a source language word w is constructed by manually associating one or more clusters of similar words of w to each candidate translation of w. We refer to the assigned clusters as Words Associated with a Translation (WAT). For example, Figure 2 shows an excerpt of our English~French bilingual thesaurus for the words account and duty.</Paragraph>
      <Paragraph position="2"> Although the WAT assignment is a manual process, it is a considerably easier task than providing lexicographic definitions. Also, we only require entries for source language words that have multiple translations. In Section 7, we corporate: modifier-of: duty: objeet-of: subject-of: adj-modifier: fiduciary: modifier-of: client 196, debt 236, development 179, fee 6, function 16, headquarter 316, IOU 128, levy 3, liability 14, manager 203, market 195, obligation 1, personnel 7, profit 595, responsibility 27, rule 7, staff 113, tax 201, training 2, vice president 231 ....</Paragraph>
      <Paragraph position="3"> assume 177, breach 111, carry out 71, do 114, have 257, impose 114, perform 151 .... affect 4, apply 6, include 42, involve 8, keep 5, officer 22, protect 8, require 13, ... active 202, additional 46, administrative 44, fiduciary 317, official 66, other 83, ... act 2, behavior I, breach 2, claim I, company 2, duty 317, irresponsibility 2, obligation 32, requirement 1, responsibility 89, role 2, ...</Paragraph>
      <Paragraph position="4">  the words corporate, duty, and fiduciary. account:  1. compte: 2. rapport: duty: 1. devoir: 2. taxe:  investment, transaction, payment, saving, i money, contract, Budget, reserve, security,! contribution, debt, property holding report, statement, testimony, card, story, record, document, data, information, view, cheek, figure, article, description, estimate, assessment, number, statistic, comment, letter, picture, note, ...</Paragraph>
      <Paragraph position="5"> responsibility, obligation, task, function, role, post, position, job, chore, mission, assignment, liability ....</Paragraph>
      <Paragraph position="6"> tariff, restriction, tax, regulation, requirement, procedure, penalty, quota, rule,  3. Contextually Similar Words  The contextually similar words of a word w are words similar to the intended meaning of w in its context. Figure 3 gives the data flow diagram for our algorithm for identifying the contextually similar words of w. Data are represented by ovals, external resources by double ovals and processes by rectangles.</Paragraph>
      <Paragraph position="7"> By parsing a sentence with Minipar 3, we extract the dependency relationships involving w. For each dependency relationship, we retrieve  contextually similar words of a word in context. from the collocation database the words that occurred in the same dependency relationship as w. We refer to this set of words as the cohort of w for that dependency relationship. Consider the word duty in the contexts corporate duty and fiduciary duty. The cohort of duty in corporate duty consists of nouns modified by corporate in Figure 1 (e.g. client, debt, development .... ) and the cohort of duty in fiduciary duty consists of nouns modified by fiduciary in Figure 1 (e.g. act, behaviour, breach .... ).</Paragraph>
      <Paragraph position="8"> Intersecting the set of similar words and the cohort then forms the set of contextually similar words of w. For example, Table 2 shows the contextually similar words of duty in the contexts corporate duty and fiduciary duty. The words in the first row are retrieved by intersecting the words in Table 1 with the nouns modified by corporate in Figure 1. Similarly, the second row represents the intersection of the words in Table I and the nouns modified by fiduciary in Figure 1.</Paragraph>
      <Paragraph position="9"> The first set of contextually similar words in</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML