File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-1029_intro.xml
Size: 5,509 bytes
Last Modified: 2025-10-06 14:03:18
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1029"> <Title>Compiling French-Japanese Terminologies from the Web</Title> <Section position="3" start_page="0" end_page="2" type="intro"> <SectionTitle> 2 Related Term Collection </SectionTitle> <Paragraph position="0"> Given a translation pair of seed terms (s</Paragraph> <Paragraph position="2"> ), we use a search engine to gather a set F of French terms related to s f , and a set J of Japanese terms related to s j . The methods applied for both languages use the framework proposed by Sato and Sasaki (2003), outlined in Figure 1. We proceed in three steps: corpus collection, automatic term recognition (ATR), and filtering.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Corpus Collection </SectionTitle> <Paragraph position="0"> For each language, we collect a corpus C from web pages by selecting passages that contain the seed.</Paragraph> <Paragraph position="1"> Web page collection In French, we use Google to find relevant web pages by entering the following three queries: are). In Japanese, we do the same with queries &quot;s</Paragraph> <Paragraph position="3"> no &quot;, where toha toha, ha ha, toiu toiu, and no no are Japanese functional words that are often used for defining or explaining a term. We retrieve the top pages for each query, and parse those pages looking for hyperlinks whose anchor text contain the seed. If such links exist, we retrieve the linked pages as well.</Paragraph> <Paragraph position="4"> Sentence extraction From the retrieved web pages, we remove html tags and other noise. Then, we keep only properly structured sentences containing the seed, as well as the preceding and following sentences that is, we use a window of three sentences around the seed.</Paragraph> </Section> <Section position="2" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 2.2 Automatic Term Recognition </SectionTitle> <Paragraph position="0"> The next step is to extract candidate related terms from the corpus. Because the sentences composing the corpus are related to the seed, the same should be true for the terms they contain. The process of extracting terms is highly language dependent.</Paragraph> <Paragraph position="1"> French ATR We use the C-value method (Frantzi and Ananiadou (2003)), which extracts compound terms and ranks them according to their termhood. It consists of a linguistic part, followed by a statistical part.</Paragraph> <Paragraph position="2"> The linguistic part consists in applying a linguistic filter to constrain the structure of terms extracted. We base our filter on a morphosyntactic pattern for the French language proposed by Daille et al. It defines the structure of multi-word units (MWUs) that are likely to be terms. Although their work focused on MWUs limited to two content words (nouns, adjectives, verbs or adverbs), we extend our filter to MWUs of greater length. The pattern is defined as follows:</Paragraph> <Paragraph position="4"> The statistical part measures the termhood of each compound that matches the linguistic pattern. It is given by the C-value: where a is the candidate string, f(a) is its frequency of occurrence in all the web pages retrieved, T a is the set of extracted candidate terms that contain a, and P(T a ) is the number of these candidate terms.</Paragraph> <Paragraph position="5"> The nature of our variable length pattern is such that if a long compound matches the pattern, all the shorter compounds it includes also match. For example, consider the N-Prep-N- null candidate systeme a base (based system) also matches, although we would prefer not to extract it.</Paragraph> <Paragraph position="6"> Fortunately, the strength of the C-value is the way it effectively handles nested MWTs. When we calculate the termhood of a string, we subtract from its total frequency its frequency as a substring of longer candidate terms. In other words, a shorter compound that almost always appears nested in a longer compound will have a comparatively smaller C-value, even if its total frequency is higher than that of the longer compound. Hence, we discard MWTs whose C-value is smaller than that of a longer candidate term in which it is nested.</Paragraph> <Paragraph position="7"> Japanese ATR Because compound nouns represent the bulk of Japanese technical MWTs, we extract them as candidate related terms. As opposed to Sato and Sasaki, we ignore single nouns. Also, we do not limit the number of candidates output by ATR as they did.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.3 Filtering </SectionTitle> <Paragraph position="0"> Finally, from the output set of ATR, we select only the technical terms that are part of the seed's semantic domain. Numerous measures have been proposed to gauge the semantic similarity between two words (van Rijsbergen (1979)). We choose the Jaccard coefficient, which we calculate based on search engine hit counts. The similarity between a seed term s and a candidate term x is given by:</Paragraph> <Paragraph position="2"> where H(s [?] x) is the hit count of pages containing both s and x, and H(s [?] x) is the hit count of pages containing s or x. The latter can be calculated as follows: ()() ()xsHxHsHxsH [?][?]+=[?] )( Candidates that have a high enough coefficient are considered related terms of the seed.</Paragraph> </Section> </Section> class="xml-element"></Paper>