File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1127_metho.xml

Size: 16,410 bytes

Last Modified: 2025-10-06 14:08:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1127">
  <Title>Cross-lingual Information Extraction System Evaluation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Query-Driven Information Extraction
</SectionTitle>
    <Paragraph position="0"> One approach to IE portability is to have a system that takes the description of the event type from the user as input and acquires extraction patterns for the given scenario. Throughout the paper, we call this kind of IE system QDIE (Query-Driven Information Extraction) system, whose typical procedure is illustrated in Figure 1.</Paragraph>
    <Paragraph position="1"> QDIE (e.g. (Sudo et al., 2003a)) consists of three phases to learn extraction patterns from the source documents for a scenario specified by the user.</Paragraph>
    <Paragraph position="2"> First, it applies morphological analysis, dependency parsing and Named Entity (NE) tagging to the entire source document set, and converts all the sentences in the source document set into dependency trees. The NE tagging replaces named entities by their class, so the resulting dependency trees contain some NE class names as leaf nodes. This is crucial to identifying common patterns, and to applying these patterns to new text.</Paragraph>
    <Paragraph position="3"> Second, the user provides a set of narrative sen- null tences describing the scenario (the events of interest). Using these sentences as a retrieval query, the information retrieval component of QDIE retrieves representative documents of the scenario specified by the user (relevant documents).</Paragraph>
    <Paragraph position="4"> Then from among all the possible connected sub-trees of all the sentences in the relevant documents, the system calculates the score for each pattern candidate. The scoring function is based on TF/IDF scoring in IR literature; a pattern is more relevant when it appears more in the relevant documents and less across the entire collection of source documents. The final output is the ordered list of pattern candidates.</Paragraph>
    <Paragraph position="5"> Note that a pattern candidate contains at least one NE, so that it can be used to match a portion of a sentence which contains an instance of the same NE type. The matched NE instance is then extracted.</Paragraph>
    <Paragraph position="6"> The pattern candidates may be simple predicate-argument structures (e.g. (resign from a0 C-POSTa1 ) in business domain) or even a complicated subtree of a sentence which commonly appears in the relevant documents (e.g. ( a0 C-ORG a1 report personnel affair (that a0 C-PERSON a1 resigns)) ).</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Cross-lingual Information Extraction
</SectionTitle>
    <Paragraph position="0"> (Riloff et al., 2002) present several approaches to cross-lingual information extraction (CLIE). They describe the use of &amp;quot;cross-language projection&amp;quot; for CLIE, exploiting the word alignment of documents in one language and the same documents translated into a different language by a machine translation (MT) system. They conducted experiments between two relatively close languages, English and French.</Paragraph>
    <Paragraph position="1"> In the experiment reported here, we will explore CLIE for two more disparate languages, English and Japanese.</Paragraph>
    <Paragraph position="2"> The QDIE system can be used in a cross-lingual setting, and thus, the resulting cross-lingual version of the QDIE system can minimize the requirement of the user's knowing the source language. Figure 2 shows two possible ways to achieve this goal.</Paragraph>
    <Paragraph position="3"> It may be realized by translating all the documents of the source language into the target language, and then running the monolingual version of the QDIE system for the target language (Translation-based QDIE). In our experiment, we translated all the source Japanese documents into English. Then we ran English-QDIE system to get the extraction patterns, which are used to extract the entities by pattern matching.</Paragraph>
    <Paragraph position="4"> On the other hand, one can first translate the scenario description into the source language and use it for the monolingual QDIE system for the source language, assuming that we have access to the tools for pattern acquisition in the source language. Each entity in the extracted table is translated into the target language (Crosslingual-QDIE). In Figure 2, we implemented this procedure by first translating the English query into Japanese. 1 Then we ran Japanese-QDIE system to identify Japanese extraction patterns. The extraction patterns are used to extract items to fill the Japanese table. Finally, each item in the extracted table is separately translated into English. Note that translating names is easier than translating the whole sentences.</Paragraph>
    <Paragraph position="5"> As we shall demonstrate, the errors introduced by the MT system impose a significant cost in extraction performance both in accuracy and coverage of the target event. However, if basic linguistic analysis tools are available for the source language, it is possible to boost CLIE performance by learning patterns in the source language. In the next section, we describe an experiment which compares these two approaches. In the following section, we assess the difficulty of learning extraction patterns from the translated source language document set caused by the errors of the MT system and/or the differences of grammatical structure of the translated sentences.</Paragraph>
    <Paragraph position="6"> We address specifically:  1. The accuracy of NE tagging on MT-ed source documents and the use of cross-language projection. null 2. How the structural difference in source and target language affects the extracted patterns.</Paragraph>
    <Paragraph position="7"> 3. The reduced frequency of the extracted pat- null terns, which makes it difficult for any measurement of pattern relevance to distinguish the 1Note that our current implementation uses the output from query translation by the MT system. As we note in Section 7, we plan to investigate the possibility of additional performance gain by using current crosslingual information retrieval techniques. null  the source document (Japanese) and the target extracted table (English) are highlighted. effective patterns of low frequency from the noise patterns.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> To evaluate the relevance of extraction patterns automatically learned for CLIE, we conducted experiments for the Translation-based QDIE system and the Cross-lingual QDIE system on the entity extraction task, which is to identify all the entities participating in relevant events in a given set of Japanese texts.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Experimental Setting
</SectionTitle>
      <Paragraph position="0"> Since general NE taggers either are trained on English sentences or use manually created rules for English sentences, the deterioration of NE tagger's performance cannot be avoided if it is applied to the MT-ed English sentences. This causes the Translation-based QDIE system to identify fewer pattern candidates from the relevant documents since a pattern candidate must contain at least one of the NE types.</Paragraph>
      <Paragraph position="1"> To remedy this problem, we incorporated &amp;quot;cross-language projection&amp;quot; (Riloff et al., 2002) only for Named Entities. We used word alignment obtained by using Giza++ (Och and Ney, 2003) to get names in the English translation from names in the original Japanese sentences. Note that it is extremely difficult to make an alignment of case markers where one language explicitly renders a marker as a word and the other does not. So, direct application of (Riloff et al., 2002) is not suitable for this experiment. null We compare the following three systems in this  experiment.</Paragraph>
      <Paragraph position="2"> 1. Crosslingual QDIE system 2. Translation-based QDIE system with word alignment 3. Translation-based QDIE system without word</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
alignment
4.2 Data
</SectionTitle>
      <Paragraph position="0"> The scenario for this experiment is the Management Succession scenario of MUC-6(muc, 1995), where corporate managers assumed and/or left their posts.</Paragraph>
      <Paragraph position="1"> We used a much simpler template structure than the one used in MUC-6, with Person, Organization, and Post slots. To assess system performance, we measure the accuracy of the system at identifying the participating entities in a management succession event. This task does not involve grouping entities associated with the same event into a single template, in order to avoid possible effects of merging failure on extraction performance for entities.</Paragraph>
      <Paragraph position="2"> The source document set from which the extraction patterns are learned consists of 132,996 Yomiuri Newspaper articles from 1998. For our Crosslingual QDIE system, all the documents are morphologically analyzed by JUMAN (Kurohashi, 1997) and converted into dependency trees by KNP (Kurohashi and Nagao, 1994). For the Translation-based QDIE system, all the documents are translated into English by a commercial machine translation system (IBM &amp;quot;King of Translation&amp;quot;), and converted into dependency trees by a corpus-based parser. We retrieved 1500 documents as relevant documents.</Paragraph>
      <Paragraph position="3"> We accumulated the test set of documents by a simple keyword search. The test set consists of 100 Yomiuri Newspaper articles from 1999, out of which only 61 articles contain at least one management succession event. Note that all NE in the test documents both in the original Japanese and in the translated English sentences were identified manually, so that the task can measure only how well extraction patterns can distinguish the participating entities from the entities that are not related to any succession events. Table 1 shows the details of the test data.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Results
</SectionTitle>
      <Paragraph position="0"> Each pattern acquisition system outputs a list of the pattern candidates ordered by the ranking function.</Paragraph>
      <Paragraph position="1"> The resulting performance is shown as a precision- null recall graph for each subset of top-a13 ranked patterns where a13 ranges from 1 to the number of pattern candidates. The parameters for each system are tuned to maximize the performance on separate validation data.</Paragraph>
      <Paragraph position="2"> The association of NE classes in the matched patterns and slots in the template is made automatically; Person, Organization, Post (slots) correspond to C-PERSON, C-ORG, C-POST (NE-classes), respectively, in the Management Succession scenario. Figure 3 shows the precision-recall curve for the top 1000 patterns acquired by each system on the entity extraction task. Crosslingual QDIE system reaches a maximum recall of 60%, which is significantly better than Translation-based QDIE with word alignment (52%) and Translation-based QDIE without word alignment (41%). Within the high recall range, Crosslingual QDIE system generally had  better precision at the same recall than Translation-based QDIE systems. At the low recall range (a14 a15a17a16a19a18 ), the performance is rather noisy.</Paragraph>
      <Paragraph position="3"> Translation-based QDIE without word alignment performs similarly to Translation-based QDIE with word alignment up to its maximum recall (41%). Translation-based QDIE with word alignment reached 10% higher maximum recall (52%).</Paragraph>
      <Paragraph position="4"> 5 Problems in Translation  The detailed analysis of the result revealed the effect of several problems caused by the MT system. The current off-the-shelf MT system's output resulted in difficulty in using it as a source of extraction patterns. In this section we will discuss the types of differences between the source and target languages, and their effect on pattern discovery.</Paragraph>
      <Paragraph position="5"> Lexical differences Abbreviations in the source language may not have their corresponding short form in the target language. For example, &amp;quot;Kei-Dan-Ren&amp;quot; is an abbreviation of &amp;quot;Keizai Dantai Rengo-kai&amp;quot; which is an organization whose English translation is &amp;quot;Japan Federation of Economic Organizations&amp;quot;. Such abbreviations may not be listed in the dictionary of the MT system. In such cases, the literal translation of the abbreviation may be difficult to recognize as a name and is likely to be treated as a common noun phrase.</Paragraph>
      <Paragraph position="6"> Structural differences Some phrases in the source language may have more than one relevant translation. Depending upon the context where a phrase appears, the MT system has to choose one among the possible translations. Moreover, the MT system may make a mistake, of course, and output an erroneous translation. This results in a diverse distribution of extraction patterns in the target language. Figure 4 shows an example of such a case. Suppose an extraction pattern (( a0 C-POST a1 -ni) shuninsuru) appears 20 times in the original Japanese document set, out of which</Paragraph>
      <Paragraph position="8"> Translation: The translation of a Japanese expression into several English different expressions including erroneous ones.</Paragraph>
      <Paragraph position="9"> Figure 5 shows an example of the case where the context around the name did not seem to be translated properly, so the dependency tree for the sentence was not correct. The right translation is &amp;quot;Okajima announced that President Hiroyuki Okajima, 40 years old, resigned formally ...&amp;quot; which results in the dependency between the main verb &amp;quot;announce&amp;quot; and the company &amp;quot;Okajima&amp;quot;. The translation shown in Figure 5 not only shows incorrect word-translations, but also shows ungrammatical structure, including too many relative clauses. The structural error causes the errors in the dependency parse tree including having &amp;quot;end&amp;quot; as a root of the entire tree and the wrong dependency from &amp;quot;announced&amp;quot; to &amp;quot;the major department&amp;quot; in Figure 5 2. Thus, the accumulation of the errors resulted in missing the organization name &amp;quot;Okajima&amp;quot;.</Paragraph>
      <Paragraph position="10"> Also, the conjunctions in Japanese sentences could not be translated properly, and therefore, the  English dependency parser's output is significantly deteriorated. The example in Figure 6 shows the case where both &amp;quot;Mr. Suzuki&amp;quot; and &amp;quot;Mr. Asada&amp;quot; were inaugurated. In the original Japanese sentence, &amp;quot;Mr. Suzuki&amp;quot; is closer to the verb &amp;quot;be inaugurated&amp;quot;. So, it seems that the MT system tries to find another verb for &amp;quot;Mr. Asada&amp;quot;, and attaches it (incorrectly) to &amp;quot;unofficially arranged&amp;quot;.</Paragraph>
      <Paragraph position="11"> Out-of-Vocabulary Words The MT system may not have a word in the source language dictionary, in which case some MT systems output it in the original script in the source language. This happens not only for names but also for sentences which are erroneously segmented into words. Such problems, of course, may make it hard to detect Named Entities and get a correct dependency tree of the sentence.</Paragraph>
      <Paragraph position="12"> However, translation of names is easier than translation of contexts; the MT system can output the transliteration of an unknown word. In fact, name translation of the MT system we used for this experiment is better than the sentence translation of the same MT system. The names appropriately extracted from Japanese documents by the Crosslingual QDIE system, in most cases, are correctly translated or transliterated if no equivalent translation exists.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML