File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/x93-1017_metho.xml
Size: 26,318 bytes
Last Modified: 2025-10-06 14:13:29
<?xml version="1.0" standalone="yes"?> <Paper uid="X93-1017"> <Title>Tokyo Marine & Fire 17th N Prt NP</Title> <Section position="3" start_page="0" end_page="165" type="metho"> <SectionTitle> METHODOLOGY </SectionTitle> <Paragraph position="0"> The argument outlined in this paper is based upon a discourse anaZysis of two portions of the entire 1297-article 3JV corpus: the 15e-article 33V test set and 1~ randomly selected development-set articles.</Paragraph> <Paragraph position="1"> In addition, a descriptive anaZysis was performed on approximately 50 JJV test articles and corresponding template results for varying combinations of the six systems that participated in MUC-5; all six systems, however, were analyzed on a subset of 12 selected articles, or a total of 72 individual template results. The entire descriptive examination is motivated by a desire to understand better the various systems' capabilities in order to make the numerical results more tangible to potential users. The assumption is that one can construct a composite performance-based description for each system derived from the analysis of individual templates, and that the resulting snapshot -- what the system actually does -- will be more comprehensible to users than the theoretical model of a system outlined in a technical summary -- what it should do.</Paragraph> <Paragraph position="2"> Although the discourse anaZysis has not yielded o fult-btown discourse structure for the JJV corpus, the most essential element of the evolving top-down paradigm, the topic sentence, is identified. Any attempt to formulate o complete discourse paradigm for JJV must first deal with this sentence. It contains much information significant in its own right and -more to the point for data extraction -- relevant to template insertion. In fact, most of the time the topic sentence contains a11 the minimally required data for instantiating and tracking a tie-up relationship.</Paragraph> <Paragraph position="3"> This paper first examines the stereotypical nature of this topic sentence -- hereafter referred to as an article's ~Impact Line&quot; -- before moving onto o discussion of the &quot;default&quot; mechanism. The Impact Line prototype operating in conjunction with the instantiation of certain high-percentage star fills (&quot;defaults&quot;) provides a proficient extraction heuristic and corresponding salubrious quantitative effect upon system performance.</Paragraph> <Paragraph position="4"> JJV DOMAIN AND THE IMPACT LINE The JV application focuses on tracking tie-ups between at least two entities. It is necessary, therefore, to I) identify the entities engaged in some business activity or development project and 2) to confirm that the arrangement between them is a tie-up relationship. Therefore, for the Impact Line to hove any &quot;impact&quot; at ali in this application, its prototype should at least contain the information necessary in fulfilling the above criteria.</Paragraph> <Paragraph position="5"> Two definitions of the prototypicat Impact Line, version i and version 2, ore presented below. Version I discusses the data items necessary to meet the above-mentioned criteria for generating o tie-up: two entities and the indication of o tie-up. In order to show how the structure of this version-1 Impact Line facilitates the identification and extractionof these data items, moreover, the first definition discusses the grammatical role of the Japanese topic marker (~ &quot;wo,&quot; its importance in marking relevant proper nouns in the JJV corpus, and the Impact Line's verbal element.</Paragraph> <Paragraph position="6"> By this definition, 81% of the JJV test set is Impact Line prototypica1.</Paragraph> <Paragraph position="7"> Version 2 is a more restrictive definition requiring the presence of two more extractable data elements in the Impact Line in addition to the criteria of version I. The second definition, therefore, discusses the types and distribution of Impact Line data items. This version of the prototype occurs 65% of the time.</Paragraph> </Section> <Section position="4" start_page="165" end_page="167" type="metho"> <SectionTitle> DEFINITION OF THE PROTOTYPICAL IMPACT LINE (VERSION 1) Cl) IMPACT LINE TOPIC MARKER (GRAMMATICAL FORCE) </SectionTitle> <Paragraph position="0"> In the same way that the Impact Line is crucial to developing a complete discourse paradigm for JJV, or perhaps any domain of Japanese newspaper articles, I any discussion about what constitutes a prototypical Impact Line must start with the Japanese topic marker (<TM) =wa&quot; whose role as designator of the Impact Line's grammtical 1 I am just beginning to analyze newspaper =announcement&quot; articles in other domains, such as JME, to see if the Impact Line prototype has validity and can form the basis for a rnetarnodel that is not domain specific.</Paragraph> <Paragraph position="1"> subject is predominant in the 33V test corpus. The =wo&quot;-designated subject sets the tone for the Impact Line as the Impact Line does for the 33V article.</Paragraph> <Paragraph position="2"> In 3apanese discourse generally, &quot;wo&quot; is o particle that indicates the theme or topic of o sentence and as such often, but not always, corresponds to the subject of the sentence. Perhaps just as often =wa&quot; serves to highlight or topicglize other pieces of information, while the particle &quot;go&quot; marks the subject. For example: Kono hon waken go yonda.</Paragraph> <Paragraph position="3"> (Speaking of this book, Ken has read it.) Eigo wa Ken ~ umai desu.</Paragraph> <Paragraph position="4"> (With regards to English, Ken is skillful.) The subject Ken is designated by go and the topic by wa. However, when the subject or agent of the action is also the sentence topic, wo marks the grammatical subject. For example: Ken wa kono hon o yondo.</Paragraph> <Paragraph position="5"> (Speaking of Ken, he read this book.) It is this latter grammatical function of &quot;wa&quot; as the sentence topic and agent-of-action designator that predominates in the JJV test articles. Example 1 below is #2638 from the 33V test set: Co.\] announced on the 17th that it has concluded a business tie-up with o large English general insurance company, Commercial Union (headquarters London).</Paragraph> <Paragraph position="6"> Given the grammatical importance of &quot;wa&quot; in indicating the subject of the Impact Line, this function takes on added significance in the 3V domain where the identification of tie-up entities in a tie-up relationship triggers the extraction process. The Impact Line topic marker in 33V articles is o reliable designator of proper nouns that are valid tie-up partners to be extracted and inserted into the template. In fact, in 117 Impact Lines out of 145 z 33V test-set articles (81%), &quot;wa&quot; marks at least one tie-up partner; 3 and this tie-up partner is not simply the Impact Line topic, but the agent of action as welt.</Paragraph> <Paragraph position="7"> Furthermore, in 19 instances out of those 117, the topic marker is z Five of the 150 test-set articles produced a template but not any tie-ups because they were about either sister-city relationships or talks that were broken off. Therefore, the baseline figure that will be used hereafter in discussing the JJV test set is 145. preceded immediately by two proper nouns designating two principal tie-up partners. Typically the structure will look like Example Z below:</Paragraph> <Paragraph position="9"> The conjunction ~ (&quot;to&quot;) binds the two entities IBM Japan and Sumitomo Electric as co-subjects. Alternately this paradigm altows for modifiers before either or both of the Thus far, the prototypical Impact Line can be encapsulated in the following short notation: ..... X where X is a principat tie-up entity and the ellipsis marks allow inclusion of multiple subjects as shown in Examptes 2 -- 5. It is important to note, moreover, that whether modifiers precede an ENTITYdesignate or not, or whether a conjunction is present or not, the topic marker =wo~ is preceded immediately -- in the grammatical sense -- by an entity that is a principal t~e-up partner. Twenty-one of the 117 &quot;wo&quot;-designated entities are preceded immediately by information about the entity -- such as location -- enclosed ~n parentheses, rather than the entity name ~tself. For exampte: Nikko Securities (hqs. Tokyo) <TM Orthographically this may be misleading, but grammatically the topic marker indicates the entity, not its headquarters location.</Paragraph> <Paragraph position="10"> Therefore, such cases retain their prototypical validity.</Paragraph> </Section> <Section position="5" start_page="167" end_page="168" type="metho"> <SectionTitle> (2) IMPACT LINE TOPIC MARKER (PRACTICAL FORCE) </SectionTitle> <Paragraph position="0"> The Impact Line topic marker exerts a force that extends beyond the scope of a JJV article's first sentence. In instances of ellipsis, which occurs frequently throughout the JJV corpus, the appropriate subject can be supplied by inserting the Impact Line &quot;wa&quot;-designated subject. Articte #1747 is a classic that \[ \] hod concluded o comprehensive business tie-up with Nomuro Securities. Z) In the securities area, \[ \] already has o tie-up arrangement with Nikko Securities, but in order to meet the diverse needs of \[ \] regional customers, \[ \] is making up for the lock of securities-related services through tie-ups with several companies .... 4) As far as the tie-up with Nomura is concerned, M & A (company mergers and acquisitions) business is included, and Joyo is poised to move aggressively into this area.</Paragraph> <Paragraph position="1"> Note that the Impact Line subject, Joyo Bank, does not appear again until the fourth sentence, which is the last line of the article. Until it reappears as the subject, it is omitted and one needs to supply a pronoun or proper name -- ~it&quot;, &quot;its &quot;, &quot;Joyo&quot; -- in order to read the passage understandabty in English. In other words, the heuristic, which states that e11ipsis can be filled by the subject marked by the Impact Line topic marker, works quite wet1 here. Admittedly this is an easy case because stylistically Japanese allows ellipsis in a sentence that follows one in which the subject was introduced originally. In fact, using the term heuristic qua a convention with grammatical and stylistic acceptability may be inappropriate. However, in numerous other instances when convenience dominates and ellipsis is propagated throughout a text beyond the decent bounds of style, assigning the proper subject is less clear-cut.</Paragraph> <Paragraph position="2"> Particularly troublesome are those cases in which ellipsis continues for several sentences before the introduction of a new subject appropriately designated by another topic marker. Thereafter, the subject -- which one? -- is again omitted, and one must decide between calling upon the proximate &quot;wa Ydesignated subject or the original Impact Line &quot;wa&quot;-designated agent. When coding or checking 1@@ of the 15@ test-set articles, I noted only one instance (#2111) in which context demanded that the subject of a particularly complex sentence was not the default Impact Line Uwa mdesignated one. It is, therefore, a powerful heuristic, especially in the JJV corpus where the articles ore on overage short and the ~protogonist&quot; principal tie-up entity is highlighted at the outset by the Impact Line &quot;wa. ~ The protagonist entity usually announces the tie-up to the public, and in this sense, ~has the action ~ throughout the remainder of the text. In short, when in doubt one should revert to the initial topic subject.</Paragraph> </Section> <Section position="6" start_page="168" end_page="169" type="metho"> <SectionTitle> INVALID USES OF uWA&quot; </SectionTitle> <Paragraph position="0"> Before turning to the Impact Line verbal element and finishing the prototype version-1 definition, the two types of occurrences below help illustrate further the legitimate uses of ~wo&quot; by showing what does not qualify as prototypical: 1. In the JJV test set, there are three instances in which the Impact Line topic marker is not preceded by an ENTITY but by a PERSON who is announcing a tie-up. The entity name is present as a modifier, e.g., Japan Development Bank's Takahashi Hajime president <'\[14 Such instances ore eliminated from consideration as a prototype because the initial &quot;wo ~ is not preceded by a principal tie-up partner.</Paragraph> <Paragraph position="1"> 2. In one instance the initiat &quot;wa&quot; marks a valid entity for extraction, however, it is not o principal tie-up partner; it is the PARENT of one of the principals.</Paragraph> <Paragraph position="2"> (3) IMPACT LINE: OTHER :</Paragraph> </Section> <Section position="7" start_page="169" end_page="170" type="metho"> <SectionTitle> REQUISITE ELEMENTS </SectionTitle> <Paragraph position="0"> As mentioned above under GRAMMATICAL FORCE, the JV application tracks tie-up relationships between two or more entities. And, it has already been demonstrated that the Impact Line topic marker is a reliable indicator (81% of the JJV test set) of at least one of those entities.</Paragraph> <Paragraph position="1"> The next question is: Does the prototypical Impact Line also contain the other elements required for instantiating a tie-up? That is: I) Is the name of the other tie-up entity(ties) present in the Impact Line, and 2) is there any explicit indication that the arrangement between the two entities is in fact a tie-up relationship? i) Remarkably, there are only seven instances -- over and above the previously cited 117 -- in which an Impact Line would otherwise be considered prototypical except that the other tie-up partner name(s) is not specified until later in the text. In other words, 81% of JJV test-set Impact Lines indicate clearly not only by virtue of the topic marker at least one tie-up entity, but atso introduce the name of the other principal partner as well.</Paragraph> <Paragraph position="2"> 2) In order to confirm that any two or more entities present in the Impact Line are in a tie-up relationship, the Impact Line must state specifically that this is the case. The verbal elements at the end of the Impact Line are important to look at, therefore, in determining whether there is a tie-up or not.</Paragraph> <Paragraph position="3"> Typically, Japanese text will stipulate ~teikei,&quot; which is the most frequent term for tie-up, but will also use other phrases that are either synonymous or describe an arrangement or activity that presupposes a tie-up, such as: (agreed to join) (z ~ED C, 7a (signed contract to establish JV company) (announced the formalization of an R&D contract) A11 of the previously judged 117 prototypical instance meet this standard, and not surprisingly, given the formulistic nature of the camp. hqs. London with</Paragraph> <Paragraph position="5"> Example I is reprised above to review the elements of a prototypical Impact Line. It must contain all the elements required by a valid tie-up. Therefore, the Impact line must state that there is a tie-up (or, was, in the case of dissolution) between at least two entities who are named; more if the partnership so stipulates. 4 Furthermore, at least one of the named tie-up entities -- the &quot;protagonist&quot; -- must be followed immediately by the topic marker onerous burden for a prototypical structure to bear. But it is the discourse nature of Impact Lines in the 3JV domain to be replete with pertinent information, much of it suitable for extraction. In view of the fact that the Impact Line introduces much data at the outset of an article, a more restrictive definition (version 2) requiring the Impact Line to contain additional extractable data items is presented below.</Paragraph> </Section> <Section position="8" start_page="170" end_page="170" type="metho"> <SectionTitle> DEFINITION OF PROTOTYPICAL IMPACT LINE (VERSION Z) </SectionTitle> <Paragraph position="0"> The definition of version 2 requires 4 Two articles vAth 3 tie-up partnem and one ~th 4 are included in the 117 prototypical cases. the presence of two extractable data items in the Impact Line in addition to the minimum criteria of version 1. As the Impact Line in Example 1 above shows, a valid tie-up relationship exists between Tokyo Marine & Fire and Commercial Union.</Paragraph> <Paragraph position="1"> Moreover, the statement presents two additional pieces of information that are relevant for extraction: Commercial Union is an English company (NATIONALITY) and its headquarters is in London (ENTITY LOCATION). One is also told that Commercial Union is, indeed, a company (ENTITY TYPE), but this is considered less an item that is extracted discretely than one that follows automatically from the identification of the entity itself.</Paragraph> <Paragraph position="2"> This slot will be discussed later as a =default ~ fill.</Paragraph> <Paragraph position="3"> The types of extractable data items that occur in the 117 prototypical</Paragraph> </Section> <Section position="9" start_page="170" end_page="170" type="metho"> <SectionTitle> CHILD COMPANY (II), ECONOMIC ACTIVITY SITE (9), INVESTMENT (1), FACILITY NAME (i), FACILITY LOCATION </SectionTitle> <Paragraph position="0"> (I), and JV COMPANY (i).</Paragraph> <Paragraph position="1"> The *-marked slots indicate that when these particular data items appear in a 33V test-set article, they ore more opt to appear in the Impact Line than in the remainder of the text. For example, ENTITY LOCATION information occurs in the Impact Line in 79 cases out of a total of 118 instantiations in the JJV test set, or 67% for the JJV test corpus; the percentages for</Paragraph> </Section> <Section position="10" start_page="170" end_page="171" type="metho"> <SectionTitle> PERSON NAME, PERSON ENTITY AFFILIATION, PERSON POSITION, AND </SectionTitle> <Paragraph position="0"> NATIONALITY ore 59%, 53%, 53%, and 44% respectively. There ore, moreover, orthographic consistencies in the textual presentation of certain information that should be noted: A11 but three of the 79 ENTITY LOCATION items are enctosed in parens; o11 but six for the ALIAS; and o11 of the PERSON NAME, POSITION, ENTITY AFFILIATION data. Viewed another way, out of 117 version-I prototypical Impact Lines, eight hove no additional data items; 15 have just one; 27 hove two; 19 hove three; 17 hove four; and 31 Impact Lines have five or more data items. In other words, if the version-2 definition of o prototypicot Impact Line were to require the presence of two additional data elements, such as NATIONALITY and ENTITY LOCATION as in the case of Example I above, then there ore 94 (117 minus the 23 that hove less than two additional items) instances out of the 145 33V test corpus that quotify, or 65% of the \]\]V test corpus. Viewed from either version of the Impact Line prototype, articles in the 33V test corpus possess at the outset a wealth of potential information for the extraction task -- 81% in its most lenient interpretation and 65% in its more restrictive.</Paragraph> <Paragraph position="1"> Two Impact Line examples from the JJV test corpus ore given below to highlight the requirements of the On the 21st, Asahi Beer announced the decision that it will do the licensed production and selling</Paragraph> </Section> <Section position="11" start_page="171" end_page="174" type="metho"> <SectionTitle> TEMPLATE DEFAULTS </SectionTitle> <Paragraph position="0"> Given the fact that the topic 3JV sentence is stereotypicat in both the amount of data contained (magnitude) and the way in which it is presented (Impact Line prototype), how this discourse structure might jump-start a system by providing top-level information which can be propagated throughout the template is examined next. One needs to discuss first, however, the notion of template &quot;default&quot; fills. Default fills can be classified as either de jure, de facto, or logical. De jure defaults include the top-level or TEMPLATE OBJECT fills, such as the DOC-NR, DOC-DATE and DOC-SOURCE, whose slots ore filled by SGML-togged data items.</Paragraph> <Paragraph position="1"> They ore, what one might call, &quot;gimmes&quot; by design and, therefore, are not incorporated in the scoring algorithm that measures system performance. The de facto and logical defaults need some explanation.</Paragraph> <Paragraph position="2"> De facto defaults correspond to those set fills instantiated with a very high percentage of one type of data. Judging by actual systems' output and the patterns of certain answer-key template fills, no one will dispute that, in the end, data fetl out of text into some set fills at a much higher frequency than was intuited originottywhen the template was being designed, s Below is o snapshot of high-percentage 33V test-set set fills. (The second figure represents percentages for 100 randomly selected development-set articles.) 5 Some of the distinctions that were made at design time over the course of pr(x~essing approximately 50 articles became blurred unavoidably as the fill rules evolved. Therefore, the initial random distribution between, e.g., the ENTITY TYPE set fills of COMPANY, GOVERNMENT, INDIVIDUAL, and OTHER became lopsided in favor of COMPANY. Given these percentages, how did the systems actually perform? Is there any indication that these de facto default fills were instantiated? The figures below seem to offer evidence for this. Every system evaluated on the TIPSTER JJV test corpus for MUC-S showed substantially lower error rates for each of the above set fills versus their overall (A11-Objects) error scores.</Paragraph> <Paragraph position="3"> SYS- TIE- ENTI- REL- ER OVER- null The descriptive analysis of the 12 templates mentioned above in METHODOLOGY shows a similarly distinctive trend in actuaI systems' output. The 12 templates were not randomly selected: All of them meet the version-1 definition for the Impact Line prototype, and only four do not meet the restrictive one; six articles are short -- six lines or less in length; one article specifies three principal tie-up partners in the Impact Line rather than the usual two; two articles contain multiple tie-ups rather than the usual (84% of JJV test corpus) one tie-up; one article specifically mentions the formation of a 3V company in the Impact Line; two Impact Lines introduce a principal tie-up entity marked by the topic marker &quot;wa&quot; that is clausally modified by the name of its parent company; and one article's Impact Line marks two tie-up entities. In short, whenever a correct ENTITY was instantiated by any system, the above-mentioned default fills cascaded throughout the template, even if -- practically speaking -the resulting fills indicated that a lone COMPANY was in o CURRENT PARTNER relationship with itself.</Paragraph> <Paragraph position="4"> The discussion of article 1528 below shows such an instance of this.</Paragraph> <Paragraph position="5"> Other template fills con be regarded as logical defaults, or those that ore o logical consequence of the template object-oriented design. If the keyword ~teikei&quot; confirms that there is a tie-up and its status is, as mentioned above EXISTING, then obviously the template has o tie-up event; i.e., a TIE-UP OBJECT must be instantiated to accommodate the extraction of such information as TIE-UP STATUS, ENTITY, etc.</Paragraph> <Paragraph position="6"> Similarly, if there is a tie-up event and two entities are in a relationship defined as PARTNER, then obviously there is an ENTITY RELATIONSHIP. If there is an INDUSTRY TYPE identified, there must be on ECONOMIC ACTIVITY OBJECT to accommodate the INDUSTRY OBJECT, which in turn accommodates the INDUSTRY TYPE. The template structure and other logical effects for inserting extracted data items into it will be outlined further below in the discussion of #1528.</Paragraph> <Paragraph position="7"> ,t,</Paragraph> </Section> <Section position="12" start_page="174" end_page="175" type="metho"> <SectionTitle> THE COMBINED EFFECTS OF PROTOTYPICAL DISCOURSE AND THE DEFAULT MECHANISM </SectionTitle> <Paragraph position="0"> To i11ustrate the potential effects that stereotypical 33V discourse structure has on template fills and overall performance when the de facto defaults are considered as wet1, the example of article #1528 is submitted betow.</Paragraph> </Section> <Section position="13" start_page="175" end_page="175" type="metho"> <SectionTitle> ENTITY RELATIONSHIP STATUS, ECONOMIC </SectionTitle> <Paragraph position="0"> ACTIVITY, etc., there are a total of 47 possible fills that are scored.</Paragraph> </Section> <Section position="14" start_page="175" end_page="175" type="metho"> <SectionTitle> SYSTEM I: MINIMUM CASE SCENARIO </SectionTitle> <Paragraph position="0"> Given the plethora of data items in the Impact Line and its prototypical structure, minimally o system should be able to identify and extract on ENTITY NAME (Shiseido) by the topic marker =wo&quot; because this element of the Impact Line is the most consistent port of the prototype.</Paragraph> <Paragraph position="1"> Suppose, moreover, o system confirms the existence of a tie-up event (CONTENT) by identifying the keyword =teikei, ~ which is another consistent element of the Impact line prototype, and one other data item from the Impact Line such as the INDUSTRY TYPE SALES, which also has a keyword associated with it &quot;hanbai.&quot; This system would have in effect identified and extracted three data items from the Impact Line. The default instantiations associated with the extraction of these items would be: TIE-UP STATUS (EXISTING), the named ENTITY (is a constituent of the TIE-UP), ENTITY TYPE (COMPANY), on ENTITY RELATIONSHIP, the named ENTITY (is a constituent of the ER), an ECONOMIC</Paragraph> </Section> class="xml-element"></Paper>