File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0504_metho.xml

Size: 20,777 bytes

Last Modified: 2025-10-06 14:10:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0504">
  <Title>Ontology Population from Textual Mentions: Task Definition and Benchmark</Title>
  <Section position="4" start_page="26" end_page="26" type="metho">
    <SectionTitle>
ACE (Automatic Content Extraction) initiative
</SectionTitle>
    <Paragraph position="0"> (Ferro et al. 2005, Linguistic Data Consortium 2004), which makes the exploitation of machine learning based approaches possible. Finally, having a limited scope with respect to OLP, the OPTM task allows for a better estimation of performance; in particular, it is possible to evaluate more easily the recall of the task, i.e. the proportion of information correctly assigned to an entity out of the total amount of information provided by a certain mention.</Paragraph>
    <Paragraph position="1"> In the paper we both define the OPTM task and describe an OPTM benchmark, i.e. a document collection annotated with mentions as well as an ontology where information from mentions has been manually extracted. The general architecture of the OPTM task has been sketched above, considering three sub tasks. The document collection we use consists of about 500 Italian news items. Currently, mentions referring to PERSON, ORGANIZATION and GEO-POLITICAL_ ENTITY have been annotated and co-references among such mentions have been established. As for the RCEA sub task, we have considered mentions referring to PERSON and have built a knowledge base of instances, each described with a number of attribute-value pairs.</Paragraph>
    <Paragraph position="2"> The paper is structured as follows. Section 2 provides the useful background as far as mentions and entities are concerned. Section 3 defines the OPTM task and introduces the dataset we have used, as well as the annotation procedures and guidelines we have defined for the realization of the OPTM benchmark corpus. Section 4 reports on a number of quantitative and qualitative analyses of the OPTM benchmark aimed at determining the difficulty of the task.</Paragraph>
    <Paragraph position="3"> Finally, Section 5 proposes future extensions and developments of our work.</Paragraph>
  </Section>
  <Section position="5" start_page="26" end_page="27" type="metho">
    <SectionTitle>
2 Mentions and Entities
</SectionTitle>
    <Paragraph position="0"> As indicated in the ACE Entity Detection task, the annotation of entities (e.g. PERSON, ORGANIZATION, LOCATION and GEO-POLITICAL_ENTITY) requires that the entities mentioned in a text be detected, their syntactic head marked, their sense disambiguated, and that selected attributes of these entities be extracted and merged into a unified representation for each entity.</Paragraph>
    <Paragraph position="1"> As it often happens that the same entity is mentioned more than once in the same text, two inter-connected levels of annotation have been defined: the level of the entity, which provides a representation of an object in the world, and the level of the entity mention, which provides information about the textual references to that object. For instance, if the entity GEORGE_W._BUSH (e.g. the individual in the world who is the current president of the U.S.) is mentioned in two different sentences of a text as &amp;quot;the U.S. president&amp;quot; and as &amp;quot;the president&amp;quot;, these two expressions are considered as two co-referring entity mentions.</Paragraph>
    <Paragraph position="2"> The kinds of reference made by entities to something in the world are described by the following four classes: * specific referential entities are those where the entity being referred to is a unique object  or set of objects (e.g. &amp;quot;The president of thecompany is here&amp;quot;)  ; * generic referential entities refer to a kind or type of entity and not to a particular object (or set of objects) in the world (e.g. &amp;quot;The president is elected every 5 years&amp;quot;); * under-specified referential entities are non-generic non-specific references, including imprecise quantifications (e.g. &amp;quot;everyone&amp;quot;) and estimates (e.g. &amp;quot;more than 10.000 people&amp;quot;); * negatively quantified entities refer to the empty set of the mentioned type of object (e.g. &amp;quot;No lawyer&amp;quot;).</Paragraph>
    <Paragraph position="3">  The textual extent of mentions is defined as the entire nominal phrase used to refer to an entity, thus including modifiers (e.g. &amp;quot;a big family&amp;quot;), prepositional phrases (e.g. &amp;quot;the President of the Republic&amp;quot;) and dependent clauses (e.g. &amp;quot;the girl who is working in the garden&amp;quot;).</Paragraph>
    <Paragraph position="4"> The classification of entity mentions is based on syntactic features; among the most significant categories defined by LDD (Linguistic Data Consortium 2004) there are:  - NAM: proper names (e.g. &amp;quot;Ciampi&amp;quot;, &amp;quot;the UN&amp;quot;); - NOM: nominal constructions (e.g. &amp;quot;good children&amp;quot;, &amp;quot;the company&amp;quot;); - PRO: pronouns, e.g. personal (&amp;quot;you&amp;quot;) and indefinite (&amp;quot;someone&amp;quot;); - WHQ: wh-words, such as relatives and interrogatives (e.g. &amp;quot;Who's there?&amp;quot;); - PTV: partitive constructions (e.g. &amp;quot;some of them&amp;quot;, &amp;quot;one of the schools&amp;quot;); - APP: appositive constructions (e.g. &amp;quot;Dante,  famous poet&amp;quot;, &amp;quot;Juventus, Italian football club&amp;quot;).</Paragraph>
    <Paragraph position="5"> Since the dataset presented in this paper has been developed for Italian, some new types of mentions have been added to those listed in the LDC guidelines; for instance, we have created a specific tag, ENCLIT, to annotate the clitics whose extension can not be identified at word-level (e.g. &amp;quot;veder[lo]&amp;quot;/&amp;quot;to see him&amp;quot;). Some types of mentions, on the other hand, have been eliminated; this is the case for pre-modifiers, due to syntactic differences between English, where both adjectives and nouns can be used as premodifiers, and Italian, which only admits adjectives in that position.</Paragraph>
    <Paragraph position="6"> In extending the annotation guidelines, we have decided to annotate all conjunctions of entities, not only those which share the same modifiers as indicated in the ACE guidelines, and to mark them using a specific new tag, CONJ (e.g.  Notice that the corpus is in Italian, but we present English examples for the sake of readability.</Paragraph>
    <Paragraph position="7"> &amp;quot;mother and child&amp;quot;)  .</Paragraph>
    <Paragraph position="8"> According to the ACE standards, each distinct person or set of people mentioned in a document refers to an entity of type PERSON. For example, people may be specified by name (&amp;quot;John Smith&amp;quot;), occupation (&amp;quot;the butcher&amp;quot;), family relation (&amp;quot;dad&amp;quot;), pronoun (&amp;quot;he&amp;quot;), etc., or by some combination of these.</Paragraph>
    <Paragraph position="9"> PERSON (PE), the class we have considered for the Ontology Population from Textual Mention task, is further classified with the following subtypes: * INDIVIDUAL_PERSON: PES which refer to a single person (e.g. &amp;quot;George W. Bush&amp;quot;); * GROUP_PERSON: PES which refer to more than one person (e.g. &amp;quot;my parents&amp;quot;, &amp;quot;your family&amp;quot;, etc.); * INDEFINITE_PERSON: a PE is classified as indefinite when it is not possible to judge from the context whether it refers to one or more persons (e.g. &amp;quot;I wonder who came to see me&amp;quot;).</Paragraph>
  </Section>
  <Section position="6" start_page="27" end_page="29" type="metho">
    <SectionTitle>
3 Task definition
</SectionTitle>
    <Paragraph position="0"> In Section 3.1 we first describe the document collection we have used for the creation of the OPTM benchmark. Then, Section 3.2 provides details about RCEA, the first step in OPTM.</Paragraph>
    <Section position="1" start_page="27" end_page="28" type="sub_section">
      <SectionTitle>
3.1 Document collection
</SectionTitle>
      <Paragraph position="0"> The OPTM benchmark is built on top of a  Appositive and conjoined mentions are complex constructions. Although LDC does not identify heads for complex constructions, we have decided to annotate all the extent as head.</Paragraph>
      <Paragraph position="1">  I-CAB is further divided into training and test sections, which contain 335 and 190 documents respectively. In total, I-CAB consists of around 182,500 words: 113,500 and 69,000 words in the training and the test sections respectively (the average length of a news story is around 339 words in the training section and 363 words in the test section).</Paragraph>
      <Paragraph position="2"> The annotation of I-CAB is being carried out manually, as we intend I-CAB to become a benchmark for various automatic Information Extraction tasks, including recognition and normalization of temporal expressions, entities, and relations between entities (e.g. the relation affiliation connecting a person to the organization to which he or she is affiliated).</Paragraph>
    </Section>
    <Section position="2" start_page="28" end_page="29" type="sub_section">
      <SectionTitle>
3.2 Recognition and Classification
</SectionTitle>
      <Paragraph position="0"> As stated in Section 1, we assume that for each type of entity there is a set of attribute-value pairs, which typically are used for mentioning that entity type. The same entity may have different values for the same attribute and, at this point no normalization of the data is made, so there is no way to differentiate between different values of the same attribute, e.g. there is no stipulation regarding the relationship between &amp;quot;politician&amp;quot; and &amp;quot;political leader&amp;quot;. Finally, we currently assume a totally flat structure among the possible values for the attributes.</Paragraph>
      <Paragraph position="1"> The work we describe in this Section and in the next one concerns a pilot study on entities of type PERSON. After an empirical investigation on the dataset described in Section 3.1 we have assumed that the attributes listed in the first column of Table 2 constitute a proper set for this type of entity. The second column lists some possible values for each attribute.</Paragraph>
      <Paragraph position="2"> The textual extent of a value is defined as the maximal extent containing pertinent information.</Paragraph>
      <Paragraph position="3"> For instance, if we have a person mentioned as &amp;quot;the thirty-year-old sport journalist&amp;quot;, we will select &amp;quot;sport journalist&amp;quot; as value for the attribute ACTIVITY. In fact, the age of the journalist in not pertinent to the activity attribute and is left out, whereas &amp;quot;sport&amp;quot; contributes to specifying the activity performed.</Paragraph>
      <Paragraph position="4"> As there are always less paradigmatic values for a given attribute, we shortly present further the guidelines in making a decision in those cases. Generally, articles and prepositions are not admitted at the beginning of the textual extent of a value, an exception being made in the case of articles in nicknames.</Paragraph>
      <Paragraph position="5">  Typical examples for the TITLE attribute are &amp;quot;Mister&amp;quot;, &amp;quot;Miss&amp;quot;, &amp;quot;Professor&amp;quot;, etc. We consider as TITLE the words which are used to address people with special status, but which do not refer specifically to their activity. In Italian, professions are often used to address people (e.g. &amp;quot;avvocato/lawyer&amp;quot;, &amp;quot;ingegnere/engineer&amp;quot;). In order to avoid a possible overlapping between the TITLE attribute and the ACTIVITY attribute, professions are considered values for title only if they appear in abbreviated forms (&amp;quot;avv.&amp;quot;, &amp;quot;ing.&amp;quot; etc.) before a proper name.</Paragraph>
      <Paragraph position="6"> With respect to the SEX attribute, we consider as values all the portions of text carrying this information. In most cases, first and middle names are relevant. In addition, the values of the SEX attribute can be gendered words (e.g. &amp;quot;Mister&amp;quot; vs. &amp;quot;Mrs.&amp;quot;, &amp;quot;husband&amp;quot; vs. &amp;quot;wife&amp;quot;) and words from grammatical categories carrying information about gender (e.g. adjectives).</Paragraph>
      <Paragraph position="7"> The attributes ACTIVITY, ROLE, AFFILIATION are three strictly connected attributes.</Paragraph>
      <Paragraph position="8"> ACTIVITY refers to the actual activity performed by a person, while ROLE refers to the position they occupy. So, for instance, &amp;quot;politician&amp;quot; is a possible value for ACTIVITY, while &amp;quot;leader of the Labour Party&amp;quot; refers to a ROLE. Each group of these three attributes is associated with a mention and all the information within a group has to be derived from the same mention. If different pieces of information derive from distinct mentions, we will have two separate groups. Consider the following three mentions of the same entity:  (1) &amp;quot;the journalist of Radio Liberty&amp;quot; (2) &amp;quot;the redactor of breaking news&amp;quot; (3) &amp;quot;a spare time astronomer&amp;quot;  These three mentions lead to three different groups of ACTIVITY, ROLE and AFFILIATION.</Paragraph>
      <Paragraph position="9"> The obvious inference that the first two mentions conceptually belong to the same group is not drawn. This step is to be taken at a further stage.</Paragraph>
      <Paragraph position="10"> The PROVENIENCE attribute can have as values all phrases denoting geographical/racial origin or provenience and religious affiliation.</Paragraph>
      <Paragraph position="11"> The attribute AGE_CATEGORY can have either numerical values, such as &amp;quot;three years old&amp;quot;, or words indicating age, such as &amp;quot;middle-aged&amp;quot;, etc. In the next section we will analyze the occurrences of the values of these attributes in a news corpus.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="29" end_page="31" type="metho">
    <SectionTitle>
4 Data analysis
</SectionTitle>
    <Paragraph position="0"> The difficulty of the OPTM task is directly correlated to four factors: (i) the extent to which the linguistic form of mentions varies; (ii) the perplexity of the values of the attributes; (iii) the size of the set of the potential co-references and (iv) the number of different mentions per entity.</Paragraph>
    <Paragraph position="1"> In this section we present the work we have undertaken so far and the results we have obtained regarding the above four factors.</Paragraph>
    <Paragraph position="2"> We started with a set of 175 documents belonging to the I-CAB corpus (see Section 3.1).</Paragraph>
    <Paragraph position="3"> Each document has been manually annotated observing the specifications described in Section 3.2. We focused on mentions referring to INDIVIDUAL PERSON (Mentions in Table 3), excluding from the dataset both mentions referring to different entity types (e.g. ORGANIZATION) and PERSON GROUP. In addition, for the purposes of this work we decided to filter out the following mentions: (i) mentions consisting of a single pronoun; (ii) nested mentions, (in particular in the case where a larger mention, e.g.</Paragraph>
    <Paragraph position="4"> &amp;quot;President Ciampi&amp;quot;, contained a smaller one, e.g. &amp;quot;Ciampi&amp;quot;, only the larger mention was considered). The total number of remaining mentions (Meaningful mentions in Table 3) is 2343. Finally, we filtered out repetitions of mentions (i.e. string equal) that co-refer inside the same document, obtaining a set of 1139 distinct mentions.</Paragraph>
    <Paragraph position="5"> The average number of mentions for an entity in a document is 2.09, while the mentions/entity proportion within the whole collection is 2.68.</Paragraph>
    <Paragraph position="6"> The detailed distribution of mentions with respect to document entities is presented in Table 4. Columns 1 and 3 list the number of mentions and columns 2 and 4 list the number of entities which are mentioned for the respective number of times (from 1 to 9 and more than 10). For instance, in the dataset there are 741 entities which, within a single document, have just one mention, while there are 27 entities which are mentioned more than 10 times in the same document. As an indication of variability, only 14% of document entities have been mentioned in two different ways.</Paragraph>
    <Section position="1" start_page="29" end_page="30" type="sub_section">
      <SectionTitle>
4.1 Co-reference density
</SectionTitle>
      <Paragraph position="0"> We can estimate the a priori probability that two entities selected from different documents corefer. Actually, this is the estimate of the probability that two entities co-refer conditioned by the fact that they have been correctly identified inside the documents. We can compute such probability as the complement of the ratio between the number of different entities and the number of the document entities in the collection. null  From Table 3 we read these values as 873 and 1117 respectively, therefore, for this corpus, the probability of intra-document co-reference is approximately 0.22.</Paragraph>
      <Paragraph position="1">  A cumulative factor in estimating the difficulty of the co-reference task is the ratio between the number of different entities and the number of mentions. We call this ratio the co-reference density and it shows the a priori expectation that a correct identified mention refers to a new entity. null  The co-reference density takes values in the interval with limits [0-1]. The case where the co-reference density tends to 0 means that all the mentions refer to the same entity, while where the value tends to 1 it means that each mention in the collection refers to a different entity. Both limits render the co-reference task superfluous. The figure for co-reference density we found in our corpus is 873/2343 [?] 0.37, and it is far from being close to one of the extremes.</Paragraph>
      <Paragraph position="2"> A last measure we introduce is the ratio between the number of different entities and the number of distinct mentions. Let's call it pseudo co-reference density. In fact it shows the value of co-reference density conditioned by the fact that one knows in advance whether two mentions that are identical also co-refer.</Paragraph>
      <Paragraph position="3">  The pseudo co-reference for our corpus is 873/1139 [?] 0.76. This information is not directly expressed in the collection, so it should be approximated. The difference between co-reference density and pseudo co-reference density (see Table 5) shows the increase in recall, if one considers that two identical mentions refer to the same entity with probability 1. On the other hand, the loss in accuracy might be too large (consider for example the case when two different people happen to have the same first name).</Paragraph>
    </Section>
    <Section position="2" start_page="30" end_page="31" type="sub_section">
      <SectionTitle>
4.2 Attribute variability
</SectionTitle>
      <Paragraph position="0"> The estimation of the variability of the values for a certain attribute is given in Table 6. The first column indicates the attribute under consideration; the second column lists the total number of mentions of the attribute found in the corpus; the third column lists the number of different values that the attribute actually takes and, between parentheses, its proportion over the total number of values; the fourth column indicates the proportion of the occurrences of the attribute with respect to the total number of mentions (distinct mentions are considered).</Paragraph>
      <Paragraph position="1"> Table 6. Variability of values for attributes.</Paragraph>
      <Paragraph position="2"> In Table 7 we show the distribution of the attributes inside one mention. That is, we calculate how many times one entity contains more than one attribute. Columns 1 and 3 list the number of attributes found in a mention, and columns 2 and 4 list the number of mentions that actually contain that number of values for attributes.</Paragraph>
      <Paragraph position="3">  An example of a mention from our dataset that includes values for eight attributes is the following: null The correspondent of Al Jazira, Amr Abdel Hamid, an Egyptian of Russian nationality...</Paragraph>
      <Paragraph position="4"> We conclude this section with a statistic regarding the coverage of attributes (miscellanea excluded). There are 7275 words used in 1139  distinct mentions, out of which 3606, approximately 49%, are included in the values of the attributes.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML