File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-2012_intro.xml

Size: 5,907 bytes

Last Modified: 2025-10-06 14:01:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-2012">
  <Title>High-precision Identification of Discourse New and Unique Noun Phrases</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Most coreference resolution systems proceed in the following way: they first identify all the possible markables (for example, noun phrases) and then check one by one candidate pairs a24a26a25a28a27 a13a30a29 a27a32a31a34a33 a15a19a35a6a36 a25a28a27 a13a4a29 a27a32a31a37a33 a15a37a38a40a39 , trying to find out whether the members of those pairs can be coreferent. As the final step, the pairs are ranked using a scoring algorithm in order to find an appropriate partition of all the markables into coreference classes.</Paragraph>
    <Paragraph position="1"> Those approaches require substantial processing: in the worst case one has to check a41a43a42a44a41a43a45a20a46a48a47 a49 candidate pairs, where a16 is the total number of markables found by the system. However, R. Vieira and M. Poesio have recently shown in (Vieira and Poesio, 2000) that such an exhaustive search is not needed, because many noun phrases are not anaphoric at all -- about a50a52a51a54a53 of definite NPs in their corpus have no prior referents. Obviously, this number is even higher if one takes into account all the other types of NPs -- for example, indefinites are almost always non-anaphoric.</Paragraph>
    <Paragraph position="2"> We can conclude that a coreference resolution engine might benefit a lot from a pre-filtering algorithm for identifying non-anaphoric entities. First, we save much processing time by discarding at least half of the markables. Second, we can hope to reduce the number of mistakes: without pre-filtering, our coreference resolution system might misclassify a discourse new entity as coreferent to some previous one.</Paragraph>
    <Paragraph position="3"> However, such a pre-filtering can also decrease the system's performance if too many anaphoric NPs are classified as discourse new: as those NPs are not processed by the main coreference resolution module at all, we cannot find correct antecedents for them. Therefore, we are interested in an algorithm with a good precision, possibly sacrificing its recall to a reasonable extent. V. Ng and C. Cardie analysed in (Ng and Cardie, 2002) the impact of such a prefiltering on their coreference resolution engine. It turned out that an automatically induced a0a2a1a54a3a48a5a8a7a10a9a12a11a14a13a4a5a12a15 a16a17a15a19a18 classifier did not help to improve the overall performance and even decreased it. However, when more NPs were considered anaphoric (that is, the precision for the a55 a1a54a3a48a5a8a7a10a9a12a11a14a13a4a5a12a15 a16a17a15a19a18 class increased and the recall decreased), the prefiltering resulted in improving the coreference resolution.</Paragraph>
    <Paragraph position="4"> Several algorithms for identifying discourse new entities have been proposed in the literature.</Paragraph>
    <Paragraph position="5"> R. Vieira and M. Poesio use hand-crafted heuristics, encoding syntactic information. For example, the noun phrase &amp;quot;the inequities of the current land-ownership system&amp;quot; is classified by their system as a55 a1a54a3a48a5a40a7a34a9a12a11 a13a30a5a12a15 a16a17a15a19a18 , because it contains the restrictive postmodification &amp;quot;of the current land-ownership system&amp;quot;. This approach leads to 72% precision and 69% recall for definite discourse new NPs.</Paragraph>
    <Paragraph position="6"> The system described in (Bean and Riloff, 1999) also makes use of syntactic heuristics. But in addition the authors mine discourse new entities from the corpus. Four types of entities can be classified as non-anaphoric:  1. having specific syntactic structure, 2. appearing in the first sentence of some text in the training corpus, 3. exhibiting the same pattern as several expressions of type (2), 4. appearing in the corpus at least 5 times and  always with the definite article (&amp;quot;definitesonly&amp;quot;). null Using various combinations of these methods, D. Bean and E. Riloff achieved an accuracy for definite non-anaphoric NPs of about a1a3a2a5a4a6a1a8a7 a53 (Fmeasure), with various combinations of precision and recall.1 This algorithm, however, has two limitations. First, one needs a corpus consisting of many small texts. Otherwise it is impossible to find enough non-anaphoric entities of type (2) and, hence, to collect enough patterns for the entities of type (3). Second, for an entity to be recognized as &amp;quot;definite-only&amp;quot;, it should be found in the corpus at least 5 times. This automatically results in the data sparseness problem, excluding many infrequent nouns and NPs.</Paragraph>
    <Paragraph position="7"> 1Bean and Riloff's non-anaphoric NPs do not correspond to our +discourse new ones, but rather to the union of our +discourse new and +unique classes.</Paragraph>
    <Paragraph position="8"> In our approach we use machine learning to identify non-anaphoric noun-phrases. We combine syntactic heuristics with the &amp;quot;definite probability&amp;quot;. Unlike Bean and Riloff, we model definite probability using the Internet instead of the training corpus itself. This helps us to overcome the data sparseness problem to a large extent. As it has been shown recently in (Keller et al., 2002), Internet counts produce reliable data for linguistic analysis, correlating well with corpus counts and plausibility judgements.</Paragraph>
    <Paragraph position="9"> The rest of the paper is organised as follows: first we discuss our NPs classification. In Section 3, we describe briefly various data sources we used. Section 4 provides an explanation of our learning strategy and evaluation results. The approach is summarised in Section 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML