File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0709_evalu.xml

Size: 7,653 bytes

Last Modified: 2025-10-06 13:59:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0709">
  <Title>The Impact of Morphological Stemming on Arabic Mention Detection and Coreference Resolution</Title>
  <Section position="8" start_page="67" end_page="68" type="evalu">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="67" end_page="67" type="sub_section">
      <SectionTitle>
6.1 Data
</SectionTitle>
      <Paragraph position="0"> The system is trained on the Arabic ACE 2003 and part of the 2004 data. We introduce here a clearly defined and replicable split of the ACE 2004 data, so that future investigations can accurately and correctly compare against the results presented here.</Paragraph>
      <Paragraph position="1"> There are 689 Arabic documents in LDC's 2004 release (version 1.4) of ACE data from three sources: the Arabic Treebank, a subset of the broadcast (bnews) and newswire (nwire) TDT-4 documents.</Paragraph>
      <Paragraph position="2"> The 178-document devtest is created by taking the last (in chronological order) 25% of documents in each of three sources: 38 Arabic tree-bank documents dating from &amp;quot;20000715&amp;quot; (i.e., July 15, 2000) to &amp;quot;20000815,&amp;quot; 76 bnews documents from &amp;quot;20001205.1100.0489&amp;quot; (i.e., Dec. 05 of 2000 from 11:00pm to 04:89am) to &amp;quot;20001230.1100.1216,&amp;quot; and 64 nwire documents from &amp;quot;20001206.1000.0050&amp;quot; to &amp;quot;20001230.0700.0061.&amp;quot; The time span of the test set is intentionally non-overlapping with that of the training set within each data source, as this models how the system will perform in the real world.</Paragraph>
    </Section>
    <Section position="2" start_page="67" end_page="68" type="sub_section">
      <SectionTitle>
6.2 Mention Detection
</SectionTitle>
      <Paragraph position="0"> We want to investigate the usefulness of stem n-gram features in the mention detection system. As stated before, the experiments are run in the ACE'04 framework (NIST, 2004) where the system will identify mentions and will label them (cf. Section 4) with a type (person, organization, etc), a sub-type (OrgCommercial, OrgGovernmental, etc), a mention level (named, nominal, etc), and a class (specific, generic, etc). Detecting the mention boundaries (set of consecutive tokens) and their main type is one of the important steps of our mention detection system. The score that the ACE community uses (ACE value) attributes a higher importance (outlined by its weight) to the main type compared to other subtasks, such as the mention level and the class. Hence, to build our mention detection system we spent a lot of effort in improving the first step: detecting the mention boundary and their main type. In this paper, we report the results in terms of precision, recall, and F-measure3.</Paragraph>
      <Paragraph position="1">  tem using lexical features only.</Paragraph>
      <Paragraph position="2"> To assess the impact of stemming n-gram features on the system under different conditions, we consider two cases: one where the system only has access to lexical features (the tokens and direct derivatives including standard n-gram features), and one where the system has access to a richer set of information, including lexical features, POS tags, text chunks, parse tree, and gazetteer information. The former framework has the advantage of being fast (making it more appropriate for deployment in commercial systems). The number of parameters to optimize in the MaxEnt framework we use when only lexical features are explored is around 280K parameters. This number increases to 443K approximately when all information is used except the stemming feature. The number of parameters introduced by the use of stemming is around 130K parameters. Table 1 reports experimental results using lexical features only; we observe that the stemming n-gram features boost the performance by one point (64.7 vs. 65.8). It is important to notice the stemming n-gram features improved the performance of each category of the main type.</Paragraph>
      <Paragraph position="3"> In the second case, the systems have access to a large amount of feature types, including lexical, syntactic, gazetteer, and those obtained by running other  relative complexity, due to different weights associated with the subparts, makes for a hard comparison, while the F-measure is relatively easy to interpret.</Paragraph>
      <Paragraph position="4">  tem using lexical, syntactic, gazetteer features as well as features obtained by running other named-entity classifiers named-entity classifiers (with different semantic tag sets). Features are also extracted from the shallow parsing information associated with the tokens in window of 3, POS, etc. The All-features system incorporates all the features except for the stem ngrams. Table 2 shows the experimental results with and without the stem n-grams features. Again, Table 2 shows that using stem n-grams features gave a small boost to the whole main-type classification system4. This is true for all types. It is interesting to note that the increase in performance in both cases (Tables 1 and 2) is obtained from increased recall, with little change in precision. When the prefix and suffix n-gram features are removed from the feature set, we notice in both cases (Tables 1 and 2) a insignificant decrease of the overall performance, which is expected: what should a feature of preceeding (or following) prepositions or finite articles captures? As stated in Section 4.1, the mention detection system uses a cascade approach. However, we were curious to see if the gain we obtained at the first level was successfully transfered into the overall performance of the mention detection system. Table 3 presents the performance in terms of precision, recall, and F-measure of the whole system. Despite the fact that the improvement was small in terms of F-measure (59.4 vs. 59.7), the stemming n-gram features gave</Paragraph>
    </Section>
    <Section position="3" start_page="68" end_page="68" type="sub_section">
      <SectionTitle>
6.3 Coreference Resolution
</SectionTitle>
      <Paragraph position="0"> In this section, we present the coreference results on the devtest defined earlier. First, to see the effect of stem matching features, we compare two coreference systems: one with the stem features, the other without. We test the two systems on both &amp;quot;true&amp;quot; and system mentions of the devtest set. &amp;quot;True&amp;quot; mentions mean that input to the coreference system are mentions marked by human, while system mentions are output from the mention detection system. We report results with two metrics: ECM-F and ACE-Value. ECM-F is an entity-constrained mention F-measure (cf. (Luo et al., 2004) for how ECM-F is computed), and ACE-Value is the official ACE evaluation metric. The result is shown in Table 4: the baseline numbers without stem features are listed under &amp;quot;Base,&amp;quot; and the results of the coreference system with stem features are listed under &amp;quot;Base+Stem.&amp;quot; On true mention, the stem matching features improve ECM-F from 77.7% to 80.0%, and ACE-value from 86.9% to 88.2%. The similar improvement is also observed on system mentions.The overall ECM-F improves from 62.3% to 64.2% and the ACE value improves from 61.9 to 63.1%. Note that the increase on the ACE value is smaller than ECM-F. This is because ACE-value is a weighted metric which emphasizes on NAME mentions and heavily discounts PRONOUN mentions. Overall the stem features give rise to consistent gain to the coreference system.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML