XML Viewer - w05-0709

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0709_metho.xml
Size: 19,881 bytes
Last Modified: 2025-10-06 14:09:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0709">
  <Title>The Impact of Morphological Stemming on Arabic Mention Detection and Coreference Resolution</Title>
  <Section position="4" start_page="63" end_page="64" type="metho">
    <SectionTitle>
2 Why is Arabic Information
</SectionTitle>
    <Paragraph position="0"> Extraction di cult? The Arabic language, which is the mother tongue of more than 300 million people (Center, 2000), present significant challenges to many natural language processing applications. Arabic is a highly inflected and derived language. In Arabic morphology, most morphemes are comprised of a basic word form (the root or stem), to which many affixes can be attached to form Arabic words. The Arabic alphabet consists of 28 letters that can be extended to ninety by additional shapes, marks, and vowels (Tayli and Al-Salamah, 1990). Unlike Latin-based alphabets, the orientation of writing in Arabic is from right to left. In written Arabic, short vowels are often omitted.</Paragraph>
    <Paragraph position="1"> Also, because variety in expression is appreciated as part of a good writing style, the synonyms are widespread. Arabic nouns encode information about gender, number, and grammatical cases. There are two genders (masculine and feminine), three numbers (singular, dual, and plural), and three grammatical cases (nominative, genitive, and accusative). A noun has a nominative case when it is a subject, accusative case when it is the object of a verb, and genitive case when it is the object of a preposition.</Paragraph>
    <Paragraph position="2"> The form of an Arabic noun is consequently determined by its gender, number, and grammatical case.</Paragraph>
    <Paragraph position="3"> The definitive nouns are formed by attaching the Arabic article &amp;quot; @ to the immediate front of the nouns, such as in the word O&gt;&gt;Q , @ (the company).</Paragraph>
    <Paragraph position="4"> Also, prepositions such as H. (by), and &amp;quot; (to) can be attached as a prefix as in O&gt;&gt;Q @, (to the company).</Paragraph>
    <Paragraph position="5"> A noun may carry a possessive pronoun as a suffix, such as in D&gt;&gt;Q (their company). For the EDR task, in this previous example, the Arabic blank-delimited word D&gt;&gt;Q should be split into two tokens: O&gt;&gt;Q and o . The first token O&gt;&gt;Q is a mention that refers to an organization, whereas the second token o is also a mention, but one that may refer to a person. Also, the prepositions (i.e., H. and &amp;quot;) not be considered a part of the mention.</Paragraph>
    <Paragraph position="6"> Arabic has two kinds of plurals: broken plurals and sound plurals (Wightwick and Gaafar, 1998; Chen and Gey, 2002). The formation of broken plurals is common, more complex and often irregular. As an example, the plural form of the noun g. P (man) is&amp;quot;A g . P (men), which is formed by inserting the infix@ . The plural form of the noun H. A J&gt;&gt; (book) is I. J&gt;&gt; (books), which is formed by deleting the infix @. The plural form and the singular form may also be completely different (e.g. L @Q @ for woman, but ZA for women). The sound plurals are formed by adding plural suffixes to singular nouns (e.g., IkA K. meaning researcher): the plural suffix is H@ for feminine nouns in grammatical cases (e.g., HA JkA K.), for masculine nouns in the nominative case (e.g., ae JkA K.), and AEK for masculine nouns in the genitive and accusative cases (e.g., AE JkA K.). The dual suffix is @ for the nominative case (e.g., A JkA K.), and AEK for the genitive or accusative (e.g., AE JkA K.).</Paragraph>
    <Paragraph position="7"> Because we consider pronouns and nominals as mentions, it is essential to segment Arabic words into these subword tokens. We also believe that the in- null formation denoted by these affixes can help with the coreference resolution task1.</Paragraph>
    <Paragraph position="8"> Arabic verbs have perfect and imperfect tenses (Abbou and McCarus, 1983). Perfect tense denotes completed actions, while imperfect denotes ongoing actions. Arabic verbs in the perfect tense consist of a stem followed by a subject marker, denoted as a suffix. The subject marker indicates the person, gender, and number of the subject. As an example, the verb K.A fl (to meet) has a perfect tense I@ K.A fl for the third person feminine singular, and @ae @ K.A fl for the third per-son masculine plural. We notice also that a verb with a subject marker and a pronoun suffix can be by itself a complete sentence, such us in the word D@K.A fl: it has a third-person feminine singular subject-marker H (she) and a pronoun suffix o (them). It is also a complete sentence meaning &amp;quot;she met them.&amp;quot; The subject markers are often suffixes, but we may find a subject marker as a combination of a prefix and a suffix as in OE@K.A fi K (she meets them). In this example, the EDR system should be able to separate OE@K.A fi K, to create two mentions ( H and o). Because the two mentions belong to different entities, the EDR system should not chain them together. An Arabic word can potentially have a large number of variants, and some of the variants can be quite complex. As an example, consider the word A D JkA J., (and to her researchers) which contains two prefixes and one suffix (A o + oe kA K. + &amp;quot; + ).</Paragraph>
  </Section>
  <Section position="5" start_page="64" end_page="65" type="metho">
    <SectionTitle>
3 Arabic Segmentation
</SectionTitle>
    <Paragraph position="0"> Lee et al. (2003) demonstrates a technique for segmenting Arabic text and uses it as a morphological processing step in machine translation. A trigram language model was used to score and select among hypothesized segmentations determined by a set of prefix and suffix expansion rules.</Paragraph>
    <Paragraph position="1"> In our latest implementation of this algorithm, we have recast this segmentation strategy as the composition of three distinct finite state machines. The first machine, illustrated in Figure 1 encodes the prefix and suffix expansion rules, producing a lattice of possible segmentations. The second machine is a dictionary that accepts characters and produces identifiers corresponding to dictionary entries. The final machine is a trigram language model, specifically a Kneser-Ney (Chen and Goodman, 1998) based back-off language model. Differing from (Lee et al., 2003), we have also introduced an explicit model for un1As an example, we do not chain mentions with different gender, number, etc.</Paragraph>
    <Paragraph position="2"> known words based upon a character unigram model, although this model is dominated by an empirically chosen unknown word penalty. Using 0.5M words from the combined Arabic Treebanks 1V2, 2V2 and 3V1, the dictionary based segmenter achieves a exact</Paragraph>
    <Section position="1" start_page="64" end_page="65" type="sub_section">
      <SectionTitle>
3.1 Bootstrapping
</SectionTitle>
      <Paragraph position="0"> In addition to the model based upon a dictionary of stems and words, we also experimented with models based upon character n-grams, similar to those used for Chinese segmentation (Sproat et al., 1996). For these models, both arabic characters and spaces, and the inserted prefix and suffix markers appear on the arcs of the finite state machine. Here, the language model is conditioned to insert prefix and suffix markers based upon the frequency of their appearance in n-gram character contexts that appear in the training data. The character based model alone achieves a 94.5% exact match segmentation accuracy, considerably less accurate then the dictionary based model. However, an analysis of the errors indicated that the character based model is more effective at segmenting words that do not appear in the training data.</Paragraph>
      <Paragraph position="1"> We seeked to exploit this ability to generalize to improve the dictionary based model. As in (Lee et al., 2003), we used unsupervised training data which is automatically segmented to discover previously unseen stems. In our case, the character n-gram model is used to segment a portion of the Arabic Gigaword corpus. From this, we create a vocabulary of stems and affixes by requiring that tokens appear more than twice in the supervised training data or more than ten times in the unsupervised, segmented corpus.</Paragraph>
      <Paragraph position="2"> The resulting vocabulary, predominately of word stems, is 53K words, or about six times the vocabulary observed in the supervised training data.</Paragraph>
      <Paragraph position="3"> This represents about only 18% of the total number of unique tokens observed in the aggregate training data. With the addition of the automatically acquired vocabulary, the segmentation accuracy achieves 98.1% exact match.</Paragraph>
    </Section>
    <Section position="2" start_page="65" end_page="65" type="sub_section">
      <SectionTitle>
3.2 Preprocessing of Arabic Treebank Data
</SectionTitle>
      <Paragraph position="0"> Because the Arabic treebank and the gigaword corpora are based upon news data, we apply some small amount of regular expression based preprocessing. Arabic specific processing include removal of the characters tatweel ( ), and vowels. Also, the following characters are treated as an equivalence class during all lookups and processing: (1) l ,l , and (2) @ ,@ , @ , @. We define a token and introduce whitespace boundaries between every span of one or more alphabetic or numeric characters. Each punctuation symbol is considered a separate token. Character classes, such as punctuation, are defined according to the Unicode Standard (Aliprand et al., 2004).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="65" end_page="66" type="metho">
    <SectionTitle>
4 Mention Detection
</SectionTitle>
    <Paragraph position="0"> The mention detection task we investigate identifies, for each mention, four pieces of information:  1. the mention type: person (PER), organization (ORG), location (LOC), geopolitical entity (GPE), facility (FAC), vehicle (VEH), and weapon (WEA) 2. the mention level (named, nominal, pronominal, or premodifier) 3. the mention class (generic, specific, negatively quantified, etc.) 4. the mention sub-type, which is a sub-category of the mention type (ACE, 2004) (e.g. OrgGovernmental, FacilityPath, etc.).</Paragraph>
    <Section position="1" start_page="65" end_page="65" type="sub_section">
      <SectionTitle>
4.1 System Description
</SectionTitle>
      <Paragraph position="0"> We formulate the mention detection problem as a classification problem, which takes as input segmented Arabic text. We assign to each token in the text a label indicating whether it starts a specific mention, is inside a specific mention, or is outside any mentions. We use a maximum entropy Markov model (MEMM) classifier. The principle of maximum entropy states that when one searches among probability distributions that model the observed data (evidence), the preferred one is the one that maximizes the entropy (a measure of the uncertainty of the model) (Berger et al., 1996). One big advantage of this approach is that it can combine arbitrary and diverse types of information in making a classification decision.</Paragraph>
      <Paragraph position="1"> Our mention detection system predicts the four labels types associated with a mention through a cascade approach. It first predicts the boundary and the main entity type for each mention. Then, it uses the information regarding the type and boundary in different second-stage classifiers to predict the subtype, the mention level, and the mention class. After the first stage, when the boundary (starting, inside, or outside a mention) has been determined, the other classifiers can use this information to analyze a larger context, capturing the patterns around the entire mentions, rather than words. As an example, the token sequence that refers to a mention will become a single recognized unit and, consequently, lexical and syntactic features occuring inside or outside of the entire mention span can be used in prediction.</Paragraph>
      <Paragraph position="2"> In the first stage (entity type detection and classification), Arabic blank-delimited words, after segmenting, become a series of tokens representing prefixes, stems, and suffixes (cf. section 2). We allow any contiguous sequence of tokens can represent a mention. Thus, prefixes and suffixes can be, and often are, labeled with a different mention type than the stem of the word that contains them as constituents.</Paragraph>
    </Section>
    <Section position="2" start_page="65" end_page="66" type="sub_section">
      <SectionTitle>
4.2 Stem n-gram Features
</SectionTitle>
      <Paragraph position="0"> We use a large set of features to improve the prediction of mentions. This set can be partitioned into 4 categories: lexical, syntactic, gazetteer-based, and those obtained by running other named-entity classifiers (with different tag sets). We use features such as the shallow parsing information associated with the tokens in a window of 3 tokens, POS, etc.</Paragraph>
      <Paragraph position="1"> The context of a current token ti is clearly one of the most important features in predicting whether ti is a mention or not (Florian et al., 2004). We denote these features as backward token tri-grams and forward token tri-grams for the previous and next context of ti respectively. For a token ti, the backward token n-gram feature will contains the previous n 1 tokens in the history (ti[?]n+1, . . . ti[?]1) and the forward token n-gram feature will contains the next n 1 tokens (ti+1, . . . ti+n[?]1).</Paragraph>
      <Paragraph position="2"> Because we are segmenting arabic words into multiple tokens, there is some concern that tri-gram contexts will no longer convey as much contextual information. Consider the following sentence extracted from the development set: H. Qj@, oe A J ,@ I. J&amp;quot; @, Q fi V@ J @ Yo (translation &amp;quot;This represents the location for Political Party Office&amp;quot;). The &amp;quot;Political Party Office&amp;quot; is tagged as an organization and, as a word-for-word translation, is expressed as &amp;quot;to the Office of the political to the party&amp;quot;. It is clear in this example that the word Q fi (location for) contains crucial information in distinguishing between a location and an organization when tagging the token I. J&amp;quot;  (office). After segmentation, the sentence becomes:+ I. J&amp;quot; + &amp;quot; @ + &amp;quot; + Q fi + &amp;quot; @ + J + l + @ Yo .H. Qk + &amp;quot; @ + &amp;quot; + oe A J + &amp;quot; @  When predicting if the token I. J&amp;quot; (office) is the beginning of an organization or not, backward and forward token n-gram features contain only &amp;quot; @ + &amp;quot; (for the) and oe A J + &amp;quot; @ (the political). This is most likely not enough context, and addressing the problem by increasing the size of the n-gram context quickly leads to a data sparseness problem.</Paragraph>
      <Paragraph position="3"> We propose in this paper the stem n-gram features as additional features to the lexical set. If the current token ti is a stem, the backward stem n-gram feature contains the previous n 1 stems and the forward stem n-gram feature will contain the following n 1 stems. We proceed similarly for prefixes and suffixes: if ti is a prefix (or suffix, respectively) we take the previous and following prefixes (or suffixes)2. In the sentence shown above, when the system is predicting if the token I. J&amp;quot; (office) is the beginning of an organization or not, the backward and forward stem n-gram features contain Q fi J (represent location of) and H. Qk oe A J (political office). The stem features contain enough information in this example to make a decision that I. J&amp;quot; (office) is the beginning of an organization. In our experiments, n is 3, therefore we use stem trigram features.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="66" end_page="67" type="metho">
    <SectionTitle>
5 Coreference Resolution
</SectionTitle>
    <Paragraph position="0"> Coreference resolution (or entity recognition) is defined as grouping together mentions referring to the same object or entity. For example, in the following text, (I) &amp;quot;John believes Mary to be the best student&amp;quot; three mentions &amp;quot;John&amp;quot;, &amp;quot;Mary&amp;quot;, &amp;quot;student&amp;quot; are underlined. &amp;quot;Mary&amp;quot; and &amp;quot;student&amp;quot; are in the same entity since both refer to the same person.</Paragraph>
    <Paragraph position="1"> The coreference system system is similar to the Bell tree algorithm as described by (Luo et al., 2004).</Paragraph>
    <Paragraph position="2"> In our implementation, the link model between a candidate entity e and the current mention m is computed as</Paragraph>
    <Paragraph position="4"> 2Thus, the difference to token n-grams is that the tokens of different type are removed from the streams, before the features are created.</Paragraph>
    <Paragraph position="5"> where mk is one mention in entity e, and the basic model building block ^PL(L = 1je, mk, m) is an exponential or maximum entropy model (Berger et al., 1996).</Paragraph>
    <Paragraph position="6"> For the start model, we use the following approximation: null</Paragraph>
    <Paragraph position="8"> The start model (cf. equation 2) says that the probability of starting a new entity, given the current mention m and the previous entities e1, e2, , et, is simply 1 minus the maximum link probability between the current mention and one of the previous entities.</Paragraph>
    <Paragraph position="9"> The maximum-entropy model provides us with a flexible framework to encode features into the the system. Our Arabic entity recognition system uses many language-indepedent features such as strict and partial string match, and distance features (Luo et al., 2004). In this paper, however, we focus on the addition of Arabic stem-based features.</Paragraph>
    <Section position="1" start_page="66" end_page="67" type="sub_section">
      <SectionTitle>
5.1 Arabic Stem Match Feature
</SectionTitle>
      <Paragraph position="0"> Features using the word context (left and right tokens) have been shown to be very helpful in coreference resolution (Luo et al., 2004). For Arabic, since words are morphologically derived from a list of roots (stems), we expected that a feature based on the right and left stems would lead to improvement in system accuracy.</Paragraph>
      <Paragraph position="1"> Let m1 and m2 be two candidate mentions where a mention is a string of tokens (prefixes, stems, and suffixes) extracted from the segmented text.</Paragraph>
      <Paragraph position="2"> In order to make a decision in either linking the two mentions or not we use additional features such as: do the stems in m1 and m2 match, do stems in m1 match all stems in m2, do stems in m1 partially match stems in m2. We proceed similarly for prefixes and suffixes. Since prefixes and suffixes can belong to different mention types, we build a parse tree on the segmented text and we can explore features dealing with the gender and number of the token. In the following example, between parentheses we make a word-for-word translations in order to better explain our stemming feature. Let us take the two mentions H. Qj@, oe A J ,@ I. J&amp;quot; @, (to-the-office the-politic to-the-party) andoe</Paragraph>
      <Paragraph position="4"> development corpus, these two mentions are chained to the same entity. The stemming match feature in this case will contain information such us all stems of m2 match, which is a strong indicator that these mentions should be chained together.</Paragraph>
      <Paragraph position="5"> Features based on the words alone would not help this specific example, because the two strings m1 and m2 do not match.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML