File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1002_metho.xml
Size: 20,230 bytes
Last Modified: 2025-10-06 14:15:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1002"> <Title>TAGARAB: A Fast, Accurate Arabic Name Recognizer Using High-Precision Morphological Analysis</Title> <Section position="3" start_page="8" end_page="10" type="metho"> <SectionTitle> 2 System Description </SectionTitle> <Paragraph position="0"> Figure 1 contains the architecture of TAGARAB.</Paragraph> <Paragraph position="1"> Our system has two major modules: a Morphological Tokenizer and a Name Finder. The Morphological Tokenizer has the ability, in addition to performing lexical scanning that establishes word-level units, to add morphological features to tokens. Text encoded in ISO-8859-6 is first passed through this tokenizer and then the tokenized stream is processed by the Name Finder module which identifies names and other extraction targets and annotates the text with appropriate SGML tags for each extracted item.</Paragraph> <Section position="1" start_page="8" end_page="10" type="sub_section"> <SectionTitle> 2.1 Morphological Tokenlzer 2.1.1 Description </SectionTitle> <Paragraph position="0"> The Morphological Tokenizer's basic task is to identify the sequences of words, punctuation symbols, numbers, existing SGML tags, etc., that comprise the input text. For each such &quot;token,&quot; a description of what it has found is returned as a vector of up to 32 application-definable bits (e.g., PUNC-TUATION, WORD, NUMBER). The Tokenizer is a very fast program, generated using the Flex scannergenerator from a tokenizer specification.</Paragraph> <Paragraph position="1"> We decided to augment the tokenizer's usual role.</Paragraph> <Paragraph position="2"> While it still finds numbers and punctuation tokens, it treats an Arabic word (a contiguous sequence of Arabic letters) as a collection of one or more morpheme tokens, each with its own bit-vector of properties. The properties include those listed above, as well as morphology-specific properties whose nature and linguistic motivation is discussed in the next section. null Making the morphological analysis part of tokenization has the advantage of maintaining the high speed of SRA's TurboTag. An external morphology module -- with a high computation overhead -- would degrade performance.</Paragraph> <Paragraph position="3"> Table 1 contains the features identified by the tokenizer. null Each token in the text receives some subset of the lexical types in Table 1. For example, a string such as $rbh, phonetically \[~aribahu\], &quot;he drank it,&quot; receives the tokenizer types ARABIC, PERFECT, and SUFFIX. The first type means that the token comprises Arabic letters, the second that it is a Perfect verb form, and the third that there is a suffix attached. Note that in this case, the string is not broken up into pieces, such as stem and suffix. It re- null mains a single token with information being added as to what the component pieces are. The only cases where the tokenizer splits off pieces of a string are where there is an attached conjunction (wa or fa) or an attached preposition (la, ba, ka), or both. In these cases, in place of an original string such as orthographic wq$1, phonetic \[waqPS1a\], &quot;and he said,&quot; there will he two separate tokens: \[wa\] with the type information ARABIC and CONJ, and \[q~la\] with the type information ARABIC and PERFECT.</Paragraph> <Paragraph position="4"> Some of these tokenizer types are exclusive, such as PERFECT and IMPERFECT. A token cannot be both simultaneously. Others, however, such as NOUN and DEF_ART, can both be applied to a token.</Paragraph> <Paragraph position="5"> We initially developed the morphological analysis module as a sequence of 31 patterns in Perl's regular expression language. This allowed us to efficiently develop and refine the patterns needed to recognize the various morphological word-shapes. When we plugged this version of the morphological analyzer into the original tokenizer, however, processing was quite slow due to the sequential nature of our morphological patterns and the backtracking nature of Perl's regular expression matcher. To compensate, we incorporated the morphological functionality directly into the Flex specification of the tokenizer. Whenever the Flex-generated scanner identifies an Arabic word, it dispatches the appropriate regular-expression to extract the separate morphemes from that word -- a task that is beyond the capability of Flex.</Paragraph> <Paragraph position="6"> The result is the fastest Arabic morphological analyzer we are aware of: The overall processing rate for TAGARAB is approximately 46 megabytes/hour on a Sun Ultra 1. Morphological processing by itself runs at about 190 megabytes/hour.</Paragraph> <Paragraph position="7"> We had originally planned to develop a morphological capability that would be helpful in improving name recognition, as discussed in Section 1.2. In the following, we discuss the linguistic design of the morphological analysis.</Paragraph> <Paragraph position="8"> Arabic is a highly inflected language. We believed that there are frequently enough surface cues in the shape of an Arabic word ~ to allow the assignment of the kind of morphological information described in Section 2.1.1. For example, inflected forms of derived verb stems such as \[yaftatiHu\], &quot;he inaugurates,&quot; would seem to have an orthographic &quot;shape&quot; that is fairly unique in an Arabic text. We felt that this information could be exploited to suecessfully identify tokens as nouns, verbs (perfects or imperfects), etc., to a sufficiently reliable extent that the later name-recognition patterns could effectively make use of it.</Paragraph> <Paragraph position="9"> The morphological analysis process consists of a series of regular expressions partially supported by lists of noun, verb, and adjective stems, as well as closed-class items. The regular expressions cover all allowable prefixes and suffixes for each stem type.</Paragraph> <Paragraph position="10"> Infixation phenomena, however, such as the infixed of the Eighth Verbal Form, s are handled as variant forms in the verb stem list e.g., 'gtbr, &quot;he considered&quot; and 9br, &quot;he crossed.&quot; No attempt is made to handle co-occurrence constraints among prefixes and suffixes, nor to assign voice. Likewise, no attempt is made to include contextual information, as is done with standard part-of-speech taggers. There is no attempt to handle ambiguity: The regular expression patterns are ordered, and the search for an analysis of a word stops at the first match. The token types are then assigned, and the form is not submitted to any other regular expressions.</Paragraph> <Paragraph position="11"> Not all Arabic tokens are hit by one of these regular-expression patterns that provide morphological features. Although there is a mix of patterns supported by lexical information and patterns that operate entirely by rule (no supporting lexical data), the vast majority of matches appear to occur with the former set of patterns. In other words, the coverage of the morphological analysis is crucially dependent on the lexical data. There are 1051 noun forms, 813 verb forms, and 241 adjective forms. There is also a comprehensive list of closed-class items.</Paragraph> <Paragraph position="12"> The notion of &quot;lexical item&quot; in TAGARAB's lists is somewhat similar to the listing principle found in Landau : broken plural forms for nouns and adrTAGAFtAB deals exclusively with the written forth, i.e., without indication of short vowels.</Paragraph> <Paragraph position="13"> SWe use the usual terms for these forms found in Western grammars.</Paragraph> <Paragraph position="14"> jectives receive an independent entry, much like different stems of verbs as mentioned previously. We make no effort to distinguish I and II forms of verbs, as these are not usually distinguished orthographically in AI-Hayat. In general, if there is no visible orthographic distinction in normal Arabic prose, we do not make a distinction in the lexical data.</Paragraph> <Paragraph position="15"> We also entered forms both with and without hamza, as in AbHayat any form that may receive a hamza may also appear without it (even in the same text!).</Paragraph> <Paragraph position="16"> Another important feature is that we entered only what seemed to us to be frequent lexical items in the lexical lists, and tried to do it in such a manner that what seemed intuitively the most likely reading of a form would be the one selected. This makes sense in the case of such a highly deterministic morphology and also given our time and resource constraints.</Paragraph> <Paragraph position="17"> We wanted to ensure that we got the right readings for a large number of highly frequent items, as this would be the most useful way to constrain the name-recognition patterns. Many high-frequency common nouns, verbs, adjectives, and other parts of speech in Arabic do not usually form part of names. As it turned out (See Section 4 below), this strategy worked quite well with person names but was less significant for organization names.</Paragraph> <Paragraph position="18"> We also decided not to enter items that are used directly by the later name-recognition patterns, such as locations and given names, as these are accessed by the patterns through that module's own word lists and therefore having a morphological reading for them is not important (see Section 2.2).</Paragraph> <Paragraph position="19"> In addition to the supporting lexical data, the ordering of the regular expressions also aided in determining the analysis selected. The regular expressions are grouped functionally, and in general the ones pertaining to closed class items apply first, then nouns, adjectives, perfects, and imperfects in that order. This had the effect that items which are highly &quot;marked&quot; as belonging to one category (e.g., verbs with a double pronominal object) would be captured appropriately by a verb-recognizing regular expression that looks for such suffixes, but that items that are not so highly marked (e.g., a simple third-person masculine perfect form) would be biased towards a reading according to the order of the regular expressions.</Paragraph> <Paragraph position="20"> We were pleasantly surprised to learn that this kind of approach -- although obviously simplistic in some ways -- produces a very high level of precision in analysis (i.e., the parts of speech assigned tend strongly to be correct) and surprisingly good recall (i.e., there is good coverage of the corpus). We discuss our empirical results in Section 3.</Paragraph> <Paragraph position="21"> In addition, the morphological information thus produced contributed substantially to the effectiveness of the name-recognition patterns. We discuss this in Section 4.</Paragraph> </Section> <Section position="2" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 2.2 Name Finder </SectionTitle> <Paragraph position="0"> The Name Finder module of TAGARAB (see Figure 1) uses as input the tokens found by the Morphological Tokenizer with the basic and morphological features attached. The pattern-matching engine is SRA's NetOwl TurboTag TM. It uses data consisting of a set of Pattern-Action rules supported by Word Lists. The latter consists of items such as personal titles that are used by the patterns to recognize names. The Pattern-Action rules use contextual and structural information about names to recognize them dynamically. They also make extensive use of the feature information coming from the Morphological Tokenizer. There is minimal permanent storage of names.</Paragraph> <Paragraph position="1"> The Pattern-Action rules are written in a convenient specification language. They are not compiled, but are read at run time as part of engine initialization. null</Paragraph> </Section> </Section> <Section position="4" start_page="10" end_page="74" type="metho"> <SectionTitle> 3 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="10" end_page="11" type="sub_section"> <SectionTitle> 3.1 Morphology 3.1.1 Preparation </SectionTitle> <Paragraph position="0"> To evaluate the quality of the morphological analysis, we used SRA's tagging tool, TagTool TM, to manually tag a set of documents for morphological analysis and part of speech. 9 For this test document set, we randomly selected fourteen texts from the AI-Hayat CD-ROM not belonging to the name recognition training or testing sets. In addition to manually tagging them, we also ran TAGARAB over these fourteen texts and used a standard MUC-style scoring program to compare the morphological output of TAGARAB with the &quot;answers&quot; in the hand-tagged version.</Paragraph> <Paragraph position="1"> We hand-tagged every token in the text, except for: tives marked by fatHafayn.</Paragraph> <Paragraph position="2"> degBecause of staiTmg constraints and need for knowledge of Arabic, the same person worked on both the development of the morphology component, the name patterns, and also hand-tagged the test set. To remove as far as possible any taint, we did not change the system or any supporting data once the manual tagging began.</Paragraph> <Paragraph position="3"> These exceptions exist because the morphology component did not attach features to these items for the reasons given in Section 2.1.3. As a result of not hand-tagging them, the scoring program judged as spurious any morphological features found by the system for such items.</Paragraph> <Paragraph position="4"> We tagged the test set contextually, again in accordance with the design of the morphological component. The most important effect was on the feature PART. We found participles which act usually as a noun, (e.g., \[al-ma~rfig\], &quot;the plan&quot;, \[almuwaz.z.af\], &quot;the employee&quot;), usually as an adjective (e.g., \[al-bayt al-majhfil\], &quot;the unknown house&quot;) or seem to be freely used in either reading (e.g., \[muslim\], &quot;Muslim,&quot;). We tagged these participles contextually as nouns or adjectives. One effect of this was that the number of items tagged as participles was quite low (in effect, only when they are used predicatively).</Paragraph> <Paragraph position="5"> In all, the evaluation corpus contains 3214 tokens, of which 2324 are Arabic words. 1879 of the latter received morphological features when hand-tagged.</Paragraph> <Paragraph position="6"> The scores for the morphology component are given in Table 2. ldeg Since we did not have access to a morphological analyzer that produces all possible readings for forms based on a large lexicon, we do not have a picture of the total morpho-lexical ambiguities in our evaluation texts. However, despite the small lexicon we manually built, the overall recall is reasonable (73.0%), and it also holds up well for most of the major open class items: perfects (72.7%), imperfects (60.2%), and nouns (66.8%). The low recall in adjectives (28.8%) is due to the fact that we did not make many lexical entries for adjectives. Since adjectives do not come first in the Arabic noun phrase, and since we use the morphological information to constrain the name patterns, tagging the head noun in a noun phrase is what is generally necessary, not tagging the adjective.</Paragraph> <Paragraph position="7"> What is striking in the above table is the high precision across all the categories, with the exception of adjectives and participles, the latter a very small set for the reasons set out in Section 3.1.2. Precision is consistently above 90%. We interpret this to mean that a manually built system with a moderate lexicon, having the capacity to only select one reading for a given form and not paying any attention ldegThe colunm headings are the standard ones from MUC: POS: possible number of points (one point for identifying a constituent boundary, another for identifying its category), ACT: actual responses given, COP,: correct answers, PAR: boundary errors, INC: category labelling errors, SPU: responses given that are not in answer key, MIS: items in key missing from response, REC: recall (COR/POS), PRE: preci- sion (COR/ACT), F-M: f-measure ((2. PRE. REC)/(PRE + REC)).</Paragraph> <Paragraph position="8"> to a word's context, is capable of a very significant amount of morphological disambiguation in Arabic.</Paragraph> <Paragraph position="9"> Our results are also consistent with the results of Levinger et al. for the structually similar Hebrew. Levinger et ai. discovered that non-context-based morphological analysis preferring the most likely morpho-lexical analysis (generated using a statistical algorithm) gives extremely good results.</Paragraph> <Paragraph position="10"> Table 3 shows the collisions among the tags.</Paragraph> <Paragraph position="11"> The most common confusions were between perfects and nouns in both directions. The system tagged the following tokens as nouns where the human tagged them as perfects: n~r (2x), &quot;he/it published,&quot; bgJ, &quot;he/it sent,&quot; Hd_t, &quot;it happened,&quot; w.sflhm, &quot;we described them,&quot; 11 .sdrt, &quot;was issued, appeared,&quot; wSs.l, &quot;he/it continued.&quot;*2 Conversely, there were 16 cases where TAGARAB considered a token as a perfect, and the human tagged it as a noun. As with the previous case, the great majority were confusions of the perfect with a derivationally related noun or verbal noun (e.g., qtl, &quot;he killed&quot; or &quot;killing.&quot;) Despite the small numbers of such collisions in the sample, it seems to us that this is the most difficult disambiguation task, at least at the part of speech level, since verbal nouns plus the semantic subject or object in an i.d~fa construction can look much like a finite verb plus sub-ject or object. Clearly, context or a higher level of syntactic/semantic understanding is required to differentiate the two readings.</Paragraph> <Paragraph position="12"> On the other hand, the other major confusion revealed by this table, Noun/Adjective and Adjective/Noun, is one that seems easily remedied by building in some knowledge of short context into the morphology component. For example, the following three examples (and the rest resemble these) are cases where the system selects a noun reading for the adjective within the scope of a noun phrase: \[lgji'fina muslimfin\], &quot;Muslim refugees,&quot; \[bi.sfiratin xa~a\], &quot;in a special form,&quot; \[mu'tamaruhu al-s.iHgfi\], &quot;his news conference.&quot; In these cases, the adjectives also have noun readings, but the local context shows clearly which reading is correct.</Paragraph> <Paragraph position="13"> These results identify specific areas of morpho-lexical ambiguity bringing into focus where additional contextual cues are needed for better ambiguity resolution.</Paragraph> </Section> <Section position="2" start_page="11" end_page="74" type="sub_section"> <SectionTitle> 3.2 Evaluation of Name Patterns </SectionTitle> <Paragraph position="0"> The scores for the name recognition in TAGARAB over the training set of texts are given in Table 4.</Paragraph> <Paragraph position="1"> The blind scores are given in Table 5.</Paragraph> <Paragraph position="2"> nIn this case, TAGARAB had identified the initial w as the conjtmtion wa and the rest of the string as a noun, #flhm, &quot;their property.&quot; nSimilar to w.sflhm. TAGARAB took the inital w as the conjunction and took the rest of the string as a noun. Pattern performance followed our experience with other languages, except for the recognition of time expressions. Usually, these scores run in the midto high-nineties on a test corpus, but the rich variety of time and date formulas hindered scoring very high here. The scores for Arabic are consistent with scores for other languages (Thai, Chinese, Japanese) where there is no orthographic case information.</Paragraph> </Section> </Section> class="xml-element"></Paper>