File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1001_metho.xml
Size: 11,116 bytes
Last Modified: 2025-10-06 14:15:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1001"> <Title>Discovering Lexical Information by Tagging Arabic Newspaper Text</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. TAGGING VERB AND NOUN </SectionTitle> <Paragraph position="0"> There are several signs in the Arabic language that indicate whether the word is a noun or a verb. One of them is the affix of the word: some of the affixes are used x~ith verbs; some of them are used with nouns; and some of them are used xvith verbs and nouns. A lot of research projects have used this technique to find the part of speech of a word. Andrei Mikheev \[1997\] used a technique for fully automatic acquisition of rules that guess possible part-of-speech tags for unknown words using their starting and ending segments. Several types of guessing rules are included: prefix morphological rules and suffix morphological rules. Zhang and Kim \[1990\] developed a system for automated learning of morphological word function rules. This system divided a string into three regions and inferred from training examples their correspondence to underlying morphological features. More advanced word-guessing methods use word features such as leading and trailing word segments to determine possible tags for unknown words. Such methods can achieve better performance, reaching a tagging accuracy of up to 85% on unknown words for English \[Brill 1992; Weischedel et al., 1993\]. Another sign that indicates whether a word is a noun or a verb is the pattern. In the Arabic language the patterns function as an important guide in recognizing the type of the word; some of these patterns are used just for nouns; some of them are usedjustfor verbs; and others are used for both nouns and verbs. One more sign comes from grammatical rules; several grammatical rules can be used to distinguish between nouns and verbs, some letters in the Arabic language (letters of signification are similar to prepositions in the English language) mark the nouns; others mark the verbs</Paragraph> </Section> <Section position="4" start_page="0" end_page="3" type="metho"> <SectionTitle> 2. TAGGING PROPER NOUNS </SectionTitle> <Paragraph position="0"> Constructing lexical entries for proper nouns is not less important than defining and analyzing common nouns, verbs, and adjectives for supporting natural language applications. The semantic categories of proper nouns are crucial information for text understanding \[Wolinski et al., 1995\] and information extraction \[Cowie and Lehnert,1996\]. They are also used in information retrieval systems \[Paik et al.,1993\]. A number of studies have shown the usefulness of lexical-semantic relationships in information retrieval systems \[Evens et al., 1985; Nutter et al., 1990; Abu-Salem, 1992\]. The lexical-semantic relationships are also important in other applications like question-answering systems \[Evens and Smith, 1978\]. Rau \[1991\] argues that proper nouns not only account for a large percentage of the unkno~aa words in a text, but also are recognized as a crucial source of information in a text for extracting contents, identifying a topic in a text, or detecting relevant documents in information retrieval. Wacholder \[1997\] analyzed the types of ambiguity structural and semantic - that make the discovery of proper names in the text difficult. Jong-Sun Kim and Evens \[1995\] built a natural language processing system for extracting personal names and other proper nouns from the Wall Street Journal.</Paragraph> <Paragraph position="1"> We have classified the proper nouns that we found in the A1-Raya newspaper as follows: Category (nationality, language, religion, ethnic, party,, etc.): proper noun t.vpe related-to American nationality America Arabic lan\[Tta\[e Arabs The Arabic language does not distinguish between upper/lower case letters like the English language. So the proper nouns do not begin with a capital letter. This makes it not nearly as easy to locate them in Arabic text as in English text. For this reason we will use another technique for tagging the proper nouns in the text. This technique depends on the ke}avords. We have studied, analyzed, and classified these ke3avords, to use them to guide us in tagging the proper nouns in the text and figuring out the t319es, and the features. We have classified these keywords as follows: We have also developed a set of grammatical rules to identify the proper noun phrases in the text. Example: Tokenizer System: This system locates a document and isolates the words (tokens).</Paragraph> <Paragraph position="2"> Type-Finder System: The main function of this system is to get the token from the tokenizer system, to get some information about it from the morphology analyzer system, to go through several tests one by one until we find the part of speech of the word.</Paragraph> <Paragraph position="3"> Feature-Finder System: This system is responsible for finding the features of the word (gender, number, person, tense). It sends the word to the morphology analyzer system, gets back information about it, analyzes this information, and figures out the features of the word.</Paragraph> <Paragraph position="4"> Morphology analyzer system: This system is used by both the t3q0e-finder system and the feature-finder system to analyze the suffix and prefix of the word. This system contains three subprograms: one for nouns, one for verbs, and one for particles. The main function of these algorithms is to isolate the affixes of the word and find the gender, number, person, and tense. Database (the lexicon): We started from a hand built lexicon created by Khalid Alsamara \[ 1996\], which our system uses and constantly updates. The lexicon consists of the main table and several tables connected to it one for verbs, one for nouns, one for particles. We add several tables for proper nouns.</Paragraph> </Section> <Section position="5" start_page="3" end_page="3" type="metho"> <SectionTitle> 4. TOKENIZER SYSTEM </SectionTitle> <Paragraph position="0"> We have implemented an algorithm that can isolate the punctuation marks as well as isolate the extra particles attached to the beginning of the word, while they are not part of it. We have classified the words in the Arabic language into eight categories with respect to their prefix. This system caries out three main steps: Isolate the word from the text, pass it to a certain algorithm to classify it, and with respect to this classification we run a certain algorithm to generate the token.</Paragraph> <Paragraph position="1"> s. TYPE-FINDER SYSTEM This system goes through several tests starting by checking the database, identifying the phrases, analyzing the affixes of the word, and analyzing its pattern.</Paragraph> </Section> <Section position="6" start_page="3" end_page="3" type="metho"> <SectionTitle> 5.1 PHRASE-FINDER TEST </SectionTitle> <Paragraph position="0"> After we check the database and discover that our token is absent, we move to the second test.</Paragraph> <Paragraph position="1"> The phrase-finder test uses a set of grammatical rules that identify the phrases in the text. It looks for the phrases (verb phrase, noun phrase, and proper noun phrase) in the text, analyzes them and figures out the part of speech of the word.</Paragraph> <Paragraph position="2"> Example: Mrs. Diana made her visit to the computer conference, that is being held. for the first time in Chicatm This test determines the part of speech of all the underlined words.</Paragraph> <Paragraph position="4"> so Uha (Diana) is a proper noun, for a female human being.</Paragraph> </Section> <Section position="7" start_page="3" end_page="3" type="metho"> <SectionTitle> 5.2 CHECKING THE AFFIX PATTERNS </SectionTitle> <Paragraph position="0"> If the second test fails to identify the part of speech of our token, we continue to the third and the four test in sequence, for these taro tests we are using two techniques: analyzing the affixes of the word and finding its patterns.</Paragraph> <Paragraph position="1"> First, we have classified affixes into two groups: * affix rule (A): if an affix occurs in a word we can surely determine the t3q)e of the word without any doubt.</Paragraph> <Paragraph position="2"> Second, we have collected one hundred and sixty three patterns that cover the patterns in the Arabic language, we have classified these pattems according to the t3q~e of the word they are used for (noun, verb, or noun and verb).</Paragraph> <Paragraph position="3"> Example:</Paragraph> <Paragraph position="5"> used with noun and verb This third test gets the affix of the word from the morphology analyzer system and checks these affixes with affix rule (A). If there is a match then the test succeeds otherwise we continue to the fourth test.</Paragraph> <Paragraph position="6"> Example: ~L~&quot; ~ ~ J_~ ~,~ Mrs. Diana made her visit to the computer conference, that is being held for the first time in Chicago.</Paragraph> <Paragraph position="7"> This test determines the part of speech of all the underlined words.</Paragraph> <Paragraph position="8"> word prefix suffix result ;,_~1 dt \[e.I\] ; noun ;,j~. ~. \[e.I\] ; noun</Paragraph> </Section> <Section position="8" start_page="3" end_page="5" type="metho"> <SectionTitle> TEST </SectionTitle> <Paragraph position="0"> The fourth test uses a combination of affix rule (B) and the pattems of the word. This test uses affix rule (B) to support the decision that will be taken from the pattern technique. It gets the affix of the word from the morphology analyzer system, checks the affix with affix rule (B)to find out if there is a match, finds the pattern of the word, analyzes it to find out the type that it is used for. We then go through the following table to get the final result for this test.</Paragraph> <Paragraph position="2"> So if we have a certain token, its pattern shows it is a NOUN or VERB, and its affixes show it is a NOUN x~4th respect to affix rule (B), then with respect to our table this token should be a NOUN.</Paragraph> <Paragraph position="3"> Example: Mrs. Diana made her visit to the computer conference, that is being held for the first time in Chicago.</Paragraph> <Paragraph position="4"> This test determines the part of speech of all the underlined words.</Paragraph> </Section> <Section position="9" start_page="5" end_page="5" type="metho"> <SectionTitle> RESULT FROM </SectionTitle> <Paragraph position="0"> W: word, S: suffix, P: prefix, R: result T: token, PT: pattern</Paragraph> </Section> <Section position="10" start_page="5" end_page="5" type="metho"> <SectionTitle> 6. FEATURE-FINDER SYSTEM </SectionTitle> <Paragraph position="0"> This system sends the word and its t39e to the morphology, analyzer system, gets back morphological information about it (the affixes and the gender, number, person, and tense of the word), gets the pattern of the word, analyzes this information using a certain rules we have developed for this system, and finds the features.</Paragraph> </Section> class="xml-element"></Paper>