File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-0506_evalu.xml
Size: 9,422 bytes
Last Modified: 2025-10-06 13:58:50
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0506"> <Title>Building a Shallow Arabic Morphological Analyzer in One Day</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Evaluation and Discussion </SectionTitle> <Paragraph position="0"> To evaluate Sebawai, it was compared to ALPNET. A random set of a 100 word-root pairs produced by ALPNET was manually examined to verify their correctness and consequently verify the correctness of ALPNET. ALPNET produces some possible roots for each given word in unranked order, but all pairs were correct.</Paragraph> <Paragraph position="1"> Three experiments were preformed. In the first and second experiments, Sebawai is trained on a large list and a small list of word-root pairs respectively. After the training, a list of words is fed into Sebawai and ALPNET for analysis. The correctness of analysis and coverage of both systems are compared. In the third experiment, a document collection is indexed using roots produced by both systems. Retrieval effectiveness of indexing using roots produced from each system is examined.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Using a Large Training Set </SectionTitle> <Paragraph position="0"> A list of 270K words was used for training the system and a list of 9,606 Arabic words was used for evaluation. Of the small test set, ALPNET analyzed all the words, while Sebawai analyzed 9,497 and failed on 112. For the generated roots, three different automatic evaluations were done: First (Auto-Eval-1): The top generated root is compared to the roots generated by ALPNET. If the root is on the list, it is considered correct.</Paragraph> <Paragraph position="1"> Using this method, 8,206 roots were considered correct.</Paragraph> <Paragraph position="2"> Second (Auto-Eval-2): The top two generated roots from Sebawai were compared to the list of roots that were generated by ALPNET. If either root appeared in the list then the morphological analysis was considered correct. Using this evaluation method, 8,861 roots were considered correct.</Paragraph> <Paragraph position="3"> Third (Auto-Eval-n): All the generated roots are compared to the ones generated by ALPNET. If any match is found, the analysis is considered correct. Using this method, 9,136 roots were considered correct.</Paragraph> <Paragraph position="4"> However, this automatic evaluation has two flaws: 1. The number of Arabic roots in ALPNET's inventory are only 4,600 roots while the number of roots used by Sebawai are more than 10,000. This could result in a correct roots being missed by ALPNET.</Paragraph> <Paragraph position="5"> 2. ALPNET often under-analyzes. For example the word a130 &quot;fy&quot; could be the particle a130 &quot;fy&quot; or could be a stem with the root a139a140a127 a96 &quot;fyy&quot;. ALPNET only generates the particle a130 &quot;fy&quot;, but not the other root a139a140a127 a96 &quot;fyy&quot;. This could lead to false negatives. Therefore manual examination of reject roots was necessary. However, due to the large number of rejected roots, 100 rejected roots from the evaluation Auto-Eval-1 and Auto-Eval-2 were selected at random for examination to estimate the shortfall of the automatic evaluation. Of the 100 rejected roots: Another list of 292,216 words that ALPNET was unable to recognize were fed to Sebawai. Sebawai analyzed 128,169 words (43.9%), and failed otherwise. To verify the correctness of the system, 100 words were taken at random from the list for manual examination. Of the 100, 47 were actually analyzed correctly. Many of the failures were named-Entities. Extrapolating from the results of the manual examination, Sebawai would successfully recognize an estimated 60,000 words The failure of ALPNET and the low accuracy of Sebawai warrant further investigation. A quick review of the list shows a high frequency of named entities, misspelled words, and obscure words.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Using a Small Training Set </SectionTitle> <Paragraph position="0"> The 9,606 words list was used for training and the 270K words list was used for evaluation. The same automatic evaluation method mentioned above was used. Of the 270,468 words, the system was unable to analyze 84,421, and analyzed 186,047. Similar to the experiment with the large training set, three automatic evaluations were used: Auto-Eval-1, Auto-Eval-2, and Auto-Eval-n. For Auto-Eval-1 and Auto-Eval-2, 100 of the rejected roots were manually examined to verify correctness. Of the 100 roots examined: Also, the 292,216 words that ALPNET was unable to recognize were fed to Sebawai.</Paragraph> <Paragraph position="1"> Sebawai analyzed 92,929 words (31.8%). To verify the correctness of the system, 100 words were taken at random from the list for manual examination. Of the 100, 55 were actually analyzed correctly. Extrapolating from the results of the manual examination, Sebawai would successfully recognize an estimated 60,000 words</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Retrieval Effectiveness </SectionTitle> <Paragraph position="0"> In the third part of the evaluation, the Zad document collection, which contains 4,000 documents, was used for retrieval evaluation.</Paragraph> <Paragraph position="1"> Associated with the collection was a set of 25 queries and their relevance judgments. Sebawai was trained using the list of 270K words. InQuery was the retrieval engine used.</Paragraph> <Paragraph position="2"> In the evaluation, 4 different runs were performed. In the first two, the collection was indexed using one root and two roots produced by ALPNET. In the later two, the collection was indexed using the top root and the top two roots generated by Sebawai. Mean average precision was used as the figure of merit in comparing the runs. For statistical significance, a paired two-tailed t-test was used. Statistical significance was concluded if the p-value of t-test was lower than .05.</Paragraph> <Paragraph position="3"> Results summary: Using Sebawai's guess of the most likely root resulted in a higher mean average precision than when using one root produced by ALPNET (Note that ALPNET randomly ordered the possible roots). Further, using two roots from ALPNET slightly improved mean average precision, but the improvement was not statistically significant.</Paragraph> <Paragraph position="4"> Using the top two roots from Sebawai significantly harmed retrieval. A likely reason for the fall in mean average precision when the second root was introduced is that the second root amounted to noise.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Success and Limitations </SectionTitle> <Paragraph position="0"> The evaluation method clearly shows the effectiveness of Sebawai. In fact, Sebawai significantly outperformed ALPNET in retrieval experiments. The analyzer is often able to detect roots that were missed by a commercially available system. Also, due to the fact that rule are derived automatically, Sebawai was developed very rapidly. It was built in less than 12 hours using about 200 lines of Perl code [21]. Further, the analyzer is able to derive the roots of 40,000 words per minute on a Pentium class machine with 256 MB of RAM running Linux. Also, Sebawai is twice as fast as ALPNET on the same machine.</Paragraph> <Paragraph position="1"> Rewriting Sebawai in a compiled language such as C is likely to improve the analysis speed.</Paragraph> <Paragraph position="2"> Furthermore, the method used to develop this Arabic morphological analyzer can potentially be used to rapidly develop morphological analyzers for other languages. Some languages exhibit morphological properties similar to those of Arabic such as Hebrew [12].</Paragraph> <Paragraph position="3"> However, the system is restricted in the following aspects: 1. Since it limits the choice of roots to a fixed set, it does not stem words transliterated from other languages such as transliterated named entities. For example, the English word Britain is transliterated as a5a127 a141a25a5a142a6a143 a144a58a145 &quot;bryTAnyA&quot;. From a5a127 a141a25a5a142a6a143 a144a77a145 &quot;bryTAnyA&quot;, some the words that maybe generated are: a143 a144a58a145 have 3 letter roots. For example, the word a128a151 &quot;q&quot; means &quot;protect (in the form of command)&quot;. Since they are very rare, they may not appear in the training set.</Paragraph> <Paragraph position="4"> 3. Some individual words in Arabic constitute complete sentences. For example, the word</Paragraph> <Paragraph position="6"> forcefully bind you to it?&quot; These also are rare and may not appear in a training set.</Paragraph> <Paragraph position="7"> 4. The analyzer lacks the ability to decipher which prefix-suffix combinations are legal.</Paragraph> <Paragraph position="8"> Although deciphering the legal combinations is feasible using statistics, the process would potentially require a huge number of examples to insure that the system would not disallow legal combinations.</Paragraph> </Section> </Section> class="xml-element"></Paper>