File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-0409_evalu.xml
Size: 2,586 bytes
Last Modified: 2025-10-06 13:59:16
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0409"> <Title>Integrating Morphology with Multi-word Expression Processing in Turkish</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> To improve and evaluate our multi-word expression extraction processor, we used two corpora of news text. We used a corpus of about 730,000 tokens to incrementally test and improve our semi-lexicalized rule base, by searching for compound verb formations, etc. Once such rules were extracted, we tested our processor on this corpus, and on a small corpus of about 4200 words to measure precision and recall. Table 2 provides some statistics on these corpora. null Table 3 shows the result of multi-word expression extraction on the large (training) corpus. It should be noted that we only mark multi-word namedentities, not all. Thus many references to persons by on the large corpus this extraction, the average number of morphological parses per token go from 1.760 down to 1.745. Table 4 shows the result of multi-word expression extraction on the small corpus. We also manu- null on the small corpus ally marked up the small corpus into a gold-standard corpus to test precision and recall. The results in Table 4 correspond to an overall recall of 65.2% and a precision of 98.9%, over all classes of multi-word expressions. When we consider all classes except named-entities, we have a recall of 60.1% and a precision of 100%. An analysis of the errors and missed multi-word expressions indicates that the test corpus had a certain variant of a compound verb construction that we had failed to extract from the larger corpus we used for compiling rules. Failing to extract the multi-word expressions for that compound verb accounted for most of the drop in recall. Since we are currently using a rather naive named-entity extraction scheme,9 re8Since this is a very large corpus, we have no easy way of obtaining accurate precision and recall gures.</Paragraph> <Paragraph position="1"> 9As opposed to a general purpose statistical NE extractor that we have developed earlier (T r et al., 2003).</Paragraph> <Paragraph position="2"> call is rather low as there are quite a number of foreign multi-word named-entities (persons and organizations mostly) that do not exist in our database of named-entities. On the other hand, since named-entity extraction for English is a relatively mature technology, we can easily integrate an existing tool to improve our recall.</Paragraph> </Section> class="xml-element"></Paper>