File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1302_metho.xml
Size: 6,406 bytes
Last Modified: 2025-10-06 14:09:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1302"> <Title>On Statistical Parameter Setting</Title> <Section position="3" start_page="13" end_page="14" type="metho"> <SectionTitle> 2 The experimental setting </SectionTitle> <Paragraph position="0"> In the following we discuss the experimental setting. We used the Brown corpus, the child- null Due to space restrictions we do not formalize this further. A complete documentation and the source code is available at: http://jones.ling.indiana.edu/~abugi/. The Brown Corpus of Standard American English, consisting of 1,156,329 words from American texts printed in 1961 organized into 59,503 utterances and compiled by W.N. Francis and H. Kucera at Brown University.</Paragraph> <Paragraph position="1"> oriented speech portion of the CHILDES Peter corpus, and Caesar's &quot;De Bello Gallico&quot; in Latin. From the Brown corpus we used the files ck01 ck09, with an average number of 2000 words per chapter. The total number of words in these files is 18071. The randomly selected portion of &quot;De Bello Gallico&quot; contained 8300 words. The randomly selected portion of the Peter corpus contains 58057 words.</Paragraph> <Paragraph position="2"> The system reads in each file and dumps log information during runtime that contains the information for online and offline evaluation, as described below in detail.</Paragraph> <Paragraph position="3"> The gold standard for evaluation is based on human segmentation of the words in the respective corpora. We create for every word a manual segmentation for the given corpora, used for online evaluation of the system for accuracy of hypothesis generation during runtime. Due to complicated cases, where linguist are undecided about the accurate morphological segmentation, a team of 5 linguists was cooperating with this task.</Paragraph> <Paragraph position="4"> The offline evaluation is based on the grammar that is generated and dumped during runtime after each input file is processed. The grammar is manually annotated by a team of linguists, indicating for each construction whether it was segmented correctly and exhaustively. An additional evaluation criterion was to mark undecided cases, where even linguists do not agree. This information was however not used in the final evaluation.</Paragraph> <Section position="1" start_page="13" end_page="14" type="sub_section"> <SectionTitle> 2.1 Evaluation </SectionTitle> <Paragraph position="0"> We used two methods to evaluate the performance of the algorithm. The first analyzes the accuracy of the morphological rules produced by the algorithm after an increment of n words.</Paragraph> <Paragraph position="1"> The second looks at how accurately the algorithm parsed each word that it encountered as it progressed through the corpus.</Paragraph> <Paragraph position="2"> The morphological rule analysis looks at each grammar rule generated by the algorithm and judges it on the correctness of the rule and the resulting parse. A grammar rule consists of a stem and the suffixes and prefixes that can be attached to it, similar to the signatures used in Goldsmith (2001). The grammar rule was then marked as to whether it consisted of legitimate suffixes and prefixes for that stem, and also as to whether the Documented in L. Bloom (1970) and available at This was taken from the Gutenberg archive at: http://www.gutenberg.net/etext/10657. The Gutenberg header and footer were removed for the experimental run.</Paragraph> <Paragraph position="3"> stem of the rule was a true stem, as opposed to a stem plus another morpheme that wasn't identified by the algorithm. The number of rules that were correct in these two categories were then summed, and precision and recall figures were calculated for the trial. The trials described in the graph below were run on three increasingly large portions of the general fiction section of the Brown Corpus. The first trial was run on one randomly chosen chapter, the second trial on two chapters, and the third on three chapters. The graph shows the harmonic average (F-score) of precision and recall.</Paragraph> <Paragraph position="4"> The second analysis is conducted as the algorithm is running and examines each parse the system produces. The algorithm's parses are compared with the &quot;correct&quot; morphological parse of the word using the following method to derive a numerical score for a particular parse. The first part of the score is the distance in characters between each morphological boundary in the two parses, with a score of one point for each character space. The second part is a penalty of two points for each morphological boundary that occurs in one parse and not the other. These scores were examined within a moving window of words that progressed through the corpus as the algorithm ran. The average scores of words in each such window were calculated as the window advanced. The purpose of this method was to allow the performance of the algorithm to be judged at a given point without prior performance in the corpus affecting the analysis of the current window. The following graph shows how the average performance of the windows of analyzed words as the algorithm progresses through five randomly chosen chapters of general fiction in the Brown Corpus amounting to around 10,000 words.</Paragraph> <Paragraph position="5"> The window size for the following graph was set to 40 words.</Paragraph> <Paragraph position="6"> The evaluations on Latin were based on the initial 4000 words of &quot;De Bello Gallico&quot; in a pretest. In the very initial phase we reached a precision of 99.5% and a recall of 13.2%. This is however the preliminary result for the initial phase only. We expect that for a larger corpus the recall will increase much higher, given the rich morphology of Latin, potentially with negative consequences for precision.</Paragraph> <Paragraph position="7"> The results on the Peter corpus are shown in the following table: We notice a more or less stable precision value with decreasing recall, due to a higher number of words. The Peter corpus contains also many very specific transcriptions and tokens that are indeed unique, thus it is rather surprising to get such results at all. The following graphics shows the F-score for the Peter corpus:</Paragraph> </Section> </Section> class="xml-element"></Paper>