File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/p97-1059_evalu.xml
Size: 3,303 bytes
Last Modified: 2025-10-06 14:00:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1059"> <Title>Finite State Transducers Approximating Hidden Markov Models</Title> <Section position="10" start_page="463" end_page="464" type="evalu"> <SectionTitle> 5 Experiments and Results </SectionTitle> <Paragraph position="0"> This section compares different n-type and s-type transducers with each other and with the underlying HMM.</Paragraph> <Paragraph position="1"> The FSTs perform tagging faster than the HMMs. Since all transducers are approximations of HMMs, they give a lower tagging accuracy than the corresponding HMMs. However, improvement in accuracy can be expected since these transducers can be composed with transducers encoding correction rules for frequent errors (sec. 1).</Paragraph> <Paragraph position="2"> Table 1 compares different transducers on an English test case.</Paragraph> <Paragraph position="3"> The s+nl-type transducer containing all possible subsequences up to a length of three classes is the most accurate (table 1, last line, s+nl-FST (~ 3): 95.95 %) but Mso the largest one. A similar rate of accuracy at a much lower size can be achieved with the s+nl-type, either with all subsequences up to a nO, nl n0-type (with only lexical probabilities) or nl-type (sec. 2) s+nl (100K, F2) s-type (sec. 3), with subsequences of frequency > 2, from a training corpus of 100 000 words (sec. 3.2 a), completed with nl-type (sec. 3.3) s+nl (< 2) s-type (sec. 3), with all possible subsequences of length _< 2 classes (sec. 3.2 b), completed with nl-type (sec. 3.3) Computer: ultra2, 1 CPU, 512 MBytes physical RAM, 1.4 GBytes virtual RAM length of two classes (s+nl-FST (5 2): 95.06 %) or with subsequences occurring at least once in a training corpus of 100 000 words (s+nl-FST (lOOK, F1): 95.05 %).</Paragraph> <Paragraph position="4"> Increasing the size of the training corpus and the frequency limit, i.e. the number of times that a sub-sequence must at least occur in the training corpus in order to be selected (sec. 3.2 a), improves the relation between tagging accuracy and the size of the transducer. E.g. the s+nl-type transducer that encodes subsequences from a training corpus of 20 000 words (table 1, s+nl-FST (20K, F1): 94.74 %, 927 states, 203 853 arcs), performs less accurate tagging and is bigger than the transducer that encodes sub-sequences occurring at least eight times in a corpus of 1 000 000 words (table 1, s+nl-FST (1M, F8): 95.09 %, 432 states, 96 712 arcs).</Paragraph> <Paragraph position="5"> Most transducers in table 1 are faster then the underlying HMM; the n0-type transducer about five times s. There is a large variation in speed between SSince n0-type and nl-type transducers have deterministic states only, a particular fast matching algorithm can be used for them.</Paragraph> <Paragraph position="6"> the different transducers due to their structure and size.</Paragraph> <Paragraph position="7"> Table 2 compares the tagging accuracy of different transducers and the underlying HMM for different languages. In these tests the highest accuracy was always obtained by s-type transducers, either with all subsequences up to a length of two classes 9 or with subsequences occurring at least once in a corpus of 100 000 words.</Paragraph> </Section> class="xml-element"></Paper>