File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/p06-1011_concl.xml
Size: 1,388 bytes
Last Modified: 2025-10-06 13:55:14
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1011"> <Title>Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora</Title> <Section position="7" start_page="86" end_page="87" type="concl"> <SectionTitle> 5 Conclusion </SectionTitle> <Paragraph position="0"> We have presented a simple and effective method for extracting sub-sentential fragments from comparable corpora. We also presented a method for computing a probabilistic lexicon based on the LLR statistic, which produces a higher quality lexicon. We showed that using this lexicon helps improve the precision of our extraction method.</Paragraph> <Paragraph position="1"> Our approach can be improved in several aspects. The signal filtering function is very simple; more advanced filters might work better, and eliminate the need of applying additional heuristics (such as our requirement that the extracted fragments have at least 3 words). The fact that the source and target signal are filtered separately is also a weakness; a joint analysis should produce better results. Despite the better lexicon, the greatest source of errors is still related to false word correspondences, generally involving punctuation and very common, closed-class words. Giving special attention to such cases should help get rid of these errors, and improve the precision of the method.</Paragraph> </Section> class="xml-element"></Paper>