File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/p05-2003_concl.xml
Size: 2,479 bytes
Last Modified: 2025-10-06 13:54:42
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-2003"> <Title>An Extensive Empirical Study of Collocation Extraction Methods</Title> <Section position="6" start_page="16" end_page="17" type="concl"> <SectionTitle> 5 Conclusions and future work </SectionTitle> <Paragraph position="0"> We implemented 84 automatic collocation extraction methods and performed series of experiments on morphologically and syntactically annotated data. The methods were evaluated against a reference set of collocations manually extracted from the columns) scores (in %) for the best individual method and linear combination of the 17 selected ones.</Paragraph> <Paragraph position="1"> same source. The best method (Pointwise mutual information) achieved 68.3 % recall with 73.0 % precision (F-measure 70.6) on this data. We proposed to combine the association scores of each candidate bigram and employed Logistic linear regression to find a linear combination of the association scores of all the basic methods. Thus we constructed a collocation extraction method which achieved 80.8 % recall with 84.8 % precision (F-measure 82.8). Furthermore, we applied an attribute selection technique in order to lower the high dimensionality of the classification problem and reduced the number of regressors from 87 to 17 with comparable performance. This result can be viewed as a kind of evaluation of basic collocation extraction techniques. We can obtain the smallest subset that still gives the best result. The other measures therefore become uninteresting and need not be further processed and evaluated. null The reseach presented in this paper is in progress.</Paragraph> <Paragraph position="2"> The list of collocation extraction methods and association measures is far from complete. Our long term goal is to collect, implement, and evaluate all available methods suitable for this task, and release the toolkit for public use.</Paragraph> <Paragraph position="3"> In the future, we will focus especially on improving quality of the training and testing data, employing other classification and attribute-selection techniques, and performing experiments on English data. A necessary part of the work will be a rigorous theoretical study of all applied methods and appropriateness of their usage. Finally, we will attempt to demonstrate contribution of collocations in selected application areas, such as machine translation or information retrieval.</Paragraph> </Section> class="xml-element"></Paper>